From 62c13faafafaee7b087e007ffae4b09ff32db341 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=EC=9C=A0=EC=B0=BD=EC=9A=B1?= Date: Sat, 20 Jun 2026 18:18:54 +0900 Subject: [PATCH] docs: implementation plan for project-review remediation --- .../2026-06-20-project-review-remediation.md | 526 ++++++++++++++++++ 1 file changed, 526 insertions(+) create mode 100644 docs/superpowers/plans/2026-06-20-project-review-remediation.md diff --git a/docs/superpowers/plans/2026-06-20-project-review-remediation.md b/docs/superpowers/plans/2026-06-20-project-review-remediation.md new file mode 100644 index 0000000..79b7f71 --- /dev/null +++ b/docs/superpowers/plans/2026-06-20-project-review-remediation.md @@ -0,0 +1,526 @@ +# Project Review Remediation Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans (inline, batched with checkpoints) to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Fix all HIGH/MEDIUM/LOW findings from the 2026-06-20 multi-agent project review, including splitting the 5119-line `sqlite_store.py` god object. + +**Architecture:** Mechanical, low-risk fixes first (security, concurrency, deps), each guarded by the existing 367-test suite. The god-file split is last and incremental: move a cohesive method group into a new module, re-export/delegate, run the full suite, commit — repeat. No behavior changes during the split. + +**Tech Stack:** Python 3.13, stdlib `http.server` + `sqlite3` + `urllib`, OpenCV/Pillow/NumPy (lazy), vanilla-JS frontend. Tests: pytest 8.4 + pytest-cov. + +## Global Constraints + +- **Air-gapped (폐쇄망) target:** every artifact installs/runs with NO internet for its OWN deps. External *search* APIs (Google/Naver/Vision) at runtime are an ALLOWED exception. +- **Do not "fix" allowed design:** external search API calls; operator-efficiency tradeoffs; relaxed `google_search` manual-query rule. Leave them. +- **Regression gate:** `python -m pytest -q` (367 tests, ~3.4 min) must stay green after every task. Run targeted tests during a task, full suite before each commit. +- **Immutability / style:** match existing code style; new objects over mutation where reasonable; functions <50 lines, files <800 lines (the split exists to enforce this). +- **No new runtime network deps.** Stdlib only for new code paths. +- **Commit format:** `: ` (feat/fix/refactor/test/chore/docs). Branch `fix/project-review-remediation`. + +--- + +## Phase 0: Safety net + +### Task 0: Branch + baseline + +- [ ] **Step 1:** Create branch +```bash +git checkout -b fix/project-review-remediation +``` +- [ ] **Step 2:** Baseline the suite (record green) +```bash +python -m pytest -q +``` +Expected: `367 passed` +- [ ] **Step 3:** No commit (branch only). + +--- + +## Phase 1: HIGH — HTTP server hardening (CORS, auth, body cap) + +**Files:** Modify `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_http_app.py`. + +### Task 1: Remove wildcard CORS + +The operator GUI is served same-origin from this server, so it needs no CORS header at all. `Access-Control-Allow-Origin: *` lets any site the operator visits read biometric/case data from `localhost`. + +- [ ] **Step 1: Failing test** — add to `test_http_app.py`: +```python +def test_json_responses_do_not_send_wildcard_cors(http_server): + base = f"http://127.0.0.1:{http_server.server_port}" + response = urlopen(f"{base}/api/bootstrap") + assert response.headers.get("Access-Control-Allow-Origin") is None +``` +- [ ] **Step 2:** Run → FAIL (`assert '*' is None`). +- [ ] **Step 3:** In `http_app.py:207-214` `_json`, delete the line: +```python + self.send_header("Access-Control-Allow-Origin", "*") +``` +- [ ] **Step 4:** Run targeted test → PASS. Run full `test_http_app.py`. +- [ ] **Step 5:** Commit `fix: remove wildcard CORS from API responses`. + +### Task 2: Optional shared-token auth gate + +In 폐쇄망 a single shared bearer token (from env `COPYRIGHTER_AUTH_TOKEN`) is enough. If unset, server stays open (back-compat for dev) but logs a one-time warning; if set, every `/api/*`, `/media/*`, `/knowledge-media/*`, `/collected-media/*`, `/face-crop-media/*` request must carry `Authorization: Bearer ` (or `?token=` for `` src which cannot send headers). + +**Interfaces produced:** `build_server(..., auth_token: str | None = None)`. + +- [ ] **Step 1: Failing test**: +```python +def test_media_requires_token_when_configured(tmp_path): + store, image_store, static_dir = _make_server_deps(tmp_path) # existing helper pattern + server = build_server("127.0.0.1", 0, store, image_store, static_dir, auth_token="secret") + port = server.server_port + thread = threading.Thread(target=server.serve_forever, daemon=True); thread.start() + try: + with pytest.raises(HTTPError) as exc: + urlopen(f"http://127.0.0.1:{port}/api/bootstrap") + assert exc.value.code == 401 + req = Request(f"http://127.0.0.1:{port}/api/bootstrap", headers={"Authorization": "Bearer secret"}) + assert urlopen(req).status == 200 + finally: + server.shutdown() +``` +- [ ] **Step 2:** Run → FAIL (`build_server() got unexpected kwarg auth_token`). +- [ ] **Step 3:** Implement. In `build_server` signature add `auth_token: str | None = None`. At top of `do_GET`/`do_POST`/`do_PATCH`, before routing, call a guard: +```python + def _authorized(self) -> bool: + if not auth_token: + return True + header = self.headers.get("Authorization", "") + if header == f"Bearer {auth_token}": + return True + query_token = parse_qs(urlparse(self.path).query).get("token", [""])[0] + return query_token == auth_token + + def _require_auth(self) -> bool: + if self._authorized(): + return True + self._json({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED) + return False +``` +Add at the start of each handler body (after `path = _path(self.path)`): +```python + if not self._require_auth(): + return +``` +Import `parse_qs` from `urllib.parse`. Thread `auth_token` from `__main__.py` env (`os.environ.get("COPYRIGHTER_AUTH_TOKEN")`). +- [ ] **Step 4:** Run targeted + full `test_http_app.py` → PASS (existing tests pass `auth_token=None`). +- [ ] **Step 5:** Add `COPYRIGHTER_AUTH_TOKEN=` to `.env.example`. Commit `feat: optional shared-token auth gate for operator server`. + +### Task 3: Cap request body size + +`_body()` reads `Content-Length` unbounded. + +- [ ] **Step 1: Failing test**: POST with `Content-Length` over cap returns 413. +```python +def test_oversized_body_rejected(http_server): + base = f"http://127.0.0.1:{http_server.server_port}" + big = b"{" + b'"x":"' + b"a" * (11 * 1024 * 1024) + b'"}' + req = Request(f"{base}/api/knowledge/manual", data=big, headers={"Content-Type": "application/json"}) + with pytest.raises(HTTPError) as exc: + urlopen(req) + assert exc.value.code == 413 +``` +- [ ] **Step 2:** Run → FAIL. +- [ ] **Step 3:** In `_body()` (line 201-205), add constant `_MAX_BODY_BYTES = 10 * 1024 * 1024` at module level, and: +```python + def _body(self) -> dict[str, object]: + length = int(self.headers.get("Content-Length", "0") or "0") + if length > _MAX_BODY_BYTES: + raise _PayloadTooLarge() + if not length: + return {} + return json.loads(self.rfile.read(length).decode("utf-8")) +``` +Add `class _PayloadTooLarge(Exception): pass` and catch it in each handler → `self._json({"error": "payload too large"}, HTTPStatus.REQUEST_ENTITY_TOO_LARGE)`. +- [ ] **Step 4:** Run → PASS. +- [ ] **Step 5:** Commit `fix: cap HTTP request body size`. + +--- + +## Phase 2: HIGH — SQLite concurrency + atomic writes + +**Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_sqlite_store.py`, `test_http_app.py`. + +### Task 4: WAL + busy_timeout in `_connect` + +- [ ] **Step 1: Failing test** (`test_sqlite_store.py`): +```python +def test_connect_uses_wal_and_busy_timeout(tmp_path): + store = CopyrighterStore(tmp_path / "c.sqlite3"); store.initialize() + with store._connect() as conn: + assert conn.execute("pragma journal_mode").fetchone()[0].lower() == "wal" + assert int(conn.execute("pragma busy_timeout").fetchone()[0]) >= 5000 +``` +- [ ] **Step 2:** Run → FAIL (`delete` != `wal`). +- [ ] **Step 3:** Edit `_connect` (3248-3252): +```python + def _connect(self) -> sqlite3.Connection: + conn = sqlite3.connect(self.db_path) + conn.execute("pragma foreign_keys = on") + conn.execute("pragma busy_timeout = 5000") + conn.row_factory = sqlite3.Row + return conn +``` +In `initialize()` (after `mkdir`, before/inside the first `_connect`), set WAL once: +```python + with self._connect() as conn: + conn.execute("pragma journal_mode = wal") + conn.executescript(...) # existing +``` +- [ ] **Step 4:** Run → PASS. Full suite (WAL persists; verify no test asserts `delete` mode). +- [ ] **Step 5:** Commit `fix: enable SQLite WAL and busy_timeout for concurrent operators`. + +### Task 5: Map `sqlite3.OperationalError` to 503 + +- [ ] **Step 1: Failing test** (`test_http_app.py`): monkeypatch a store method to raise `sqlite3.OperationalError("database is locked")`, assert handler returns 503 JSON not a crash. +```python +def test_operational_error_returns_503(http_server, monkeypatch): + monkeypatch.setattr(http_server.store_for_test, "bootstrap", + lambda: (_ for _ in ()).throw(sqlite3.OperationalError("database is locked"))) + base = f"http://127.0.0.1:{http_server.server_port}" + with pytest.raises(HTTPError) as exc: + urlopen(f"{base}/api/bootstrap") + assert exc.value.code == 503 +``` +(If the existing `http_server` fixture doesn't expose the store, add a minimal one matching the existing fixture pattern.) +- [ ] **Step 2:** Run → FAIL (500/connection abort). +- [ ] **Step 3:** In all three handlers, add before the generic catches: +```python + except sqlite3.OperationalError: + self._json({"error": "service busy, retry"}, HTTPStatus.SERVICE_UNAVAILABLE) +``` +Import `sqlite3` in `http_app.py`. +- [ ] **Step 4:** Run → PASS. +- [ ] **Step 5:** Commit `fix: surface SQLite lock contention as retryable 503`. + +### Task 6: Atomic decision + audit (transaction threading) + +Make `record_decision`'s state change + audit event commit in ONE transaction so the audit trail can't diverge. + +**Interfaces:** `_put(table, id, payload, *, conn=None)` and `add_audit_event(..., *, conn=None)` and `_create_or_update_watchlist_entry(..., *, conn=None)` accept an optional connection; when given, they reuse it (no own commit). Add a private `_transaction()` contextmanager. + +- [ ] **Step 1: Failing test**: simulate audit failure mid-decision, assert submission status is NOT committed (rolled back). +```python +def test_record_decision_rolls_back_when_audit_fails(tmp_path, monkeypatch, seeded_store): + store, sid = seeded_store + def boom(*a, **k): raise RuntimeError("audit down") + monkeypatch.setattr(store, "add_audit_event", boom) + with pytest.raises(RuntimeError): + store.record_decision(sid, "held", "m") + assert store._get("submissions", sid)["decisionStatus"] == "unreviewed" +``` +- [ ] **Step 2:** Run → FAIL (status persisted because `_put` committed independently). +- [ ] **Step 3:** Implement: + - Add helper: +```python + @contextmanager + def _transaction(self): + conn = self._connect() + try: + yield conn + conn.commit() + except Exception: + conn.rollback() + raise + finally: + conn.close() +``` + - Add `conn: sqlite3.Connection | None = None` param to `_put`, `add_audit_event`, `_create_or_update_watchlist_entry`, `_get`. When `conn` is provided, use it directly instead of `with self._connect()`; do not commit (the `_transaction` owns commit). + - Rewrite `record_decision` body (1004-1018) to wrap the three writes: +```python + with self._transaction() as conn: + submission = self._get("submissions", submission_id, conn=conn) + submission["decisionStatus"] = decision + submission["applicantStatus"] = (...same...) + self._put("submissions", submission_id, submission, conn=conn) + if decision in {"held", "rejected"}: + self._create_or_update_watchlist_entry(submission_id, decision, memo, image_store, conn=conn) + self.add_audit_event("rights.ops", "Operator decision created", submission_id, + f"{decision} · {memo or 'memo optional'}", conn=conn) + return self.review(submission_id) +``` + - `_connect()` in autocommit-less mode: pass `isolation_level` default keeps implicit transactions; ensure the helper functions that do `with self._connect() as conn:` are NOT affected (they keep their own path when `conn is None`). +- [ ] **Step 4:** Run targeted + full suite → PASS. +- [ ] **Step 5:** Commit `fix: make record_decision + audit atomic via single transaction`. + +> NOTE: apply the same `_transaction` wrapping to `mark_evidence_status` (state + audit) as a follow-up micro-task with the same pattern; keep each as its own commit. + +--- + +## Phase 3: HIGH — Air-gapped dependency manifest + +**Files:** Create `requirements.txt`, `requirements-lock.txt`, `docs/operations/offline-install.md`. + +### Task 7: Pin runtime deps + +- [ ] **Step 1:** Capture the known-good versions (from current working env): +``` +fastapi==0.115.0 +flask==3.1.2 +httpx==0.27.0 +numpy==2.3.5 +opencv-python-headless==4.13.0.92 +pillow==12.2.0 +requests==2.32.5 +``` +Write `requirements.txt` with these (the project's actual imports — verify each is imported in `src/`; drop any not imported, e.g. confirm `fastapi`/`flask` usage with `grep -rn "import fastapi\|import flask" src`). +- [ ] **Step 2:** Generate a full transitive lock on the build machine: +```bash +pip freeze > requirements-lock.txt +``` +- [ ] **Step 3:** Document offline install in `docs/operations/offline-install.md`: +``` +# Build machine (has internet): +pip download -r requirements-lock.txt -d wheelhouse/ +# Transfer wheelhouse/ + repo across air gap, then on target: +pip install --no-index --find-links wheelhouse/ -r requirements-lock.txt +``` +- [ ] **Step 4:** Verify the manifest installs offline (build a wheelhouse, `pip install --no-index` into a fresh venv). Record result. +- [ ] **Step 5:** Commit `chore: pin runtime dependencies for offline air-gapped install`. + +--- + +## Phase 4: HIGH — SSRF guard on remote fetches + +**Files:** Modify `src/rights_filter/server/sqlite_store.py`; Test `test_sqlite_store.py`. + +### Task 8: Block private/loopback/link-local hosts + disable redirects + +- [ ] **Step 1: Failing tests**: +```python +@pytest.mark.parametrize("url", [ + "http://127.0.0.1/x", "http://169.254.169.254/latest/meta-data", + "http://10.0.0.1/a", "http://192.168.1.1/a", "http://[::1]/a", "http://localhost/a", +]) +def test_fetch_rejects_internal_hosts(url): + from rights_filter.server.sqlite_store import _fetch_url_bytes + with pytest.raises(ValueError): + _fetch_url_bytes(url) +``` +- [ ] **Step 2:** Run → FAIL (attempts real fetch / no guard). +- [ ] **Step 3:** Add a guard used by `_fetch_url_bytes_with_headers` before `urlopen`: +```python +import ipaddress, socket +from urllib.parse import urlparse + +def _assert_public_http_url(url: str) -> None: + parsed = urlparse(url) + if parsed.scheme not in {"http", "https"} or not parsed.hostname: + raise ValueError("only public http(s) URLs may be fetched") + host = parsed.hostname + try: + infos = socket.getaddrinfo(host, parsed.port or (443 if parsed.scheme == "https" else 80), proto=socket.IPPROTO_TCP) + except OSError as exc: + raise ValueError("could not resolve fetch host") from exc + for *_, sockaddr in infos: + ip = ipaddress.ip_address(sockaddr[0]) + if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or ip.is_multicast or ip.is_unspecified: + raise ValueError("refusing to fetch internal address") +``` +Call `_assert_public_http_url(url)` at the top of `_fetch_url_bytes_with_headers` (line 3771). Disable redirects: build an opener with a no-redirect handler, OR re-validate on redirect. Minimal: use a custom `HTTPRedirectHandler` that re-runs `_assert_public_http_url` on `redirect_request`, via `build_opener`. Replace `urlopen(request, ...)` with that opener's `.open(...)`. +- [ ] **Step 4:** Run → PASS. Verify the existing happy-path fetch test (monkeypatched `urlopen`) still passes — it may need to also patch `_assert_public_http_url` or use a public-looking host; adjust the existing test's monkeypatch to bypass DNS (patch `_assert_public_http_url` to no-op in that test). +- [ ] **Step 5:** Commit `fix: block SSRF to internal addresses in remote fetchers`. + +--- + +## Phase 5: HIGH — PII / governance retention enforcement + +**Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/governance/policies.py`; Create `src/rights_filter/jobs/retention_sweeper.py`; Test `tests/rights_filter/jobs/test_retention_sweeper.py`, `test_sqlite_store.py`. + +### Task 9: Face-crop retention sweep + deletion path + +Add a retention TTL (env `COPYRIGHTER_FACE_CROP_RETENTION_DAYS`, default e.g. 90; 0 = keep) and a sweep that deletes expired face-crop files + clears their DB references, writing an audit event. + +- [ ] **Step 1: Failing test**: create a face-crop file with an old mtime + a submission referencing it; run sweep; assert file deleted, reference cleared, audit event present. +- [ ] **Step 2:** Run → FAIL (no sweeper). +- [ ] **Step 3:** Implement `FaceCropRetentionSweeper(store, retention_days)` in `jobs/retention_sweeper.py`: + - List `store.face_crop_image_dir` files older than cutoff; delete; for each submission whose `faceCrops` referenced a deleted file, update payload to drop it; `add_audit_event(actor="system", event="Face crop purged", ...)`. + - Add `store.purge_expired_face_crops(now_epoch)` method doing the DB+file work in one `_transaction`. +- [ ] **Step 4:** Run → PASS. +- [ ] **Step 5:** Commit `feat: face-crop retention sweep with audit trail`. + +### Task 10: Wire it to run + audit external personal-image transmission + +- [ ] **Step 1:** Call the sweeper at server startup (`__main__.py`) and at the start of `seed_from_image_store` (cheap, idempotent). Add `add_audit_event(actor="system", event="Personal image sent to external provider", object_id=submission_id, change="face crop → Google Vision")` at the face-crop external-transmission site (`sqlite_store.py:~2127`, gated on the crop actually being sent). +- [ ] **Step 2:** Test: assert the audit event is written when a face crop is transmitted (use the existing fake Vision client path). +- [ ] **Step 3:** Run → PASS. +- [ ] **Step 4:** Commit `feat: audit personal-image external transmission; run retention sweep at startup`. + +> NOTE: Full `GovernancePolicyRegistry` enforcement on read/serve routes is large; this phase enforces the highest-impact rule (retention/deletion of biometric data) + makes transmission auditable. Deeper role enforcement is folded into Task 2's auth gate. Document remaining registry wiring as a follow-up in the plan's "Deferred" section. + +--- + +## Phase 6: MEDIUM — correctness / idempotency + +**Files:** Modify `sqlite_store.py`, `analysis/evidence_enrichment.py`; Tests alongside. + +### Task 11: Atomic provider usage increment + +Replace read-modify-write counter updates with one SQL `UPDATE`. + +- [ ] **Step 1: Failing test**: two interleaved increments don't lose an update. Simulate by calling a new `store._increment_provider_usage(provider, n)` twice and asserting cumulative. +- [ ] **Step 2:** Run → FAIL (method missing). +- [ ] **Step 3:** Add: +```python + def _increment_provider_usage(self, provider: str, count: int, *, conn=None) -> None: + owns = conn is None + conn = conn or self._connect() + try: + conn.execute("update providers set usage = usage + ? where id = ?", (int(count), provider)) + if owns: conn.commit() + finally: + if owns: conn.close() +``` +Replace the 5 sites (`1361`, `1415`, `2076`, `2301`, `2535`) `provider_payload["usage"] += n; ...; self._put("providers", ...)` so the `usage` field is updated via `_increment_provider_usage` and the remaining `lastSuccess/lastFailure` go through `_put` (or fold into the UPDATE). Keep behavior identical for single-threaded tests. +- [ ] **Step 4:** Run → PASS + full suite. +- [ ] **Step 5:** Commit `fix: atomic provider usage increment to prevent lost updates`. + +### Task 12: Idempotent candidate promotion + +- [ ] **Step 1: Failing test**: promoting the same candidate twice yields ONE knowledge entry; second call returns existing id. +- [ ] **Step 2:** Run → FAIL (two entries). +- [ ] **Step 3:** In `promote_collection_candidate` (1670): if `candidate.get("status") == "promoted"`, return bootstrap (or the existing entry) without creating a new one. Derive `entry_id = _stable_id("kb-candidate", candidate_id)` (deterministic; `_stable_id` exists at ~3697) so `_put`'s upsert collapses retries. Wrap the entry put + candidate flip in `_transaction`. +- [ ] **Step 4:** Run → PASS. +- [ ] **Step 5:** Commit `fix: make collection candidate promotion idempotent`. + +### Task 13: Short-circuit LLM re-invocation + +- [ ] **Step 1: Failing test** (`test_evidence_enrichment.py`): a run that already has an LLM summary does NOT call `llm_assistant.summarize` (spy/mock asserts not called). +- [ ] **Step 2:** Run → FAIL (currently always called at line 93). +- [ ] **Step 3:** Edit `enrich_latest` lines 93-97: +```python + if not _has_existing_llm_summary(run.evidence): + llm_summary = self.llm_assistant.summarize(submission_id, source_evidence) + if llm_summary.source == EvidenceSource.ENRICHMENT_FAILURE: + summary.summary_failures += 1 + new_evidence.append(llm_summary) +``` +- [ ] **Step 4:** Run → PASS + full suite. +- [ ] **Step 5:** Commit `fix: skip LLM summarize when a summary already exists`. + +### Task 14: Migration crash-safety + +- [ ] **Step 1: Failing test**: seed a legacy table with a row violating the new CHECK so `insert...select` raises; assert original data still readable and no orphan `__copyrighter_*_new` table remains. +- [ ] **Step 2:** Run → FAIL (orphan temp table / partial state). +- [ ] **Step 3:** In `_ensure_constrained_schema`/`_rebuild_constrained_table` (308-362): on exception, drop any leftover `__copyrighter_*_new` temp tables in the `finally`/`except`; rebuild all pending tables into temp tables BEFORE any `drop table`, then drop+rename in a tight loop (minimize the irreversible window). Add cleanup of stale temp tables at startup. +- [ ] **Step 4:** Run → PASS. +- [ ] **Step 5:** Commit `fix: crash-safe constraint migration with temp-table cleanup`. + +--- + +## Phase 7: MEDIUM — test coverage for security-sensitive paths + +**Files:** Tests only. + +### Task 15: Path-traversal guard tests + +- [ ] **Step 1:** Add to `test_sqlite_store.py`: for each of `knowledge_media_path`, `collected_media_path`, `face_crop_media_path`, assert `ValueError` on `"../../etc/passwd"` and url-encoded `"..%2f..%2fx"` (decode handled by caller, so test the method with `../`). +- [ ] **Step 2:** Run → PASS immediately (guards exist). These are regression pins. +- [ ] **Step 3:** Add one HTTP-level test: `GET /collected-media/../../secret` → 400/404. +- [ ] **Step 4:** Commit `test: pin media path-traversal guards`. + +### Task 16: Risk-band boundary tests + +- [ ] **Step 1:** Add table tests to `test_risk_scoring.py` asserting bands at scores 29→low, 30→medium, 69→medium, 70→high, and an input whose raw sum >100 → score==100. Craft evidence: FINGERPRINT similarity≥0.9 (+80) + WEB_DETECTION full (+45) = 125 → expect 100/high. +- [ ] **Step 2:** Run → PASS (logic is correct; pins boundaries). +- [ ] **Step 3:** Commit `test: pin risk-band thresholds and 100 cap`. + +### Task 17: Upload/import validation tests + +- [ ] **Step 1:** Add HTTP tests for: unsupported suffix, non-base64 data, empty data → 400; import-folder missing path → 400; body `not json` → 400 (not 500). +- [ ] **Step 2:** Run → mostly PASS; if `not json` yields 500, fix `_body` to raise `ValueError` on `JSONDecodeError`: +```python + try: + return json.loads(...) + except json.JSONDecodeError as exc: + raise ValueError("invalid JSON body") from exc +``` +- [ ] **Step 3:** Commit `test: cover upload/import boundary validation; 400 on malformed JSON`. + +--- + +## Phase 8: LOW — frontend defense-in-depth + misc + +**Files:** Modify `web/operator-gui/app.js`; `integrations/google_custom_search.py`; `analysis/face_person_detection.py`; `integrations/env_clients.py`. Test `tests/operator_gui/test_static_workbench.py`. + +### Task 18: Client-side `safeUrl` allowlist + `apiJson` ok-check + img onerror + +- [ ] **Step 1:** Add `safeUrl(u)` in `app.js` returning `u` only if it starts `http://`/`https://`/`/`, else `"#"`. Apply at href/src construction (`~1722`, `~1159`). In `apiJson` (`~116`), check `response.ok` BEFORE parsing JSON; on non-ok, read text and throw with status. Add `onerror` to evidence/candidate `` (hide or placeholder). +- [ ] **Step 2:** Static test asserts `safeUrl` exists and the JS parses (existing `test_static_workbench.py` pattern — string/structure checks). +- [ ] **Step 3:** Commit `fix: frontend URL scheme allowlist, fetch ok-check, image onerror`. + +### Task 19: Complete ARIA tab semantics + +- [ ] **Step 1:** Add `role="tablist"/"tab"/"tabpanel"`, `aria-selected`, `aria-controls`, keyboard arrow nav to the tab control (`index.html:~187` + handler in `app.js`). +- [ ] **Step 2:** Commit `fix: complete ARIA tab pattern for keyboard nav`. + +### Task 20: Misc hardening + +- [ ] **Step 1:** `google_custom_search.py:65` — move API key from query string to a header if the API supports it; else document why query param is required (Google CSE requires `key=` query param → document, don't log full URL). Ensure the URL is never logged with the key. +- [ ] **Step 2:** `face_person_detection.py:52` — raise/log a clear warning if the OpenCV cascade asset is empty instead of silent no-op. +- [ ] **Step 3:** `env_clients.py:187` — emit a startup provider-readiness summary (which providers enabled/disabled and why). +- [ ] **Step 4:** Commit `fix: provider readiness logging, cascade-load warning, avoid key in logs`. + +--- + +## Phase 9: HIGH — split `sqlite_store.py` god object (the long pole) + +**Approach:** Pure mechanical extraction, behavior-preserving. The public class `CopyrighterStore` stays; internals move to focused modules. After each extraction: full suite green, commit. Target: no file >800 lines. + +**Target module map** (under `src/rights_filter/server/store/`): +- `url_utils.py` — module-level URL helpers (`_is_http_url`, `_decoded_nested_url`, `_url_path_has_image_suffix`, `_url_has_image_format_hint`, `_url_looks_like_image`, normalization helpers ~4660-4772). Also import these in `naver_search.py`/`google_custom_search.py` to kill triplication. +- `remote_fetch.py` — `_fetch_url_bytes*`, headers, `_PageImageParser`, css/page extraction, `_assert_public_http_url` (Task 8). +- `schema.py` — `_ensure_constrained_schema`, `_rebuild_constrained_table`, `_table_*`, `CONSTRAINED_TABLE_SCHEMAS`, `LEGACY_REBUILD_ORDER`, the `executescript` DDL. +- `serialization.py` — `_submission_payload`, `_evidence_payload`, `_evidence_id`, `_stable_id`, `_timestamp_id`, `_now_label`, `_text_list`, `_unique_texts`. +- `media_paths.py` — the three `*_media_path` methods (as functions taking dirs) + path-traversal guard. +- `enrichment_orchestrator.py` — `seed_from_image_store`, `rerun_enrichment`, `run_auto_search`, `_auto_*_search`, `_rerun_*` (the heavy business orchestration). +- `sqlite_store.py` — slim `CopyrighterStore`: `__init__`, `_connect`, `_transaction`, `_put`/`_get`/`_all`/`_count`, and thin methods delegating to the orchestrator. + +### Task 21..27: one module per task + +For EACH module above, repeat this micro-cycle: +- [ ] **Step 1:** Create the new module; cut the target functions verbatim into it (fix imports). +- [ ] **Step 2:** In `sqlite_store.py`, import from the new module; where instance methods moved, keep a thin delegating method OR call the free function. Do NOT change logic. +- [ ] **Step 3:** Run full `pytest -q` → must stay `367 passed`. +- [ ] **Step 4:** `wc -l` the touched files; confirm shrinking. +- [ ] **Step 5:** Commit `refactor: extract from sqlite_store god object`. + +**Order (low-risk first):** `url_utils` → `serialization` → `media_paths` → `remote_fetch` → `schema` → `enrichment_orchestrator`. Each is independently testable via the existing suite; if any extraction reddens the suite, revert that single commit and re-slice smaller. + +### Task 28: De-duplicate URL helpers across integrations + +- [ ] **Step 1:** Point `naver_search.py:629` and `google_custom_search.py` URL helpers at `store/url_utils.py` (or move shared helpers to a neutral `rights_filter/common/url_utils.py` if `server` import direction is wrong from `integrations`). Respect layering: `integrations` must NOT import `server`. → put shared `url_utils` under `rights_filter/common/`. +- [ ] **Step 2:** Run full suite → PASS. +- [ ] **Step 3:** Commit `refactor: share URL helpers via common/url_utils`. + +--- + +## Phase 10: Verification & finish + +### Task 29: Full verification + +- [ ] **Step 1:** `python -m pytest -q` → `367 passed` (+ new tests; count grows). +- [ ] **Step 2:** `python -m pytest --cov=rights_filter --cov-report=term-missing -q` → coverage ≥ prior 89%. +- [ ] **Step 3:** Confirm no source file >800 lines: `find src -name '*.py' | xargs wc -l | sort -rn | head`. +- [ ] **Step 4:** Offline-install dry run (Phase 3 wheelhouse). +- [ ] **Step 5:** Update `docs/operations/` with the auth token + retention env vars. + +### Task 30: PR + +- [ ] **Step 1:** `git diff main...HEAD` review. +- [ ] **Step 2:** Push branch, open PR with the review findings → fixes mapping table. + +--- + +## Deferred (explicitly out of this plan, log as follow-ups) +- Full `GovernancePolicyRegistry` role enforcement on every read/serve path (Task 2 + Task 9 cover the high-impact subset). +- `rerun_enrichment` per-submission in-progress lock (LOW; cosmetic `lastRerunDiff` only). +- `seed_from_image_store` per-row claim lock (MEDIUM; converges on data, dup side-effects only) — mitigated once auth limits operators. +- Optimistic-concurrency `version` column on submissions (LOW). + +## Self-review notes +- Spec coverage: every HIGH (#1 SSRF→T8, #2 CORS/auth→T1/T2, #3 concurrency→T4/T5, #4 governance→T9/T10, #5 deps→T7, #6 god-file→T21-28, #7 atomic→T6), every MEDIUM (T11-T17), every LOW (T18-T20) maps to a task. +- Test-first for all behavioral changes; pure refactor (Phase 9) gated by existing suite. +- No placeholders: each behavioral task has concrete test + exact code/site.