# Project Review Remediation Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans (inline, batched with checkpoints) to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Fix all HIGH/MEDIUM/LOW findings from the 2026-06-20 multi-agent project review, including splitting the 5119-line `sqlite_store.py` god object. **Architecture:** Mechanical, low-risk fixes first (security, concurrency, deps), each guarded by the existing 367-test suite. The god-file split is last and incremental: move a cohesive method group into a new module, re-export/delegate, run the full suite, commit — repeat. No behavior changes during the split. **Tech Stack:** Python 3.13, stdlib `http.server` + `sqlite3` + `urllib`, OpenCV/Pillow/NumPy (lazy), vanilla-JS frontend. Tests: pytest 8.4 + pytest-cov. ## Global Constraints - **Air-gapped (폐쇄망) target:** every artifact installs/runs with NO internet for its OWN deps. External *search* APIs (Google/Naver/Vision) at runtime are an ALLOWED exception. - **Do not "fix" allowed design:** external search API calls; operator-efficiency tradeoffs; relaxed `google_search` manual-query rule. Leave them. - **Regression gate:** `python -m pytest -q` (367 tests, ~3.4 min) must stay green after every task. Run targeted tests during a task, full suite before each commit. - **Immutability / style:** match existing code style; new objects over mutation where reasonable; functions <50 lines, files <800 lines (the split exists to enforce this). - **No new runtime network deps.** Stdlib only for new code paths. - **Commit format:** `: ` (feat/fix/refactor/test/chore/docs). Branch `fix/project-review-remediation`. --- ## Phase 0: Safety net ### Task 0: Branch + baseline - [ ] **Step 1:** Create branch ```bash git checkout -b fix/project-review-remediation ``` - [ ] **Step 2:** Baseline the suite (record green) ```bash python -m pytest -q ``` Expected: `367 passed` - [ ] **Step 3:** No commit (branch only). --- ## Phase 1: HIGH — HTTP server hardening (CORS, auth, body cap) **Files:** Modify `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_http_app.py`. ### Task 1: Remove wildcard CORS The operator GUI is served same-origin from this server, so it needs no CORS header at all. `Access-Control-Allow-Origin: *` lets any site the operator visits read biometric/case data from `localhost`. - [ ] **Step 1: Failing test** — add to `test_http_app.py`: ```python def test_json_responses_do_not_send_wildcard_cors(http_server): base = f"http://127.0.0.1:{http_server.server_port}" response = urlopen(f"{base}/api/bootstrap") assert response.headers.get("Access-Control-Allow-Origin") is None ``` - [ ] **Step 2:** Run → FAIL (`assert '*' is None`). - [ ] **Step 3:** In `http_app.py:207-214` `_json`, delete the line: ```python self.send_header("Access-Control-Allow-Origin", "*") ``` - [ ] **Step 4:** Run targeted test → PASS. Run full `test_http_app.py`. - [ ] **Step 5:** Commit `fix: remove wildcard CORS from API responses`. ### Task 2: Optional shared-token auth gate In 폐쇄망 a single shared bearer token (from env `COPYRIGHTER_AUTH_TOKEN`) is enough. If unset, server stays open (back-compat for dev) but logs a one-time warning; if set, every `/api/*`, `/media/*`, `/knowledge-media/*`, `/collected-media/*`, `/face-crop-media/*` request must carry `Authorization: Bearer ` (or `?token=` for `` src which cannot send headers). **Interfaces produced:** `build_server(..., auth_token: str | None = None)`. - [ ] **Step 1: Failing test**: ```python def test_media_requires_token_when_configured(tmp_path): store, image_store, static_dir = _make_server_deps(tmp_path) # existing helper pattern server = build_server("127.0.0.1", 0, store, image_store, static_dir, auth_token="secret") port = server.server_port thread = threading.Thread(target=server.serve_forever, daemon=True); thread.start() try: with pytest.raises(HTTPError) as exc: urlopen(f"http://127.0.0.1:{port}/api/bootstrap") assert exc.value.code == 401 req = Request(f"http://127.0.0.1:{port}/api/bootstrap", headers={"Authorization": "Bearer secret"}) assert urlopen(req).status == 200 finally: server.shutdown() ``` - [ ] **Step 2:** Run → FAIL (`build_server() got unexpected kwarg auth_token`). - [ ] **Step 3:** Implement. In `build_server` signature add `auth_token: str | None = None`. At top of `do_GET`/`do_POST`/`do_PATCH`, before routing, call a guard: ```python def _authorized(self) -> bool: if not auth_token: return True header = self.headers.get("Authorization", "") if header == f"Bearer {auth_token}": return True query_token = parse_qs(urlparse(self.path).query).get("token", [""])[0] return query_token == auth_token def _require_auth(self) -> bool: if self._authorized(): return True self._json({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED) return False ``` Add at the start of each handler body (after `path = _path(self.path)`): ```python if not self._require_auth(): return ``` Import `parse_qs` from `urllib.parse`. Thread `auth_token` from `__main__.py` env (`os.environ.get("COPYRIGHTER_AUTH_TOKEN")`). - [ ] **Step 4:** Run targeted + full `test_http_app.py` → PASS (existing tests pass `auth_token=None`). - [ ] **Step 5:** Add `COPYRIGHTER_AUTH_TOKEN=` to `.env.example`. Commit `feat: optional shared-token auth gate for operator server`. ### Task 3: Cap request body size `_body()` reads `Content-Length` unbounded. - [ ] **Step 1: Failing test**: POST with `Content-Length` over cap returns 413. ```python def test_oversized_body_rejected(http_server): base = f"http://127.0.0.1:{http_server.server_port}" big = b"{" + b'"x":"' + b"a" * (11 * 1024 * 1024) + b'"}' req = Request(f"{base}/api/knowledge/manual", data=big, headers={"Content-Type": "application/json"}) with pytest.raises(HTTPError) as exc: urlopen(req) assert exc.value.code == 413 ``` - [ ] **Step 2:** Run → FAIL. - [ ] **Step 3:** In `_body()` (line 201-205), add constant `_MAX_BODY_BYTES = 10 * 1024 * 1024` at module level, and: ```python def _body(self) -> dict[str, object]: length = int(self.headers.get("Content-Length", "0") or "0") if length > _MAX_BODY_BYTES: raise _PayloadTooLarge() if not length: return {} return json.loads(self.rfile.read(length).decode("utf-8")) ``` Add `class _PayloadTooLarge(Exception): pass` and catch it in each handler → `self._json({"error": "payload too large"}, HTTPStatus.REQUEST_ENTITY_TOO_LARGE)`. - [ ] **Step 4:** Run → PASS. - [ ] **Step 5:** Commit `fix: cap HTTP request body size`. --- ## Phase 2: HIGH — SQLite concurrency + atomic writes **Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_sqlite_store.py`, `test_http_app.py`. ### Task 4: WAL + busy_timeout in `_connect` - [ ] **Step 1: Failing test** (`test_sqlite_store.py`): ```python def test_connect_uses_wal_and_busy_timeout(tmp_path): store = CopyrighterStore(tmp_path / "c.sqlite3"); store.initialize() with store._connect() as conn: assert conn.execute("pragma journal_mode").fetchone()[0].lower() == "wal" assert int(conn.execute("pragma busy_timeout").fetchone()[0]) >= 5000 ``` - [ ] **Step 2:** Run → FAIL (`delete` != `wal`). - [ ] **Step 3:** Edit `_connect` (3248-3252): ```python def _connect(self) -> sqlite3.Connection: conn = sqlite3.connect(self.db_path) conn.execute("pragma foreign_keys = on") conn.execute("pragma busy_timeout = 5000") conn.row_factory = sqlite3.Row return conn ``` In `initialize()` (after `mkdir`, before/inside the first `_connect`), set WAL once: ```python with self._connect() as conn: conn.execute("pragma journal_mode = wal") conn.executescript(...) # existing ``` - [ ] **Step 4:** Run → PASS. Full suite (WAL persists; verify no test asserts `delete` mode). - [ ] **Step 5:** Commit `fix: enable SQLite WAL and busy_timeout for concurrent operators`. ### Task 5: Map `sqlite3.OperationalError` to 503 - [ ] **Step 1: Failing test** (`test_http_app.py`): monkeypatch a store method to raise `sqlite3.OperationalError("database is locked")`, assert handler returns 503 JSON not a crash. ```python def test_operational_error_returns_503(http_server, monkeypatch): monkeypatch.setattr(http_server.store_for_test, "bootstrap", lambda: (_ for _ in ()).throw(sqlite3.OperationalError("database is locked"))) base = f"http://127.0.0.1:{http_server.server_port}" with pytest.raises(HTTPError) as exc: urlopen(f"{base}/api/bootstrap") assert exc.value.code == 503 ``` (If the existing `http_server` fixture doesn't expose the store, add a minimal one matching the existing fixture pattern.) - [ ] **Step 2:** Run → FAIL (500/connection abort). - [ ] **Step 3:** In all three handlers, add before the generic catches: ```python except sqlite3.OperationalError: self._json({"error": "service busy, retry"}, HTTPStatus.SERVICE_UNAVAILABLE) ``` Import `sqlite3` in `http_app.py`. - [ ] **Step 4:** Run → PASS. - [ ] **Step 5:** Commit `fix: surface SQLite lock contention as retryable 503`. ### Task 6: Atomic decision + audit (transaction threading) Make `record_decision`'s state change + audit event commit in ONE transaction so the audit trail can't diverge. **Interfaces:** `_put(table, id, payload, *, conn=None)` and `add_audit_event(..., *, conn=None)` and `_create_or_update_watchlist_entry(..., *, conn=None)` accept an optional connection; when given, they reuse it (no own commit). Add a private `_transaction()` contextmanager. - [ ] **Step 1: Failing test**: simulate audit failure mid-decision, assert submission status is NOT committed (rolled back). ```python def test_record_decision_rolls_back_when_audit_fails(tmp_path, monkeypatch, seeded_store): store, sid = seeded_store def boom(*a, **k): raise RuntimeError("audit down") monkeypatch.setattr(store, "add_audit_event", boom) with pytest.raises(RuntimeError): store.record_decision(sid, "held", "m") assert store._get("submissions", sid)["decisionStatus"] == "unreviewed" ``` - [ ] **Step 2:** Run → FAIL (status persisted because `_put` committed independently). - [ ] **Step 3:** Implement: - Add helper: ```python @contextmanager def _transaction(self): conn = self._connect() try: yield conn conn.commit() except Exception: conn.rollback() raise finally: conn.close() ``` - Add `conn: sqlite3.Connection | None = None` param to `_put`, `add_audit_event`, `_create_or_update_watchlist_entry`, `_get`. When `conn` is provided, use it directly instead of `with self._connect()`; do not commit (the `_transaction` owns commit). - Rewrite `record_decision` body (1004-1018) to wrap the three writes: ```python with self._transaction() as conn: submission = self._get("submissions", submission_id, conn=conn) submission["decisionStatus"] = decision submission["applicantStatus"] = (...same...) self._put("submissions", submission_id, submission, conn=conn) if decision in {"held", "rejected"}: self._create_or_update_watchlist_entry(submission_id, decision, memo, image_store, conn=conn) self.add_audit_event("rights.ops", "Operator decision created", submission_id, f"{decision} · {memo or 'memo optional'}", conn=conn) return self.review(submission_id) ``` - `_connect()` in autocommit-less mode: pass `isolation_level` default keeps implicit transactions; ensure the helper functions that do `with self._connect() as conn:` are NOT affected (they keep their own path when `conn is None`). - [ ] **Step 4:** Run targeted + full suite → PASS. - [ ] **Step 5:** Commit `fix: make record_decision + audit atomic via single transaction`. > NOTE: apply the same `_transaction` wrapping to `mark_evidence_status` (state + audit) as a follow-up micro-task with the same pattern; keep each as its own commit. --- ## Phase 3: HIGH — Air-gapped dependency manifest **Files:** Create `requirements.txt`, `requirements-lock.txt`, `docs/operations/offline-install.md`. ### Task 7: Pin runtime deps - [ ] **Step 1:** Capture the known-good versions (from current working env): ``` fastapi==0.115.0 flask==3.1.2 httpx==0.27.0 numpy==2.3.5 opencv-python-headless==4.13.0.92 pillow==12.2.0 requests==2.32.5 ``` Write `requirements.txt` with these (the project's actual imports — verify each is imported in `src/`; drop any not imported, e.g. confirm `fastapi`/`flask` usage with `grep -rn "import fastapi\|import flask" src`). - [ ] **Step 2:** Generate a full transitive lock on the build machine: ```bash pip freeze > requirements-lock.txt ``` - [ ] **Step 3:** Document offline install in `docs/operations/offline-install.md`: ``` # Build machine (has internet): pip download -r requirements-lock.txt -d wheelhouse/ # Transfer wheelhouse/ + repo across air gap, then on target: pip install --no-index --find-links wheelhouse/ -r requirements-lock.txt ``` - [ ] **Step 4:** Verify the manifest installs offline (build a wheelhouse, `pip install --no-index` into a fresh venv). Record result. - [ ] **Step 5:** Commit `chore: pin runtime dependencies for offline air-gapped install`. --- ## Phase 4: HIGH — SSRF guard on remote fetches **Files:** Modify `src/rights_filter/server/sqlite_store.py`; Test `test_sqlite_store.py`. ### Task 8: Block private/loopback/link-local hosts + disable redirects - [ ] **Step 1: Failing tests**: ```python @pytest.mark.parametrize("url", [ "http://127.0.0.1/x", "http://169.254.169.254/latest/meta-data", "http://10.0.0.1/a", "http://192.168.1.1/a", "http://[::1]/a", "http://localhost/a", ]) def test_fetch_rejects_internal_hosts(url): from rights_filter.server.sqlite_store import _fetch_url_bytes with pytest.raises(ValueError): _fetch_url_bytes(url) ``` - [ ] **Step 2:** Run → FAIL (attempts real fetch / no guard). - [ ] **Step 3:** Add a guard used by `_fetch_url_bytes_with_headers` before `urlopen`: ```python import ipaddress, socket from urllib.parse import urlparse def _assert_public_http_url(url: str) -> None: parsed = urlparse(url) if parsed.scheme not in {"http", "https"} or not parsed.hostname: raise ValueError("only public http(s) URLs may be fetched") host = parsed.hostname try: infos = socket.getaddrinfo(host, parsed.port or (443 if parsed.scheme == "https" else 80), proto=socket.IPPROTO_TCP) except OSError as exc: raise ValueError("could not resolve fetch host") from exc for *_, sockaddr in infos: ip = ipaddress.ip_address(sockaddr[0]) if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or ip.is_multicast or ip.is_unspecified: raise ValueError("refusing to fetch internal address") ``` Call `_assert_public_http_url(url)` at the top of `_fetch_url_bytes_with_headers` (line 3771). Disable redirects: build an opener with a no-redirect handler, OR re-validate on redirect. Minimal: use a custom `HTTPRedirectHandler` that re-runs `_assert_public_http_url` on `redirect_request`, via `build_opener`. Replace `urlopen(request, ...)` with that opener's `.open(...)`. - [ ] **Step 4:** Run → PASS. Verify the existing happy-path fetch test (monkeypatched `urlopen`) still passes — it may need to also patch `_assert_public_http_url` or use a public-looking host; adjust the existing test's monkeypatch to bypass DNS (patch `_assert_public_http_url` to no-op in that test). - [ ] **Step 5:** Commit `fix: block SSRF to internal addresses in remote fetchers`. --- ## Phase 5: HIGH — PII / governance retention enforcement **Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/governance/policies.py`; Create `src/rights_filter/jobs/retention_sweeper.py`; Test `tests/rights_filter/jobs/test_retention_sweeper.py`, `test_sqlite_store.py`. ### Task 9: Face-crop retention sweep + deletion path Add a retention TTL (env `COPYRIGHTER_FACE_CROP_RETENTION_DAYS`, default e.g. 90; 0 = keep) and a sweep that deletes expired face-crop files + clears their DB references, writing an audit event. - [ ] **Step 1: Failing test**: create a face-crop file with an old mtime + a submission referencing it; run sweep; assert file deleted, reference cleared, audit event present. - [ ] **Step 2:** Run → FAIL (no sweeper). - [ ] **Step 3:** Implement `FaceCropRetentionSweeper(store, retention_days)` in `jobs/retention_sweeper.py`: - List `store.face_crop_image_dir` files older than cutoff; delete; for each submission whose `faceCrops` referenced a deleted file, update payload to drop it; `add_audit_event(actor="system", event="Face crop purged", ...)`. - Add `store.purge_expired_face_crops(now_epoch)` method doing the DB+file work in one `_transaction`. - [ ] **Step 4:** Run → PASS. - [ ] **Step 5:** Commit `feat: face-crop retention sweep with audit trail`. ### Task 10: Wire it to run + audit external personal-image transmission - [ ] **Step 1:** Call the sweeper at server startup (`__main__.py`) and at the start of `seed_from_image_store` (cheap, idempotent). Add `add_audit_event(actor="system", event="Personal image sent to external provider", object_id=submission_id, change="face crop → Google Vision")` at the face-crop external-transmission site (`sqlite_store.py:~2127`, gated on the crop actually being sent). - [ ] **Step 2:** Test: assert the audit event is written when a face crop is transmitted (use the existing fake Vision client path). - [ ] **Step 3:** Run → PASS. - [ ] **Step 4:** Commit `feat: audit personal-image external transmission; run retention sweep at startup`. > NOTE: Full `GovernancePolicyRegistry` enforcement on read/serve routes is large; this phase enforces the highest-impact rule (retention/deletion of biometric data) + makes transmission auditable. Deeper role enforcement is folded into Task 2's auth gate. Document remaining registry wiring as a follow-up in the plan's "Deferred" section. --- ## Phase 6: MEDIUM — correctness / idempotency **Files:** Modify `sqlite_store.py`, `analysis/evidence_enrichment.py`; Tests alongside. ### Task 11: Atomic provider usage increment Replace read-modify-write counter updates with one SQL `UPDATE`. - [ ] **Step 1: Failing test**: two interleaved increments don't lose an update. Simulate by calling a new `store._increment_provider_usage(provider, n)` twice and asserting cumulative. - [ ] **Step 2:** Run → FAIL (method missing). - [ ] **Step 3:** Add: ```python def _increment_provider_usage(self, provider: str, count: int, *, conn=None) -> None: owns = conn is None conn = conn or self._connect() try: conn.execute("update providers set usage = usage + ? where id = ?", (int(count), provider)) if owns: conn.commit() finally: if owns: conn.close() ``` Replace the 5 sites (`1361`, `1415`, `2076`, `2301`, `2535`) `provider_payload["usage"] += n; ...; self._put("providers", ...)` so the `usage` field is updated via `_increment_provider_usage` and the remaining `lastSuccess/lastFailure` go through `_put` (or fold into the UPDATE). Keep behavior identical for single-threaded tests. - [ ] **Step 4:** Run → PASS + full suite. - [ ] **Step 5:** Commit `fix: atomic provider usage increment to prevent lost updates`. ### Task 12: Idempotent candidate promotion - [ ] **Step 1: Failing test**: promoting the same candidate twice yields ONE knowledge entry; second call returns existing id. - [ ] **Step 2:** Run → FAIL (two entries). - [ ] **Step 3:** In `promote_collection_candidate` (1670): if `candidate.get("status") == "promoted"`, return bootstrap (or the existing entry) without creating a new one. Derive `entry_id = _stable_id("kb-candidate", candidate_id)` (deterministic; `_stable_id` exists at ~3697) so `_put`'s upsert collapses retries. Wrap the entry put + candidate flip in `_transaction`. - [ ] **Step 4:** Run → PASS. - [ ] **Step 5:** Commit `fix: make collection candidate promotion idempotent`. ### Task 13: Short-circuit LLM re-invocation - [ ] **Step 1: Failing test** (`test_evidence_enrichment.py`): a run that already has an LLM summary does NOT call `llm_assistant.summarize` (spy/mock asserts not called). - [ ] **Step 2:** Run → FAIL (currently always called at line 93). - [ ] **Step 3:** Edit `enrich_latest` lines 93-97: ```python if not _has_existing_llm_summary(run.evidence): llm_summary = self.llm_assistant.summarize(submission_id, source_evidence) if llm_summary.source == EvidenceSource.ENRICHMENT_FAILURE: summary.summary_failures += 1 new_evidence.append(llm_summary) ``` - [ ] **Step 4:** Run → PASS + full suite. - [ ] **Step 5:** Commit `fix: skip LLM summarize when a summary already exists`. ### Task 14: Migration crash-safety - [ ] **Step 1: Failing test**: seed a legacy table with a row violating the new CHECK so `insert...select` raises; assert original data still readable and no orphan `__copyrighter_*_new` table remains. - [ ] **Step 2:** Run → FAIL (orphan temp table / partial state). - [ ] **Step 3:** In `_ensure_constrained_schema`/`_rebuild_constrained_table` (308-362): on exception, drop any leftover `__copyrighter_*_new` temp tables in the `finally`/`except`; rebuild all pending tables into temp tables BEFORE any `drop table`, then drop+rename in a tight loop (minimize the irreversible window). Add cleanup of stale temp tables at startup. - [ ] **Step 4:** Run → PASS. - [ ] **Step 5:** Commit `fix: crash-safe constraint migration with temp-table cleanup`. --- ## Phase 7: MEDIUM — test coverage for security-sensitive paths **Files:** Tests only. ### Task 15: Path-traversal guard tests - [ ] **Step 1:** Add to `test_sqlite_store.py`: for each of `knowledge_media_path`, `collected_media_path`, `face_crop_media_path`, assert `ValueError` on `"../../etc/passwd"` and url-encoded `"..%2f..%2fx"` (decode handled by caller, so test the method with `../`). - [ ] **Step 2:** Run → PASS immediately (guards exist). These are regression pins. - [ ] **Step 3:** Add one HTTP-level test: `GET /collected-media/../../secret` → 400/404. - [ ] **Step 4:** Commit `test: pin media path-traversal guards`. ### Task 16: Risk-band boundary tests - [ ] **Step 1:** Add table tests to `test_risk_scoring.py` asserting bands at scores 29→low, 30→medium, 69→medium, 70→high, and an input whose raw sum >100 → score==100. Craft evidence: FINGERPRINT similarity≥0.9 (+80) + WEB_DETECTION full (+45) = 125 → expect 100/high. - [ ] **Step 2:** Run → PASS (logic is correct; pins boundaries). - [ ] **Step 3:** Commit `test: pin risk-band thresholds and 100 cap`. ### Task 17: Upload/import validation tests - [ ] **Step 1:** Add HTTP tests for: unsupported suffix, non-base64 data, empty data → 400; import-folder missing path → 400; body `not json` → 400 (not 500). - [ ] **Step 2:** Run → mostly PASS; if `not json` yields 500, fix `_body` to raise `ValueError` on `JSONDecodeError`: ```python try: return json.loads(...) except json.JSONDecodeError as exc: raise ValueError("invalid JSON body") from exc ``` - [ ] **Step 3:** Commit `test: cover upload/import boundary validation; 400 on malformed JSON`. --- ## Phase 8: LOW — frontend defense-in-depth + misc **Files:** Modify `web/operator-gui/app.js`; `integrations/google_custom_search.py`; `analysis/face_person_detection.py`; `integrations/env_clients.py`. Test `tests/operator_gui/test_static_workbench.py`. ### Task 18: Client-side `safeUrl` allowlist + `apiJson` ok-check + img onerror - [ ] **Step 1:** Add `safeUrl(u)` in `app.js` returning `u` only if it starts `http://`/`https://`/`/`, else `"#"`. Apply at href/src construction (`~1722`, `~1159`). In `apiJson` (`~116`), check `response.ok` BEFORE parsing JSON; on non-ok, read text and throw with status. Add `onerror` to evidence/candidate `` (hide or placeholder). - [ ] **Step 2:** Static test asserts `safeUrl` exists and the JS parses (existing `test_static_workbench.py` pattern — string/structure checks). - [ ] **Step 3:** Commit `fix: frontend URL scheme allowlist, fetch ok-check, image onerror`. ### Task 19: Complete ARIA tab semantics - [ ] **Step 1:** Add `role="tablist"/"tab"/"tabpanel"`, `aria-selected`, `aria-controls`, keyboard arrow nav to the tab control (`index.html:~187` + handler in `app.js`). - [ ] **Step 2:** Commit `fix: complete ARIA tab pattern for keyboard nav`. ### Task 20: Misc hardening - [ ] **Step 1:** `google_custom_search.py:65` — move API key from query string to a header if the API supports it; else document why query param is required (Google CSE requires `key=` query param → document, don't log full URL). Ensure the URL is never logged with the key. - [ ] **Step 2:** `face_person_detection.py:52` — raise/log a clear warning if the OpenCV cascade asset is empty instead of silent no-op. - [ ] **Step 3:** `env_clients.py:187` — emit a startup provider-readiness summary (which providers enabled/disabled and why). - [ ] **Step 4:** Commit `fix: provider readiness logging, cascade-load warning, avoid key in logs`. --- ## Phase 9: HIGH — split `sqlite_store.py` god object (the long pole) **Approach:** Pure mechanical extraction, behavior-preserving. The public class `CopyrighterStore` stays; internals move to focused modules. After each extraction: full suite green, commit. Target: no file >800 lines. **Target module map** (under `src/rights_filter/server/store/`): - `url_utils.py` — module-level URL helpers (`_is_http_url`, `_decoded_nested_url`, `_url_path_has_image_suffix`, `_url_has_image_format_hint`, `_url_looks_like_image`, normalization helpers ~4660-4772). Also import these in `naver_search.py`/`google_custom_search.py` to kill triplication. - `remote_fetch.py` — `_fetch_url_bytes*`, headers, `_PageImageParser`, css/page extraction, `_assert_public_http_url` (Task 8). - `schema.py` — `_ensure_constrained_schema`, `_rebuild_constrained_table`, `_table_*`, `CONSTRAINED_TABLE_SCHEMAS`, `LEGACY_REBUILD_ORDER`, the `executescript` DDL. - `serialization.py` — `_submission_payload`, `_evidence_payload`, `_evidence_id`, `_stable_id`, `_timestamp_id`, `_now_label`, `_text_list`, `_unique_texts`. - `media_paths.py` — the three `*_media_path` methods (as functions taking dirs) + path-traversal guard. - `enrichment_orchestrator.py` — `seed_from_image_store`, `rerun_enrichment`, `run_auto_search`, `_auto_*_search`, `_rerun_*` (the heavy business orchestration). - `sqlite_store.py` — slim `CopyrighterStore`: `__init__`, `_connect`, `_transaction`, `_put`/`_get`/`_all`/`_count`, and thin methods delegating to the orchestrator. ### Task 21..27: one module per task For EACH module above, repeat this micro-cycle: - [ ] **Step 1:** Create the new module; cut the target functions verbatim into it (fix imports). - [ ] **Step 2:** In `sqlite_store.py`, import from the new module; where instance methods moved, keep a thin delegating method OR call the free function. Do NOT change logic. - [ ] **Step 3:** Run full `pytest -q` → must stay `367 passed`. - [ ] **Step 4:** `wc -l` the touched files; confirm shrinking. - [ ] **Step 5:** Commit `refactor: extract from sqlite_store god object`. **Order (low-risk first):** `url_utils` → `serialization` → `media_paths` → `remote_fetch` → `schema` → `enrichment_orchestrator`. Each is independently testable via the existing suite; if any extraction reddens the suite, revert that single commit and re-slice smaller. ### Task 28: De-duplicate URL helpers across integrations - [ ] **Step 1:** Point `naver_search.py:629` and `google_custom_search.py` URL helpers at `store/url_utils.py` (or move shared helpers to a neutral `rights_filter/common/url_utils.py` if `server` import direction is wrong from `integrations`). Respect layering: `integrations` must NOT import `server`. → put shared `url_utils` under `rights_filter/common/`. - [ ] **Step 2:** Run full suite → PASS. - [ ] **Step 3:** Commit `refactor: share URL helpers via common/url_utils`. --- ## Phase 10: Verification & finish ### Task 29: Full verification - [ ] **Step 1:** `python -m pytest -q` → `367 passed` (+ new tests; count grows). - [ ] **Step 2:** `python -m pytest --cov=rights_filter --cov-report=term-missing -q` → coverage ≥ prior 89%. - [ ] **Step 3:** Confirm no source file >800 lines: `find src -name '*.py' | xargs wc -l | sort -rn | head`. - [ ] **Step 4:** Offline-install dry run (Phase 3 wheelhouse). - [ ] **Step 5:** Update `docs/operations/` with the auth token + retention env vars. ### Task 30: PR - [ ] **Step 1:** `git diff main...HEAD` review. - [ ] **Step 2:** Push branch, open PR with the review findings → fixes mapping table. --- ## Deferred (explicitly out of this plan, log as follow-ups) - Full `GovernancePolicyRegistry` role enforcement on every read/serve path (Task 2 + Task 9 cover the high-impact subset). - `rerun_enrichment` per-submission in-progress lock (LOW; cosmetic `lastRerunDiff` only). - `seed_from_image_store` per-row claim lock (MEDIUM; converges on data, dup side-effects only) — mitigated once auth limits operators. - Optimistic-concurrency `version` column on submissions (LOW). ## Self-review notes - Spec coverage: every HIGH (#1 SSRF→T8, #2 CORS/auth→T1/T2, #3 concurrency→T4/T5, #4 governance→T9/T10, #5 deps→T7, #6 god-file→T21-28, #7 atomic→T6), every MEDIUM (T11-T17), every LOW (T18-T20) maps to a task. - Test-first for all behavioral changes; pure refactor (Phase 9) gated by existing suite. - No placeholders: each behavioral task has concrete test + exact code/site.