POSA_Copyrighter/docs/superpowers/plans/2026-06-20-project-review-remediation.md

569 lines
32 KiB
Markdown

# Project Review Remediation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans (inline, batched with checkpoints) to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Fix all HIGH/MEDIUM/LOW findings from the 2026-06-20 multi-agent project review, including splitting the 5119-line `sqlite_store.py` god object.
**Architecture:** Mechanical, low-risk fixes first (security, concurrency, deps), each guarded by the existing 367-test suite. The god-file split is last and incremental: move a cohesive method group into a new module, re-export/delegate, run the full suite, commit — repeat. No behavior changes during the split.
**Tech Stack:** Python 3.13, stdlib `http.server` + `sqlite3` + `urllib`, OpenCV/Pillow/NumPy (lazy), vanilla-JS frontend. Tests: pytest 8.4 + pytest-cov.
## Global Constraints
- **Air-gapped (폐쇄망) target:** every artifact installs/runs with NO internet for its OWN deps. External *search* APIs (Google/Naver/Vision) at runtime are an ALLOWED exception.
- **Do not "fix" allowed design:** external search API calls; operator-efficiency tradeoffs; relaxed `google_search` manual-query rule. Leave them.
- **Regression gate:** `python -m pytest -q` (367 tests, ~3.4 min) must stay green after every task. Run targeted tests during a task, full suite before each commit.
- **Immutability / style:** match existing code style; new objects over mutation where reasonable; functions <50 lines, files <800 lines (the split exists to enforce this).
- **No new runtime network deps.** Stdlib only for new code paths.
- **Commit format:** `<type>: <desc>` (feat/fix/refactor/test/chore/docs). Branch `fix/project-review-remediation`.
---
## Phase 0: Safety net
### Task 0: Branch + baseline
- [ ] **Step 1:** Create branch
```bash
git checkout -b fix/project-review-remediation
```
- [ ] **Step 2:** Baseline the suite (record green)
```bash
python -m pytest -q
```
Expected: `367 passed`
- [ ] **Step 3:** No commit (branch only).
---
## Phase 1: HIGH — HTTP server hardening (CORS, auth, body cap)
**Files:** Modify `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_http_app.py`.
### Task 1: Remove wildcard CORS
The operator GUI is served same-origin from this server, so it needs no CORS header at all. `Access-Control-Allow-Origin: *` lets any site the operator visits read biometric/case data from `localhost`.
- [ ] **Step 1: Failing test** add to `test_http_app.py`:
```python
def test_json_responses_do_not_send_wildcard_cors(http_server):
base = f"http://127.0.0.1:{http_server.server_port}"
response = urlopen(f"{base}/api/bootstrap")
assert response.headers.get("Access-Control-Allow-Origin") is None
```
- [ ] **Step 2:** Run FAIL (`assert '*' is None`).
- [ ] **Step 3:** In `http_app.py:207-214` `_json`, delete the line:
```python
self.send_header("Access-Control-Allow-Origin", "*")
```
- [ ] **Step 4:** Run targeted test PASS. Run full `test_http_app.py`.
- [ ] **Step 5:** Commit `fix: remove wildcard CORS from API responses`.
### Task 2: Optional shared-token auth gate
In 폐쇄망 a single shared bearer token (from env `COPYRIGHTER_AUTH_TOKEN`) is enough. If unset, server stays open (back-compat for dev) but logs a one-time warning; if set, every `/api/*`, `/media/*`, `/knowledge-media/*`, `/collected-media/*`, `/face-crop-media/*` request must carry `Authorization: Bearer <token>` (or `?token=` for `<img>` src which cannot send headers).
**Interfaces produced:** `build_server(..., auth_token: str | None = None)`.
- [ ] **Step 1: Failing test**:
```python
def test_media_requires_token_when_configured(tmp_path):
store, image_store, static_dir = _make_server_deps(tmp_path) # existing helper pattern
server = build_server("127.0.0.1", 0, store, image_store, static_dir, auth_token="secret")
port = server.server_port
thread = threading.Thread(target=server.serve_forever, daemon=True); thread.start()
try:
with pytest.raises(HTTPError) as exc:
urlopen(f"http://127.0.0.1:{port}/api/bootstrap")
assert exc.value.code == 401
req = Request(f"http://127.0.0.1:{port}/api/bootstrap", headers={"Authorization": "Bearer secret"})
assert urlopen(req).status == 200
finally:
server.shutdown()
```
- [ ] **Step 2:** Run FAIL (`build_server() got unexpected kwarg auth_token`).
- [ ] **Step 3:** Implement. In `build_server` signature add `auth_token: str | None = None`. At top of `do_GET`/`do_POST`/`do_PATCH`, before routing, call a guard:
```python
def _authorized(self) -> bool:
if not auth_token:
return True
header = self.headers.get("Authorization", "")
if header == f"Bearer {auth_token}":
return True
query_token = parse_qs(urlparse(self.path).query).get("token", [""])[0]
return query_token == auth_token
def _require_auth(self) -> bool:
if self._authorized():
return True
self._json({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED)
return False
```
Add at the start of each handler body (after `path = _path(self.path)`):
```python
if not self._require_auth():
return
```
Import `parse_qs` from `urllib.parse`. Thread `auth_token` from `__main__.py` env (`os.environ.get("COPYRIGHTER_AUTH_TOKEN")`).
- [ ] **Step 4:** Run targeted + full `test_http_app.py` PASS (existing tests pass `auth_token=None`).
- [ ] **Step 5:** Add `COPYRIGHTER_AUTH_TOKEN=` to `.env.example`. Commit `feat: optional shared-token auth gate for operator server`.
### Task 3: Cap request body size
`_body()` reads `Content-Length` unbounded.
- [ ] **Step 1: Failing test**: POST with `Content-Length` over cap returns 413.
```python
def test_oversized_body_rejected(http_server):
base = f"http://127.0.0.1:{http_server.server_port}"
big = b"{" + b'"x":"' + b"a" * (11 * 1024 * 1024) + b'"}'
req = Request(f"{base}/api/knowledge/manual", data=big, headers={"Content-Type": "application/json"})
with pytest.raises(HTTPError) as exc:
urlopen(req)
assert exc.value.code == 413
```
- [ ] **Step 2:** Run FAIL.
- [ ] **Step 3:** In `_body()` (line 201-205), add constant `_MAX_BODY_BYTES = 10 * 1024 * 1024` at module level, and:
```python
def _body(self) -> dict[str, object]:
length = int(self.headers.get("Content-Length", "0") or "0")
if length > _MAX_BODY_BYTES:
raise _PayloadTooLarge()
if not length:
return {}
return json.loads(self.rfile.read(length).decode("utf-8"))
```
Add `class _PayloadTooLarge(Exception): pass` and catch it in each handler `self._json({"error": "payload too large"}, HTTPStatus.REQUEST_ENTITY_TOO_LARGE)`.
- [ ] **Step 4:** Run PASS.
- [ ] **Step 5:** Commit `fix: cap HTTP request body size`.
---
## Phase 2: HIGH — SQLite concurrency + atomic writes
**Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/server/http_app.py`; Test `tests/rights_filter/server/test_sqlite_store.py`, `test_http_app.py`.
### Task 4: WAL + busy_timeout in `_connect`
- [ ] **Step 1: Failing test** (`test_sqlite_store.py`):
```python
def test_connect_uses_wal_and_busy_timeout(tmp_path):
store = CopyrighterStore(tmp_path / "c.sqlite3"); store.initialize()
with store._connect() as conn:
assert conn.execute("pragma journal_mode").fetchone()[0].lower() == "wal"
assert int(conn.execute("pragma busy_timeout").fetchone()[0]) >= 5000
```
- [ ] **Step 2:** Run FAIL (`delete` != `wal`).
- [ ] **Step 3:** Edit `_connect` (3248-3252):
```python
def _connect(self) -> sqlite3.Connection:
conn = sqlite3.connect(self.db_path)
conn.execute("pragma foreign_keys = on")
conn.execute("pragma busy_timeout = 5000")
conn.row_factory = sqlite3.Row
return conn
```
In `initialize()` (after `mkdir`, before/inside the first `_connect`), set WAL once:
```python
with self._connect() as conn:
conn.execute("pragma journal_mode = wal")
conn.executescript(...) # existing
```
- [ ] **Step 4:** Run PASS. Full suite (WAL persists; verify no test asserts `delete` mode).
- [ ] **Step 5:** Commit `fix: enable SQLite WAL and busy_timeout for concurrent operators`.
### Task 5: Map `sqlite3.OperationalError` to 503
- [ ] **Step 1: Failing test** (`test_http_app.py`): monkeypatch a store method to raise `sqlite3.OperationalError("database is locked")`, assert handler returns 503 JSON not a crash.
```python
def test_operational_error_returns_503(http_server, monkeypatch):
monkeypatch.setattr(http_server.store_for_test, "bootstrap",
lambda: (_ for _ in ()).throw(sqlite3.OperationalError("database is locked")))
base = f"http://127.0.0.1:{http_server.server_port}"
with pytest.raises(HTTPError) as exc:
urlopen(f"{base}/api/bootstrap")
assert exc.value.code == 503
```
(If the existing `http_server` fixture doesn't expose the store, add a minimal one matching the existing fixture pattern.)
- [ ] **Step 2:** Run FAIL (500/connection abort).
- [ ] **Step 3:** In all three handlers, add before the generic catches:
```python
except sqlite3.OperationalError:
self._json({"error": "service busy, retry"}, HTTPStatus.SERVICE_UNAVAILABLE)
```
Import `sqlite3` in `http_app.py`.
- [ ] **Step 4:** Run PASS.
- [ ] **Step 5:** Commit `fix: surface SQLite lock contention as retryable 503`.
### Task 6: Atomic decision + audit (transaction threading)
Make `record_decision`'s state change + audit event commit in ONE transaction so the audit trail can't diverge.
**Interfaces:** `_put(table, id, payload, *, conn=None)` and `add_audit_event(..., *, conn=None)` and `_create_or_update_watchlist_entry(..., *, conn=None)` accept an optional connection; when given, they reuse it (no own commit). Add a private `_transaction()` contextmanager.
- [ ] **Step 1: Failing test**: simulate audit failure mid-decision, assert submission status is NOT committed (rolled back).
```python
def test_record_decision_rolls_back_when_audit_fails(tmp_path, monkeypatch, seeded_store):
store, sid = seeded_store
def boom(*a, **k): raise RuntimeError("audit down")
monkeypatch.setattr(store, "add_audit_event", boom)
with pytest.raises(RuntimeError):
store.record_decision(sid, "held", "m")
assert store._get("submissions", sid)["decisionStatus"] == "unreviewed"
```
- [ ] **Step 2:** Run FAIL (status persisted because `_put` committed independently).
- [ ] **Step 3:** Implement:
- Add helper:
```python
@contextmanager
def _transaction(self):
conn = self._connect()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
```
- Add `conn: sqlite3.Connection | None = None` param to `_put`, `add_audit_event`, `_create_or_update_watchlist_entry`, `_get`. When `conn` is provided, use it directly instead of `with self._connect()`; do not commit (the `_transaction` owns commit).
- Rewrite `record_decision` body (1004-1018) to wrap the three writes:
```python
with self._transaction() as conn:
submission = self._get("submissions", submission_id, conn=conn)
submission["decisionStatus"] = decision
submission["applicantStatus"] = (...same...)
self._put("submissions", submission_id, submission, conn=conn)
if decision in {"held", "rejected"}:
self._create_or_update_watchlist_entry(submission_id, decision, memo, image_store, conn=conn)
self.add_audit_event("rights.ops", "Operator decision created", submission_id,
f"{decision} · {memo or 'memo optional'}", conn=conn)
return self.review(submission_id)
```
- `_connect()` in autocommit-less mode: pass `isolation_level` default keeps implicit transactions; ensure the helper functions that do `with self._connect() as conn:` are NOT affected (they keep their own path when `conn is None`).
- [ ] **Step 4:** Run targeted + full suite PASS.
- [ ] **Step 5:** Commit `fix: make record_decision + audit atomic via single transaction`.
> NOTE: apply the same `_transaction` wrapping to `mark_evidence_status` (state + audit) as a follow-up micro-task with the same pattern; keep each as its own commit.
---
## Phase 3: HIGH — Air-gapped dependency manifest
**Files:** Create `requirements.txt`, `requirements-lock.txt`, `docs/operations/offline-install.md`.
### Task 7: Pin runtime deps
- [ ] **Step 1:** Capture the known-good versions (from current working env):
```
fastapi==0.115.0
flask==3.1.2
httpx==0.27.0
numpy==2.3.5
opencv-python-headless==4.13.0.92
pillow==12.2.0
requests==2.32.5
```
Write `requirements.txt` with these (the project's actual imports verify each is imported in `src/`; drop any not imported, e.g. confirm `fastapi`/`flask` usage with `grep -rn "import fastapi\|import flask" src`).
- [ ] **Step 2:** Generate a full transitive lock on the build machine:
```bash
pip freeze > requirements-lock.txt
```
- [ ] **Step 3:** Document offline install in `docs/operations/offline-install.md`:
```
# Build machine (has internet):
pip download -r requirements-lock.txt -d wheelhouse/
# Transfer wheelhouse/ + repo across air gap, then on target:
pip install --no-index --find-links wheelhouse/ -r requirements-lock.txt
```
- [ ] **Step 4:** Verify the manifest installs offline (build a wheelhouse, `pip install --no-index` into a fresh venv). Record result.
- [ ] **Step 5:** Commit `chore: pin runtime dependencies for offline air-gapped install`.
---
## Phase 4: HIGH — SSRF guard on remote fetches
**Files:** Modify `src/rights_filter/server/sqlite_store.py`; Test `test_sqlite_store.py`.
### Task 8: Block private/loopback/link-local hosts + disable redirects
- [ ] **Step 1: Failing tests**:
```python
@pytest.mark.parametrize("url", [
"http://127.0.0.1/x", "http://169.254.169.254/latest/meta-data",
"http://10.0.0.1/a", "http://192.168.1.1/a", "http://[::1]/a", "http://localhost/a",
])
def test_fetch_rejects_internal_hosts(url):
from rights_filter.server.sqlite_store import _fetch_url_bytes
with pytest.raises(ValueError):
_fetch_url_bytes(url)
```
- [ ] **Step 2:** Run FAIL (attempts real fetch / no guard).
- [ ] **Step 3:** Add a guard used by `_fetch_url_bytes_with_headers` before `urlopen`:
```python
import ipaddress, socket
from urllib.parse import urlparse
def _assert_public_http_url(url: str) -> None:
parsed = urlparse(url)
if parsed.scheme not in {"http", "https"} or not parsed.hostname:
raise ValueError("only public http(s) URLs may be fetched")
host = parsed.hostname
try:
infos = socket.getaddrinfo(host, parsed.port or (443 if parsed.scheme == "https" else 80), proto=socket.IPPROTO_TCP)
except OSError as exc:
raise ValueError("could not resolve fetch host") from exc
for *_, sockaddr in infos:
ip = ipaddress.ip_address(sockaddr[0])
if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or ip.is_multicast or ip.is_unspecified:
raise ValueError("refusing to fetch internal address")
```
Call `_assert_public_http_url(url)` at the top of `_fetch_url_bytes_with_headers` (line 3771). Disable redirects: build an opener with a no-redirect handler, OR re-validate on redirect. Minimal: use a custom `HTTPRedirectHandler` that re-runs `_assert_public_http_url` on `redirect_request`, via `build_opener`. Replace `urlopen(request, ...)` with that opener's `.open(...)`.
- [ ] **Step 4:** Run PASS. Verify the existing happy-path fetch test (monkeypatched `urlopen`) still passes it may need to also patch `_assert_public_http_url` or use a public-looking host; adjust the existing test's monkeypatch to bypass DNS (patch `_assert_public_http_url` to no-op in that test).
- [ ] **Step 5:** Commit `fix: block SSRF to internal addresses in remote fetchers`.
---
## Phase 5: HIGH — PII / governance retention enforcement
**Files:** Modify `src/rights_filter/server/sqlite_store.py`, `src/rights_filter/governance/policies.py`; Create `src/rights_filter/jobs/retention_sweeper.py`; Test `tests/rights_filter/jobs/test_retention_sweeper.py`, `test_sqlite_store.py`.
### Task 9: Face-crop retention sweep + deletion path
Add a retention TTL (env `COPYRIGHTER_FACE_CROP_RETENTION_DAYS`, default e.g. 90; 0 = keep) and a sweep that deletes expired face-crop files + clears their DB references, writing an audit event.
- [ ] **Step 1: Failing test**: create a face-crop file with an old mtime + a submission referencing it; run sweep; assert file deleted, reference cleared, audit event present.
- [ ] **Step 2:** Run FAIL (no sweeper).
- [ ] **Step 3:** Implement `FaceCropRetentionSweeper(store, retention_days)` in `jobs/retention_sweeper.py`:
- List `store.face_crop_image_dir` files older than cutoff; delete; for each submission whose `faceCrops` referenced a deleted file, update payload to drop it; `add_audit_event(actor="system", event="Face crop purged", ...)`.
- Add `store.purge_expired_face_crops(now_epoch)` method doing the DB+file work in one `_transaction`.
- [ ] **Step 4:** Run PASS.
- [ ] **Step 5:** Commit `feat: face-crop retention sweep with audit trail`.
### Task 10: Wire it to run + audit external personal-image transmission
- [ ] **Step 1:** Call the sweeper at server startup (`__main__.py`) and at the start of `seed_from_image_store` (cheap, idempotent). Add `add_audit_event(actor="system", event="Personal image sent to external provider", object_id=submission_id, change="face crop → Google Vision")` at the face-crop external-transmission site (`sqlite_store.py:~2127`, gated on the crop actually being sent).
- [ ] **Step 2:** Test: assert the audit event is written when a face crop is transmitted (use the existing fake Vision client path).
- [ ] **Step 3:** Run PASS.
- [ ] **Step 4:** Commit `feat: audit personal-image external transmission; run retention sweep at startup`.
> NOTE: Full `GovernancePolicyRegistry` enforcement on read/serve routes is large; this phase enforces the highest-impact rule (retention/deletion of biometric data) + makes transmission auditable. Deeper role enforcement is folded into Task 2's auth gate. Document remaining registry wiring as a follow-up in the plan's "Deferred" section.
---
## Phase 6: MEDIUM — correctness / idempotency
**Files:** Modify `sqlite_store.py`, `analysis/evidence_enrichment.py`; Tests alongside.
### Task 11: Atomic provider usage increment
Replace read-modify-write counter updates with one SQL `UPDATE`.
- [ ] **Step 1: Failing test**: two interleaved increments don't lose an update. Simulate by calling a new `store._increment_provider_usage(provider, n)` twice and asserting cumulative.
- [ ] **Step 2:** Run FAIL (method missing).
- [ ] **Step 3:** Add:
```python
def _increment_provider_usage(self, provider: str, count: int, *, conn=None) -> None:
owns = conn is None
conn = conn or self._connect()
try:
conn.execute("update providers set usage = usage + ? where id = ?", (int(count), provider))
if owns: conn.commit()
finally:
if owns: conn.close()
```
Replace the 5 sites (`1361`, `1415`, `2076`, `2301`, `2535`) `provider_payload["usage"] += n; ...; self._put("providers", ...)` so the `usage` field is updated via `_increment_provider_usage` and the remaining `lastSuccess/lastFailure` go through `_put` (or fold into the UPDATE). Keep behavior identical for single-threaded tests.
- [ ] **Step 4:** Run PASS + full suite.
- [ ] **Step 5:** Commit `fix: atomic provider usage increment to prevent lost updates`.
### Task 12: Idempotent candidate promotion
- [ ] **Step 1: Failing test**: promoting the same candidate twice yields ONE knowledge entry; second call returns existing id.
- [ ] **Step 2:** Run FAIL (two entries).
- [ ] **Step 3:** In `promote_collection_candidate` (1670): if `candidate.get("status") == "promoted"`, return bootstrap (or the existing entry) without creating a new one. Derive `entry_id = _stable_id("kb-candidate", candidate_id)` (deterministic; `_stable_id` exists at ~3697) so `_put`'s upsert collapses retries. Wrap the entry put + candidate flip in `_transaction`.
- [ ] **Step 4:** Run PASS.
- [ ] **Step 5:** Commit `fix: make collection candidate promotion idempotent`.
### Task 13: Short-circuit LLM re-invocation
- [ ] **Step 1: Failing test** (`test_evidence_enrichment.py`): a run that already has an LLM summary does NOT call `llm_assistant.summarize` (spy/mock asserts not called).
- [ ] **Step 2:** Run FAIL (currently always called at line 93).
- [ ] **Step 3:** Edit `enrich_latest` lines 93-97:
```python
if not _has_existing_llm_summary(run.evidence):
llm_summary = self.llm_assistant.summarize(submission_id, source_evidence)
if llm_summary.source == EvidenceSource.ENRICHMENT_FAILURE:
summary.summary_failures += 1
new_evidence.append(llm_summary)
```
- [ ] **Step 4:** Run PASS + full suite.
- [ ] **Step 5:** Commit `fix: skip LLM summarize when a summary already exists`.
### Task 14: Migration crash-safety
- [ ] **Step 1: Failing test**: seed a legacy table with a row violating the new CHECK so `insert...select` raises; assert original data still readable and no orphan `__copyrighter_*_new` table remains.
- [ ] **Step 2:** Run FAIL (orphan temp table / partial state).
- [ ] **Step 3:** In `_ensure_constrained_schema`/`_rebuild_constrained_table` (308-362): on exception, drop any leftover `__copyrighter_*_new` temp tables in the `finally`/`except`; rebuild all pending tables into temp tables BEFORE any `drop table`, then drop+rename in a tight loop (minimize the irreversible window). Add cleanup of stale temp tables at startup.
- [ ] **Step 4:** Run PASS.
- [ ] **Step 5:** Commit `fix: crash-safe constraint migration with temp-table cleanup`.
---
## Phase 7: MEDIUM — test coverage for security-sensitive paths
**Files:** Tests only.
### Task 15: Path-traversal guard tests
- [ ] **Step 1:** Add to `test_sqlite_store.py`: for each of `knowledge_media_path`, `collected_media_path`, `face_crop_media_path`, assert `ValueError` on `"../../etc/passwd"` and url-encoded `"..%2f..%2fx"` (decode handled by caller, so test the method with `../`).
- [ ] **Step 2:** Run PASS immediately (guards exist). These are regression pins.
- [ ] **Step 3:** Add one HTTP-level test: `GET /collected-media/../../secret` 400/404.
- [ ] **Step 4:** Commit `test: pin media path-traversal guards`.
### Task 16: Risk-band boundary tests
- [ ] **Step 1:** Add table tests to `test_risk_scoring.py` asserting bands at scores 29low, 30medium, 69medium, 70high, and an input whose raw sum >100 → score==100. Craft evidence: FINGERPRINT similarity≥0.9 (+80) + WEB_DETECTION full (+45) = 125 → expect 100/high.
- [ ] **Step 2:** Run → PASS (logic is correct; pins boundaries).
- [ ] **Step 3:** Commit `test: pin risk-band thresholds and 100 cap`.
### Task 17: Upload/import validation tests
- [ ] **Step 1:** Add HTTP tests for: unsupported suffix, non-base64 data, empty data → 400; import-folder missing path → 400; body `not json` → 400 (not 500).
- [ ] **Step 2:** Run → mostly PASS; if `not json` yields 500, fix `_body` to raise `ValueError` on `JSONDecodeError`:
```python
try:
return json.loads(...)
except json.JSONDecodeError as exc:
raise ValueError("invalid JSON body") from exc
```
- [ ] **Step 3:** Commit `test: cover upload/import boundary validation; 400 on malformed JSON`.
---
## Phase 8: LOW — frontend defense-in-depth + misc
**Files:** Modify `web/operator-gui/app.js`; `integrations/google_custom_search.py`; `analysis/face_person_detection.py`; `integrations/env_clients.py`. Test `tests/operator_gui/test_static_workbench.py`.
### Task 18: Client-side `safeUrl` allowlist + `apiJson` ok-check + img onerror
- [ ] **Step 1:** Add `safeUrl(u)` in `app.js` returning `u` only if it starts `http://`/`https://`/`/`, else `"#"`. Apply at href/src construction (`~1722`, `~1159`). In `apiJson` (`~116`), check `response.ok` BEFORE parsing JSON; on non-ok, read text and throw with status. Add `onerror` to evidence/candidate `<img>` (hide or placeholder).
- [ ] **Step 2:** Static test asserts `safeUrl` exists and the JS parses (existing `test_static_workbench.py` pattern — string/structure checks).
- [ ] **Step 3:** Commit `fix: frontend URL scheme allowlist, fetch ok-check, image onerror`.
### Task 19: Complete ARIA tab semantics
- [ ] **Step 1:** Add `role="tablist"/"tab"/"tabpanel"`, `aria-selected`, `aria-controls`, keyboard arrow nav to the tab control (`index.html:~187` + handler in `app.js`).
- [ ] **Step 2:** Commit `fix: complete ARIA tab pattern for keyboard nav`.
### Task 20: Misc hardening
- [ ] **Step 1:** `google_custom_search.py:65` — move API key from query string to a header if the API supports it; else document why query param is required (Google CSE requires `key=` query param → document, don't log full URL). Ensure the URL is never logged with the key.
- [ ] **Step 2:** `face_person_detection.py:52` — raise/log a clear warning if the OpenCV cascade asset is empty instead of silent no-op.
- [ ] **Step 3:** `env_clients.py:187` — emit a startup provider-readiness summary (which providers enabled/disabled and why).
- [ ] **Step 4:** Commit `fix: provider readiness logging, cascade-load warning, avoid key in logs`.
---
## Phase 9: HIGH — split `sqlite_store.py` god object (the long pole)
**Approach:** Pure mechanical extraction, behavior-preserving. The public class `CopyrighterStore` stays; internals move to focused modules. After each extraction: full suite green, commit. Target: no file >800 lines.
**Target module map** (under `src/rights_filter/server/store/`):
- `url_utils.py` — module-level URL helpers (`_is_http_url`, `_decoded_nested_url`, `_url_path_has_image_suffix`, `_url_has_image_format_hint`, `_url_looks_like_image`, normalization helpers ~4660-4772). Also import these in `naver_search.py`/`google_custom_search.py` to kill triplication.
- `remote_fetch.py``_fetch_url_bytes*`, headers, `_PageImageParser`, css/page extraction, `_assert_public_http_url` (Task 8).
- `schema.py``_ensure_constrained_schema`, `_rebuild_constrained_table`, `_table_*`, `CONSTRAINED_TABLE_SCHEMAS`, `LEGACY_REBUILD_ORDER`, the `executescript` DDL.
- `serialization.py``_submission_payload`, `_evidence_payload`, `_evidence_id`, `_stable_id`, `_timestamp_id`, `_now_label`, `_text_list`, `_unique_texts`.
- `media_paths.py` — the three `*_media_path` methods (as functions taking dirs) + path-traversal guard.
- `enrichment_orchestrator.py``seed_from_image_store`, `rerun_enrichment`, `run_auto_search`, `_auto_*_search`, `_rerun_*` (the heavy business orchestration).
- `sqlite_store.py` — slim `CopyrighterStore`: `__init__`, `_connect`, `_transaction`, `_put`/`_get`/`_all`/`_count`, and thin methods delegating to the orchestrator.
### Task 21..27: one module per task
For EACH module above, repeat this micro-cycle:
- [ ] **Step 1:** Create the new module; cut the target functions verbatim into it (fix imports).
- [ ] **Step 2:** In `sqlite_store.py`, import from the new module; where instance methods moved, keep a thin delegating method OR call the free function. Do NOT change logic.
- [ ] **Step 3:** Run full `pytest -q` → must stay `367 passed`.
- [ ] **Step 4:** `wc -l` the touched files; confirm shrinking.
- [ ] **Step 5:** Commit `refactor: extract <module> from sqlite_store god object`.
**Order (low-risk first):** `url_utils``serialization``media_paths``remote_fetch``schema``enrichment_orchestrator`. Each is independently testable via the existing suite; if any extraction reddens the suite, revert that single commit and re-slice smaller.
### Task 28: De-duplicate URL helpers across integrations
- [ ] **Step 1:** Point `naver_search.py:629` and `google_custom_search.py` URL helpers at `store/url_utils.py` (or move shared helpers to a neutral `rights_filter/common/url_utils.py` if `server` import direction is wrong from `integrations`). Respect layering: `integrations` must NOT import `server`. → put shared `url_utils` under `rights_filter/common/`.
- [ ] **Step 2:** Run full suite → PASS.
- [ ] **Step 3:** Commit `refactor: share URL helpers via common/url_utils`.
---
## Phase 10: Verification & finish
### Task 29: Full verification
- [ ] **Step 1:** `python -m pytest -q``367 passed` (+ new tests; count grows).
- [ ] **Step 2:** `python -m pytest --cov=rights_filter --cov-report=term-missing -q` → coverage ≥ prior 89%.
- [ ] **Step 3:** Confirm no source file >800 lines: `find src -name '*.py' | xargs wc -l | sort -rn | head`.
- [ ] **Step 4:** Offline-install dry run (Phase 3 wheelhouse).
- [ ] **Step 5:** Update `docs/operations/` with the auth token + retention env vars.
### Task 30: PR
- [ ] **Step 1:** `git diff main...HEAD` review.
- [ ] **Step 2:** Push branch, open PR with the review findings → fixes mapping table.
---
## Implementation status (2026-06-20)
All HIGH/MEDIUM/LOW findings implemented, tested, and committed on branch
`fix/project-review-remediation`; full suite **400 passed**. Commits:
- Phase 1 (HTTP hardening) → `e9a15e8`; auth reworked to HttpOnly cookie after
security review → `1abb110`.
- Phase 2 (WAL/busy_timeout/atomic decision) → `20a6f55`.
- Phase 3 (deps manifest + offline runbook) → `8958dd1`.
- Phase 4 (SSRF guard) → `62e2d18`.
- Phase 5 + 6 (PII retention purge, transmission audit, provider-counter
atomicity, promote idempotency, seed serialization, LLM short-circuit,
migration crash-safety) + Phase 7 tests + Task 20 readiness → `7f5799e`.
- Phase 8 (frontend safeUrl/apiJson/onerror) → `f8aa10f`; protocol-relative
URL fix after security review → `7317bfb`.
### Phase 9 (god-file split) — complete
Behavior-preserving, each step gated on the full suite (400 passed). The
5119-line `sqlite_store.py` god file is now **724 lines** (the `CopyrighterStore`
facade: `__init__`, `initialize`, seed, bootstrap, `_search_coverage`, review,
decision, providers, media, audit) composed from focused modules + mixins:
Helper modules (stateless): `store_url_utils` (52), `store_text` (27),
`store_constants` (113), `store_remote_fetch` (125, fetch + SSRF opener),
`store_schema` (257, DDL/migration), `store_serialization` (597, payload/domain
helpers), `store_page_scrape` (976, HTML/CSS/JSON image-URL extraction).
Class mixins (composed into `CopyrighterStore` via inheritance):
`StorePersistenceMixin` (313, connection/transaction/_put/_get),
`StoreQueueMixin` (170), `StoreSearchCandidatesMixin` (743, similarity +
candidate storage + rescore), `StoreEnrichmentMixin` (789, provider sync + face
+ rerun + auto-search), `StoreOperationsMixin` (761, knowledge lifecycle +
manual/rerun search + collection).
All modules are <800 lines except `store_page_scrape` (976) a single cohesive
HTML/CSS/JSON/srcset image-URL parser whose helpers are tightly coupled
(parser css json srcset predicates); splitting it further would fragment
one responsibility, so it is intentionally left whole. Tests that monkeypatch
`HeuristicFacePersonDetector` were updated to also patch `store_enrichment`
(where face detection now runs). Cross-file URL-helper dedup (Task 28) is
intentionally NOT done the integration adapters' suffix policy diverges from
the store's (`.svg`).
## Deferred (explicitly out of this plan, log as follow-ups)
- Full `GovernancePolicyRegistry` role enforcement on every read/serve path (Task 2 + Task 9 cover the high-impact subset).
- `mark_evidence_status` atomic state+audit transaction (same pattern as `record_decision`; deferred per Phase 2 note).
- `rerun_enrichment` per-submission in-progress lock (LOW; cosmetic `lastRerunDiff` only).
- Optimistic-concurrency `version` column on submissions (LOW).
## Self-review notes
- Spec coverage: every HIGH (#1 SSRFT8, #2 CORS/authT1/T2, #3 concurrencyT4/T5, #4 governanceT9/T10, #5 depsT7, #6 god-fileT21-28, #7 atomicT6), every MEDIUM (T11-T17), every LOW (T18-T20) maps to a task.
- Test-first for all behavioral changes; pure refactor (Phase 9) gated by existing suite.
- No placeholders: each behavioral task has concrete test + exact code/site.