30 KiB
Project Review Remediation Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:executing-plans (inline, batched with checkpoints) to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Fix all HIGH/MEDIUM/LOW findings from the 2026-06-20 multi-agent project review, including splitting the 5119-line sqlite_store.py god object.
Architecture: Mechanical, low-risk fixes first (security, concurrency, deps), each guarded by the existing 367-test suite. The god-file split is last and incremental: move a cohesive method group into a new module, re-export/delegate, run the full suite, commit — repeat. No behavior changes during the split.
Tech Stack: Python 3.13, stdlib http.server + sqlite3 + urllib, OpenCV/Pillow/NumPy (lazy), vanilla-JS frontend. Tests: pytest 8.4 + pytest-cov.
Global Constraints
- Air-gapped (폐쇄망) target: every artifact installs/runs with NO internet for its OWN deps. External search APIs (Google/Naver/Vision) at runtime are an ALLOWED exception.
- Do not "fix" allowed design: external search API calls; operator-efficiency tradeoffs; relaxed
google_searchmanual-query rule. Leave them. - Regression gate:
python -m pytest -q(367 tests, ~3.4 min) must stay green after every task. Run targeted tests during a task, full suite before each commit. - Immutability / style: match existing code style; new objects over mutation where reasonable; functions <50 lines, files <800 lines (the split exists to enforce this).
- No new runtime network deps. Stdlib only for new code paths.
- Commit format:
<type>: <desc>(feat/fix/refactor/test/chore/docs). Branchfix/project-review-remediation.
Phase 0: Safety net
Task 0: Branch + baseline
- Step 1: Create branch
git checkout -b fix/project-review-remediation
- Step 2: Baseline the suite (record green)
python -m pytest -q
Expected: 367 passed
- Step 3: No commit (branch only).
Phase 1: HIGH — HTTP server hardening (CORS, auth, body cap)
Files: Modify src/rights_filter/server/http_app.py; Test tests/rights_filter/server/test_http_app.py.
Task 1: Remove wildcard CORS
The operator GUI is served same-origin from this server, so it needs no CORS header at all. Access-Control-Allow-Origin: * lets any site the operator visits read biometric/case data from localhost.
- Step 1: Failing test — add to
test_http_app.py:
def test_json_responses_do_not_send_wildcard_cors(http_server):
base = f"http://127.0.0.1:{http_server.server_port}"
response = urlopen(f"{base}/api/bootstrap")
assert response.headers.get("Access-Control-Allow-Origin") is None
- Step 2: Run → FAIL (
assert '*' is None). - Step 3: In
http_app.py:207-214_json, delete the line:
self.send_header("Access-Control-Allow-Origin", "*")
- Step 4: Run targeted test → PASS. Run full
test_http_app.py. - Step 5: Commit
fix: remove wildcard CORS from API responses.
Task 2: Optional shared-token auth gate
In 폐쇄망 a single shared bearer token (from env COPYRIGHTER_AUTH_TOKEN) is enough. If unset, server stays open (back-compat for dev) but logs a one-time warning; if set, every /api/*, /media/*, /knowledge-media/*, /collected-media/*, /face-crop-media/* request must carry Authorization: Bearer <token> (or ?token= for <img> src which cannot send headers).
Interfaces produced: build_server(..., auth_token: str | None = None).
- Step 1: Failing test:
def test_media_requires_token_when_configured(tmp_path):
store, image_store, static_dir = _make_server_deps(tmp_path) # existing helper pattern
server = build_server("127.0.0.1", 0, store, image_store, static_dir, auth_token="secret")
port = server.server_port
thread = threading.Thread(target=server.serve_forever, daemon=True); thread.start()
try:
with pytest.raises(HTTPError) as exc:
urlopen(f"http://127.0.0.1:{port}/api/bootstrap")
assert exc.value.code == 401
req = Request(f"http://127.0.0.1:{port}/api/bootstrap", headers={"Authorization": "Bearer secret"})
assert urlopen(req).status == 200
finally:
server.shutdown()
- Step 2: Run → FAIL (
build_server() got unexpected kwarg auth_token). - Step 3: Implement. In
build_serversignature addauth_token: str | None = None. At top ofdo_GET/do_POST/do_PATCH, before routing, call a guard:
def _authorized(self) -> bool:
if not auth_token:
return True
header = self.headers.get("Authorization", "")
if header == f"Bearer {auth_token}":
return True
query_token = parse_qs(urlparse(self.path).query).get("token", [""])[0]
return query_token == auth_token
def _require_auth(self) -> bool:
if self._authorized():
return True
self._json({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED)
return False
Add at the start of each handler body (after path = _path(self.path)):
if not self._require_auth():
return
Import parse_qs from urllib.parse. Thread auth_token from __main__.py env (os.environ.get("COPYRIGHTER_AUTH_TOKEN")).
- Step 4: Run targeted + full
test_http_app.py→ PASS (existing tests passauth_token=None). - Step 5: Add
COPYRIGHTER_AUTH_TOKEN=to.env.example. Commitfeat: optional shared-token auth gate for operator server.
Task 3: Cap request body size
_body() reads Content-Length unbounded.
- Step 1: Failing test: POST with
Content-Lengthover cap returns 413.
def test_oversized_body_rejected(http_server):
base = f"http://127.0.0.1:{http_server.server_port}"
big = b"{" + b'"x":"' + b"a" * (11 * 1024 * 1024) + b'"}'
req = Request(f"{base}/api/knowledge/manual", data=big, headers={"Content-Type": "application/json"})
with pytest.raises(HTTPError) as exc:
urlopen(req)
assert exc.value.code == 413
- Step 2: Run → FAIL.
- Step 3: In
_body()(line 201-205), add constant_MAX_BODY_BYTES = 10 * 1024 * 1024at module level, and:
def _body(self) -> dict[str, object]:
length = int(self.headers.get("Content-Length", "0") or "0")
if length > _MAX_BODY_BYTES:
raise _PayloadTooLarge()
if not length:
return {}
return json.loads(self.rfile.read(length).decode("utf-8"))
Add class _PayloadTooLarge(Exception): pass and catch it in each handler → self._json({"error": "payload too large"}, HTTPStatus.REQUEST_ENTITY_TOO_LARGE).
- Step 4: Run → PASS.
- Step 5: Commit
fix: cap HTTP request body size.
Phase 2: HIGH — SQLite concurrency + atomic writes
Files: Modify src/rights_filter/server/sqlite_store.py, src/rights_filter/server/http_app.py; Test tests/rights_filter/server/test_sqlite_store.py, test_http_app.py.
Task 4: WAL + busy_timeout in _connect
- Step 1: Failing test (
test_sqlite_store.py):
def test_connect_uses_wal_and_busy_timeout(tmp_path):
store = CopyrighterStore(tmp_path / "c.sqlite3"); store.initialize()
with store._connect() as conn:
assert conn.execute("pragma journal_mode").fetchone()[0].lower() == "wal"
assert int(conn.execute("pragma busy_timeout").fetchone()[0]) >= 5000
- Step 2: Run → FAIL (
delete!=wal). - Step 3: Edit
_connect(3248-3252):
def _connect(self) -> sqlite3.Connection:
conn = sqlite3.connect(self.db_path)
conn.execute("pragma foreign_keys = on")
conn.execute("pragma busy_timeout = 5000")
conn.row_factory = sqlite3.Row
return conn
In initialize() (after mkdir, before/inside the first _connect), set WAL once:
with self._connect() as conn:
conn.execute("pragma journal_mode = wal")
conn.executescript(...) # existing
- Step 4: Run → PASS. Full suite (WAL persists; verify no test asserts
deletemode). - Step 5: Commit
fix: enable SQLite WAL and busy_timeout for concurrent operators.
Task 5: Map sqlite3.OperationalError to 503
- Step 1: Failing test (
test_http_app.py): monkeypatch a store method to raisesqlite3.OperationalError("database is locked"), assert handler returns 503 JSON not a crash.
def test_operational_error_returns_503(http_server, monkeypatch):
monkeypatch.setattr(http_server.store_for_test, "bootstrap",
lambda: (_ for _ in ()).throw(sqlite3.OperationalError("database is locked")))
base = f"http://127.0.0.1:{http_server.server_port}"
with pytest.raises(HTTPError) as exc:
urlopen(f"{base}/api/bootstrap")
assert exc.value.code == 503
(If the existing http_server fixture doesn't expose the store, add a minimal one matching the existing fixture pattern.)
- Step 2: Run → FAIL (500/connection abort).
- Step 3: In all three handlers, add before the generic catches:
except sqlite3.OperationalError:
self._json({"error": "service busy, retry"}, HTTPStatus.SERVICE_UNAVAILABLE)
Import sqlite3 in http_app.py.
- Step 4: Run → PASS.
- Step 5: Commit
fix: surface SQLite lock contention as retryable 503.
Task 6: Atomic decision + audit (transaction threading)
Make record_decision's state change + audit event commit in ONE transaction so the audit trail can't diverge.
Interfaces: _put(table, id, payload, *, conn=None) and add_audit_event(..., *, conn=None) and _create_or_update_watchlist_entry(..., *, conn=None) accept an optional connection; when given, they reuse it (no own commit). Add a private _transaction() contextmanager.
- Step 1: Failing test: simulate audit failure mid-decision, assert submission status is NOT committed (rolled back).
def test_record_decision_rolls_back_when_audit_fails(tmp_path, monkeypatch, seeded_store):
store, sid = seeded_store
def boom(*a, **k): raise RuntimeError("audit down")
monkeypatch.setattr(store, "add_audit_event", boom)
with pytest.raises(RuntimeError):
store.record_decision(sid, "held", "m")
assert store._get("submissions", sid)["decisionStatus"] == "unreviewed"
- Step 2: Run → FAIL (status persisted because
_putcommitted independently). - Step 3: Implement:
- Add helper:
@contextmanager
def _transaction(self):
conn = self._connect()
try:
yield conn
conn.commit()
except Exception:
conn.rollback()
raise
finally:
conn.close()
- Add
conn: sqlite3.Connection | None = Noneparam to_put,add_audit_event,_create_or_update_watchlist_entry,_get. Whenconnis provided, use it directly instead ofwith self._connect(); do not commit (the_transactionowns commit). - Rewrite
record_decisionbody (1004-1018) to wrap the three writes:
with self._transaction() as conn:
submission = self._get("submissions", submission_id, conn=conn)
submission["decisionStatus"] = decision
submission["applicantStatus"] = (...same...)
self._put("submissions", submission_id, submission, conn=conn)
if decision in {"held", "rejected"}:
self._create_or_update_watchlist_entry(submission_id, decision, memo, image_store, conn=conn)
self.add_audit_event("rights.ops", "Operator decision created", submission_id,
f"{decision} · {memo or 'memo optional'}", conn=conn)
return self.review(submission_id)
_connect()in autocommit-less mode: passisolation_leveldefault keeps implicit transactions; ensure the helper functions that dowith self._connect() as conn:are NOT affected (they keep their own path whenconn is None).- Step 4: Run targeted + full suite → PASS.
- Step 5: Commit
fix: make record_decision + audit atomic via single transaction.
NOTE: apply the same
_transactionwrapping tomark_evidence_status(state + audit) as a follow-up micro-task with the same pattern; keep each as its own commit.
Phase 3: HIGH — Air-gapped dependency manifest
Files: Create requirements.txt, requirements-lock.txt, docs/operations/offline-install.md.
Task 7: Pin runtime deps
- Step 1: Capture the known-good versions (from current working env):
fastapi==0.115.0
flask==3.1.2
httpx==0.27.0
numpy==2.3.5
opencv-python-headless==4.13.0.92
pillow==12.2.0
requests==2.32.5
Write requirements.txt with these (the project's actual imports — verify each is imported in src/; drop any not imported, e.g. confirm fastapi/flask usage with grep -rn "import fastapi\|import flask" src).
- Step 2: Generate a full transitive lock on the build machine:
pip freeze > requirements-lock.txt
- Step 3: Document offline install in
docs/operations/offline-install.md:
# Build machine (has internet):
pip download -r requirements-lock.txt -d wheelhouse/
# Transfer wheelhouse/ + repo across air gap, then on target:
pip install --no-index --find-links wheelhouse/ -r requirements-lock.txt
- Step 4: Verify the manifest installs offline (build a wheelhouse,
pip install --no-indexinto a fresh venv). Record result. - Step 5: Commit
chore: pin runtime dependencies for offline air-gapped install.
Phase 4: HIGH — SSRF guard on remote fetches
Files: Modify src/rights_filter/server/sqlite_store.py; Test test_sqlite_store.py.
Task 8: Block private/loopback/link-local hosts + disable redirects
- Step 1: Failing tests:
@pytest.mark.parametrize("url", [
"http://127.0.0.1/x", "http://169.254.169.254/latest/meta-data",
"http://10.0.0.1/a", "http://192.168.1.1/a", "http://[::1]/a", "http://localhost/a",
])
def test_fetch_rejects_internal_hosts(url):
from rights_filter.server.sqlite_store import _fetch_url_bytes
with pytest.raises(ValueError):
_fetch_url_bytes(url)
- Step 2: Run → FAIL (attempts real fetch / no guard).
- Step 3: Add a guard used by
_fetch_url_bytes_with_headersbeforeurlopen:
import ipaddress, socket
from urllib.parse import urlparse
def _assert_public_http_url(url: str) -> None:
parsed = urlparse(url)
if parsed.scheme not in {"http", "https"} or not parsed.hostname:
raise ValueError("only public http(s) URLs may be fetched")
host = parsed.hostname
try:
infos = socket.getaddrinfo(host, parsed.port or (443 if parsed.scheme == "https" else 80), proto=socket.IPPROTO_TCP)
except OSError as exc:
raise ValueError("could not resolve fetch host") from exc
for *_, sockaddr in infos:
ip = ipaddress.ip_address(sockaddr[0])
if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or ip.is_multicast or ip.is_unspecified:
raise ValueError("refusing to fetch internal address")
Call _assert_public_http_url(url) at the top of _fetch_url_bytes_with_headers (line 3771). Disable redirects: build an opener with a no-redirect handler, OR re-validate on redirect. Minimal: use a custom HTTPRedirectHandler that re-runs _assert_public_http_url on redirect_request, via build_opener. Replace urlopen(request, ...) with that opener's .open(...).
- Step 4: Run → PASS. Verify the existing happy-path fetch test (monkeypatched
urlopen) still passes — it may need to also patch_assert_public_http_urlor use a public-looking host; adjust the existing test's monkeypatch to bypass DNS (patch_assert_public_http_urlto no-op in that test). - Step 5: Commit
fix: block SSRF to internal addresses in remote fetchers.
Phase 5: HIGH — PII / governance retention enforcement
Files: Modify src/rights_filter/server/sqlite_store.py, src/rights_filter/governance/policies.py; Create src/rights_filter/jobs/retention_sweeper.py; Test tests/rights_filter/jobs/test_retention_sweeper.py, test_sqlite_store.py.
Task 9: Face-crop retention sweep + deletion path
Add a retention TTL (env COPYRIGHTER_FACE_CROP_RETENTION_DAYS, default e.g. 90; 0 = keep) and a sweep that deletes expired face-crop files + clears their DB references, writing an audit event.
- Step 1: Failing test: create a face-crop file with an old mtime + a submission referencing it; run sweep; assert file deleted, reference cleared, audit event present.
- Step 2: Run → FAIL (no sweeper).
- Step 3: Implement
FaceCropRetentionSweeper(store, retention_days)injobs/retention_sweeper.py:- List
store.face_crop_image_dirfiles older than cutoff; delete; for each submission whosefaceCropsreferenced a deleted file, update payload to drop it;add_audit_event(actor="system", event="Face crop purged", ...). - Add
store.purge_expired_face_crops(now_epoch)method doing the DB+file work in one_transaction.
- List
- Step 4: Run → PASS.
- Step 5: Commit
feat: face-crop retention sweep with audit trail.
Task 10: Wire it to run + audit external personal-image transmission
- Step 1: Call the sweeper at server startup (
__main__.py) and at the start ofseed_from_image_store(cheap, idempotent). Addadd_audit_event(actor="system", event="Personal image sent to external provider", object_id=submission_id, change="face crop → Google Vision")at the face-crop external-transmission site (sqlite_store.py:~2127, gated on the crop actually being sent). - Step 2: Test: assert the audit event is written when a face crop is transmitted (use the existing fake Vision client path).
- Step 3: Run → PASS.
- Step 4: Commit
feat: audit personal-image external transmission; run retention sweep at startup.
NOTE: Full
GovernancePolicyRegistryenforcement on read/serve routes is large; this phase enforces the highest-impact rule (retention/deletion of biometric data) + makes transmission auditable. Deeper role enforcement is folded into Task 2's auth gate. Document remaining registry wiring as a follow-up in the plan's "Deferred" section.
Phase 6: MEDIUM — correctness / idempotency
Files: Modify sqlite_store.py, analysis/evidence_enrichment.py; Tests alongside.
Task 11: Atomic provider usage increment
Replace read-modify-write counter updates with one SQL UPDATE.
- Step 1: Failing test: two interleaved increments don't lose an update. Simulate by calling a new
store._increment_provider_usage(provider, n)twice and asserting cumulative. - Step 2: Run → FAIL (method missing).
- Step 3: Add:
def _increment_provider_usage(self, provider: str, count: int, *, conn=None) -> None:
owns = conn is None
conn = conn or self._connect()
try:
conn.execute("update providers set usage = usage + ? where id = ?", (int(count), provider))
if owns: conn.commit()
finally:
if owns: conn.close()
Replace the 5 sites (1361, 1415, 2076, 2301, 2535) provider_payload["usage"] += n; ...; self._put("providers", ...) so the usage field is updated via _increment_provider_usage and the remaining lastSuccess/lastFailure go through _put (or fold into the UPDATE). Keep behavior identical for single-threaded tests.
- Step 4: Run → PASS + full suite.
- Step 5: Commit
fix: atomic provider usage increment to prevent lost updates.
Task 12: Idempotent candidate promotion
- Step 1: Failing test: promoting the same candidate twice yields ONE knowledge entry; second call returns existing id.
- Step 2: Run → FAIL (two entries).
- Step 3: In
promote_collection_candidate(1670): ifcandidate.get("status") == "promoted", return bootstrap (or the existing entry) without creating a new one. Deriveentry_id = _stable_id("kb-candidate", candidate_id)(deterministic;_stable_idexists at ~3697) so_put's upsert collapses retries. Wrap the entry put + candidate flip in_transaction. - Step 4: Run → PASS.
- Step 5: Commit
fix: make collection candidate promotion idempotent.
Task 13: Short-circuit LLM re-invocation
- Step 1: Failing test (
test_evidence_enrichment.py): a run that already has an LLM summary does NOT callllm_assistant.summarize(spy/mock asserts not called). - Step 2: Run → FAIL (currently always called at line 93).
- Step 3: Edit
enrich_latestlines 93-97:
if not _has_existing_llm_summary(run.evidence):
llm_summary = self.llm_assistant.summarize(submission_id, source_evidence)
if llm_summary.source == EvidenceSource.ENRICHMENT_FAILURE:
summary.summary_failures += 1
new_evidence.append(llm_summary)
- Step 4: Run → PASS + full suite.
- Step 5: Commit
fix: skip LLM summarize when a summary already exists.
Task 14: Migration crash-safety
- Step 1: Failing test: seed a legacy table with a row violating the new CHECK so
insert...selectraises; assert original data still readable and no orphan__copyrighter_*_newtable remains. - Step 2: Run → FAIL (orphan temp table / partial state).
- Step 3: In
_ensure_constrained_schema/_rebuild_constrained_table(308-362): on exception, drop any leftover__copyrighter_*_newtemp tables in thefinally/except; rebuild all pending tables into temp tables BEFORE anydrop table, then drop+rename in a tight loop (minimize the irreversible window). Add cleanup of stale temp tables at startup. - Step 4: Run → PASS.
- Step 5: Commit
fix: crash-safe constraint migration with temp-table cleanup.
Phase 7: MEDIUM — test coverage for security-sensitive paths
Files: Tests only.
Task 15: Path-traversal guard tests
- Step 1: Add to
test_sqlite_store.py: for each ofknowledge_media_path,collected_media_path,face_crop_media_path, assertValueErroron"../../etc/passwd"and url-encoded"..%2f..%2fx"(decode handled by caller, so test the method with../). - Step 2: Run → PASS immediately (guards exist). These are regression pins.
- Step 3: Add one HTTP-level test:
GET /collected-media/../../secret→ 400/404. - Step 4: Commit
test: pin media path-traversal guards.
Task 16: Risk-band boundary tests
- Step 1: Add table tests to
test_risk_scoring.pyasserting bands at scores 29→low, 30→medium, 69→medium, 70→high, and an input whose raw sum >100 → score==100. Craft evidence: FINGERPRINT similarity≥0.9 (+80) + WEB_DETECTION full (+45) = 125 → expect 100/high. - Step 2: Run → PASS (logic is correct; pins boundaries).
- Step 3: Commit
test: pin risk-band thresholds and 100 cap.
Task 17: Upload/import validation tests
- Step 1: Add HTTP tests for: unsupported suffix, non-base64 data, empty data → 400; import-folder missing path → 400; body
not json→ 400 (not 500). - Step 2: Run → mostly PASS; if
not jsonyields 500, fix_bodyto raiseValueErroronJSONDecodeError:
try:
return json.loads(...)
except json.JSONDecodeError as exc:
raise ValueError("invalid JSON body") from exc
- Step 3: Commit
test: cover upload/import boundary validation; 400 on malformed JSON.
Phase 8: LOW — frontend defense-in-depth + misc
Files: Modify web/operator-gui/app.js; integrations/google_custom_search.py; analysis/face_person_detection.py; integrations/env_clients.py. Test tests/operator_gui/test_static_workbench.py.
Task 18: Client-side safeUrl allowlist + apiJson ok-check + img onerror
- Step 1: Add
safeUrl(u)inapp.jsreturninguonly if it startshttp:///https:////, else"#". Apply at href/src construction (~1722,~1159). InapiJson(~116), checkresponse.okBEFORE parsing JSON; on non-ok, read text and throw with status. Addonerrorto evidence/candidate<img>(hide or placeholder). - Step 2: Static test asserts
safeUrlexists and the JS parses (existingtest_static_workbench.pypattern — string/structure checks). - Step 3: Commit
fix: frontend URL scheme allowlist, fetch ok-check, image onerror.
Task 19: Complete ARIA tab semantics
- Step 1: Add
role="tablist"/"tab"/"tabpanel",aria-selected,aria-controls, keyboard arrow nav to the tab control (index.html:~187+ handler inapp.js). - Step 2: Commit
fix: complete ARIA tab pattern for keyboard nav.
Task 20: Misc hardening
- Step 1:
google_custom_search.py:65— move API key from query string to a header if the API supports it; else document why query param is required (Google CSE requireskey=query param → document, don't log full URL). Ensure the URL is never logged with the key. - Step 2:
face_person_detection.py:52— raise/log a clear warning if the OpenCV cascade asset is empty instead of silent no-op. - Step 3:
env_clients.py:187— emit a startup provider-readiness summary (which providers enabled/disabled and why). - Step 4: Commit
fix: provider readiness logging, cascade-load warning, avoid key in logs.
Phase 9: HIGH — split sqlite_store.py god object (the long pole)
Approach: Pure mechanical extraction, behavior-preserving. The public class CopyrighterStore stays; internals move to focused modules. After each extraction: full suite green, commit. Target: no file >800 lines.
Target module map (under src/rights_filter/server/store/):
url_utils.py— module-level URL helpers (_is_http_url,_decoded_nested_url,_url_path_has_image_suffix,_url_has_image_format_hint,_url_looks_like_image, normalization helpers ~4660-4772). Also import these innaver_search.py/google_custom_search.pyto kill triplication.remote_fetch.py—_fetch_url_bytes*, headers,_PageImageParser, css/page extraction,_assert_public_http_url(Task 8).schema.py—_ensure_constrained_schema,_rebuild_constrained_table,_table_*,CONSTRAINED_TABLE_SCHEMAS,LEGACY_REBUILD_ORDER, theexecutescriptDDL.serialization.py—_submission_payload,_evidence_payload,_evidence_id,_stable_id,_timestamp_id,_now_label,_text_list,_unique_texts.media_paths.py— the three*_media_pathmethods (as functions taking dirs) + path-traversal guard.enrichment_orchestrator.py—seed_from_image_store,rerun_enrichment,run_auto_search,_auto_*_search,_rerun_*(the heavy business orchestration).sqlite_store.py— slimCopyrighterStore:__init__,_connect,_transaction,_put/_get/_all/_count, and thin methods delegating to the orchestrator.
Task 21..27: one module per task
For EACH module above, repeat this micro-cycle:
- Step 1: Create the new module; cut the target functions verbatim into it (fix imports).
- Step 2: In
sqlite_store.py, import from the new module; where instance methods moved, keep a thin delegating method OR call the free function. Do NOT change logic. - Step 3: Run full
pytest -q→ must stay367 passed. - Step 4:
wc -lthe touched files; confirm shrinking. - Step 5: Commit
refactor: extract <module> from sqlite_store god object.
Order (low-risk first): url_utils → serialization → media_paths → remote_fetch → schema → enrichment_orchestrator. Each is independently testable via the existing suite; if any extraction reddens the suite, revert that single commit and re-slice smaller.
Task 28: De-duplicate URL helpers across integrations
- Step 1: Point
naver_search.py:629andgoogle_custom_search.pyURL helpers atstore/url_utils.py(or move shared helpers to a neutralrights_filter/common/url_utils.pyifserverimport direction is wrong fromintegrations). Respect layering:integrationsmust NOT importserver. → put sharedurl_utilsunderrights_filter/common/. - Step 2: Run full suite → PASS.
- Step 3: Commit
refactor: share URL helpers via common/url_utils.
Phase 10: Verification & finish
Task 29: Full verification
- Step 1:
python -m pytest -q→367 passed(+ new tests; count grows). - Step 2:
python -m pytest --cov=rights_filter --cov-report=term-missing -q→ coverage ≥ prior 89%. - Step 3: Confirm no source file >800 lines:
find src -name '*.py' | xargs wc -l | sort -rn | head. - Step 4: Offline-install dry run (Phase 3 wheelhouse).
- Step 5: Update
docs/operations/with the auth token + retention env vars.
Task 30: PR
- Step 1:
git diff main...HEADreview. - Step 2: Push branch, open PR with the review findings → fixes mapping table.
Deferred (explicitly out of this plan, log as follow-ups)
- Full
GovernancePolicyRegistryrole enforcement on every read/serve path (Task 2 + Task 9 cover the high-impact subset). rerun_enrichmentper-submission in-progress lock (LOW; cosmeticlastRerunDiffonly).seed_from_image_storeper-row claim lock (MEDIUM; converges on data, dup side-effects only) — mitigated once auth limits operators.- Optimistic-concurrency
versioncolumn on submissions (LOW).
Self-review notes
- Spec coverage: every HIGH (#1 SSRF→T8, #2 CORS/auth→T1/T2, #3 concurrency→T4/T5, #4 governance→T9/T10, #5 deps→T7, #6 god-file→T21-28, #7 atomic→T6), every MEDIUM (T11-T17), every LOW (T18-T20) maps to a task.
- Test-first for all behavioral changes; pure refactor (Phase 9) gated by existing suite.
- No placeholders: each behavioral task has concrete test + exact code/site.