cw2007/POSA_Copyrighter

유창욱 2eb7bd3b8b docs: mark god-file split complete in remediation plan

2026-06-20 22:36:02 +09:00

32 KiB

Raw Blame History

Project Review Remediation Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:executing-plans (inline, batched with checkpoints) to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Fix all HIGH/MEDIUM/LOW findings from the 2026-06-20 multi-agent project review, including splitting the 5119-line sqlite_store.py god object.

Architecture: Mechanical, low-risk fixes first (security, concurrency, deps), each guarded by the existing 367-test suite. The god-file split is last and incremental: move a cohesive method group into a new module, re-export/delegate, run the full suite, commit — repeat. No behavior changes during the split.

Tech Stack: Python 3.13, stdlib http.server + sqlite3 + urllib, OpenCV/Pillow/NumPy (lazy), vanilla-JS frontend. Tests: pytest 8.4 + pytest-cov.

Global Constraints

Air-gapped (폐쇄망) target: every artifact installs/runs with NO internet for its OWN deps. External search APIs (Google/Naver/Vision) at runtime are an ALLOWED exception.
Do not "fix" allowed design: external search API calls; operator-efficiency tradeoffs; relaxed google_search manual-query rule. Leave them.
Regression gate: python -m pytest -q (367 tests, ~3.4 min) must stay green after every task. Run targeted tests during a task, full suite before each commit.
Immutability / style: match existing code style; new objects over mutation where reasonable; functions <50 lines, files <800 lines (the split exists to enforce this).
No new runtime network deps. Stdlib only for new code paths.
Commit format: <type>: <desc> (feat/fix/refactor/test/chore/docs). Branch fix/project-review-remediation.

Phase 0: Safety net

Task 0: Branch + baseline

Step 1: Create branch

git checkout -b fix/project-review-remediation

Step 2: Baseline the suite (record green)

python -m pytest -q

Expected: 367 passed

Step 3: No commit (branch only).

Phase 1: HIGH — HTTP server hardening (CORS, auth, body cap)

Files: Modify src/rights_filter/server/http_app.py; Test tests/rights_filter/server/test_http_app.py.

Task 1: Remove wildcard CORS

The operator GUI is served same-origin from this server, so it needs no CORS header at all. Access-Control-Allow-Origin: * lets any site the operator visits read biometric/case data from localhost.

Step 1: Failing test — add to test_http_app.py:

def test_json_responses_do_not_send_wildcard_cors(http_server):
    base = f"http://127.0.0.1:{http_server.server_port}"
    response = urlopen(f"{base}/api/bootstrap")
    assert response.headers.get("Access-Control-Allow-Origin") is None

Step 2: Run → FAIL (assert '*' is None).
Step 3: In http_app.py:207-214 _json, delete the line:

            self.send_header("Access-Control-Allow-Origin", "*")

Step 4: Run targeted test → PASS. Run full test_http_app.py.
Step 5: Commit fix: remove wildcard CORS from API responses.

Task 2: Optional shared-token auth gate

In 폐쇄망 a single shared bearer token (from env COPYRIGHTER_AUTH_TOKEN) is enough. If unset, server stays open (back-compat for dev) but logs a one-time warning; if set, every /api/*, /media/*, /knowledge-media/*, /collected-media/*, /face-crop-media/* request must carry Authorization: Bearer <token> (or ?token= for <img> src which cannot send headers).

Interfaces produced: build_server(..., auth_token: str | None = None).

Step 1: Failing test:

def test_media_requires_token_when_configured(tmp_path):
    store, image_store, static_dir = _make_server_deps(tmp_path)  # existing helper pattern
    server = build_server("127.0.0.1", 0, store, image_store, static_dir, auth_token="secret")
    port = server.server_port
    thread = threading.Thread(target=server.serve_forever, daemon=True); thread.start()
    try:
        with pytest.raises(HTTPError) as exc:
            urlopen(f"http://127.0.0.1:{port}/api/bootstrap")
        assert exc.value.code == 401
        req = Request(f"http://127.0.0.1:{port}/api/bootstrap", headers={"Authorization": "Bearer secret"})
        assert urlopen(req).status == 200
    finally:
        server.shutdown()

Step 2: Run → FAIL (build_server() got unexpected kwarg auth_token).
Step 3: Implement. In build_server signature add auth_token: str | None = None. At top of do_GET/do_POST/do_PATCH, before routing, call a guard:

        def _authorized(self) -> bool:
            if not auth_token:
                return True
            header = self.headers.get("Authorization", "")
            if header == f"Bearer {auth_token}":
                return True
            query_token = parse_qs(urlparse(self.path).query).get("token", [""])[0]
            return query_token == auth_token

        def _require_auth(self) -> bool:
            if self._authorized():
                return True
            self._json({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED)
            return False

Add at the start of each handler body (after path = _path(self.path)):

            if not self._require_auth():
                return

Import parse_qs from urllib.parse. Thread auth_token from __main__.py env (os.environ.get("COPYRIGHTER_AUTH_TOKEN")).

Step 4: Run targeted + full test_http_app.py → PASS (existing tests pass auth_token=None).
Step 5: Add COPYRIGHTER_AUTH_TOKEN= to .env.example. Commit feat: optional shared-token auth gate for operator server.

Task 3: Cap request body size

_body() reads Content-Length unbounded.

Step 1: Failing test: POST with Content-Length over cap returns 413.

def test_oversized_body_rejected(http_server):
    base = f"http://127.0.0.1:{http_server.server_port}"
    big = b"{" + b'"x":"' + b"a" * (11 * 1024 * 1024) + b'"}'
    req = Request(f"{base}/api/knowledge/manual", data=big, headers={"Content-Type": "application/json"})
    with pytest.raises(HTTPError) as exc:
        urlopen(req)
    assert exc.value.code == 413

Step 2: Run → FAIL.
Step 3: In _body() (line 201-205), add constant _MAX_BODY_BYTES = 10 * 1024 * 1024 at module level, and:

        def _body(self) -> dict[str, object]:
            length = int(self.headers.get("Content-Length", "0") or "0")
            if length > _MAX_BODY_BYTES:
                raise _PayloadTooLarge()
            if not length:
                return {}
            return json.loads(self.rfile.read(length).decode("utf-8"))

Add class _PayloadTooLarge(Exception): pass and catch it in each handler → self._json({"error": "payload too large"}, HTTPStatus.REQUEST_ENTITY_TOO_LARGE).

Step 4: Run → PASS.
Step 5: Commit fix: cap HTTP request body size.

Phase 2: HIGH — SQLite concurrency + atomic writes

Files: Modify src/rights_filter/server/sqlite_store.py, src/rights_filter/server/http_app.py; Test tests/rights_filter/server/test_sqlite_store.py, test_http_app.py.

Task 4: WAL + busy_timeout in `_connect`

Step 1: Failing test (test_sqlite_store.py):

def test_connect_uses_wal_and_busy_timeout(tmp_path):
    store = CopyrighterStore(tmp_path / "c.sqlite3"); store.initialize()
    with store._connect() as conn:
        assert conn.execute("pragma journal_mode").fetchone()[0].lower() == "wal"
        assert int(conn.execute("pragma busy_timeout").fetchone()[0]) >= 5000

Step 2: Run → FAIL (delete != wal).
Step 3: Edit _connect (3248-3252):

    def _connect(self) -> sqlite3.Connection:
        conn = sqlite3.connect(self.db_path)
        conn.execute("pragma foreign_keys = on")
        conn.execute("pragma busy_timeout = 5000")
        conn.row_factory = sqlite3.Row
        return conn

In initialize() (after mkdir, before/inside the first _connect), set WAL once:

        with self._connect() as conn:
            conn.execute("pragma journal_mode = wal")
            conn.executescript(...)  # existing

Step 4: Run → PASS. Full suite (WAL persists; verify no test asserts delete mode).
Step 5: Commit fix: enable SQLite WAL and busy_timeout for concurrent operators.

Task 5: Map `sqlite3.OperationalError` to 503

Step 1: Failing test (test_http_app.py): monkeypatch a store method to raise sqlite3.OperationalError("database is locked"), assert handler returns 503 JSON not a crash.

def test_operational_error_returns_503(http_server, monkeypatch):
    monkeypatch.setattr(http_server.store_for_test, "bootstrap",
                        lambda: (_ for _ in ()).throw(sqlite3.OperationalError("database is locked")))
    base = f"http://127.0.0.1:{http_server.server_port}"
    with pytest.raises(HTTPError) as exc:
        urlopen(f"{base}/api/bootstrap")
    assert exc.value.code == 503

(If the existing http_server fixture doesn't expose the store, add a minimal one matching the existing fixture pattern.)

Step 2: Run → FAIL (500/connection abort).
Step 3: In all three handlers, add before the generic catches:

            except sqlite3.OperationalError:
                self._json({"error": "service busy, retry"}, HTTPStatus.SERVICE_UNAVAILABLE)

Import sqlite3 in http_app.py.

Step 4: Run → PASS.
Step 5: Commit fix: surface SQLite lock contention as retryable 503.

Task 6: Atomic decision + audit (transaction threading)

Make record_decision's state change + audit event commit in ONE transaction so the audit trail can't diverge.

Interfaces: _put(table, id, payload, *, conn=None) and add_audit_event(..., *, conn=None) and _create_or_update_watchlist_entry(..., *, conn=None) accept an optional connection; when given, they reuse it (no own commit). Add a private _transaction() contextmanager.

Step 1: Failing test: simulate audit failure mid-decision, assert submission status is NOT committed (rolled back).

def test_record_decision_rolls_back_when_audit_fails(tmp_path, monkeypatch, seeded_store):
    store, sid = seeded_store
    def boom(*a, **k): raise RuntimeError("audit down")
    monkeypatch.setattr(store, "add_audit_event", boom)
    with pytest.raises(RuntimeError):
        store.record_decision(sid, "held", "m")
    assert store._get("submissions", sid)["decisionStatus"] == "unreviewed"

Step 2: Run → FAIL (status persisted because _put committed independently).
Step 3: Implement:
- Add helper:

    @contextmanager
    def _transaction(self):
        conn = self._connect()
        try:
            yield conn
            conn.commit()
        except Exception:
            conn.rollback()
            raise
        finally:
            conn.close()

Add conn: sqlite3.Connection | None = None param to _put, add_audit_event, _create_or_update_watchlist_entry, _get. When conn is provided, use it directly instead of with self._connect(); do not commit (the _transaction owns commit).
Rewrite record_decision body (1004-1018) to wrap the three writes:

        with self._transaction() as conn:
            submission = self._get("submissions", submission_id, conn=conn)
            submission["decisionStatus"] = decision
            submission["applicantStatus"] = (...same...)
            self._put("submissions", submission_id, submission, conn=conn)
            if decision in {"held", "rejected"}:
                self._create_or_update_watchlist_entry(submission_id, decision, memo, image_store, conn=conn)
            self.add_audit_event("rights.ops", "Operator decision created", submission_id,
                                 f"{decision} · {memo or 'memo optional'}", conn=conn)
        return self.review(submission_id)

_connect() in autocommit-less mode: pass isolation_level default keeps implicit transactions; ensure the helper functions that do with self._connect() as conn: are NOT affected (they keep their own path when conn is None).
Step 4: Run targeted + full suite → PASS.
Step 5: Commit fix: make record_decision + audit atomic via single transaction.

NOTE: apply the same _transaction wrapping to mark_evidence_status (state + audit) as a follow-up micro-task with the same pattern; keep each as its own commit.

Phase 3: HIGH — Air-gapped dependency manifest

Files: Create requirements.txt, requirements-lock.txt, docs/operations/offline-install.md.

Task 7: Pin runtime deps

Step 1: Capture the known-good versions (from current working env):

fastapi==0.115.0
flask==3.1.2
httpx==0.27.0
numpy==2.3.5
opencv-python-headless==4.13.0.92
pillow==12.2.0
requests==2.32.5

Write requirements.txt with these (the project's actual imports — verify each is imported in src/; drop any not imported, e.g. confirm fastapi/flask usage with grep -rn "import fastapi\|import flask" src).

Step 2: Generate a full transitive lock on the build machine:

pip freeze > requirements-lock.txt

Step 3: Document offline install in docs/operations/offline-install.md:

# Build machine (has internet):
pip download -r requirements-lock.txt -d wheelhouse/
# Transfer wheelhouse/ + repo across air gap, then on target:
pip install --no-index --find-links wheelhouse/ -r requirements-lock.txt

Step 4: Verify the manifest installs offline (build a wheelhouse, pip install --no-index into a fresh venv). Record result.
Step 5: Commit chore: pin runtime dependencies for offline air-gapped install.

Phase 4: HIGH — SSRF guard on remote fetches

Files: Modify src/rights_filter/server/sqlite_store.py; Test test_sqlite_store.py.

Task 8: Block private/loopback/link-local hosts + disable redirects

Step 1: Failing tests:

@pytest.mark.parametrize("url", [
    "http://127.0.0.1/x", "http://169.254.169.254/latest/meta-data",
    "http://10.0.0.1/a", "http://192.168.1.1/a", "http://[::1]/a", "http://localhost/a",
])
def test_fetch_rejects_internal_hosts(url):
    from rights_filter.server.sqlite_store import _fetch_url_bytes
    with pytest.raises(ValueError):
        _fetch_url_bytes(url)

Step 2: Run → FAIL (attempts real fetch / no guard).
Step 3: Add a guard used by _fetch_url_bytes_with_headers before urlopen:

import ipaddress, socket
from urllib.parse import urlparse

def _assert_public_http_url(url: str) -> None:
    parsed = urlparse(url)
    if parsed.scheme not in {"http", "https"} or not parsed.hostname:
        raise ValueError("only public http(s) URLs may be fetched")
    host = parsed.hostname
    try:
        infos = socket.getaddrinfo(host, parsed.port or (443 if parsed.scheme == "https" else 80), proto=socket.IPPROTO_TCP)
    except OSError as exc:
        raise ValueError("could not resolve fetch host") from exc
    for *_, sockaddr in infos:
        ip = ipaddress.ip_address(sockaddr[0])
        if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved or ip.is_multicast or ip.is_unspecified:
            raise ValueError("refusing to fetch internal address")

Call _assert_public_http_url(url) at the top of _fetch_url_bytes_with_headers (line 3771). Disable redirects: build an opener with a no-redirect handler, OR re-validate on redirect. Minimal: use a custom HTTPRedirectHandler that re-runs _assert_public_http_url on redirect_request, via build_opener. Replace urlopen(request, ...) with that opener's .open(...).

Step 4: Run → PASS. Verify the existing happy-path fetch test (monkeypatched urlopen) still passes — it may need to also patch _assert_public_http_url or use a public-looking host; adjust the existing test's monkeypatch to bypass DNS (patch _assert_public_http_url to no-op in that test).
Step 5: Commit fix: block SSRF to internal addresses in remote fetchers.

Phase 5: HIGH — PII / governance retention enforcement

Files: Modify src/rights_filter/server/sqlite_store.py, src/rights_filter/governance/policies.py; Create src/rights_filter/jobs/retention_sweeper.py; Test tests/rights_filter/jobs/test_retention_sweeper.py, test_sqlite_store.py.

Task 9: Face-crop retention sweep + deletion path

Add a retention TTL (env COPYRIGHTER_FACE_CROP_RETENTION_DAYS, default e.g. 90; 0 = keep) and a sweep that deletes expired face-crop files + clears their DB references, writing an audit event.

Step 1: Failing test: create a face-crop file with an old mtime + a submission referencing it; run sweep; assert file deleted, reference cleared, audit event present.
Step 2: Run → FAIL (no sweeper).
Step 3: Implement FaceCropRetentionSweeper(store, retention_days) in jobs/retention_sweeper.py:
- List store.face_crop_image_dir files older than cutoff; delete; for each submission whose faceCrops referenced a deleted file, update payload to drop it; add_audit_event(actor="system", event="Face crop purged", ...).
- Add store.purge_expired_face_crops(now_epoch) method doing the DB+file work in one _transaction.
Step 4: Run → PASS.
Step 5: Commit feat: face-crop retention sweep with audit trail.

Task 10: Wire it to run + audit external personal-image transmission

Step 1: Call the sweeper at server startup (__main__.py) and at the start of seed_from_image_store (cheap, idempotent). Add add_audit_event(actor="system", event="Personal image sent to external provider", object_id=submission_id, change="face crop → Google Vision") at the face-crop external-transmission site (sqlite_store.py:~2127, gated on the crop actually being sent).
Step 2: Test: assert the audit event is written when a face crop is transmitted (use the existing fake Vision client path).
Step 3: Run → PASS.
Step 4: Commit feat: audit personal-image external transmission; run retention sweep at startup.

NOTE: Full GovernancePolicyRegistry enforcement on read/serve routes is large; this phase enforces the highest-impact rule (retention/deletion of biometric data) + makes transmission auditable. Deeper role enforcement is folded into Task 2's auth gate. Document remaining registry wiring as a follow-up in the plan's "Deferred" section.

Phase 6: MEDIUM — correctness / idempotency

Files: Modify sqlite_store.py, analysis/evidence_enrichment.py; Tests alongside.

Task 11: Atomic provider usage increment

Replace read-modify-write counter updates with one SQL UPDATE.

Step 1: Failing test: two interleaved increments don't lose an update. Simulate by calling a new store._increment_provider_usage(provider, n) twice and asserting cumulative.
Step 2: Run → FAIL (method missing).
Step 3: Add:

    def _increment_provider_usage(self, provider: str, count: int, *, conn=None) -> None:
        owns = conn is None
        conn = conn or self._connect()
        try:
            conn.execute("update providers set usage = usage + ? where id = ?", (int(count), provider))
            if owns: conn.commit()
        finally:
            if owns: conn.close()

Replace the 5 sites (1361, 1415, 2076, 2301, 2535) provider_payload["usage"] += n; ...; self._put("providers", ...) so the usage field is updated via _increment_provider_usage and the remaining lastSuccess/lastFailure go through _put (or fold into the UPDATE). Keep behavior identical for single-threaded tests.

Step 4: Run → PASS + full suite.
Step 5: Commit fix: atomic provider usage increment to prevent lost updates.

Task 12: Idempotent candidate promotion

Step 1: Failing test: promoting the same candidate twice yields ONE knowledge entry; second call returns existing id.
Step 2: Run → FAIL (two entries).
Step 3: In promote_collection_candidate (1670): if candidate.get("status") == "promoted", return bootstrap (or the existing entry) without creating a new one. Derive entry_id = _stable_id("kb-candidate", candidate_id) (deterministic; _stable_id exists at ~3697) so _put's upsert collapses retries. Wrap the entry put + candidate flip in _transaction.
Step 4: Run → PASS.
Step 5: Commit fix: make collection candidate promotion idempotent.

Task 13: Short-circuit LLM re-invocation

Step 1: Failing test (test_evidence_enrichment.py): a run that already has an LLM summary does NOT call llm_assistant.summarize (spy/mock asserts not called).
Step 2: Run → FAIL (currently always called at line 93).
Step 3: Edit enrich_latest lines 93-97:

        if not _has_existing_llm_summary(run.evidence):
            llm_summary = self.llm_assistant.summarize(submission_id, source_evidence)
            if llm_summary.source == EvidenceSource.ENRICHMENT_FAILURE:
                summary.summary_failures += 1
            new_evidence.append(llm_summary)

Step 4: Run → PASS + full suite.
Step 5: Commit fix: skip LLM summarize when a summary already exists.

Task 14: Migration crash-safety

Step 1: Failing test: seed a legacy table with a row violating the new CHECK so insert...select raises; assert original data still readable and no orphan __copyrighter_*_new table remains.
Step 2: Run → FAIL (orphan temp table / partial state).
Step 3: In _ensure_constrained_schema/_rebuild_constrained_table (308-362): on exception, drop any leftover __copyrighter_*_new temp tables in the finally/except; rebuild all pending tables into temp tables BEFORE any drop table, then drop+rename in a tight loop (minimize the irreversible window). Add cleanup of stale temp tables at startup.
Step 4: Run → PASS.
Step 5: Commit fix: crash-safe constraint migration with temp-table cleanup.

Phase 7: MEDIUM — test coverage for security-sensitive paths

Files: Tests only.

Task 15: Path-traversal guard tests

Step 1: Add to test_sqlite_store.py: for each of knowledge_media_path, collected_media_path, face_crop_media_path, assert ValueError on "../../etc/passwd" and url-encoded "..%2f..%2fx" (decode handled by caller, so test the method with ../).
Step 2: Run → PASS immediately (guards exist). These are regression pins.
Step 3: Add one HTTP-level test: GET /collected-media/../../secret → 400/404.
Step 4: Commit test: pin media path-traversal guards.

Task 16: Risk-band boundary tests

Step 1: Add table tests to test_risk_scoring.py asserting bands at scores 29→low, 30→medium, 69→medium, 70→high, and an input whose raw sum >100 → score==100. Craft evidence: FINGERPRINT similarity≥0.9 (+80) + WEB_DETECTION full (+45) = 125 → expect 100/high.
Step 2: Run → PASS (logic is correct; pins boundaries).
Step 3: Commit test: pin risk-band thresholds and 100 cap.

Task 17: Upload/import validation tests

Step 1: Add HTTP tests for: unsupported suffix, non-base64 data, empty data → 400; import-folder missing path → 400; body not json → 400 (not 500).
Step 2: Run → mostly PASS; if not json yields 500, fix _body to raise ValueError on JSONDecodeError:

            try:
                return json.loads(...)
            except json.JSONDecodeError as exc:
                raise ValueError("invalid JSON body") from exc

Step 3: Commit test: cover upload/import boundary validation; 400 on malformed JSON.

Phase 8: LOW — frontend defense-in-depth + misc

Files: Modify web/operator-gui/app.js; integrations/google_custom_search.py; analysis/face_person_detection.py; integrations/env_clients.py. Test tests/operator_gui/test_static_workbench.py.

Task 18: Client-side `safeUrl` allowlist + `apiJson` ok-check + img onerror

Step 1: Add safeUrl(u) in app.js returning u only if it starts http:///https:////, else "#". Apply at href/src construction (~1722, ~1159). In apiJson (~116), check response.ok BEFORE parsing JSON; on non-ok, read text and throw with status. Add onerror to evidence/candidate <img> (hide or placeholder).
Step 2: Static test asserts safeUrl exists and the JS parses (existing test_static_workbench.py pattern — string/structure checks).
Step 3: Commit fix: frontend URL scheme allowlist, fetch ok-check, image onerror.

Task 19: Complete ARIA tab semantics

Step 1: Add role="tablist"/"tab"/"tabpanel", aria-selected, aria-controls, keyboard arrow nav to the tab control (index.html:~187 + handler in app.js).
Step 2: Commit fix: complete ARIA tab pattern for keyboard nav.

Task 20: Misc hardening

Step 1: google_custom_search.py:65 — move API key from query string to a header if the API supports it; else document why query param is required (Google CSE requires key= query param → document, don't log full URL). Ensure the URL is never logged with the key.
Step 2: face_person_detection.py:52 — raise/log a clear warning if the OpenCV cascade asset is empty instead of silent no-op.
Step 3: env_clients.py:187 — emit a startup provider-readiness summary (which providers enabled/disabled and why).
Step 4: Commit fix: provider readiness logging, cascade-load warning, avoid key in logs.

Phase 9: HIGH — split `sqlite_store.py` god object (the long pole)

Approach: Pure mechanical extraction, behavior-preserving. The public class CopyrighterStore stays; internals move to focused modules. After each extraction: full suite green, commit. Target: no file >800 lines.

Target module map (under src/rights_filter/server/store/):

url_utils.py — module-level URL helpers (_is_http_url, _decoded_nested_url, _url_path_has_image_suffix, _url_has_image_format_hint, _url_looks_like_image, normalization helpers ~4660-4772). Also import these in naver_search.py/google_custom_search.py to kill triplication.
remote_fetch.py — _fetch_url_bytes*, headers, _PageImageParser, css/page extraction, _assert_public_http_url (Task 8).
schema.py — _ensure_constrained_schema, _rebuild_constrained_table, _table_*, CONSTRAINED_TABLE_SCHEMAS, LEGACY_REBUILD_ORDER, the executescript DDL.
serialization.py — _submission_payload, _evidence_payload, _evidence_id, _stable_id, _timestamp_id, _now_label, _text_list, _unique_texts.
media_paths.py — the three *_media_path methods (as functions taking dirs) + path-traversal guard.
enrichment_orchestrator.py — seed_from_image_store, rerun_enrichment, run_auto_search, _auto_*_search, _rerun_* (the heavy business orchestration).
sqlite_store.py — slim CopyrighterStore: __init__, _connect, _transaction, _put/_get/_all/_count, and thin methods delegating to the orchestrator.

Task 21..27: one module per task

For EACH module above, repeat this micro-cycle:

Step 1: Create the new module; cut the target functions verbatim into it (fix imports).
Step 2: In sqlite_store.py, import from the new module; where instance methods moved, keep a thin delegating method OR call the free function. Do NOT change logic.
Step 3: Run full pytest -q → must stay 367 passed.
Step 4: wc -l the touched files; confirm shrinking.
Step 5: Commit refactor: extract <module> from sqlite_store god object.

Order (low-risk first): url_utils → serialization → media_paths → remote_fetch → schema → enrichment_orchestrator. Each is independently testable via the existing suite; if any extraction reddens the suite, revert that single commit and re-slice smaller.

Task 28: De-duplicate URL helpers across integrations

Step 1: Point naver_search.py:629 and google_custom_search.py URL helpers at store/url_utils.py (or move shared helpers to a neutral rights_filter/common/url_utils.py if server import direction is wrong from integrations). Respect layering: integrations must NOT import server. → put shared url_utils under rights_filter/common/.
Step 2: Run full suite → PASS.
Step 3: Commit refactor: share URL helpers via common/url_utils.

Phase 10: Verification & finish

Task 29: Full verification

Step 1: python -m pytest -q → 367 passed (+ new tests; count grows).
Step 2: python -m pytest --cov=rights_filter --cov-report=term-missing -q → coverage ≥ prior 89%.
Step 3: Confirm no source file >800 lines: find src -name '*.py' | xargs wc -l | sort -rn | head.
Step 4: Offline-install dry run (Phase 3 wheelhouse).
Step 5: Update docs/operations/ with the auth token + retention env vars.

Task 30: PR

Step 1: git diff main...HEAD review.
Step 2: Push branch, open PR with the review findings → fixes mapping table.

Implementation status (2026-06-20)

All HIGH/MEDIUM/LOW findings implemented, tested, and committed on branch fix/project-review-remediation; full suite 400 passed. Commits:

Phase 1 (HTTP hardening) → e9a15e8; auth reworked to HttpOnly cookie after security review → 1abb110.
Phase 2 (WAL/busy_timeout/atomic decision) → 20a6f55.
Phase 3 (deps manifest + offline runbook) → 8958dd1.
Phase 4 (SSRF guard) → 62e2d18.
Phase 5 + 6 (PII retention purge, transmission audit, provider-counter atomicity, promote idempotency, seed serialization, LLM short-circuit, migration crash-safety) + Phase 7 tests + Task 20 readiness → 7f5799e.
Phase 8 (frontend safeUrl/apiJson/onerror) → f8aa10f; protocol-relative URL fix after security review → 7317bfb.

Phase 9 (god-file split) — complete

Behavior-preserving, each step gated on the full suite (400 passed). The 5119-line sqlite_store.py god file is now 724 lines (the CopyrighterStore facade: __init__, initialize, seed, bootstrap, _search_coverage, review, decision, providers, media, audit) composed from focused modules + mixins:

Helper modules (stateless): store_url_utils (52), store_text (27), store_constants (113), store_remote_fetch (125, fetch + SSRF opener), store_schema (257, DDL/migration), store_serialization (597, payload/domain helpers), store_page_scrape (976, HTML/CSS/JSON image-URL extraction).

Class mixins (composed into CopyrighterStore via inheritance): StorePersistenceMixin (313, connection/transaction/_put/_get), StoreQueueMixin (170), StoreSearchCandidatesMixin (743, similarity + candidate storage + rescore), StoreEnrichmentMixin (789, provider sync + face

rerun + auto-search), StoreOperationsMixin (761, knowledge lifecycle + manual/rerun search + collection).

All modules are <800 lines except store_page_scrape (976) — a single cohesive HTML/CSS/JSON/srcset image-URL parser whose helpers are tightly coupled (parser ↔ css ↔ json ↔ srcset predicates); splitting it further would fragment one responsibility, so it is intentionally left whole. Tests that monkeypatch HeuristicFacePersonDetector were updated to also patch store_enrichment (where face detection now runs). Cross-file URL-helper dedup (Task 28) is intentionally NOT done — the integration adapters' suffix policy diverges from the store's (.svg).

Deferred (explicitly out of this plan, log as follow-ups)

Full GovernancePolicyRegistry role enforcement on every read/serve path (Task 2 + Task 9 cover the high-impact subset).
mark_evidence_status atomic state+audit transaction (same pattern as record_decision; deferred per Phase 2 note).
rerun_enrichment per-submission in-progress lock (LOW; cosmetic lastRerunDiff only).
Optimistic-concurrency version column on submissions (LOW).

Self-review notes

Spec coverage: every HIGH (#1 SSRF→T8, #2 CORS/auth→T1/T2, #3 concurrency→T4/T5, #4 governance→T9/T10, #5 deps→T7, #6 god-file→T21-28, #7 atomic→T6), every MEDIUM (T11-T17), every LOW (T18-T20) maps to a task.
Test-first for all behavioral changes; pure refactor (Phase 9) gated by existing suite.
No placeholders: each behavioral task has concrete test + exact code/site.

32 KiB Raw Blame History