POSA_Copyrighter/docs/project-introduction-and-technical-implementation.md

# Copyrighter Project Introduction and Technical Implementation

## 1. Project Overview

Copyrighter is an operator-facing image rights review system. It helps a review team identify potentially risky image submissions before they are approved for use. The system does not automatically make a legal copyright decision. Instead, it gathers evidence, computes a triage risk score, summarizes source-linked findings, and presents the case to a human operator for final approval, hold, rejection, or correction.

The core product idea is evidence-led review:

- collect internal image signals from the submitted file;
- enrich the case with approved external search and computer vision sources;
- compare images against known reference material and collected candidates;
- generate a source-grounded LLM summary for operator readability;
- preserve all evidence, actions, and changes in an auditable local database.

## 2. What Problem It Solves

Image review teams often need to decide whether a submitted image may be associated with a celebrity, character, broadcast, webtoon, game, brand asset, copied source image, or previously rejected internal reference. A manual-only workflow is slow because the operator has to inspect the image, search the web, remember prior decisions, compare similar images, and document the reason.

Copyrighter reduces that effort by turning a raw submission into a structured review case:

- a risk score and risk band for queue prioritization;
- top evidence explaining why the case may be risky;
- external search tool status for Google, Naver, and local LLM summarization;
- source URLs, image candidates, query history, and provider failures;
- a knowledge database for confirmed, watchlist, and excluded references;
- an audit log for operational traceability.

## 3. Main User Experience

The operator console is a local web application with these primary work areas:

- Review Queue: board-style list of submissions with image thumbnail, risk, top evidence, external search tool status, applicant status, operator decision, and timestamp.
- Case Review: detailed case screen where evidence and judgment controls are reviewed in one flow. The operator can mark evidence as used or unused and make the final decision.
- Knowledge DB: confirmed references and watchlist candidates used for future internal similarity checks.
- External Search Tool Usage: provider status, quota, recent success/failure state, and emergency disable controls.
- Audit Log: event history covering provider changes, analysis runs, manual searches, knowledge entry updates, and operator decisions.

## 4. High-Level Architecture

Copyrighter is implemented as a local Python backend with a static operator GUI.

Key components:

- `src/rights_filter/server/http_app.py`: local HTTP API server.
- `src/rights_filter/server/sqlite_store.py`: SQLite-backed application store, evidence persistence, provider state synchronization, enrichment orchestration, and audit events.
- `src/rights_filter/server/image_store.py`: local submission image loading.
- `web/operator-gui/`: static HTML, CSS, and JavaScript operator console.
- `data/copyrighter.sqlite3`: local SQLite database.
- `data/submissions/`: local submission image source directory.

The system can run locally at `http://127.0.0.1:9500/`.

## 5. End-to-End Processing Flow

1. A submission image is loaded from the local image store.
2. The backend creates or refreshes a submission record in SQLite.
3. Internal analysis generates local evidence:
   - SHA-256 exact fingerprint;
   - pHash perceptual fingerprint;
   - face/person presence signal;
   - known reference similarity matches.
4. Approved external enrichment may run:
   - Google Cloud Vision Web Detection for web entities, matching images, visually similar images, and pages with matching images;
   - Naver text-query search for Korean image, blog, and web evidence;
   - Google Custom Search only when configured, though this is treated as a legacy/disabled-capable path.
5. Search result images and page images can be compared against the submitted image using pHash similarity.
6. Ollama local LLM summarizes only the stored source evidence.
7. `RiskScorer` computes a rule-based risk score and band.
8. The operator reviews the evidence and makes the final decision manually.

## 6. AI, ML, and Algorithmic Components

Copyrighter uses several AI/ML or algorithmic techniques. They have different roles and should not be described as one generic "AI score."

### 6.1 Google Cloud Vision Web Detection

Google Cloud Vision Web Detection is the strongest external ML-based computer vision component. It analyzes an approved image derivative and returns:

- web entities;
- full matching images;
- partial matching images;
- visually similar images;
- pages with matching images;
- best guess labels.

These results are stored as evidence with source, URL, image URL, page title, match type, provider score, and confidence.

### 6.2 Local LLM Summarization with Ollama

The local LLM assistant uses Ollama's Generate API through `src/rights_filter/analysis/llm_assistance.py`.

Its prompt explicitly restricts the model:

- summarize only the provided source evidence;
- do not make a final decision;
- do not add claims that are not grounded in source evidence.

The LLM output is stored as a source-linked summary evidence item. It helps operators read the case faster, but it does not directly add to the risk score.

### 6.3 Face and Person Presence Detection

`src/rights_filter/analysis/face_person_detection.py` uses OpenCV Haar cascades for presence-only face/person detection. It detects whether a face-like/person-like signal exists in the image.

Important boundary:

- it does not identify a person;
- it does not store face embeddings;
- it does not perform biometric matching;
- it is used only as a review-priority signal.

### 6.4 Image Fingerprints and pHash Similarity

`src/rights_filter/analysis/fingerprints.py` generates:

- SHA-256 exact file fingerprint;
- 64-bit perceptual hash from an 8x8 grayscale thumbnail.

pHash similarity is computed from Hamming distance. This is not a learned ML model, but it is a key algorithmic image-comparison feature. A similarity score close to `1.0` means the images are visually very similar by this hash method.

## 7. Risk Score and Confidence Model

The risk score is not an LLM-generated probability. It is a rule-based triage score implemented in `src/rights_filter/analysis/risk_scoring.py`.

The scorer adds points based on evidence type:

- pHash similarity `>= 0.9`: strong image similarity signal.
- Face/person presence: additional review signal.
- Google full image match: strong external match.
- Google partial/page match: medium external match.
- Google visual match: weaker supporting signal.
- Promoted Naver search result: score based on confidence.
- LLM summary: no direct score contribution.

The final score is capped at 100 and mapped into bands:

- `high`: 70 or above;
- `medium`: 30 to 69;
- `low`: below 30.

Therefore, `riskScore = 100` means "highest review priority under the rule set." It does not mean "100% legally infringing."

## 8. Evidence Model

Evidence is the central unit of the system. Each evidence item can include:

- source, such as fingerprint, face, Google, Naver, LLM, or failure;
- reason/title;
- confidence;
- query and query strategy;
- URL, image URL, thumbnail URL, source page URL;
- match type and provider score;
- source evidence IDs;
- contribution status;
- operator status.

Operators can mark evidence as used or unused for judgment. This keeps the final decision explainable without allowing raw automation to become the final decision maker.

## 9. External Search Tool Usage

External integrations are provider-managed. The UI exposes them as "External Search Tool Usage" rather than generic "providers."

Supported provider paths include:

- Google Cloud Vision for image/web detection;
- Naver Search API for text-query based Korean evidence;
- Google Custom Search when configured, treated as an optional legacy/disable-capable path;
- Ollama local LLM for evidence summarization.

Provider state is calculated per submission:

- `covered`: evidence exists;
- `empty`: the tool ran but returned no useful result;
- `not_run`: no run has happened;
- `failed`: the tool attempted execution and failed;
- `disabled`: the tool is configured off.

This distinction avoids misleading queue states such as treating every enabled tool as merely pending.

## 10. Knowledge DB and Feedback Loop

Copyrighter includes a knowledge database for reusable review references:

- confirmed entries: accepted reusable references;
- watchlist entries: derived from held/rejected cases but not yet confirmed;
- excluded entries: false positives or stale references.

The knowledge DB helps the internal analyzer detect future similar submissions. It is deliberately operator-controlled: automated evidence can suggest candidates, but operators decide what becomes a reusable reference.

## 11. Persistence and Auditability

SQLite is used as the local persistence layer. The store manages:

- submissions;
- evidence;
- external search tool records;
- knowledge entries;
- collection candidates;
- corrections;
- audit events.

Audit events are created for important actions such as analysis runs, manual provider calls, provider setting changes, knowledge entry creation, and submission pruning. This makes the review process traceable and easier to inspect after a decision.

## 12. Operational Boundaries

The project intentionally keeps several strict boundaries:

- Do not automate Google Image Search, Google Lens, Naver web UI, or scrape result pages.
- Do not send original images to Naver; Naver is used through text queries.
- Do not store biometric templates, face embeddings, or celebrity identity matches from faces.
- Do not expose internal risk scores or evidence details to applicants.
- Do not let automated analysis change final review status.
- Do not treat LLM output as standalone evidence unless it links back to source evidence.

These boundaries make the system more defensible: AI/ML is used to support review, not to replace accountable human judgment.

## 13. Configuration

Runtime behavior is configured through environment variables and provider runtime construction in `src/rights_filter/integrations/env_clients.py`.

Examples of configurable areas:

- Google Cloud Vision API key and request limits;
- Naver client ID/secret and query limits;
- Google Custom Search key/CX when used;
- Ollama base URL and model;
- daily limits and provider-specific policies;
- automatic Naver query limits;
- search result image comparison thresholds.

When a required external credential is missing, the corresponding tool can be disabled while the internal workflow continues.

## 14. Why This Can Be Described as an AI/ML System

The project can accurately be described as using AI/ML because it includes:

- ML-based computer vision through Google Cloud Vision Web Detection;
- local generative AI summarization through Ollama;
- classical computer vision face/person presence detection;
- algorithmic image similarity through perceptual hashing;
- rule-based evidence scoring for review triage.

The strongest and most accurate positioning is:

> Copyrighter is an AI/ML-assisted image rights review platform that automatically collects, compares, and summarizes source-linked evidence, while keeping final rights decisions under human operator control.

## 15. Current Limitations

The system is a review-assistance platform, not a legal decision engine. Known limitations include:

- image similarity does not prove copyright ownership;
- search results can be incomplete, duplicated, stale, or misleading;
- weak labels and visually similar images are low-confidence signals;
- LLM summaries can only be trusted to the extent that the source evidence is complete and correctly linked;
- provider failures and quota limits must be visible to operators rather than silently treated as low risk.

These limitations are handled by surfacing evidence, source links, provider status, and manual operator controls in the UI.