OCR Architecture Review
Scope
This document describes the OCR pipeline that is currently implemented in Kototoro on this branch.
It focuses on:
- the active runtime architecture
- what has already been corrected
- which quality issues still remain
- how the current branch differs from the stricter target documented in OCR Pipeline
This document intentionally describes the code that exists now, not an aspirational design.
Executive Summary
The OCR stack is no longer bubble-first.
The active branch now follows a detector-first, recognizer-second pipeline, with bubble logic demoted to optional post-OCR assistance:
page image
-> text detection
-> region recognition
-> text merge
-> optional bubble-assisted grouping
-> translation
-> renderThe main branch-level improvements are:
ComicTextDetectorOnnxis now a standalone detector implementation instead of being mounted under the Paddle engine.- OCR routing in
ReaderPageTranslationProcessornow explicitly models detector and recognizer backends. - CTD is a first-class
OCR_DETECTORmodel and can be paired withMangaOCRor Paddle recognition. - Bubble detection is no longer the default OCR entrance. It is retained only as an optional downstream assistant.
- Render sizing for detector-anchored groups has been widened, and the reader now exposes a render debug overlay with explicit diagnoses.
The current architecture is much closer to manga-translator-ui than the earlier bubble-ROI-centric design, but it is still not identical.
Current Runtime Behavior
OCR routing is now detector/recognizer-based
The real runtime behavior is no longer well described by a single "OCR engine" label.
At the page level, ReaderPageTranslationProcessor now resolves a PageOcrRoute with:
- detector backend:
MLKITPADDLECTDBUBBLE_DETECTORas an optional bubble-first Japanese fallback path
- recognizer backend:
MLKITPADDLEMANGA_OCR
This is the most important architecture correction on the branch. The pipeline is now expressed in terms of:
detector -> recognizerinstead of a monolithic OCR engine abstraction.
CTD is now a real standalone detector
ComicTextDetectorOnnx.kt is now a dedicated ReaderTextDetector.
It is no longer treated as a Paddle detector variant.
Its current behavior is:
- ONNX Runtime inference
1024x1024letterbox preprocessing- primary decoding from the
detscore map - secondary recovery from
seg - fallback region recovery from
blk - quad generation for rotated regions
That means CTD now contributes detector geometry directly, including non-axis-aligned quads that can later be used for crop warping.
Supported active page OCR routes
The current active routes in ReaderPageTranslationProcessor.kt include:
MLKIT -> MLKITMLKIT -> MANGA_OCRMLKIT -> PADDLEPADDLE -> PADDLEPADDLE -> MANGA_OCRCTD -> PADDLECTD -> MANGA_OCR
There is also an optional Japanese-only BUBBLE_DETECTOR -> MANGA_OCR path when the configured pipeline strategy prefers bubble-first behavior.
This bubble-detector route still exists, but it is no longer the default architecture and should be treated as an escape hatch rather than the primary manga OCR path.
Merge and grouping are downstream from OCR
The active reader flow now behaves like:
- Detect text regions.
- Recognize text from those regions.
- Merge fragments into translation units.
- Optionally use bubble detection to attach groups to bubbles.
- Translate and render.
This is a material improvement over the old "bubble detector decides the OCR search space" design.
Rendering diagnostics now exist in the runtime
The branch now contains a dedicated render-debug layer:
BubbleDebugOverlayin ReaderTranslationBubbleModels.kt- debug drawing and diagnosis in ReaderPageTranslationProcessor.kt
When translation debug logging is enabled, the runtime can now visualize:
- source rect
- source content rect
- prepared render rect
- final content area
It also emits a diagnosis string:
渲染框偏小内容区偏小渲染框偏大基本匹配无内容框
This has turned render debugging from guesswork into an inspectable runtime feature.
What Has Been Corrected
1. CTD is no longer mounted under Paddle
Earlier versions mixed CTD into the Paddle engine path, which violated separation of concerns.
That is now fixed:
- Paddle handles Paddle detection and recognition.
- CTD handles CTD detection.
- recognizers consume detector regions through shared contracts.
This is a clear SRP improvement.
2. OCR model taxonomy is more truthful
comic_text_detector_onnx is now registered as OCR_DETECTOR, not as bubble detection.
That means model selection and runtime routing now agree with the actual role of the model.
3. Detector-anchored render rects are less aggressively compressed
The render path used to produce regions that were visibly too small even when OCR text and translation were complete.
The main correction in prepareTranslatedBubble(...) is:
- merge detector rect with
sourceContentRectwhen available - widen detector-anchored stabilization
- allow larger detector-anchored expansion scales
This corrected a real rendering bug rather than an OCR bug.
4. UI naming is more user-facing
The translation settings and model-management pages have been renamed to reflect user intent:
- text detection / recognition
- local translation
- online translation
- bubble detection
instead of exposing too many Paddle / ONNX implementation names at the top level.
This does not change the core architecture, but it makes the runtime model more truthful for users.
What Still Limits Quality
1. Bubble-first fallback still exists
BUBBLE_DETECTOR -> MANGA_OCR is still present as a strategy-dependent route.
That means the codebase has not fully committed to the rule:
OCR core path = detector -> recognizer -> mergeas the only runtime architecture.
The branch is much closer to that rule now, but not completely exclusive yet.
2. Brightness-based speech-bubble heuristics still exist
isLikelySpeechBubbleRegion(...) still uses luminance thresholds.
This is no longer the dominant OCR gate, which is good, but it still affects downstream bubble-like rendering behavior.
That heuristic remains brittle for:
- dark bubbles
- colored bubbles
- dense screentones
- narration boxes
- text outside classic white speech bubbles
3. Render still relies on rectangle-centric sizing
CTD now provides quads, and recognizers can warp from quads, but the final render-preparation stage still works mostly with rectangles.
This means the branch has improved:
- crop quality
- region alignment
more than it has improved:
- final translated overlay geometry
For vertical or rotated dialogue, this is still a visible limitation.
4. Merge and bubble assignment are still tightly coupled in some downstream behavior
The branch has already demoted bubble logic relative to OCR, but bubble grouping still influences how merged units are anchored for rendering.
That is acceptable for now, but it means merge is not yet fully independent from downstream bubble-placement assumptions.
Comparison With manga-translator-ui
The closest shared architectural shape is now:
page image
-> text detection
-> text-region recognition
-> merge
-> translation
-> renderKototoro is now aligned with manga-translator-ui on these core principles:
- text detection is a first-class stage
- recognition consumes detector regions
- merge happens after recognition
- bubble logic is not the default OCR gate
The main remaining gap is not the detector/recognizer split anymore. The main remaining gap is that Kototoro still preserves more downstream bubble-aware rendering and optional bubble-first fallback behavior than manga-translator-ui.
Current Architectural Assessment
The branch is no longer in the "wrong architecture" state.
The current state is better described as:
- the core OCR direction is correct
- the detector/recognizer split is implemented
- CTD has been promoted to a real detector
- render sizing and bubble-assisted downstream behavior still need refinement
In other words:
Main risk has moved from OCR entrance design
to detector quality, merge quality, and render geometry quality.Recommended Next Focus
The highest-value next steps are:
- Keep
detector -> recognizer -> mergeas the default and preferred route for all mainstream cases. - Continue reducing the architectural importance of bubble-first fallback.
- Push quad-aware geometry further into render preparation, not just crop preparation.
- Use the new debug overlay and diagnosis output to tune render sizing with real page evidence instead of heuristics alone.
The stricter target design is documented in OCR Pipeline.