Canonical File Naming
Documents on disk use the DB auto-increment ID with semantic suffixes, organized in a tiered directory structure that scales from 10 to ~10^51 files.
Directory Structure
Each document is placed under a tier letter (A–Z) followed by zero or more 2-digit directory levels. The directory path is derived from a single global leaf index:
leaf = (id - 1) / 10
The leaf value is decomposed into base-100 pairs. The number of pairs determines the tier letter:
| Tier | Depth | ID Range | Leaf Range | Example Path |
|---|---|---|---|---|
| A | 0 | 1–10 | 0 | A/7.orig.pdf |
| B | 1 | 11–1,000 | 1–99 | B/04/42.orig.pdf |
| C | 2 | 1,001–100,000 | 100–9,999 | C/01/23/1234.orig.pdf |
| D | 3 | 100,001–10,000,000 | 10,000–999,999 | D/01/23/45/1234567.orig.pdf |
| ... | ... | ... | ... | ... |
| Z | 25 | ... | ... | Z/01/.../99/{id}.orig.pdf |
Each leaf directory holds ~10 documents (~30–50 files including sidecars). Tier boundaries fall at clean powers: 10, 1000, 100,000, 10,000,000, ...
File Naming
C/01/23/1234.orig.pdf # Original document
C/01/23/1234.ocr.txt # OCR/extracted text
C/01/23/1234.thumb.png # Thumbnail
C/01/23/1234.tags.json # Tags metadata
Rules
- Root documents:
.pdf,.jpg,.jpeg,.png,.tiff,.doc,.docx,.odf,.rtf,.text .orig.unambiguously marks the primary document.ocr.marks extracted/OCR text.thumb.pngfor thumbnails.tags.jsonfor tag metadata- DB
Namefield preserves the original filename for display filepath.Ext("1234.orig.pdf")returns.pdfso content-type serving works
Path Computation
- Compute leaf index:
leaf = (id - 1) / 10 - Decompose into base-100 pairs: divide
leafrepeatedly by 100, collecting 2-digit remainders right-to-left - Determine letter:
letter = 'A' + number_of_pairs(leaf 0 → A, 1–99 → B, 100–9999 → C, ...) - Assemble path:
{root}/{letter}/{pair1}/{pair2}/.../{id}.orig.{ext}
Worked examples
ID 7:
- leaf = (7-1)/10 = 0
- No pairs needed (leaf is 0)
- Letter = A (depth 0)
- Path:
A/7.orig.pdf
ID 42:
- leaf = (42-1)/10 = 4
- One pair:
04 - Letter = B (depth 1)
- Path:
B/04/42.orig.pdf
Leaf directory B/04/ contains IDs 41–50 (10 documents).
ID 1234:
- leaf = (1234-1)/10 = 123
- Two pairs: 123 →
01,23 - Letter = C (depth 2)
- Path:
C/01/23/1234.orig.pdf
Leaf directory C/01/23/ contains IDs 1231–1240 (10 documents).
Legacy Structure (L/K/J)
The previous scheme used reverse-alphabet tier letters with the padded ID as both the directory path and filename:
L/00/12/34/001234.orig.pdf # L tier: IDs 1–99,999 (6-digit, 3 levels)
K/01/23/45/67/01234567.orig.pdf # K tier: IDs 100,000–9,999,999 (8-digit, 4 levels)
J/00/12/34/56/78/0012345678.orig.pdf # J tier: IDs 10,000,000+ (10-digit, 5 levels)
This put exactly 1 document per leaf directory (~4 files with sidecars). The new A–Z scheme groups ~10 documents per leaf (~30–50 files) and scales to 26 tiers.
The clean DB job migrates documents from legacy to canonical paths automatically.
Key Functions
ComputeNestedPath(id, ext, root)— full canonical pathCanonicalDocName(id, ext)— e.g."1234.orig.pdf"SidecarBasePath(docPath)— strips.orig.{ext}to get sidecar basegetOCRPath(docPath),getThumbPath(docPath),getTagsPath(docPath)— sidecar paths