Document Lifecycle
Documents in godocs move through three phases: ingestion, active editing, and archival. This document describes the full lifecycle and the archival design.
Phases
Ingress folder Document folder Archive folder
┌─────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ New files │───>│ Active documents │───>│ Archive pending │
│ │ │ (view, tag, edit) │ │ (files + metadata) │
└─────────────┘ └──────────────────────┘ └──────────────────────┘
Ingestion Active phase │
(existing) (existing) │
v
External backup
tool moves files
│
v
┌──────────────────┐
│ Archived │
│ (metadata only │
│ in DB, frozen) │
└──────────────────┘
1. Ingestion (existing)
Files arrive in the ingress folder and are processed in three steps:
- Hash & deduplicate — MD5 hash calculated, checked against DB
- Move to canonical path — file copied to nested document folder (
L/00/12/34/001234.orig.pdf), hash verified, source deleted - Extract & enrich — OCR text extracted, thumbnail generated,
.tags.jsonapplied, search index updated
No changes needed here.
2. Active editing (existing)
Documents in the document folder can be:
- Viewed, searched, tagged, assigned to stories
- Rotated (destructive, rehashes)
- Date and metadata edited
- Bulk-edited via multi-select
No changes needed here.
3. Archival (new)
Archival removes documents from day-to-day use while preserving all metadata for audit and recovery. It is a two-stage process: archive pending then archived.
Archive design
Archive folder
A new config value ARCHIVE_PATH (default: archive/ sibling of DOCUMENT_PATH). The archive folder mirrors the nested directory structure of the document folder:
documents/L/00/12/34/001234.orig.pdf → archive/L/00/12/34/001234.orig.pdf
documents/L/00/12/34/001234.ocr.txt → archive/L/00/12/34/001234.ocr.txt
documents/L/00/12/34/001234.thumb.png → archive/L/00/12/34/001234.thumb.png
documents/L/00/12/34/001234.tags.json → archive/L/00/12/34/001234.tags.json
archive/L/00/12/34/001234.lifecycle.json (new)
Lifecycle metadata file
A new .lifecycle.json sidecar is written at archive time. This keeps the .tags.json file unchanged (frozen) and records archive-specific metadata separately:
{
"archived_at": "2026-02-25T14:30:00Z",
"archived_by": "godocs",
"archive_reason": "user-initiated",
"original_path": "L/00/12/34/001234.orig.pdf",
"hash": "d41d8cd98f00b204e9800998ecf8427e",
"ulid": "01JFXYZ...",
"db_id": 1234,
"schema_version": "1"
}
This means:
.tags.jsonis copied as-is (frozen at archive time).lifecycle.jsonrecords when, why, and the document identity- An external backup tool can read
.lifecycle.jsonto verify integrity (hash) and track provenance
Archive states
Archival uses two states tracked via a dedicated archive_status column on the documents table (not a tag — see rationale below):
| State | archive_status |
Files on disk | Visible in UI | Editable |
|---|---|---|---|---|
| Active | NULL |
document folder | Yes | Yes |
| Archive pending | 'pending' |
archive folder | Only in archive view | No (frozen) |
| Archived | 'archived' |
removed from archive folder (by external tool) | No | No (frozen) |
Why a column, not a tag? The "Archive Pending" concept needs to:
- Prevent edits (tags can't enforce this)
- Filter documents from all default queries (tags require subquery exclusion in every query)
- Track a timestamp (
archived_at) - Be queryable without joins
However, an "Archive Pending" system tag is also created (like the existing "Hide" tag) so the archive state is visible in the tag UI and in .tags.json exports. The tag is applied automatically when archival begins and is the mechanism by which users can select documents for archival via the existing bulk-edit multi-select.
Workflow
Selecting documents for archival
Uses the existing multi-select system:
- User visits home page with
?select=1 - Selects documents via checkboxes
- Clicks "Archive Selected" button on the bulk-edit page
- Confirmation dialog: "Archive N documents? This will move files to the archive folder and freeze metadata."
Archive pending stage
When the user confirms:
- For each document:
- Set
archive_status = 'pending',archived_at = NOW()in DB - Add the "Archive Pending" system tag
- Export final
.tags.json(includes the Archive Pending tag) - Write
.lifecycle.jsonsidecar - Move all files (
.orig.*,.ocr.txt,.thumb.png,.tags.json,.lifecycle.json) to the archive folder, preserving nested structure - Update
documents.pathto point to archive location
- Set
- Documents disappear from default views (filtered by
archive_status IS NOT NULL) - Documents are frozen — tag/date/metadata edits rejected with "document is archived"
Archived stage
An external program (backup tool, rsync script, cloud uploader) is responsible for moving files from the archive folder to long-term storage. Once files are moved:
- External tool calls:
PUT /api/document/{ulid}/archive-confirm - godocs sets
archive_status = 'archived' - Physical files are now gone from the archive folder
- DB record retained indefinitely as a metadata-only tombstone
Viewing archived documents
- Default queries exclude
archive_status IS NOT NULL - A dedicated
/archivepage lists archived documents (metadata only, no file access) - The
/archivepage shows: name, date, tags, hash, archived_at, archive_status - No download/view/thumbnail — files are gone
Unarchiving (optional, manual)
If files need recovery before the external tool has moved them (i.e. still in archive folder):
PUT /api/document/{ulid}/unarchive- Moves files back from archive folder to document folder
- Clears
archive_status,archived_at - Removes "Archive Pending" tag
- Fails if
archive_status = 'archived'(files already gone)
Database changes
documents table
ALTER TABLE documents ADD COLUMN archive_status TEXT; -- NULL, 'pending', 'archived'
ALTER TABLE documents ADD COLUMN archived_at TIMESTAMP; -- when archival began
System tag
Migration creates the "Archive Pending" tag:
INSERT INTO tags (name, color, description, tag_group, sort_order, created_at, updated_at)
VALUES ('Archive Pending', '#95a5a6', 'Document queued for archival', 'System', 10,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP)
ON CONFLICT (name) DO NOTHING;
Config changes
ARCHIVE_PATH=archive # relative or absolute; default: sibling of DOCUMENT_PATH
API endpoints (new)
| Method | Path | Purpose |
|---|---|---|
POST |
/api/documents/archive |
Archive documents (body: {"ulids": [...]}) |
PUT |
/api/document/{ulid}/archive-confirm |
External tool confirms files moved |
PUT |
/api/document/{ulid}/unarchive |
Undo archive-pending (if files still exist) |
GET |
/api/documents/archived |
List archived document metadata |
UI changes
| Page | Change |
|---|---|
| Bulk edit | Add "Archive Selected" button |
| Home/search | Filter out archive_status IS NOT NULL (like Hide filtering) |
New /archive page |
Read-only list of archived documents with metadata |
| Document edit | Reject edits if archive_status is set; show "Archived" banner |
Implementation order
- Add
archive_statusandarchived_atcolumns (migration) - Add "Archive Pending" system tag (migration)
- Add
ARCHIVE_PATHto config - Filter archived documents from default queries
- Implement archive operation (move files, write
.lifecycle.json, update DB) - Add "Archive Selected" to bulk-edit page
- Add
/archivelist page - Add
archive-confirmendpoint for external tools - Add unarchive endpoint
- Freeze edits on archived documents
Interaction with existing features
- Clean DB: Skip documents with
archive_status IS NOT NULLduring orphan scanning. Do not delete archive-pending files from the archive folder. - Hide tag: Orthogonal. A document can be hidden (excluded from default view) without being archived. Archival is permanent removal; hiding is temporary suppression.
- Stories: Archived documents remain associated with stories in the DB but won't appear in story document lists.
- Search: Archived documents excluded from search results by default. The
/archivepage could have its own search. - Ingestion: No interaction. Ingestion only adds new active documents.
External backup tool contract
The external tool is expected to:
- Scan the archive folder for
.lifecycle.jsonfiles - Read
.lifecycle.jsonto get hash, ULID, and document identity - Copy/move all sibling files (
.orig.*,.ocr.txt,.thumb.png,.tags.json,.lifecycle.json) to backup storage - Verify hash of
.orig.*matches.lifecycle.jsonhash - Call
PUT /api/document/{ulid}/archive-confirmto mark as archived - Delete files from archive folder (or let godocs clean DB do it)
The tool never needs to understand nested paths or canonical naming — it just processes whatever it finds in the archive folder.