Yusef Mosiah Nathanson

Founder of Choir

Build a Local-First Extraction Ladder for Agentic Publishing

Mosiah.org · article artifact

Related: sources · notes · metadata · Drafts

Build a Local-First Extraction Ladder for Agentic Publishing

If you are building an agentic publishing system, do not make a paid web extraction API your first move. Start with the cheapest, most inspectable method that can faithfully capture the thing the user actually read. Escalate only when the page proves it needs a heavier tool.

That is the point of a **local-first extraction ladder**: a sequence of increasingly capable retrieval methods, each with explicit provenance, cost, and failure modes.

The immediate use case is Mosiah.org and Choir-style publishing. When Yusef reads an article, watches a video, pastes a thread, or uploads a document, the system should preserve a durable source artifact: raw input where possible, normalized Markdown, metadata, citations, warnings, and a clear record of how the source was acquired. The final article can then cite clean source tiddlers while the local archive keeps operational details.

Why a ladder?

Web extraction fails in different ways. Treating every failure as “use the expensive extractor” is wasteful and makes the publishing system harder to trust.

A normal public page may only need a plain HTTP fetch plus readability extraction. A JavaScript-heavy documentation page may need a browser. A hostile Medium or DataDome page may be better handled by a user-supplied text copy. Twitter/X may need a specialized path. Each class of source deserves a different rung.

The ladder also protects the editorial boundary. A source artifact should say whether it came from:

  • a user-uploaded/manual copy,
  • direct HTTP,
  • search-assisted alternate discovery,
  • a self-hosted proxy/ruleset,
  • a browser runtime,
  • an anti-detection browser,
  • or a paid hosted fallback.

That provenance matters later when an article is edited, challenged, republished, or turned into a citation graph.

The rungs

1. Manual or local artifact

Use this when the user uploads text, a PDF, an EPUB, screenshots, copied excerpts, or notes from another app.

This is often the best source, not a fallback. It captures what the user actually read and avoids wasting time on hostile pages. For example, if a Medium article blocks automation but the user can paste or upload the text, ingest the local text instead of burning paid extraction credits proving Medium is hostile.

Save the original local artifact, compute a hash, record the source URL if known, then normalize to Markdown.

2. Direct HTTP plus readability extraction

For ordinary public pages, start simple:

fetch URL → save raw HTML → extract readable content → save Markdown + metadata

This rung should catch static blogs, documentation pages, press releases, many news posts, and personal websites. It is cheap, debuggable, and easy to reproduce.

Do not trust 200 OK by itself. Bot walls often return HTML with titles like “Just a moment...” or tiny script shells. Classify the acquired HTML before treating it as source content.

3. Search-assisted discovery

If the original URL is blocked, stale, duplicated, or ambiguous, use SearXNG to find alternate copies, canonical URLs, archives, syndicated versions, or corroborating pages.^searxng

SearXNG is not the extractor. It is the discovery layer. It helps answer: “Is there a better URL to extract?”

4. Ladder proxy/ruleset experiments

Use Ladder when the problem looks like headers, paywall/content-delivery behavior, or a domain-specific rule that can be studied with a self-hosted proxy.^ladder

Ladder is not magic anti-bot technology. Treat it as a research and ruleset rung, useful for reproducible experiments and certain paywall/content-delivery cases.^ladder-rules

5. Lightweight browser rendering

Some pages require JavaScript. For these, try a browser runtime before jumping to paid extraction.

Obscura is interesting here because it advertises a Rust engine, V8 execution, Chrome DevTools Protocol compatibility, and lower memory use than full Chrome.^obscura It should be tested empirically rather than trusted from claims.

Playwright remains the reliable baseline for local browser rendering.

6. Anti-detection browser

If a page specifically targets automation, try a heavier browser rung such as Camoufox.^camoufox

This should not be the default. It is heavier, more operationally complex, and more ethically sensitive. Use it when the source is public, valuable, and ordinary browser rendering fails.

7. Paid hosted fallback

Use Firecrawl or a similar hosted extractor only after local methods fail, when the source value justifies the cost, or when the target is a platform where the paid extractor is currently the practical path, such as Twitter/X.

When this rung is used, record it. The source artifact should say that a paid fallback produced the content.

Container feasibility on this machine

This VM is Debian 13 on ARM64 under Apple Virtualization. That is good enough for local extraction services. Containers do not require nested virtualization here; Docker or Podman can use the Linux kernel inside the guest.

The live checks were encouraging:

arch: aarch64
kernel: 6.12.86+deb13-arm64
cgroup: cgroup v2 mounted rw with nsdelegate
unprivileged user namespaces: enabled
subuid/subgid for choir: present
container runtimes: docker/podman/docker-compose not installed yet

The practical plan is:

  • start with Podman if rootless containers behave well;
  • fall back to Docker if Compose-based docs create friction;
  • bind SearXNG and Ladder to 127.0.0.1, not the public internet;
  • check ARM64 image support before assuming browser tools will run locally;
  • test one trivial container before installing the whole stack.

SearXNG and Ladder both have straightforward container stories. Ladder publishes amd64 and arm64 Docker images, which matters for this VM. Obscura and Camoufox need separate ARM64 checks.

What each source artifact should record

Every extraction attempt should leave behind enough information to debug and cite it later:

  • original URL or local path,
  • retrieval timestamp,
  • rung used,
  • command or service used,
  • HTTP status and final URL,
  • parser/extractor name,
  • content hash,
  • warnings such as bot wall, login wall, tiny shell, partial extraction, or paid fallback,
  • raw input when legally and technically appropriate,
  • normalized Markdown,
  • short summary.

The published Mosiah article does not need all of this operational noise inline. It should link to clean source tiddlers. The local artifact keeps the retrieval history.

Starter test corpus

Do not test the ladder on one URL. Use a small corpus of page classes:

  • easy static public page, such as Pluralistic;
  • manual text copy of a hostile source;
  • Medium as a bot-wall behavior probe;
  • mainstream soft/paywalled pages such as NYTimes, Washington Post, Bloomberg, or Economist;
  • bot/JS-friction news pages such as Reuters;
  • Cloudflare/DataDome style anti-bot pages;
  • JavaScript-heavy app or documentation pages;
  • Twitter/X public post/thread URLs.

Initial cheap probes already showed useful separation:

pluralistic.net: 200, normal HTML
Doctorow Medium URL: 403, Cloudflare
NYTimes Cloudflare AI article: 403, DataDome
Reuters AI page: 401, CloudFront
X/AOC post: 200 HTML shell, Cloudflare/envoy

That is enough to justify the ladder. The goal is not to defeat every site. The goal is to know which rung acquired the content, when to stop, when to ask for a manual artifact, and when a paid fallback is actually worth it.

Minimal implementation checklist

1. Implement direct HTTP + readability extraction and HTML classification. 2. Save raw, normalized, metadata, summary, and warnings for every artifact. 3. Add SearXNG only for alternate/canonical/archive discovery. 4. Add Ladder as a reproducible proxy/ruleset experiment rung. 5. Add Playwright or Obscura for JavaScript-rendered pages. 6. Add Camoufox only for measured bot-detection cases. 7. Keep Firecrawl as the logged paid fallback. 8. Expose the result to Mosiah.org as source tiddlers, not as operational clutter in the article body.

The invariant is simple: **local, cheap, inspectable first; heavy, hosted, or paid last.**