# A Local-First Extraction Ladder for Agentic Publishing

Canonical: https://mosiah.org/articles/extraction-ladder-research/
Interactive: https://mosiah.org/#Articles%2Fextraction-ladder-research

Related: [[sources|Article Sources/extraction-ladder-research]] · [[notes|Article Notes/extraction-ladder-research]] · [[metadata|Article Metadata/extraction-ladder-research]] · [[Published Pieces|Published Pieces]]

! A Local-First Extraction Ladder for Agentic Publishing

Mosiah.org should not depend on expensive hosted extraction as the default path. The publishing system needs an extraction ladder: cheap, local, inspectable methods first; heavier browser/proxy systems only when the target actually requires them; Firecrawl last, except for cases where it is clearly the right paid tool, such as Twitter/X scraping.

The immediate trigger is Medium. A Cory Doctorow article behind Medium’s bot wall is exactly the kind of page that can burn paid extraction credits without producing a better artifact. The right move is what Yusef suggested: if the user can upload a text copy, ingest the text copy. Do not spend ten Firecrawl credits proving Medium is hostile to automation.

!! Proposed ladder

1. **Manual/local artifact** when the user uploads text, PDF, EPUB, screenshots, or copied source material. This is the highest-fidelity path because it reflects what the user actually read.
2. **Simple HTTP fetch + readability extraction** for ordinary public pages. This should be the default: requests/curl, trafilatura/readability, metadata capture, canonical URL, and raw HTML saved locally.
3. **Search-assisted discovery with SearXNG** when the URL is missing, stale, blocked, duplicated, or when we need alternate copies. SearXNG is not an extractor by itself; it is the private metasearch layer that helps find accessible versions, canonical URLs, archives, and corroborating sources.[[^searxng|Sources/extraction-ladder-research/03-searxng-docker-installation-docs]]
4. **Ladder** for sites where the issue is headers, crawler presentation, CORS/CSP, or paywall/content-delivery behavior that can be studied with a self-hosted proxy and rulesets.[[^ladder|Sources/extraction-ladder-research/01-everywall-ladder-self-hosted-http-proxy-for-paywall-content-deli]] Ladder is not magic anti-bot; it is a configurable proxy and research tool. That makes it useful, but it should be treated as a middle rung, not a universal bypass machine.[[^ladder-rules|Sources/extraction-ladder-research/01-everywall-ladder-self-hosted-http-proxy-for-paywall-content-deli]]
5. **Obscura** for JavaScript-rendered pages where a lightweight headless engine is enough. The project claims a Rust engine, V8 execution, CDP compatibility, lower memory than Chrome, and Playwright/Puppeteer integration.[[^obscura|Sources/extraction-ladder-research/05-obscura-rust-headless-browser-for-ai-agents-and-scraping]] It looks worth testing, but it should be evaluated empirically because young scraping tools can overclaim.
6. **Camoufox** for adversarial bot-detection pages where ordinary Playwright/Chromium fails. Camoufox is a Firefox fork aimed at AI-agent automation and anti-detection, with fingerprint injection and Playwright-compatible APIs.[[^camoufox|Sources/extraction-ladder-research/04-camoufox-anti-detect-firefox-fork-for-ai-agents]] It is probably the heavy local browser rung.
7. **Firecrawl** as a paid fallback, not the default. Use it when local methods fail, when the value of the source justifies the cost, or when the target is Twitter/X and Firecrawl is currently the practical path.

!! Container feasibility on this machine

The VM is Debian 13 on `aarch64`, running under Apple Virtualization rather than QEMU. That is good news for the container plan: from inside Linux, containers should mostly behave like normal Linux containers. We are not asking for nested hardware virtualization; Docker or Podman will use the Linux kernel in the guest. The relevant questions are therefore ordinary Linux container questions: cgroups, namespaces, overlay filesystems, network forwarding, DNS, ports, and ARM64 image availability.

Current live checks:

```text
arch: aarch64
kernel: 6.12.86+deb13-arm64
cgroup: cgroup v2 mounted rw with nsdelegate
unprivileged user namespaces: enabled
subuid/subgid for choir: present
container runtimes: docker/podman/docker-compose not installed yet
```

That points toward containers being feasible. The main issues to navigate are:

- **Runtime choice.** Docker is familiar and has the broadest docs. Podman is attractive for rootless containers, daemonless operation, and a more Linux-native feel. Other things equal, start with Podman, but keep Docker as the fallback if SearXNG/Ladder docs or compose files create friction.
- **Rootless vs rootful.** Rootless Podman is safer and likely enough for localhost extraction services. If networking or compose behavior becomes annoying, rootful Podman or Docker can be a pragmatic second rung.
- **Compose compatibility.** SearXNG’s official docs assume Compose. Podman has `podman compose`/Docker Compose compatibility, but this is a common friction point. The first win should be a single trivial container, not SearXNG.
- **Networking.** Bind services to `127.0.0.1` first. Do not expose Ladder or SearXNG publicly. Apple Virtualization may add host/guest networking details later, but agent access from inside the VM only needs localhost.
- **ARM64 images.** Ladder advertises `arm64` Docker support. SearXNG has container images and should be fine. Obscura/Camoufox need empirical checks for Linux ARM64 support; if they are x86-only, running them inside this VM may require emulation or a different host strategy.
- **Storage drivers.** Rootless Podman may need `fuse-overlayfs` depending on Debian’s defaults. This is fixable, but it is exactly why the plan should accumulate small wins.

SearXNG and Ladder both have straightforward container stories. SearXNG’s docs recommend Docker/Compose deployment and provide templates. Ladder publishes Docker images for `amd64` and `arm64`, which matters because this VM is ARM. Obscura may be simpler as a binary if Linux ARM builds exist; otherwise it may need a source build or run on the Mac host. Camoufox is heavier and should be treated as a later browser-runtime integration.

!! Recommended implementation plan

Build this as a local extraction service with explicit provenance and cost boundaries:

- `extract(url)` starts with local fetch/readability and saves raw HTML, normalized Markdown, metadata, and warnings.
- If local fetch fails, it asks SearXNG for alternate/canonical/archive candidates.
- If the target looks like a Ladder-supported domain or a header/ruleset problem, try Ladder.
- If JavaScript rendering is required, try Obscura or Playwright.
- If bot detection blocks ordinary automation, try Camoufox.
- Only then use Firecrawl, and record that paid fallback was used.

Every artifact should record the rung used, commands/services involved, source URL, retrieval timestamp, content hash, and whether extraction was complete, partial, manually supplied, or paid-fallback. This is the important publishing boundary: Mosiah.org gets packaged source tiddlers and clean citations; the local system keeps the operational provenance.

!! Test corpus

We need a small corpus of content classes, not just one URL. The first probe set should include:

- an easy static public page such as Pluralistic,
- a manually supplied text copy for a hostile source such as the Doctorow/Medium article,
- Medium itself as a bot-wall behavior probe, not a paid extraction target,
- soft/paywalled mainstream pages such as NYTimes/Washington Post/Bloomberg/Economist,
- bot/JS-friction news pages such as Reuters or similar,
- a Cloudflare/DataDome style anti-bot page,
- a JavaScript-heavy app/documentation page,
- Twitter/X public post/thread URLs, where Firecrawl is allowed as a fallback.

Initial cheap probes, using only local HTTP and no paid extraction, already show useful class separation:

```text
pluralistic.net: 200, normal HTML
Doctorow Medium URL: 403, Cloudflare
NYTimes Cloudflare AI article: 403, DataDome
Reuters AI page: 401, CloudFront
X/AOC post: 200 HTML shell, Cloudflare/envoy
```

A working corpus file lives with the local report as `test_corpus.md`. The corpus should evolve as experiments reveal better examples.

!! Experiment ladder: accumulate wins

The build should progress from simplest to most complex, with each rung leaving behind a working capability and a published note.

1. **Runtime smoke test.** Install Podman first if possible. Run `hello-world` or `alpine uname -a`. Verify rootless mode, DNS, outbound HTTPS, volume mounts, and localhost port binding. If this is annoying, try rootful Podman before abandoning Podman for Docker. **Status: passed.** Podman 5.4.2 installed from Debian 13 ARM64 packages. Rootless Podman reports cgroup v2, systemd cgroup manager, netavark/aardvark DNS, crun, pasta/slirp4netns, and working subuid/subgid mappings.
2. **Static service test.** Run a tiny HTTP container bound to `127.0.0.1`, fetch it from the agent, and verify restart/cleanup commands. This proves service lifecycle before adding app complexity. **Status: passed.** `nginx:alpine` bound to `127.0.0.1:18080`, returned `HTTP/1.1 200 OK`, then was removed cleanly. Alpine also verified DNS/outbound HTTPS and host/container volume mounts.
3. **SearXNG.** Run SearXNG locally. Query it from Python. Save JSON/HTML search results as an artifact. Confirm it helps find alternate copies and canonical pages, not just search the web for the user. **Status: passed with a note.** SearXNG runs in Podman at `127.0.0.1:18081`. HTML search works. JSON search returned 403 under the default bot-detection/API settings, so the first integration should parse HTML or tune formats/limiter config. Useful result: searching for the Doctorow/Medium title surfaced the canonical Medium URL, a Mamot/Mastodon copy, a Chinwag discussion, and the direct Pluralistic URL `https://pluralistic.net/2025/02/18/pikettys-productivity/`.
4. **Ladder.** Run Ladder with basic auth and localhost-only binding. Test an easy article and one known annoying article. Record when Ladder helps and when it does not. **Status: passed as a proxy, limited as an anti-bot tool.** Ladder runs in Podman at `127.0.0.1:18082` with the public ruleset loaded: 16 rules for 41 domains. It successfully proxied Pluralistic. For Medium, NYTimes/DataDome, and Reuters/CloudFront, it returned the bot-wall/interstitial HTML rather than article text. That is still a useful result: Ladder belongs in the header/ruleset/proxy rung, not the adversarial-browser rung.
5. **Extractor baseline.** Build a plain local extractor using HTTP fetch + readability/trafilatura-style parsing. Make this the default path before any browser runtime.
6. **JavaScript rendering.** Test Obscura or ordinary Playwright against a JS-heavy page. Compare output quality and resource use.
7. **Adversarial browser.** Test Camoufox only after the normal browser path fails. Treat it as a heavy rung, not the default.
8. **Paid fallback.** Use Firecrawl only when the previous rungs fail and the source is worth the cost, or for Twitter/X where it is explicitly allowed.

The goal is not to build a pirate paywall machine. The goal is a cost-aware, local-first, artifact-native extraction system for agentic publishing.
