Articles/extraction-ladder-research

9th May 2026 at 5:08am

Related: sources · notes · metadata · Published Pieces

A Local-First Extraction Ladder for Agentic Publishing

Mosiah.org should not depend on expensive hosted extraction as the default path. The publishing system needs an extraction ladder: cheap, local, inspectable methods first; heavier browser/proxy systems only when the target actually requires them; Firecrawl last, except for cases where it is clearly the right paid tool, such as Twitter/X scraping.

The immediate trigger is Medium. A Cory Doctorow article behind Medium’s bot wall is exactly the kind of page that can burn paid extraction credits without producing a better artifact. The right move is what Yusef suggested: if the user can upload a text copy, ingest the text copy. Do not spend ten Firecrawl credits proving Medium is hostile to automation.

Proposed ladder

1. **Manual/local artifact** when the user uploads text, PDF, EPUB, screenshots, or copied source material. This is the highest-fidelity path because it reflects what the user actually read. 2. **Simple HTTP fetch + readability extraction** for ordinary public pages. This should be the default: requests/curl, trafilatura/readability, metadata capture, canonical URL, and raw HTML saved locally. 3. **Search-assisted discovery with SearXNG** when the URL is missing, stale, blocked, duplicated, or when we need alternate copies. SearXNG is not an extractor by itself; it is the private metasearch layer that helps find accessible versions, canonical URLs, archives, and corroborating sources.^searxng 4. **Ladder** for sites where the issue is headers, crawler presentation, CORS/CSP, or paywall/content-delivery behavior that can be studied with a self-hosted proxy and rulesets.^ladder Ladder is not magic anti-bot; it is a configurable proxy and research tool. That makes it useful, but it should be treated as a middle rung, not a universal bypass machine.^ladder-rules 5. **Obscura** for JavaScript-rendered pages where a lightweight headless engine is enough. The project claims a Rust engine, V8 execution, CDP compatibility, lower memory than Chrome, and Playwright/Puppeteer integration.^obscura It looks worth testing, but it should be evaluated empirically because young scraping tools can overclaim. 6. **Camoufox** for adversarial bot-detection pages where ordinary Playwright/Chromium fails. Camoufox is a Firefox fork aimed at AI-agent automation and anti-detection, with fingerprint injection and Playwright-compatible APIs.^camoufox It is probably the heavy local browser rung. 7. **Firecrawl** as a paid fallback, not the default. Use it when local methods fail, when the value of the source justifies the cost, or when the target is Twitter/X and Firecrawl is currently the practical path.

Container feasibility on this machine

The VM is Debian 13 on aarch64, running under Apple Virtualization rather than QEMU. That is good news for the container plan: from inside Linux, containers should mostly behave like normal Linux containers. We are not asking for nested hardware virtualization; Docker or Podman will use the Linux kernel in the guest. The relevant questions are therefore ordinary Linux container questions: cgroups, namespaces, overlay filesystems, network forwarding, DNS, ports, and ARM64 image availability.

Current live checks:

arch: aarch64
kernel: 6.12.86+deb13-arm64
cgroup: cgroup v2 mounted rw with nsdelegate
unprivileged user namespaces: enabled
subuid/subgid for choir: present
container runtimes: docker/podman/docker-compose not installed yet

That points toward containers being feasible. The main issues to navigate are:

- **Runtime choice.** Docker is familiar and has the broadest docs. Podman is attractive for rootless containers, daemonless operation, and a more Linux-native feel. Other things equal, start with Podman, but keep Docker as the fallback if SearXNG/Ladder docs or compose files create friction. - **Rootless vs rootful.** Rootless Podman is safer and likely enough for localhost extraction services. If networking or compose behavior becomes annoying, rootful Podman or Docker can be a pragmatic second rung. - **Compose compatibility.** SearXNG’s official docs assume Compose. Podman has podman compose/Docker Compose compatibility, but this is a common friction point. The first win should be a single trivial container, not SearXNG. - **Networking.** Bind services to 127.0.0.1 first. Do not expose Ladder or SearXNG publicly. Apple Virtualization may add host/guest networking details later, but agent access from inside the VM only needs localhost. - **ARM64 images.** Ladder advertises arm64 Docker support. SearXNG has container images and should be fine. Obscura/Camoufox need empirical checks for Linux ARM64 support; if they are x86-only, running them inside this VM may require emulation or a different host strategy. - **Storage drivers.** Rootless Podman may need fuse-overlayfs depending on Debian’s defaults. This is fixable, but it is exactly why the plan should accumulate small wins.

SearXNG and Ladder both have straightforward container stories. SearXNG’s docs recommend Docker/Compose deployment and provide templates. Ladder publishes Docker images for amd64 and arm64, which matters because this VM is ARM. Obscura may be simpler as a binary if Linux ARM builds exist; otherwise it may need a source build or run on the Mac host. Camoufox is heavier and should be treated as a later browser-runtime integration.

Recommended implementation plan

Build this as a local extraction service with explicit provenance and cost boundaries:

- extract(url) starts with local fetch/readability and saves raw HTML, normalized Markdown, metadata, and warnings. - If local fetch fails, it asks SearXNG for alternate/canonical/archive candidates. - If the target looks like a Ladder-supported domain or a header/ruleset problem, try Ladder. - If JavaScript rendering is required, try Obscura or Playwright. - If bot detection blocks ordinary automation, try Camoufox. - Only then use Firecrawl, and record that paid fallback was used.

Every artifact should record the rung used, commands/services involved, source URL, retrieval timestamp, content hash, and whether extraction was complete, partial, manually supplied, or paid-fallback. This is the important publishing boundary: Mosiah.org gets packaged source tiddlers and clean citations; the local system keeps the operational provenance.

Test corpus

We need a small corpus of content classes, not just one URL. The first probe set should include:

- an easy static public page such as Pluralistic, - a manually supplied text copy for a hostile source such as the Doctorow/Medium article, - Medium itself as a bot-wall behavior probe, not a paid extraction target, - soft/paywalled mainstream pages such as NYTimes/Washington Post/Bloomberg/Economist, - bot/JS-friction news pages such as Reuters or similar, - a Cloudflare/DataDome style anti-bot page, - a JavaScript-heavy app/documentation page, - Twitter/X public post/thread URLs, where Firecrawl is allowed as a fallback.

Initial cheap probes, using only local HTTP and no paid extraction, already show useful class separation:

pluralistic.net: 200, normal HTML
Doctorow Medium URL: 403, Cloudflare
NYTimes Cloudflare AI article: 403, DataDome
Reuters AI page: 401, CloudFront
X/AOC post: 200 HTML shell, Cloudflare/envoy

A working corpus file lives with the local report as test_corpus.md. The corpus should evolve as experiments reveal better examples.

Experiment ladder: accumulate wins

The build should progress from simplest to most complex, with each rung leaving behind a working capability and a published note.

1. **Runtime smoke test.** Install Podman first if possible. Run hello-world or alpine uname -a. Verify rootless mode, DNS, outbound HTTPS, volume mounts, and localhost port binding. If this is annoying, try rootful Podman before abandoning Podman for Docker. **Status: passed.** Podman 5.4.2 installed from Debian 13 ARM64 packages. Rootless Podman reports cgroup v2, systemd cgroup manager, netavark/aardvark DNS, crun, pasta/slirp4netns, and working subuid/subgid mappings. 2. **Static service test.** Run a tiny HTTP container bound to 127.0.0.1, fetch it from the agent, and verify restart/cleanup commands. This proves service lifecycle before adding app complexity. **Status: passed.** nginx:alpine bound to 127.0.0.1:18080, returned HTTP/1.1 200 OK, then was removed cleanly. Alpine also verified DNS/outbound HTTPS and host/container volume mounts. 3. **SearXNG.** Run SearXNG locally. Query it from Python. Save JSON/HTML search results as an artifact. Confirm it helps find alternate copies and canonical pages, not just search the web for the user. **Status: passed with a note.** SearXNG runs in Podman at 127.0.0.1:18081. HTML search works. JSON search returned 403 under the default bot-detection/API settings, so the first integration should parse HTML or tune formats/limiter config. Useful result: searching for the Doctorow/Medium title surfaced the canonical Medium URL, a Mamot/Mastodon copy, a Chinwag discussion, and the direct Pluralistic URL https://pluralistic.net/2025/02/18/pikettys-productivity/. 4. **Ladder.** Run Ladder with basic auth and localhost-only binding. Test an easy article and one known annoying article. Record when Ladder helps and when it does not. **Status: passed as a proxy, limited as an anti-bot tool.** Ladder runs in Podman at 127.0.0.1:18082 with the public ruleset loaded: 16 rules for 41 domains. It successfully proxied Pluralistic. For Medium, NYTimes/DataDome, and Reuters/CloudFront, it returned the bot-wall/interstitial HTML rather than article text. That is still a useful result: Ladder belongs in the header/ruleset/proxy rung, not the adversarial-browser rung. 5. **Extractor baseline.** Build a plain local extractor using HTTP fetch + readability/trafilatura-style parsing. Make this the default path before any browser runtime. 6. **JavaScript rendering.** Test Obscura or ordinary Playwright against a JS-heavy page. Compare output quality and resource use. 7. **Adversarial browser.** Test Camoufox only after the normal browser path fails. Treat it as a heavy rung, not the default. 8. **Paid fallback.** Use Firecrawl only when the previous rungs fail and the source is worth the cost, or for Twitter/X where it is explicitly allowed.

The goal is not to build a pirate paywall machine. The goal is a cost-aware, local-first, artifact-native extraction system for agentic publishing.

Article Metadata/extraction-ladder-research
Article Notes/extraction-ladder-research
Article Sources/extraction-ladder-research
Extraction Bakeoff Hub
Sources/extraction-ladder-research/01-everywall-ladder-self-hosted-http-proxy-for-paywall-content-deli
Sources/extraction-ladder-research/02-everywall-ladder-rules-rulesets-for-ladder
Sources/extraction-ladder-research/03-searxng-docker-installation-docs
Sources/extraction-ladder-research/04-camoufox-anti-detect-firefox-fork-for-ai-agents
Sources/extraction-ladder-research/05-obscura-rust-headless-browser-for-ai-agents-and-scraping