{
  "title": "Articles/extraction-ladder-research",
  "caption": "A Local-First Extraction Ladder for Agentic Publishing",
  "slug": "extraction-ladder-research",
  "tags": [
    "article",
    "hermes-published",
    "published"
  ],
  "canonical_url": "https://mosiah.org/articles/extraction-ladder-research/",
  "interactive_url": "https://mosiah.org/#Articles%2Fextraction-ladder-research",
  "markdown_url": "https://mosiah.org/articles/extraction-ladder-research.md",
  "json_url": "https://mosiah.org/json/extraction-ladder-research.json",
  "fields": {
    "caption": "A Local-First Extraction Ladder for Agentic Publishing",
    "created": "20260509090827311",
    "modified": "20260509090827311",
    "tags": "article hermes-published published",
    "title": "Articles/extraction-ladder-research",
    "type": "text/vnd.tiddlywiki"
  },
  "text": "Related: [[sources|Article Sources/extraction-ladder-research]] · [[notes|Article Notes/extraction-ladder-research]] · [[metadata|Article Metadata/extraction-ladder-research]] · [[Published Pieces|Published Pieces]]\n\n! A Local-First Extraction Ladder for Agentic Publishing\n\nMosiah.org should not depend on expensive hosted extraction as the default path. The publishing system needs an extraction ladder: cheap, local, inspectable methods first; heavier browser/proxy systems only when the target actually requires them; Firecrawl last, except for cases where it is clearly the right paid tool, such as Twitter/X scraping.\n\nThe immediate trigger is Medium. A Cory Doctorow article behind Medium’s bot wall is exactly the kind of page that can burn paid extraction credits without producing a better artifact. The right move is what Yusef suggested: if the user can upload a text copy, ingest the text copy. Do not spend ten Firecrawl credits proving Medium is hostile to automation.\n\n!! Proposed ladder\n\n1. **Manual/local artifact** when the user uploads text, PDF, EPUB, screenshots, or copied source material. This is the highest-fidelity path because it reflects what the user actually read.\n2. **Simple HTTP fetch + readability extraction** for ordinary public pages. This should be the default: requests/curl, trafilatura/readability, metadata capture, canonical URL, and raw HTML saved locally.\n3. **Search-assisted discovery with SearXNG** when the URL is missing, stale, blocked, duplicated, or when we need alternate copies. SearXNG is not an extractor by itself; it is the private metasearch layer that helps find accessible versions, canonical URLs, archives, and corroborating sources.[[^searxng|Sources/extraction-ladder-research/03-searxng-docker-installation-docs]]\n4. **Ladder** for sites where the issue is headers, crawler presentation, CORS/CSP, or paywall/content-delivery behavior that can be studied with a self-hosted proxy and rulesets.[[^ladder|Sources/extraction-ladder-research/01-everywall-ladder-self-hosted-http-proxy-for-paywall-content-deli]] Ladder is not magic anti-bot; it is a configurable proxy and research tool. That makes it useful, but it should be treated as a middle rung, not a universal bypass machine.[[^ladder-rules|Sources/extraction-ladder-research/01-everywall-ladder-self-hosted-http-proxy-for-paywall-content-deli]]\n5. **Obscura** for JavaScript-rendered pages where a lightweight headless engine is enough. The project claims a Rust engine, V8 execution, CDP compatibility, lower memory than Chrome, and Playwright/Puppeteer integration.[[^obscura|Sources/extraction-ladder-research/05-obscura-rust-headless-browser-for-ai-agents-and-scraping]] It looks worth testing, but it should be evaluated empirically because young scraping tools can overclaim.\n6. **Camoufox** for adversarial bot-detection pages where ordinary Playwright/Chromium fails. Camoufox is a Firefox fork aimed at AI-agent automation and anti-detection, with fingerprint injection and Playwright-compatible APIs.[[^camoufox|Sources/extraction-ladder-research/04-camoufox-anti-detect-firefox-fork-for-ai-agents]] It is probably the heavy local browser rung.\n7. **Firecrawl** as a paid fallback, not the default. Use it when local methods fail, when the value of the source justifies the cost, or when the target is Twitter/X and Firecrawl is currently the practical path.\n\n!! Container feasibility on this machine\n\nThe VM is Debian 13 on `aarch64`, running under Apple Virtualization rather than QEMU. That is good news for the container plan: from inside Linux, containers should mostly behave like normal Linux containers. We are not asking for nested hardware virtualization; Docker or Podman will use the Linux kernel in the guest. The relevant questions are therefore ordinary Linux container questions: cgroups, namespaces, overlay filesystems, network forwarding, DNS, ports, and ARM64 image availability.\n\nCurrent live checks:\n\n```text\narch: aarch64\nkernel: 6.12.86+deb13-arm64\ncgroup: cgroup v2 mounted rw with nsdelegate\nunprivileged user namespaces: enabled\nsubuid/subgid for choir: present\ncontainer runtimes: docker/podman/docker-compose not installed yet\n```\n\nThat points toward containers being feasible. The main issues to navigate are:\n\n- **Runtime choice.** Docker is familiar and has the broadest docs. Podman is attractive for rootless containers, daemonless operation, and a more Linux-native feel. Other things equal, start with Podman, but keep Docker as the fallback if SearXNG/Ladder docs or compose files create friction.\n- **Rootless vs rootful.** Rootless Podman is safer and likely enough for localhost extraction services. If networking or compose behavior becomes annoying, rootful Podman or Docker can be a pragmatic second rung.\n- **Compose compatibility.** SearXNG’s official docs assume Compose. Podman has `podman compose`/Docker Compose compatibility, but this is a common friction point. The first win should be a single trivial container, not SearXNG.\n- **Networking.** Bind services to `127.0.0.1` first. Do not expose Ladder or SearXNG publicly. Apple Virtualization may add host/guest networking details later, but agent access from inside the VM only needs localhost.\n- **ARM64 images.** Ladder advertises `arm64` Docker support. SearXNG has container images and should be fine. Obscura/Camoufox need empirical checks for Linux ARM64 support; if they are x86-only, running them inside this VM may require emulation or a different host strategy.\n- **Storage drivers.** Rootless Podman may need `fuse-overlayfs` depending on Debian’s defaults. This is fixable, but it is exactly why the plan should accumulate small wins.\n\nSearXNG and Ladder both have straightforward container stories. SearXNG’s docs recommend Docker/Compose deployment and provide templates. Ladder publishes Docker images for `amd64` and `arm64`, which matters because this VM is ARM. Obscura may be simpler as a binary if Linux ARM builds exist; otherwise it may need a source build or run on the Mac host. Camoufox is heavier and should be treated as a later browser-runtime integration.\n\n!! Recommended implementation plan\n\nBuild this as a local extraction service with explicit provenance and cost boundaries:\n\n- `extract(url)` starts with local fetch/readability and saves raw HTML, normalized Markdown, metadata, and warnings.\n- If local fetch fails, it asks SearXNG for alternate/canonical/archive candidates.\n- If the target looks like a Ladder-supported domain or a header/ruleset problem, try Ladder.\n- If JavaScript rendering is required, try Obscura or Playwright.\n- If bot detection blocks ordinary automation, try Camoufox.\n- Only then use Firecrawl, and record that paid fallback was used.\n\nEvery artifact should record the rung used, commands/services involved, source URL, retrieval timestamp, content hash, and whether extraction was complete, partial, manually supplied, or paid-fallback. This is the important publishing boundary: Mosiah.org gets packaged source tiddlers and clean citations; the local system keeps the operational provenance.\n\n!! Test corpus\n\nWe need a small corpus of content classes, not just one URL. The first probe set should include:\n\n- an easy static public page such as Pluralistic,\n- a manually supplied text copy for a hostile source such as the Doctorow/Medium article,\n- Medium itself as a bot-wall behavior probe, not a paid extraction target,\n- soft/paywalled mainstream pages such as NYTimes/Washington Post/Bloomberg/Economist,\n- bot/JS-friction news pages such as Reuters or similar,\n- a Cloudflare/DataDome style anti-bot page,\n- a JavaScript-heavy app/documentation page,\n- Twitter/X public post/thread URLs, where Firecrawl is allowed as a fallback.\n\nInitial cheap probes, using only local HTTP and no paid extraction, already show useful class separation:\n\n```text\npluralistic.net: 200, normal HTML\nDoctorow Medium URL: 403, Cloudflare\nNYTimes Cloudflare AI article: 403, DataDome\nReuters AI page: 401, CloudFront\nX/AOC post: 200 HTML shell, Cloudflare/envoy\n```\n\nA working corpus file lives with the local report as `test_corpus.md`. The corpus should evolve as experiments reveal better examples.\n\n!! Experiment ladder: accumulate wins\n\nThe build should progress from simplest to most complex, with each rung leaving behind a working capability and a published note.\n\n1. **Runtime smoke test.** Install Podman first if possible. Run `hello-world` or `alpine uname -a`. Verify rootless mode, DNS, outbound HTTPS, volume mounts, and localhost port binding. If this is annoying, try rootful Podman before abandoning Podman for Docker. **Status: passed.** Podman 5.4.2 installed from Debian 13 ARM64 packages. Rootless Podman reports cgroup v2, systemd cgroup manager, netavark/aardvark DNS, crun, pasta/slirp4netns, and working subuid/subgid mappings.\n2. **Static service test.** Run a tiny HTTP container bound to `127.0.0.1`, fetch it from the agent, and verify restart/cleanup commands. This proves service lifecycle before adding app complexity. **Status: passed.** `nginx:alpine` bound to `127.0.0.1:18080`, returned `HTTP/1.1 200 OK`, then was removed cleanly. Alpine also verified DNS/outbound HTTPS and host/container volume mounts.\n3. **SearXNG.** Run SearXNG locally. Query it from Python. Save JSON/HTML search results as an artifact. Confirm it helps find alternate copies and canonical pages, not just search the web for the user. **Status: passed with a note.** SearXNG runs in Podman at `127.0.0.1:18081`. HTML search works. JSON search returned 403 under the default bot-detection/API settings, so the first integration should parse HTML or tune formats/limiter config. Useful result: searching for the Doctorow/Medium title surfaced the canonical Medium URL, a Mamot/Mastodon copy, a Chinwag discussion, and the direct Pluralistic URL `https://pluralistic.net/2025/02/18/pikettys-productivity/`.\n4. **Ladder.** Run Ladder with basic auth and localhost-only binding. Test an easy article and one known annoying article. Record when Ladder helps and when it does not. **Status: passed as a proxy, limited as an anti-bot tool.** Ladder runs in Podman at `127.0.0.1:18082` with the public ruleset loaded: 16 rules for 41 domains. It successfully proxied Pluralistic. For Medium, NYTimes/DataDome, and Reuters/CloudFront, it returned the bot-wall/interstitial HTML rather than article text. That is still a useful result: Ladder belongs in the header/ruleset/proxy rung, not the adversarial-browser rung.\n5. **Extractor baseline.** Build a plain local extractor using HTTP fetch + readability/trafilatura-style parsing. Make this the default path before any browser runtime.\n6. **JavaScript rendering.** Test Obscura or ordinary Playwright against a JS-heavy page. Compare output quality and resource use.\n7. **Adversarial browser.** Test Camoufox only after the normal browser path fails. Treat it as a heavy rung, not the default.\n8. **Paid fallback.** Use Firecrawl only when the previous rungs fail and the source is worth the cost, or for Twitter/X where it is explicitly allowed.\n\nThe goal is not to build a pirate paywall machine. The goal is a cost-aware, local-first, artifact-native extraction system for agentic publishing.\n"
}