The LLM-as-ETL Architecture: Building a Data Pipeline Around Claude Instead of Bolting It On

Most people add an LLM to a pipeline they already built. I built the pipeline around the LLM, and that inversion changes everything about how the system behaves.

Key takeaways

LLM-as-ETL replaces the transformation layer: Instead of deterministic rules, a language model handles ambiguous input classification at scale, changing where engineering trust lives.
Front-load deterministic checks: NowServingTO runs a free chain denylist before any LLM call, cutting costs and reserving expensive steps for genuinely hard cases.
One call, three outputs: A single Haiku web_search call returns cuisine tag, operating status, and website URL simultaneously, reducing cost and context incoherence.
The gates are the product: Of 2,016 food licences, only 258 (12.8%) reach the public directory. The filtering, not the raw data, is the value.
Self-healing via re-check intervals: Every cached entry carries a verified_at timestamp; stale entries are auto-queued by source type, from 7 days (unclear verdicts) to 180 days (own-website links).
Silent success is a failure mode: The worst production bug was a cron job that exited zero but stopped writing the output file. Always verify file modification time after each run.

What Is an LLM-as-ETL Architecture?

Traditional ETL pipelines extract structured data, run deterministic transformations, and load the result into a destination. LLM-as-ETL replaces the transformation layer with a language model call. The model does what no regex or lookup table can do cleanly: reads ambiguous inputs, reasons about them, and returns structured judgments at scale. The distinction matters because it changes where you put your trust. In a classical pipeline you trust your transformation code. In an LLM-as-ETL pipeline you trust the model’s reasoning, and you design the rest of the system to catch where that reasoning fails. Those are different engineering problems, and conflating them is why most LLM integrations feel fragile.

How Does the NowServingTO Pipeline Work?

The pipeline runs in five stages. Input is a City of Toronto business licence CSV. Output is a static JSON file served by Apache to a vanilla HTML front end. Claude Haiku is the intelligence layer at stages two and three. No database, no backend, no build step, no React.

Stage 1, Chain denylist (deterministic, zero LLM cost). Every business name runs through a substring match against ~140 known chains: Popeyes, KFC, Tim Hortons, Boston Pizza, and equivalents. A match returns unknown immediately and skips every downstream step. This is not an LLM job. Chains are chains. Deterministic rules are cheaper, faster, and more reliable for facts that do not change.

Stage 2, Name-only Haiku classification. Entries that clear the denylist go to a name-only classification call. Haiku reads the business name and returns a cuisine tag from a fixed 44-category taxonomy. No web search, no external context. Cheap first pass, local cache so the same name is never classified twice.

Stage 3, Web-search verification (one call, three outputs). A single Haiku call with web_search tool access reads live search results and returns all three judgments simultaneously: cuisine tag (refined or confirmed), operating status (open / unclear / closed), and website URL if found. One call for three outputs was a deliberate cost decision. It also means the cuisine and operating status judgments share the same context, which improves coherence.

Stage 4, Google Places match gate. Every verified entry must match a Google Places record. No place_id means the entry does not render. A row that links nowhere is worse than no row: it wastes the user’s time and erodes trust in every correct entry on the page. The flag in the codebase: ALLOW_WEB_SEARCH_ONLY = False.

Stage 5, Static output. The inject script writes a single data/corridors.json file. Apache serves it. The LLM did the hard work upstream. The rendering layer is trivially simple because the intelligence layer already ran.

What Makes This Self-Healing?

Every cached entry carries a verified_at timestamp. The daily cron compares that timestamp against a re-check interval specific to the entry’s link type and queues stale entries for a fresh web_search call. The intervals are calibrated to how quickly each source type changes:

Own website (.com / .ca): 180 days
Instagram / Facebook / TikTok: 30 days
blogTO / Yelp / Google Maps / TripAdvisor: 14 days
No-link yes verdicts (open confirmed, no URL found): 14 days
Unclear verdicts: 7 days
Confirmed closed: 60 days

As Google and Bing index new places, the next cron run silently upgrades entries. A restaurant with no web presence on opening day gets a confirmed-open verdict and a website link within weeks, without any manual intervention. The system heals itself as the internet catches up to the physical world.

What Is the Verified-Open Gate and Why Does It Exist?

A City of Toronto food-service licence is not evidence that a restaurant is open. Licences get issued months before opening day. They persist for restaurants that have quietly closed. They exist for catering operations, ghost kitchens, and institutional food services that will never serve a walk-in customer. Raw licence data is a terrible restaurant directory. The verified-open gate is what turns it into a useful one.

The combined effect of the Places gate and the OPERATIONAL status check is dramatic. Of 2,016 active food licences issued in the last 365 days, 258 entries make it through to the public directory, a 12.8% yield. The other 87.2% are chains, unverifiable, status-uncertain, or lacking a Places match. Surfacing all 2,016 would make the directory useless. The gates are the product.

When Should You Build LLM-as-ETL vs Bolt an LLM Onto Existing Infrastructure?

Approach	Use when	Without the LLM	Error handling focus
Build around the LLM	Transformation is the core problem: ambiguous inputs, 30+ languages, no deterministic rule covers it cleanly	Pipeline must be rebuilt from scratch	Catch model failures across the whole system design
Bolt the LLM on	Transformation already solved; LLM adds summaries, copy drafts, or ticket routing on top of existing logic	Feature turns off; rest of pipeline still runs	Isolate to the feature wrapper only

Cuisine classification from business names in 30+ languages is a build-around problem. No lookup table handles it reliably. An LLM handles it cheaply and accurately. Generating editorial copy from existing structured records is a bolt-on problem. Neither approach is wrong, but knowing which you are building changes how you design error handling, cost controls, and cache invalidation strategy.

What Are the Failure Modes?

LLM refusal text in live HTML. Early in the project, Haiku occasionally returned a refusal instead of a structured verdict. The output parser was not strict enough to catch these, and the raw refusal text propagated into the JSON wire file and rendered to users. The fix: explicit output validation before any response is written to cache. If the response does not parse into the expected schema, flag it as unclear and re-queue.

Grocery store false positives. Large grocery chains with in-store food counters hold food-service licences. Several made it through early pipeline versions tagged as restaurants. The denylist now covers major chains explicitly, and web_search verification catches the rest because a grocery store’s web presence makes its nature obvious. But new international chains the denylist does not know yet will occasionally slip through until the next cache refresh.

Silent build failures. The most frustrating failure: a three-day period where the nightly cron completed without errors but wrote a stale output file. All exit codes were zero. A logic error in the inject script was skipping the write step when the input delta fell below a threshold added as an optimization. The directory was not updating. Nothing was alerting. The fix: an explicit output file modification-time check after every cron run, with an alert if the file is older than 26 hours. Silent success is a failure mode.

NowServingTO is running in production at nowservingto.com, Toronto’s daily-refresh directory of newly licensed restaurants across 45 cuisine categories, verified open, chains excluded.

Frequently asked questions

What is the difference between LLM-as-ETL and a traditional ETL pipeline?

A traditional ETL pipeline uses deterministic transformations: regex, lookups, and rule-based logic that produce predictable outputs from predictable inputs. LLM-as-ETL replaces the transformation step with a language model call, enabling the system to handle ambiguous, multilingual, or unstructured inputs that no lookup table covers well. The tradeoff is that you shift trust from code you fully control to a model whose reasoning you must validate through output schema checks and downstream gates.

How much does running Claude Haiku as an ETL engine cost at production scale?

NowServingTO's full nightly pipeline runs at approximately $0.35 per day for a directory covering 2,000+ licence entries. That cost is achieved by front-loading free deterministic checks (the chain denylist), caching every classification result with a verified_at timestamp, and batching re-checks by source-type interval rather than re-checking everything nightly. The biggest cost driver is web_search calls, which are reserved for entries that actually need fresh verification rather than running on every record.

What happens when Claude returns a refusal instead of a structured verdict?

Early in the project, Haiku occasionally returned a refusal string instead of a structured JSON verdict. Without strict output validation, those refusals propagated into the wire file and rendered raw text to users. The fix is explicit schema validation before any response is written to cache: if the response does not parse into the expected structure, the entry is flagged as unclear and re-queued for the next run. Never trust model output as valid without parsing it first.

Why does NowServingTO require a Google Places match before showing an entry?

A City of Toronto food-service licence is not proof a restaurant is open or serves walk-in customers. Licences cover ghost kitchens, catering operations, and institutional food services that will never have a public-facing Google Places listing. Requiring a Places match with OPERATIONAL status acts as a final-layer reality check: if Google's index cannot confirm the business exists at that address as a functioning restaurant, neither can NowServingTO. The Places gate is what transforms a raw licence dump into a useful directory.

How does the self-healing re-check system prevent stale data?

Every cached entry stores a verified_at timestamp alongside its verdict. The nightly cron compares that timestamp against a re-check interval calibrated to how volatile each source type is: unclear verdicts re-check every 7 days, social media links every 30 days, and stable own-website links every 180 days. As new restaurants build a web presence, the system silently upgrades their entries from unclear to confirmed-open without manual intervention. The directory heals itself as the internet catches up to the physical world.