Most people add an LLM to a pipeline they already built. I built the pipeline around the LLM, and that inversion changes everything about how the system behaves.
What Is an LLM-as-ETL Architecture?
Traditional ETL pipelines extract structured data, run deterministic transformations, and load the result into a destination. LLM-as-ETL replaces the transformation layer with a language model call. The model does what no regex or lookup table can do cleanly: reads ambiguous inputs, reasons about them, and returns structured judgments at scale. The distinction matters because it changes where you put your trust. In a classical pipeline you trust your transformation code. In an LLM-as-ETL pipeline you trust the model’s reasoning, and you design the rest of the system to catch where that reasoning fails. Those are different engineering problems, and conflating them is why most LLM integrations feel fragile.
How Does the NowServingTO Pipeline Work?
The pipeline runs in five stages. Input is a City of Toronto business licence CSV. Output is a static JSON file served by Apache to a vanilla HTML front end. Claude Haiku is the intelligence layer at stages two and three. No database, no backend, no build step, no React.
Stage 1 – Chain denylist (deterministic, zero LLM cost). Every business name runs through a substring match against ~140 known chains: Popeyes, KFC, Tim Hortons, Boston Pizza, and equivalents. A match returns unknown immediately and skips every downstream step. This is not an LLM job. Chains are chains. Deterministic rules are cheaper, faster, and more reliable for facts that do not change.
Stage 2 – Name-only Haiku classification. Entries that clear the denylist go to a name-only classification call. Haiku reads the business name and returns a cuisine tag from a fixed 44-category taxonomy. No web search, no external context. Cheap first pass, local cache so the same name is never classified twice.
Stage 3 – Web-search verification (one call, three outputs). A single Haiku call with web_search tool access reads live search results and returns all three judgments simultaneously: cuisine tag (refined or confirmed), operating status (open / unclear / closed), and website URL if found. One call for three outputs was a deliberate cost decision. It also means the cuisine and operating status judgments share the same context, which improves coherence.
Stage 4 – Google Places match gate. Every verified entry must match a Google Places record. No place_id means the entry does not render. A row that links nowhere is worse than no row: it wastes the user’s time and erodes trust in every correct entry on the page. The flag in the codebase: ALLOW_WEB_SEARCH_ONLY = False.
Stage 5 – Static output. The inject script writes a single data/corridors.json file. Apache serves it. The LLM did the hard work upstream. The rendering layer is trivially simple because the intelligence layer already ran.
What Makes This Self-Healing?
Every cached entry carries a verified_at timestamp. The daily cron compares that timestamp against a re-check interval specific to the entry’s link type and queues stale entries for a fresh web_search call. The intervals are calibrated to how quickly each source type changes:
- Own website (.com / .ca): 180 days
- Instagram / Facebook / TikTok: 30 days
- blogTO / Yelp / Google Maps / TripAdvisor: 14 days
- No-link yes verdicts (open confirmed, no URL found): 14 days
- Unclear verdicts: 7 days
- Confirmed closed: 60 days
As Google and Bing index new places, the next cron run silently upgrades entries. A restaurant with no web presence on opening day gets a confirmed-open verdict and a website link within weeks, without any manual intervention. The system heals itself as the internet catches up to the physical world.
What Is the Verified-Open Gate and Why Does It Exist?
A City of Toronto food-service licence is not evidence that a restaurant is open. Licences get issued months before opening day. They persist for restaurants that have quietly closed. They exist for catering operations, ghost kitchens, and institutional food services that will never serve a walk-in customer. Raw licence data is a terrible restaurant directory. The verified-open gate is what turns it into a useful one.
The combined effect of the Places gate and the OPERATIONAL status check is dramatic. Of 2,016 active food licences issued in the last 365 days, 258 entries make it through to the public directory – a 12.8% yield. The other 87.2% are chains, unverifiable, status-uncertain, or lacking a Places match. Surfacing all 2,016 would make the directory useless. The gates are the product.
When Should You Build LLM-as-ETL vs Bolt an LLM Onto Existing Infrastructure?
Build around the LLM when the transformation step is the hard problem – when your pipeline’s core challenge is making structured sense of ambiguous, unstructured, or semi-structured input at scale. Cuisine classification from business names in 30+ languages is exactly this. No lookup table handles it well. An LLM handles it cheaply and accurately. Bolt an LLM on when the transformation step is already solved and you are adding a new capability: generating summaries of structured records, drafting templated copy from database fields, classifying support tickets in an existing routing system.
The practical test: if removing the LLM would require rebuilding the pipeline from scratch, you built around it. If removing it would just turn off a feature, you bolted it on. Neither is inherently wrong. But knowing which one you are doing matters for how you design error handling, cost controls, and cache invalidation.
What Are the Failure Modes?
LLM refusal text in live HTML. Early in the project, Haiku occasionally returned a refusal instead of a structured verdict. The output parser was not strict enough to catch these, and the raw refusal text propagated into the JSON wire file and rendered to users. The fix: explicit output validation before any response is written to cache. If the response does not parse into the expected schema, flag it as unclear and re-queue.
Grocery store false positives. Large grocery chains with in-store food counters hold food-service licences. Several made it through early pipeline versions tagged as restaurants. The denylist now covers major chains explicitly, and web_search verification catches the rest because a grocery store’s web presence makes its nature obvious. But new international chains the denylist does not know yet will occasionally slip through until the next cache refresh.
Silent build failures. The most frustrating failure: a three-day period where the nightly cron completed without errors but wrote a stale output file. All exit codes were zero. A logic error in the inject script was skipping the write step when the input delta fell below a threshold added as an optimization. The directory was not updating. Nothing was alerting. The fix: an explicit output file modification-time check after every cron run, with an alert if the file is older than 26 hours. Silent success is a failure mode.
NowServingTO is running in production at nowservingto.com – Toronto’s daily-refresh directory of newly licensed restaurants across 45 cuisine categories, verified open, chains excluded.
