February 20266 min read

LLMs Aren't Magic: What I Learned Building a Production AI Pipeline

There's a gap between "I got GPT-4 to do something impressive in a demo" and "this runs reliably on 10,000 products without failing, hallucinating, or timing out." I spent several months in that gap building HS Code Autopilot — a Shopify app that classifies product catalogs using the same legal framework customs brokers use.

The problem with keyword matching

Most automated HS code tools are glorified lookup tables. Type "leather handbag" → return 4202.21. The problem is that HS classification is legally defined through a hierarchy of rules called the General Rules for the Interpretation of the Harmonised System (GRI). Under GRI, a bag's classification depends on its outer surface material, its primary function, whether it's designed to be carried in the hand or on the body, and sometimes the predominant component by value.

A keyword lookup can't reason through that. A well-prompted LLM, with the right domain context injected, can.

The extraction layer

The first thing I built wasn't a classifier — it was an extraction pipeline. For each product, GPT-4o reads the title, description, and Shopify tags, then extracts structured attributes: primary material, function, construction type, outer surface material, whether it's a finished good or a component.

The key design decision was adding a confidence score and an "ambiguity note" to the output schema. When the model isn't sure — because the product description says "high-quality material" without specifying what — it flags it for merchant review rather than guessing. This turned out to be more valuable than forcing a confident wrong answer.

The clustering insight

A merchant with 2,000 products can't review every single one. They shouldn't have to. Most of those products fall into 40–50 natural clusters: all the cotton t-shirts, all the leather wallets, all the stainless steel tools. If you group similar products together, the merchant reviews a cluster — approves the shared attributes — and the whole group is classified at once.

This was a UX insight that made the AI pipeline actually usable at scale. The clustering uses embedding similarity on extracted attributes, not raw product text. Two products described completely differently might share the same extracted profile — and should share a classification.

Running it reliably

LLM calls fail. Rate limits, network timeouts, model errors. Building on a synchronous request/response model means catalog syncs timing out and a bad experience for users. Every job — catalog sync, extraction, clustering — runs through BullMQ and Redis queues with automatic retries and exponential backoff. The merchant sees a progress dashboard that updates as jobs complete.

What I'd do differently

I'd separate the extraction model from the classification model earlier. GPT-4o is overkill for attribute extraction — a smaller model fine-tuned on product descriptions would be faster and more consistent. Keeping them in one prompt made iteration harder than it needed to be.

I'd also build the merchant review UI before the AI pipeline. Understanding exactly what decision the merchant needs to make shapes how you structure the extraction schema. Going the other way — building the pipeline first and retrofitting UI — led to one complete schema redesign.

The honest summary: LLMs are genuinely useful for structured reasoning tasks with high domain complexity. But the value isn't in the model — it's in the prompt design, the output schema, the confidence flagging, the clustering layer, and the reliability infrastructure. The model is a component. The product is everything around it.

Enjoyed this?

Let's talk about
what you're building.

I'm happy to go deeper on architecture, decisions, and tradeoffs — whether it's a follow-up on this post or a product problem you're working through.

akshaymalu1@gmail.com↗← All posts