May 12, 2026

Why General AI Fails the Customs Broker License Exam and Specialized AI Passes

When Gaia Dynamics' AI sat the fall 2025 US Customs Broker License Exam, it scored 100%. The same exam given to ChatGPT, Google Gemini, and Anthropic's Claude produced 38%, 63%, and 25%, respectively. None of the three general-purpose models would have cleared the 75% pass threshold. For context, the most recent cohort of human test-takers passed at 12%. That spread is the entire reason customs compliance teams need to care which kind of AI they put in front of classification work. The gap is the subject of Episode 2 of Gaia Dynamics' Trade and Tech podcast, featuring Gaia CTO Carlos Alzate alongside CEO Emil Stefanutti and CSCO Tom Gould.

Alzate built the technical foundations for roughly 40 AI startups as CTO at Andrew Ng's AI Fund, was a research scientist at IBM Research, and holds a PhD from the University of Leuven. The hour with him produced a usable map of where AI in customs is real, where it is hype, and where the regulators are headed.

How General LLMs Work and Why Trade Breaks Them

Most confusion about AI in customs starts with a misunderstanding of how models like ChatGPT generate answers. Alzate's explanation was direct.

"These models are trained by processing an enormous amount of text. Through this process, they learn the patterns in language. So when you ask ChatGPT something, it is not searching a database or looking up an answer. It's predicting word by word what is the most likely helpful response based on the patterns it learned during pre-training."

A general LLM is a sophisticated pattern matcher, not a fact retriever. "These models don't know things the way humans do," Alzate said. "This is why they can be very wrong, but they are very confident about it."

That problem compounds in the context of trade compliance. Pre-training data on customs classification is sparse, unverifiable, and cut off at a specific date. In a year when the HTS schedule reached version 31, that cutoff alone disqualifies general models from production classification work.

The exam confirmed the gap. A competing specialized trade AI tool scored 93%. Gaia scored 100%. The three general models all fell below the pass threshold. Alzate identified three structural reasons specialized AI wins:

  • Specialized knowledge: General LLMs lack direct access to the official USITC HTS database with 18,000+ codes, chapter notes, exclusions, and legal interpretations. Specialized tools connect to official tariff schedules and update as regulations change.

  • Verification: A general LLM may confidently produce an HTS code that does not exist. Specialized systems validate every code against the live database before returning it.

  • Structured workflow: "Our system doesn't just ask AI once and return the answer. It's a multi-stage process. We avoid hallucinations, we have up-to-date information."

What AI Does Well in Customs Today and Where It Falls Short

Gould described pre-AI classification as a process of elimination, not lookup. Before opening the tariff schedule, he gathers technical documents, marketing materials, and comparable import records. The classification work involves identifying all possible codes and narrowing. Even after landing on a code, the work continues. PGA flags, Section 232, Section 301, IEEPA layers, and AD/CVD duties all stack on top. For a straightforward product, that process takes a minimum of an hour. For a complex product, several days.

Where AI excels today:

  • Narrowing 18,000+ HTS codes to the 40 most relevant in seconds

  • Reaching the four- to six-digit HTS level quickly with high accuracy

  • Locating historical rulings that would take a human hours

  • Handling product information across multiple languages

Where it still falls short:

  • Full 10-digit accuracy, which requires composition percentages and technical specifications

  • Producing legally defensible explanations end-to-end without expert review

  • Full autonomous agency across complex workflows

  • Genuinely novel products with no historical pattern to learn from

Even with those limits, going from an hour per classification to a few seconds is enough to change the operating model. Alzate made a precise distinction about what "agent" means in AI: a brain (the LLM), memory (to refer back to prior decisions), and tools (the ability to query a database or call an API). Agency proper, the ability to make and execute decisions autonomously, is still developing. In some compliance contexts, that is by design.

Reasonable Care, Regulators, and the Risk of Waiting

The right compliance question is not "Will CBP let me use AI?", it is "Am I using AI in a way that demonstrates reasonable care?" CBP itself uses AI internally. Gould's framing of what reasonable care means in practice: "What I'm going to do as a human is go do all kinds of research and create a body of work that becomes the description. I can do that using AI far more efficiently."

Stefanutti's analogy captures the posture cleanly. AI today is like a junior analyst on the team. It can do the research and present findings. A licensed broker would not let a junior analyst submit filings without review. They should not let AI either.

On integration, Alzate advised starting standalone. Validate the value on a real workload before investing in TMS, ERP, or ABI integration. Most teams get value from standalone tools first, integrated APIs second.

The risk of not starting is no longer theoretical. Gould was direct: "Some customs brokers that are not adopting the technology are getting overwhelmed. I'm concerned that some of them are going to drop things, miss things, and suffer because of that." The pending Supreme Court ruling on IEEPA constitutionality will flood brokers with refund and reclassification questions. Brokers already at capacity will not survive that surge.

Gould's closing line was the simplest of the episode. "Pick something and start. You have to do it right now, and you have to do it quickly because if you don't, you're going to be left behind."

The Risk of Waiting and Where AI Is Going Next

The flip side of "AI as compliance enabler" is the risk of not adopting it. Gould was direct about what he is starting to see in the broker community. "Some customs brokers that are not adopting the technology are getting overwhelmed. I'm concerned that some of them are going to drop things, miss things, and suffer because of that."

The concrete near-term scenario Gould cited is the pending Supreme Court ruling on IEEPA constitutionality. When that ruling arrives, importers will flood their brokers with refund and reclassification questions. Brokers already overwhelmed by the day-to-day will not survive that surge. "Companies that are prepared by implementing the technology today and becoming more efficient and more compliant, I see them as being in a better position to handle whatever unknown is coming down the road. We all know based on the last year that there's going to be a lot more unknowns."

On information security, Alzate flagged five areas any company evaluating an AI vendor should pressure-test:

  • Data residency and handling: Where does product data go on submission? Is it stored, for how long, and is it used to train models? Reputable vendors should be transparent.

  • Multi-tenancy isolation: Multiple customers will share infrastructure. Their data should be physically and logically segregated.

  • Audit logging. Trade compliance requires knowing what was classified, when, by whom, and why, end to end.

  • On-premise and private deployment options: For high-sensitivity contexts, the option to run AI within the customer's own environment matters.

  • Prompt injection and input validation: Wherever a user can input free text, an attacker can attempt to manipulate the model's behaviour, including extracting sensitive information from system prompts. This is a new class of threat that did not exist in pre-LLM software.

Looking five years out, Alzate flagged four trends for compliance teams to track. 

  1. Multimodal AI (processing text, image, and video together)

  2. Reasoning models that think before answering

  3. Continued agency improvements

  4. Private on-premise AI 

On private deployment, he was specific: "Right now we cannot try to have private AI because there is still a huge gap with respect to the major providers. But the gap is getting closer and closer. In five years, I anticipate that you have everything running on your private cloud, your system built entirely on premise, and customer data never leaving your premises."

Gould's parting advice for brokers and importers still on the sidelines was the simplest line of the episode. "Pick something and start. You have to do it right now and you have to do it quickly because if you don't, you're going to be left behind. There are a lot of tools out there that will help you become far more efficient without a tremendous amount of investment of time or money."

Conclusion

The exam scores tell the structural story. General AI is a pattern matcher trained on internet-scale data. Specialized trade AI connects to official tariff schedules, validates every code against live databases, and runs multi-stage workflows that catch hallucinations before they reach a compliance file. The 75% pass threshold is not an arbitrary number. It is the point at which the architecture earns or loses the right to be trusted.

AI used to gather information and build defensible documentation fits within the reasonable care standard. AI used as a black box that auto-files without human review does not. The brokers and compliance teams prepared with the right tools today will handle whatever comes next. The brokers waiting will not.

See What Specialized Trade AI Can Do for Your Operations

The gap between general AI and specialized trade AI is architectural, not a matter of prompt engineering or model version.

Gaia Dynamics builds AI specifically for customs and trade compliance. The platform connects to the official USITC HTS database, validates every classification against live tariff schedules, and runs the multi-stage verification workflow that general LLMs cannot replicate:

  • HTS classification from 18,000+ codes narrowed to the most relevant candidates in seconds, with full audit trails documenting every decision

  • Automatic flag review for PGA requirements, Section 232, Section 301, IEEPA, and AD/CVD duties

  • Historical ruling lookup that surfaces relevant CBP rulings faster than manual Federal Register searches

  • SOC 2-compliant data handling with multi-tenancy isolation and prompt injection protection

Standalone deployment gets value in days. API integration with your TMS or ERP follows once ROI is validated on your real workload.

Explore Gaia Dynamics and see how the platform performs on your actual classification workload.

Listen to the Full Episode

Listen to Episode 2 of the Trade and Tech podcast for the full discussion, including Carlos Alzate's explanation of agentic AI, multimodal models, and the road to private on-premise deployment for trade compliance.

Frequently Asked Questions

How can a specialized AI score 100% on the customs broker exam when ChatGPT scores 38%? 

The gap is architectural. General LLMs predict text from internet-scale pre-training with no connection to the official HTS database or current exclusions. Specialized tools validate every code against the live database and run multi-stage workflows that catch hallucinations before returning an answer.

Will CBP accept HTS classifications produced with AI assistance? 

CBP uses AI internally and has not published policy prohibiting AI-assisted classification. The standard is reasonable care. AI used to gather information and build defensible documentation fits. AI used as a black box that auto-files without human review does not.

Should a customs broker integrate AI with their TMS or ERP from day one? 

Start standalone. Validate the value on a real workload first. Integration unlocks benefits like automatic classification at catalog entry and unified audit trails, but it is multi-month engineering work. Validate first, invest in integration once ROI is clear.

Can prompt engineering alone improve ChatGPT’s accuracy for customs classification?

Prompt engineering can improve how clearly a general model responds, but it does not fix the underlying limitation. General LLMs still lack access to the live HTS database, legal notes, and verification layers required for defensible classification. Accuracy in customs depends on structured workflows, validated data sources, and audit trails, which sit outside the model itself.