Understanding Frontier Models Means Understanding Their Limitations

Justin T. HuangMay 20, 2026

Recently I asked a frontier model why a one-minute TikTok was funny.

The model went to some lengths to answer. It transcribed the audio, sampled the background track to identify the song, and pulled frames every three seconds to read the visuals. On its maximum reasoning setting it spent about twenty minutes working through all of this before returning a careful, structured analysis of the clip.

The analysis was wrong. The humor turned on a pun and a deadpan reply to a ridiculous line, the kind of thing a child would catch on the first watch, and the model never got there.

How a chatbot analyzes a video

It helps to know what the model was actually doing in those twenty minutes. A chatbot does not watch a video the way we do. It breaks the file into channels it can handle separately, so the speech runs through transcription, frames are sampled every few seconds and passed to an image model, and the background audio is matched against a song identification service. Each of those channels returns text or structured data, and the model reasons over that material the way it reasons over any other prompt. Every step works well on its own, so by the time the model began to think, it held an accurate and fairly complete record of what the clip contained.

Seen that way, the failure stops being mysterious. The model never experienced the clip; it studied a description of one. A pun depends on holding two meanings at once and feeling them collide; the timing of a deadpan only works if you are carried along in real time. Neither survives translation into a transcript and a grid of screenshots. A child laughs because she watched the clip; the model wrote an essay because a description was the only access it had. The lesson generalizes: when you understand how a model actually works, its limits become something you can see coming rather than something that surprises you.

A second limit, in the training

The same logic points at a second limitation, and this one matters more for marketing: look at what the model is trained to do. A foundation language model learns by predicting the next token in a sequence, given everything that came before, and the fine-tuning that follows, reinforcement learning from human feedback, rewards it for being helpful and for following instructions. Together these produce a system that is fluent, well-organized, and useful across an enormous range of tasks.

Neither objective has anything to do with persuasion. Predicting a plausible next word is not the same as moving a particular person, and an answer that annotators rate as helpful is not necessarily one that changes what a customer does. A frontier model will write you a polished subject line in seconds, but whether that line outperforms another one with your audience is a question its training never asked it to answer. The gap is not a flaw waiting to be patched in the next release; it sits outside what the objective was ever pointed at.

This is why generating copy with AI is only half the job. The model supplies fluency, range, and a strong first draft, but the evidence of which lines move which segments cannot come from the model itself. It has to come from putting the copy in front of real customers and measuring what they do.

Cromulent is built around that division of labor. The platform generates copy from explicit persuasion hypotheses, ships them as live campaign variants, and updates its picture of which framings work for which audience segments as the results come in.