A week ago we said our chat model would be ready for an investor preview. It finished training on schedule, all metrics looked healthy, the loss curve descended cleanly, the test suite stayed green. We loaded the weights, gave it a prompt, and got back an output that was structurally English-shaped but semantically nothing — every word a real word, every word independent of the previous one. Word salad with grammar.
This is the kind of failure that, if you only watched the numbers, you'd never see. So we spent yesterday and today doing a real forensic exam of why it happened, and we caught two genuine bugs in our training pipeline that combined to produce exactly this outcome. We're retraining with fixes right now. Both the failure and the fix are worth writing about.
What broke
The model finished training with the loss numbers we expected. Cross-entropy descended from 4.5 to 2.2 over a few hundred thousand steps. No alarms in our diagnostic logs. Train/inference parity checks passed throughout. By every internal metric we had, the run was healthy.
When we sampled outputs, we got things like:
"subsidies hooks qatar masculine sound 1947 heather hans..."
The words are real. The lengths are sensible. There's no token repetition, no mode collapse, no obvious sign of the model learning a single anchor and refusing to leave it. It just... doesn't form sentences. Word position one has no relationship to word position two.
This was the first clue. Pure undertraining usually produces local coherence — three or four grammatical words then drift. We got zero local coherence. The model wasn't undertrained in the obvious sense; something more structural was happening.
The investigation
We wrote a small forensic tool that loads the saved checkpoint and looks at it across three dimensions: per-token statistics, per-layer statistics, and per-brain comparison. (The architecture has multiple parallel sub-networks; comparing them turned out to be the diagnostic that caught the real bug.)
Three findings, each surprising:
Finding 1: every token's embedding had nonzero magnitude, but they all had nearly the same magnitude.
Across all 16,000 tokens in the vocabulary, the embedding L2 norm was crushed into a band of [0.13, 0.20]. The most-frequent words like "the" and "of" had only marginally larger embeddings than rare words. Normally, a healthy model differentiates tokens by both direction (which dimensions are active) and magnitude (how much). Ours had collapsed magnitude entirely. The model knew which tokens existed; it didn't know which were special.
Finding 2: half the model's middle layers were near-zero.
We split the per-layer weight statistics for the only sub-network that gets meaningful gradient. Layers 0 and 7 had ~5-8% of their weights at meaningful magnitudes. Layer 5 had 1.1%. Mid-layers had effectively been pulled toward zero.
The shape of this told us something specific: gradients were vanishing through the middle of the network. Layer 0 gets strong signal from the embedding gradient path; layer 7 is adjacent to the output projection. Layers in the middle had to carry signal through the recurrent stack, and that signal was dying along the way.
Finding 3: the comparison that broke the case open.
We compared the trained sub-network to its parallel siblings. Same starter weights, same training data, very different training schedules. The siblings — which received much less training — sat at their initialization values to four decimal places, with healthy weight magnitudes (mean L2 ≈ 1.0, full dynamic range used). The one we actually trained had mean L2 ≈ 0.15.
That's a 6.6× shrinkage during training. The model wasn't learning toward something; it was being pulled toward zero.
The two bugs
The cross-brain comparison made the cause visible. Our optimizer applies decoupled weight decay — every step, every weight gets multiplied by (1 - lr × wd) regardless of its gradient. That's a normal regularization tool when configured carefully. We had it configured at wd = 0.01, which over 450,000 training steps applies a multiplicative shrinkage of about 10×. The gradient signal has to be strong enough to push back against that shrinkage on every update, or weights drift toward zero.
Our gradient signal wasn't strong enough.
The reason: a separate bug in our cross-entropy "funnel" code path. Funnel CE is a technique where, instead of computing softmax over the full vocabulary on every step, you compute it over a smaller candidate set (in our case, 200 tokens). Done right, this is a nice optimization for early training when the model can't yet use full-vocab gradient anyway. Done wrong, it permanently restricts what the model can learn.
We had it done wrong. The annealing direction in our code was reversed: instead of starting narrow and widening to full vocabulary as training progresses, it started wide and narrowed down to 200 candidates, then locked there. So 99% of training computed cross-entropy on a 200-token subset. The per-step gradient was correspondingly small.
Small gradient + constant weight decay over 450,000 steps = the weights drift toward zero. Mid-layers, which receive the smallest gradient signal due to depth, drift fastest.
Two independent bugs, each survivable on its own, lethal together.
What we changed
We made three concrete fixes:
Fix 1: weight decay is now configurable per-run, defaulting to 0.0 for fine-tuning runs and 0.01 only for from-scratch pretraining. The previous code path hardcoded it at 0.01 and ignored the user's setting entirely. This was a bug we caused ourselves and never noticed because everything before this run was either short enough that decay didn't matter or healthy enough to absorb it.
Fix 2: the funnel annealing direction is reversed. It now starts narrow (focus on a small candidate set when the model is dumb) and widens to the full vocabulary as training progresses. After the anneal period, the funnel is fully open — no permanent restriction.
Fix 3: we're shipping a forensic tool alongside the trainer that can examine any saved checkpoint and surface this class of bug automatically. If we'd had this tool last week, we'd have caught the weight collapse around step 50,000 and saved most of the run. Going forward, every checkpoint gets inspected before we declare it shipped.
What we're doing right now
Retraining. The fixed binary is running on a fresh GPU instance as we write this. Same architecture, same data, same compute budget. The only differences are the two bug fixes plus the explicit decision not to enable the funnel for this fine-tuning phase at all.
We're being careful with the budget this time — we have an exact projection of what each phase will cost, a sanity-check eval scheduled between phases so we can abort early if something looks wrong, and the forensic tool ready to inspect the result before we declare it shipped.
If the diagnosis is right, we should see:
- The trained sub-network's mean weight magnitude staying near initialization rather than collapsing.
- Free-prompt generation that produces at least sentence-shaped output, not word salad.
- A meaningful difference between predicted distributions for different prefixes (a model that's actually conditioning on context).
If we see all three, the chat model ships. If we see one or two but not all three, we have a different bug to chase, but at least we know the obvious one is gone.
Why we're writing this
A week ago we said the chat bot was coming. It's coming a week later than we said. The honest version of why is what you just read.
We could have hidden the failed run, fixed it quietly, and shipped only when it worked. The reason we're not is that the failure is more interesting than the success would have been. Every model release that lands smoothly hides a hundred decisions that could have gone the other way. When you watch only the wins, you learn nothing about what the work actually looks like. When you watch a careful postmortem of a failure, you learn how the team thinks.
We think this matters more than looking impressive on the first try. We'd rather ship a real picture of what building a small-team LLM looks like than a polished narrative.
The retry is running. We'll know in about 18 hours whether it worked.
What's next
- Finish the retry; eval and ship the chat model if the fixes hold.
- Publish the forensic tool as part of our open-source release once weights ship.
- Write up the architecture in a proper preprint — the negative result and the methodology that caught it both go in the methods section.
- Hold the public injection-resistance demo we promised, on the new weights.
If you're following along, want to test the chat bot once it ships, or are interested in custom-architecture LLM research at consumer scale, reach out — vicente@vinte.app.
— Vinte team