Flint. A model built for inspiration.
While industrial AI races toward reasoning and accuracy, we realised that for creative industries, the 'correct' answer is often the least interesting one. So today, we’re introducing Flint. a model designed specifically to inspire, not give you the answers.
We started Springboards with the mission to spark better ideas in people. But after building for three years, one thing has become impossible to ignore: frontier models are getting smarter, faster, and more polished, while their outputs are getting eerily similar and more repetitive.
For a lawyer or an accountant, convergence can be a feature. For a strategist, writer, marketer, comedian or creative team, it is a bug.
So we built the model we needed ourselves.
Flint is a small model with big implications. As the first language model designed specifically around inspiration, it achieves a dramatic increase in output diversity on creative tasks WITHOUT degrading performance in other areas.
In other words, it’s a model with entropy in the right places.
We are releasing an alpha version of Flint today via the Springboards app.
Divergent and convergent thinking are fundamental elements of the creative process.
Divergent thinking is the act of going wide and exploring possibilities while Convergent thinking narrows those options down to a single solution. While both are important, frontier LLMs are particularly poor tools for divergent thinking due to their limited output diversity. By design or by accident, nearly every LLM in the world converges on the same small set of answers, even for open-ended questions; a phenomenon known as “mode collapse”.
As a result, if used as a tool for brainstorming or ideation, LLMs are likely to lead us all to the same place, and make the world a lot less interesting.
With successive releases, convergence amongst the frontier models has only gotten worse. AI companies are optimising for accuracy across domains like science, mathematics and coding. Hallucination is treated as failure. But there is a whole class of creative and open-ended tasks for which divergence is much more important than accuracy. Flint is built for these tasks, so we have dubbed it a divergence model.
What that means concretely, is that Flint is trained to have higher entropy at key moments in a generation that lead to substantively different answers. Instead of consistently reinforcing the highest-probability path, Flint is trained to produce a higher entropy probability distribution where multiple valid generation paths exist. This allows less obvious ideas/answers to emerge.
The result is structured variation. Less repetition, less slop and more range.
NoveltyBench is a benchmark that measures how many meaningfully distinct responses a model generates across ten samples of the same open-ended prompt.
On NoveltyBench, Flint α scores 7.47 mean distinct responses out of 10.
In comparison the SOTA models perform significantly worse.
Gemini 3.1 Pro scores 3.19.
GPT-5.4 scores 2.54.
Claude 4.6 Sonnet scores 1.83.
Flint also more than doubles the NoveltyBench score of its base model, Qwen3-30B-A3B, which scores 3.11. This shows Flint is not just a lightly remixed Qwen. It behaves like a different creative instrument.
Average number of meaningfully distinct responses out of 10 samples. Higher = more diverse. Source: novelty-bench.github.io
MEAN_DISTINCT_K
Browse 10 generations per prompt per model · see how diversity manifests in actual outputs
NoveltyBench
Explore the difference for yourself!
Another test we ran looked at how similar a model’s responses are when given the exact same prompt repeatedly.
We used prompts from the NeurIPS 2025 Artificial Hivemind paper and measured the similarity of outputs using cosine similarity — a standard metric that compares how closely related two responses are.
The scale runs from 0 to 1.0 means completely different.
1 means effectively identical.
When we sample 50 responses to the same query, most models stay within a tight, repetitive band.
Flint does not.
Its mean intra-model similarity is 0.721. Lower is better.
For comparison:
GPT-5.4: 0.864
Gemini 3.1 Pro: 0.871
Claude 4.6 Sonnet: 0.905
Same prompt. Far more range.
Average similarity across all queries · lower = more diverse
Mean intra-model similarity
Intra-Model Similarity
Same prompt, 50 times. Flint: 0.721. Everyone else: 0.864–0.905. Lower is better.
The per-query distribution view makes the difference even clearer.
Other models cluster tightly in the high-similarity zone, mostly around 0.8 to 1.0. Flint spreads much wider, with some prompts dropping as low as 0.12 similarity across samples.
On prompts like writing a confession from a mathematician, inventing a new emotion, generating a manifesto from an unusual perspective, or imagining what gravity would feel like in reverse, Flint keeps exploring. Other models tend to return the same answer in slightly different clothes.
This is the heart of the model. Not synthetic randomness. Real divergence on open-ended creative tasks.
Each point = one query's mean similarity score across 50 responses
Per-Query Distribution
Other models cluster. Flint ranges.
Across 100 queries, the overall mean inter-model similarity is 0.740.
Flint’s average similarity to other models is just 0.672, making it the most distinctive model in the set.
The most similar pair in the comparison is Claude 4.6 Sonnet and Gemini 3.1 Pro at 0.759.
Mean cosine similarity between model pairs across 100 queries
Inter-Model Similarity
Flint’s average similarity to other models is 0.672, making it the most distinctive model in the comparison.
This is where Flint gets really interesting.
On MMLU-STEM, Flint scores 78.9% overall versus 78.9% for Qwen3-30B-A3B.
What matters is the finding underneath it: divergence tuning does not have to be a tax on capability. You can train a model to range more widely without gutting what it already knows.
Grouped bars sorted by Flint α accuracy · hover for details
Accuracy
Divergence without collapse: Flint scores 81.5% vs 82.0% for Qwen3-30B-A3B.
Flint's divergence training preserves its responsible AI performance. On TruthfulQA MC1, Flint scores 34.4% versus 34.0% for Qwen3-30B-A3B and on ToxiGen standard accuracy, Flint leads with 59.6% compared to 58.1%.
Frontier models are converging. That makes them powerful across reasoning, coding, planning, and knowledge work. But it also makes them predictable. The same patterns reappear. The same shapes repeat. Over time, consistency starts to flatten range.
Flint shows there is another path. We have dramatically increased output diversity without significantly degrading performance on other benchmarks. A small model can expand the space of possible responses, reduce repetition, and stay coherent enough to be useful. That is the breakthrough.
It also points to a better creative workflow. Flint is not a replacement for frontier models. It is a multiplier. Flint generates range. Larger models provide depth, knowledge, and reasoning. Humans apply taste and judgment. Instead of one polished answer, you get multiple starting points worth pursuing.
Flint is available now.
Only in the Springboards app. Give it a go and let us know what you think.