How we built Flint: training AI for higher entropy

July 14, 2026

Insights

Kieran Browne

Ask any AI for creative ideas and you'll get the same handful of answers. Every time. Across every model. Researchers call this the Artificial Hivemind effect. Springboards built Flint to fix it.

For a high-level overview of the model, visit Flint alpha model.

The convergence problem

The current class of state-of-the-art LLMs all show the same tendency to repeat a narrow set of answers even when asked an open-ended query. Such repetitive suggestions make LLMs a poor tool for a broad class of tasks that include creativity and brainstorming.

While the importance of "variation" in AI outputs is something Springboards has been championing since we got started back in 2023, the lack of diversity in LLM outputs is increasingly recognised as a significant challenge in AI research. The paper "Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)" (2025) by Liwei Jiang and colleagues discusses the extent of what they term the "Artificial Hivemind effect" [1]. Their paper demonstrates both the repetitiveness of individual models, and more surprisingly, the pattern of "inter-model homogeneity" whereby different models produce strikingly similar responses to the same queries.

A motivating experiment presented in their paper has 25 different AI models each generate 50 responses to the query "Write a metaphor about time." Despite the openness of the task, out of all 1250 responses generated, the vast majority centred on the notion that "time is a river", with a smaller cluster comparing time to a weaver.

The same phenomenon occurs reliably for other open-ended questions. When prompted to "generate a random number between 1 and 10", major models generally answer "7". When asked for travel recommendations, or band name suggestions, or tagline ideas for running shoes, the same small set of answers tend to arise regardless of the model or the user. This catastrophic loss of diversity in machine learning models is known as mode collapse; (think "mode" as in "most common value in a dataset"). Though models are trained on archives of data as large and broad as the internet, they seem incapable of reproducing the diversity of their source material.

What exactly causes different models with different architectures, trained by different teams in different countries to converge on the same small set of ideas remains a topic of active research. It is clear though, that the lack of diversity in LLMs is transmissible to users. A 2024 study found that users of ChatGPT produce less semantically distinct answers in creative ideation tasks [2]. Similar effects have been observed in several studies with respect to homogenisation of creative and formal writing [3]. With the rapid adoption of LLM-based chat assistants, there are growing concerns about the long-term homogenization of culture and thought, as users are only exposed to the same narrow set of ideas.

‍

Flint; a divergence model

The convergence problem has been a long-term focus at Springboards. Our product is built for the early stages of creative exploration and our goal is to help users produce more novel and interesting work by helping them generate a broader set of ideas at the start. For this use case, divergence is key. Regardless of how "good" an idea might be, it is only useful as a thought-starter once. Moreover, because novelty is an important quality of creative work, a model which offers the same ideas to everyone is only likely to help produce work that feels generic.

Existing models, having collapsed to the average of the internet, are a poor tool for questions where the best answer is not the most common one; so in late 2024 we established a research team to try to crack this nut. The result of nearly two years of work is Flint; a language model built for divergence.

Flint alpha is a fine-tuned Qwen3-30B model, which dramatically increases output diversity compared to both the base model and current state-of-the-art models. Importantly, we have achieved the increased novelty without degrading performance of the base model on closed-ended question+answer benchmarks, overcoming the perceived tradeoff between generalisation and diversity in existing fine-tuning methods [4].

‍

Understanding convergence

LLMs are next-token predictors. That is, at every step of a generation they produce a probability distribution that assigns each token in the model's vocabulary a likelihood of occurring next based on the tokens that have come before. Text generation involves iteratively sampling from these probabilities.

To understand why LLMs skew so heavily towards a few answers, it is useful to examine the probability distributions generated by a LLM during inference. Token-level predictions are essentially deterministic, excluding small fluctuations caused by floating-point precision errors, but excepting certain sampling methods, actual LLM generations tend to be probabilistic.

When a user makes a request like "Give me a number between 1 and 10", this text is tokenised and fed into the model to produce a probability distribution for the first token of the model's response. As a worked example, we will examine the distributions generated by the open weight model Llama 2. For the first step of the generation, Llama assigns the token "Sure" 99.66% likelihood and "Okay" 0.32%; all other tokens receive some infinitesimal probability. At this point, sampling methods usually use some metric (e.g. Top K, Min P) to exclude unlikely tokens, then rebalance the remaining probabilities and sample from the final distribution. In this case, given the >99% probability assigned, "Sure" becomes the first response token almost every time. This token is appended to the list of tokens, which is fed into the model again to predict the next token.

Over several iterations, the model produces the text; "Sure! The number I am thinking of is…" and is now poised to tell us a number between 1 and 10; the crux of the user's request. However, though the top token predictions are mostly numbers in the appropriate range, the model assigns 96.4% likelihood to the number 7, only 3.5% to "5", and <0.1% to the third most likely token "6". Subscript 7 and "seven" are both predicted higher than 3 and 9, and 1, 2 and 10 do not appear at all in the top ten tokens. Even if we were to sample from the probability distribution without any processing, 7 still will be the selected token almost every time, which is what we observe empirically.

‍

*_{Figure 1: Top ten tokens predicted by Llama 2 as continuations to "[INST] Give me a number between 1 and 10 [/INST] Sure! The number I am thinking of is…"}*

‍

This example is illustrative of what we observe throughout LLM inference; even at stages in a generation where many possible valid continuations exist, we still find that LLMs assign >90% probability to very few, or even just one token. These low entropy distributions are symptomatic of mode collapse. LLM generations may be technically stochastic, but the low entropy of assigned probabilities make most generations highly predictable.

‍

Entropy and temperature

LLMs produce low entropy probability distributions even when, intuitively, probabilities ought to be more balanced. In some sense, because LLM sampling methods already tend to be probabilistic, increasing the entropy of the LLMs predictions is enough to increase output diversity. At this point, people with some familiarity with LLMs are shouting "temperature!" at their screens. This is understandable given that received wisdom states that temperature is the go-to parameter for increasing "randomness" and "creativity" [5].

Temperature does indeed increase the entropy of LLM generations, but it is a blunt instrument, applied uniformly to each token in a generation. We have found when increasing temperature that generations tend to lose coherence before genuinely novel ideas emerge. This is consistent with what others have reported [6].

There are two key issues with simply using temperature to drive novelty. The first is that even in open ended contexts there are still some points in the generation where there is only one reasonable next token, such as when the generation is part way through a word or must abide by the rules of grammar, and in these circumstances the entropy needs to remain low to preserve coherence. The second is that due to the shape and structure of the raw probabilities, it is not possible to raise the probability of reasonable continuation tokens without also significantly raising the probability of unreasonable tokens.

‍

Training Flint

To drive useful divergence without losing coherence, higher entropy needs to be encouraged selectively, at moments with high potential for semantic branching. Our approach with Flint has been to optimise the model's raw probability distributions via training, rather than transforming them post-hoc via temperature or some other sampling method. This allows us to incentivise high or low entropy distributions at different stages as appropriate, and to work on improving the quality of distributions, not just the entropy.

Our training approach focuses on open-ended queries and tasks, and particularly on those tokens with a high potential for semantic branching, ones which are likely to change the meaning of the answer or lead a generation down a different path. We call these "critical tokens". As an example, in response to the query "which cities should I visit in Europe?" An LLM might begin with some preambling which has little impact on the actual semantics of the answer "Sure, why not go to…", then a token representing a part or whole of a place name; e.g. "Paris", "Rome", "Barcelona", followed by some justification for why you ought to go there. Here the first token of the place name is a critical token in the generation, changing all that comes after. An LLM is perfectly capable of telling you why you should visit Cork, but because of mode collapse, a normal model is essentially incapable of recommending anything other than "Paris", "Rome" and "Barcelona" in response to that question. By focusing our training on the critical tokens Flint is able to unlock latent information already present in the base model that is otherwise unreachable via open-ended querying.

‍

How does Flint perform?

Flint alpha significantly increases the output diversity without degrading the general capabilities of the model in closed-ended tasks. On NoveltyBench, Flint scores 7.47 mean distinct responses out of 10, more than double the score of its base model, Qwen3-30B-A3B, which scores 3.11. This also significantly improves on scores of state-of-the-art models; Gemini 3.1 Pro scores 3.19, GPT-5.4 scores 2.54 and Claude 4.6 Sonnet scores just 1.83.

Flint also displays lower inter-model similarity; i.e. its responses are less similar to other models than those models are to each other. What this means in practice is that when asked an open-ended question, Flint is able to surface many more distinct answers than leading models, and answers produced are less likely to be the same as those suggested by leading models.

Flint increases divergence and diversity in open-ended questions without compromising existing capabilities on closed-ended tasks. On MMLU-STEM, Flint scores 78.9% overall versus 78.9% for Qwen3-30B-A3B. On TruthfulQA MC1, Flint scores 34.4% versus 34.0% for Qwen3-30B-A3B and on ToxiGen standard accuracy, Flint leads with 59.6% compared to 58.1%. While these scores are not competitive with much larger state-of-the-art models, staying on par with the base model suggests that training for divergence is not incompatible with general model capabilities.

For a more in-depth presentation of benchmark results, visit Springboards Flint alpha model.

‍

Footnotes

[1] Jiang, Liwei, et al. "Artificial hivemind: The open-ended homogeneity of language models (and beyond)." Advances in Neural Information Processing Systems 38 (2026).

[2] Anderson, Barrett R., Jash Hemant Shah, and Max Kreminski. "Homogenization effects of large language models on human creative ideation." Proceedings of the 16th conference on creativity & cognition. 2024.

[3] Doshi, Anil R., and Oliver P. Hauser. "Generative AI enhances individual creativity but reduces the collective diversity of novel content." Science advances 10.28 (2024): eadn5290;

‍Padmakumar, Vishakh, and He He. "Does writing with language models reduce content diversity?" International Conference on Learning Representations. 2024;

Sourati, Zhivar, et al. "The shrinking landscape of linguistic diversity in the age of large language models." arXiv preprint arXiv:2502.11266 (2025).

[4] Kirk, Robert, et al. "Understanding the effects of RLHF on LLM generalisation and diversity." International Conference on Learning Representations. Vol. 2024. 2024.

[5] OpenAI. "Best practices for prompt engineering with the OpenAI API."

[6] Janus. "Mysteries of mode collapse." LessWrong. (2022)

more like this

see all

Springboards launches ‘Flint’ to break AI's habit of predictable, boring answers

SYDNEY, Australia and NEW YORK, NY, April 13, 2026: Springboards today announced the alpha launch of Flint, an AI tool for marketers and creatives designed to generate high-variance options and break out of predictable outputs.

Ask any LLM to pick a number between 1 and 10 and you will get a 7 followed by a 3 (or a 4) followed by a 9. This is because all LLMs tend to converge on a narrow set of predictable answers, even for open-ended queries. This makes them good at utility tasks like telling you the capital of France but terrible at creativity and brainstorming, where diverse ideas are essential. Flint has the opposite instincts. It is tuned to explore the model's latent knowledge and surface non-obvious directions quickly, repeatedly, and on demand to inspire better creative thinking. For creatives and marketers using the Springboards platform, where the model will be available exclusively, it means they are able to produce a wider spread of ideas and inspiration at the earliest stage of thinking.

“We never set out to become a model company. We set out to help people have better ideas.

But after three years building Springboards, one thing became impossible to ignore: frontier models were getting smarter, faster, and more polished, while their outputs were getting eerily similar and more repetitive,” said Pip Bingemann, Co-Founder and CEO of Springboards. “For a lawyer or an accountant, convergence can be a feature. For a strategist, writer, marketer, comedian or creative team, it is a bug. So we built Flint, the model we needed ourselves.”

A tiny but mighty creative inspiration model

Based on a lightweight, open-source foundation model, Flint favors speed and iteration over heavyweight “smartness.” In testing, it significantly outperformed leading LLMs on creative diversity, scoring 7/10 on the independent Novelty Bench compared to an average of 2.88. This means that when prompted ten times, Flint generates seven functionally distinct responses, rather than just offering surface-level paraphrases of the same idea.

“Flint is a tiny but mighty model that is significantly outperforming the world’s largest LLMs on the one metric that actually matters for the future of the creative industries: novelty,” said Kieran Browne, Chief Technology Officer of Springboards. “The reality is that frontier models are prioritising accuracy and correctness over originality and entropy. Flint is built on the belief that human taste and creativity must be at the core of good creative work; we are optimising for variation rather than automation. And what’s particularly exciting is that we have been able to achieve all of this without degrading the base model’s general capabilities, proving that you can train a model to range more widely without gutting what it already knows.”

A global standard for creative ideation

The launch of Flint marks a significant evolution for Springboards. Over the past three years, the company has transitioned from a specialized agency tool to a global platform, seeing massive momentum with 100s of PR, media, creative, experiential and inhouse client agencies across the US, UK and Australia, including TRG & BMF. With Flint, Springboards is upleveling their offering with an engine that provides the efficiency of AI without sacrificing the friction and unpredictability that makes human ideas great.

"We're seeing a clear shift in the market from generalised AI and 'one model to rule them all' to models purpose-built in scale, cost, and design for specific capabilities—and creativity is one of the hardest specialties to crack. Pip and Amy understand the alchemy of a great idea from the inside—they're agency veterans who built the thing they wished existed—and Kieran is assembling one of the most capable AI research teams in Australia. Flint isn't AI as decoration. It's the engine the whole software product is built around. That's the kind of conviction we back." said Thomas Humphrey, Investments Partner at Blackbird.

New flexible tiers to suit all kinds of creatives

Alongside Flint, Springboards is also expanding its service tiers for the platform, opening up direct access to the model and a suite of tools through flexible plans, including free and paid tiers, for freelancers, small teams and boutique agencies for the first time. The addition of these flexible licensing options makes the platform more accessible to a global audience of strategists, creatives and marketers, lowering the barrier to entry while accelerating adoption at scale.

“Since day one, our customers have been at the centre of our innovation. Our goal has always been to build tools that enable advertisers and marketers to do their best work, and Flint is the culmination of that,” said Amy Tucker, Co-founder of Springboards. “We’re so excited to finally open this up to everyone, from solo freelancers to global agency teams. Whether you’re a strategist, a creative or marketer, you can now use our platform and model to explore your best ideas.”

“What if your imaginary strategy friend didn’t have to be imaginary? Springboards gets you to more curious places faster and helps sharpen your sense of what good, better, best looks like. Surrender to it.” said Christopher Owens, Head of Brand Strategy, TRG

“Springboards is an incredible ideation platform and creative strategy partner. It surfaces ideas and insights that other models ignore and, in doing so, takes you down the most unexpected and refreshing creative paths” said Anna Bollinger, Executive Planning Director, BMF

As concerns grow around AI-driven sameness and over-automation, Springboards offers an antidote: a platform designed to enhance human creativity, not automate it away.

Flint is available globally from today.

To learn more or sign up, visit: springboards.ai or springboards.ai/models/flint-alpha

‍

About Springboards:

LLMs are built to be right. Springboards is built to be interesting.

Springboards is an AI platform for advertising and marketing teams who want better ideas, not just faster answers.

While most AI models converge on a single "correct" output, Springboards is built to expand the range of thinking.

It combines the world's leading AI models with Flint, its own model for creative divergence, to help teams explore more directions, without replacing human judgment or craft.

Founded by Pip Bingemann, Amy Tucker, and Kieran Browne, Springboards works with 100+ companies globally.

‍

For media inquiries, please contact: press@springboards.ai

Deep in the heart of Texas. Creativity is raging

As many have sung, “the stars at night are big and bright, deep in the heart of Texas” and let me say, the creative stars were big and bright in Deep Ellum, home to TRG. TRG hosted us for a night of Raging WITH the Machines, where Springboards co-founder and CEO Pip Bingemann opened the audience's eyes to the risks AI can pose to creative thinking and Dustin Ballard, TRG Creative Director and the mind behind There I Ruined It, got the crowd laughing and pondering what AI means for music.

‍

At Springboards, we proudly call ourselves a self-loathing AI company. Not because we hate AI, but because there can be such a negative connotation about it. Certain people with big platforms love to pitch AI as a magic button: run all your campaigns, replace strategists, no need for production or media teams. And that just isn’t what we believe. Yes, Springboards uses artificial intelligence, but it really only comes to life when human intelligence is in the driver's seat, pushing back and steering it.

‍

TRG’s Chief Technology Officer, Randy Bradshaw, spoke of the importance of keeping “humans in the loop”. Due to the nature of today’s industry, agencies don’t always have time to experiment and play, however Randy shared that AI tools allow them to fail faster, which in turn allows them to learn faster and iterate quicker. Randy and Pip both hit the same key point: using AI comes with responsibility. We need to bring our critical thinking, lived experiences, and the knowledge of what is happening in the world around us, to whatever AI tool we are using.

‍

Pip shared research from Springboards we've found again and again: traditional LLMs are so good at bringing everyone to the same place (for example, they love recommending pepperoni as a pizza topping - what about eggs? Or pineapple?!).

Which is exactly what marketers need to watch out for.

He challenged the audience to input their favourite song into an LLM and ask it to “make it better”. Notice what happens. It sands off the edges. It will strip the emotion, the story telling, the rage, from the song. Who wants a dull song? Not me.

‍

Then Dustin took the stage and there certainly were no boring songs (ps if you aren’t following There I Ruined It, you need to). He reprised his well known Ted Talk, “Is AI Ruining Music”, (yes, he confirmed Sir Richard Dawkins was in the audience listening to Baby Got Back) and challenged us to think about what music truly is. Much like the synthesiser was criticised when it was used in popular music, the question now is: are musicians still “musicians” if they use AI?

‍

Dustin’s takeaways were simple and to the point: consider the intent (is it additive or just more content to try to increase steam counts), is the artist trying to be deceptive about the use of AI, and then consider what the original musician might think. There are ways to leverage AI in music, you just need to be responsible about it.

‍

The night wasn’t just all talk though. The whole crowd helped rage with the machines as we sparked campaign ideas for the Deep Ellum neighbourhood of Dallas. Deep Ellum has a rich history of music and culture, but is struggling due to major infrastructure projects. Stores are closing and foot traffic is dying down. So we worked together as a group to spark some ideas of how we could hype up the neighbourhood during this challenging time. From celebrating the grit, to scavenger hunts finding the vibrant murals around the neighborhood, all the way to robotic shoes that help you explore the history of Deep Ellum - the ideas were flowing. Ideas sparked with AI and brought to life by the people.

‍

In the end, we all agreed, AI can, and sometimes, should be used to increase creativity - so long as humans are in the loop, of course. So let’s rage on!

‍

Missed the night? Watch the full recording here.

What happens when AI makes creativity too effortless?

Recently our team ran an experiment to see how quickly we could go from concept to finished work. The experiment started inside our own creative process. We were playing in Springboards and explored nearly 20 different ways to talk about our brand. But we kept coming back to the elephant in the room – that these models are all giving people the same answers.

We wanted to lean into this and decided to dramatise the problem instead of explaining it.

We picked an ad, a recent spot from OpenAI, and flipped the ending to make a point about what happens when everyone uses the same tools – they end up getting sent to the same destination both in real life and creatively.

Our original thought was to see how quickly we could conceptualise this approach and to mock up the concept. What we didn’t expect was how quickly the work would become uncomfortably close to the original.

The acceleration of LLM adoption across marketing and creative industries over the past couple of years has been remarkable. These tools are being woven into workflows everywhere – from concepting to copywriting to production.

When deployed thoughtfully, generative AI can push creative boundaries and help teams explore territory they may never reach by themselves, or help to short-circuit work that could take days or weeks in that creative exploration to help teams move more quickly.

But LLMs are converging – and not enough of us are paying attention.

Recent research from MIT and other institutions — published as the “Artificial Hivemind” study — documents something many of us have felt but struggled to quantify: these models are gravitating toward remarkably similar outputs, even in open-ended scenarios where countless valid answers should exist.

The simple test is just to ask your LLM to generate a random number between one and 10. With 95%+ accuracy you will get a seven every single time regardless of the model, where you live or your chat history. And while there are parallels with humans who also pick seven the most often, at 28%, of the time, LLMs are amplifying the average – from 28% probability to 98%. Doesn’t that tell you everything you need to know?

This isn’t about people using the technology incorrectly. It’s about how the models themselves are designed. They’re trained on patterns and they optimise for coherence and probability. They deliver what’s most likely, not what’s most interesting or unexpected.

And when everyone’s drawing from the same well, standing out becomes exponentially harder.

Which brings me back to our experiment.

Firstly, credit should go where it’s due — our production partners absolutely nailed the brief. Frame composition, lighting, movement – all of it was eerily accurate. Too accurate.

The result raised immediate questions about likeness, intellectual property and how effortlessly these systems can blur ethical lines without anyone deliberately trying to.

We wanted to get close to the original and the technology made it almost effortless to get there.

It crystallised the convergence problem in a way that felt impossible to ignore. If we could recreate a high-production advertisement this accurately, this quickly, with relatively little iteration – what does that mean for originality across the board? What does that mean for our craft? And how busy are copyright lawyers (or their bots) going to be in the years to come?

So we kept playing and pulled the work right back into a safer zone. We needed to make it less perfect. And as anyone in the industry knows, adjusting the brief halfway through a campaign means deadlines and costs often get blown out.

The final version we ended up with was a bit rougher around the edges, but it was necessary.

‍

Another thing that became obvious through this process is how easy it’s become to mistake polish for purpose.

AI-generated content now rivals or exceeds human-created work across massive portions of the web. By late 2024, the balance had already tipped in many categories. That stat alone isn’t the problem – the problem is why.

Speed and frictionless production are replacing deliberation. Teams are shipping work that looks finished even if it took minutes to create instead of days. The question has now shifted from “Is this the right direction?” to “Is this ready to publish?”.

When creating something that looks finished, that takes minutes instead of days, we risk conflating output with outcome, volume with value and “good enough” with genuinely good.

Here’s where it gets tricky for our industry specifically.

Marketing has always been about standing out and saying something in a way that cuts through. That’s the craft.

But if the tools we’re using to generate ideas are all trained on the same corpus, optimised for similar outputs and rewarding safe, predictable thinking – how do we avoid becoming indistinguishable from each other?

The answer isn’t to abandon AI. That ship has sailed, and frankly, I think it’d be the wrong move anyway. The answer is to fundamentally change how we interact with what these systems give us.

Our experiment forced us to reckon with something uncomfortable: the first thing AI gives you is almost never the right thing to run with. It’s a spark but it’s not the answer.

Here’s what that means in practice:

Interrogate everything.

The moment something looks finished, that’s when you need to push harder. Ask what’s been smoothed away, what assumptions the model made and what directions got optimised out in favour of coherence. The rough edges are often where the truth lives;

Resist the path of least resistance.

Just because you can generate a hundred options in ten minutes doesn’t mean you should use the first one that’s good enough. Speed is valuable but only if it’s pointed in an interesting direction;

Make imperfection intentional.

We deliberately pulled our final version back from perfection because perfection wasn’t the goal – purpose was. Sometimes the most polished version is the least honest one.

The advertising industry has always been vulnerable to trends, templates and formulas. We’ve dealt with this before – when everything looked like an Apple ad, when every brand tried to sound like Dove, when “purpose-driven” became a checkbox instead of a commitment.

AI accelerates that tendency. It makes it easier to drift toward the middle. But when used with intent, it can help us generate unexpected combinations and surface connections we’d miss.

So yes, we used AI to recreate an LLM ad to criticise how LLMs create sameness. The irony isn’t lost on us. In fact, through play, it became the point. But, sometimes, to make people aware of the danger, you need to take them there. Because if we let convenience override craft, if we confuse ease with excellence, we won’t just end up with boring work – we’ll end up in a boring industry.

And none of us got into this business for that.

‍

This article first appeared on Mumbrella, one of Australia’s leading media and marketing industry publications. Read the original piece by Pip here.