AI and the age of probabilistic programming

Abraham Marin-Perez
2 days ago
11 min read

Every new tool opens the door to new practices that were previously impossible, and LLMs have brought us Vibe Coding. The concept is simple: you describe an LLM what you want, and the LLM produces code. Then you play with your application, if it seems to work then you crack open a beer (or your beverage of choice); if it doesn't then you tell the LLM what the problem is and ask it to fix the code. It's a loop of refining the requirements over and over until the LLM gets it right without ever looking at the code.

This is made possible due to the way LLMs work, which can be bluntly summarised as autocomplete on steroids: after a lengthy statistical analysis of a very large corpus of data, the LLM identifies patterns that it can use to generate sequence of words that sound plausible to a human.

Is "Vibe Coding" a joke?

It certainly sounds like it. In fact, if you read Andrej Karpathy's tweet where he coined the term, it does sound like he's not taking the thing particularly seriously.

Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.

It's not too bad for throwaway weekend projects

it's not really coding

Not particularly reassuring if you ask me; any sane person would reject the idea immediately, but as we have witnessed time and again the world of software developers is not the sanest of all. When I wrote Real-World Maintainable Software, I highlighted the fact that Dr. Winston W. Royce introduced the world to the concept of Waterfall Project Development with a stark warning:

the implementation described is risky and invites failure

Did the world heed this warning? No, we went full on Waterfall for decades, suffering failed project after failed project, never wondering if anything had to be changed. We do this because when there is a brand new idea that sounds appealing, our motivated reasoning encourages us to adopt it full throttle without regard for consequences. This is why we have JFDI and YOLO Mode.

But those consequences are beginning to be seep in. Replit's AI agent deleted a company's DB, Google's Gemini hallucinated to the point of deleting a user's entire codebase (but at least it apologised for it), Dave Farley has described Vibe Coding as the the worst idea of 2025 (admittedly a contested title), and The Economist has mocked the concept by saying that AI companies seem to be appraised through Vibe Valuing.

Defenders of the trend will argue that these are just teething issues, that the technology is young but that, once it improves, it will change the way we code forever. But will it change it for the better?

Code quality: regression to the mean

Quality is not a measurable entity (although you can try to measure a proxy for quality like code survival), but we can image it following a pattern. Consider a large amount of code, the vast majority of code will be considered ok, some bits of code will be considered inspired, and some other parts will be considered awful. It we could attach a number to quality and plot it, it would probably follow a Gaussian distribution. Now, assuming that you start from the high-quality side (the right-hand side) and gather code samples, the more code you consider, the lower the quality bar.

Focusing on high-quality code yields a data corpus that is too small to be usable

Having a larger corpus implies adding code of lower quality

Because you need large amounts of data to train LLMs, you need to go heavy towards the lower-quality side. This creates a trade-off: each additional piece of data will add expressivity and variety to your LLM, but it will also reduce the quality bar in your model. Remember, LLMs produce output by mimicking the most probable (i.e. frequent) patterns in the analysed data, which means that the code that it will produce will be on the average side of the input.

Responses generated by the LLM will mimic the most common patterns, which coincide with the most common code, which is of average quality

Ramón López de Mántaras, former director of the Artificial Intelligence Research Institute in Barcelona, Spain, and one of Europe's pioneers in the field of AI (he defended his PhD in 1977), gives us a clue to the limitations of code-generating tools: LLMs don't have a model of the world, just a model of the language. They don't have any knowledge of what they are producing, much like a large scale version of the Chinese Room experiment. Since the LLM itself doesn't have any actual knowledge to assess the correctness of the code that it's producing other than the fact that it follows patterns that it has observed before, this necessarily implies that the code that it produces is just probably, but never definitely, correct. Or, as Simon Wardley put it, LLMs are optimised for coherence, not for truth. The inevitable consequence is that the LLM will produce bugs, and you will never know where. Welcome to probabilistic programming.

Humans make mistakes too

Aye, humans make mistakes, which means that they also produce bugs. You know that, I know that. But humans also have other characteristics that set them apart from LLMs:

Humans understand that there are sensitive areas or operations, aspects that are difficult to get right or with a higher potential impact if done wrong; they can put more attention to those areas.
Barring Dunning-Kruger effects, humans know when they are not having a good day or when a topic goes beyond their level of expertise, meaning they know when to ask for help.
Humans have certain instincts and pattern-recognition capabilities that AI hasn't replicated and perhaps will never be able to replicate. This is what in the industry we call "smells".

The last point merits a more exhaustive explanation. Try to recall the last time you were debugging a piece of code, and try to analyse what is required for the very action of debugging to take place. On one side, you have the program that you're debugging, but on the other hand you need to "wrap" that program in another program that can control it: you can pause execution, inspect variables, modify values or even code on the fly. You have the complexity of the program that you're working with, plus the complexity of the wrapper that allows you to debug the program. In other words, in order to analyse and understand, you need a higher level of complexity that allows you to contain what you are trying to analyse and understand. This idea has been described in many ways by many people, from Von Neumann to Asimov, but my favourite is Ashby’s Law of Requisite Variety: “The complexity of a control system must be equal to or greater than the complexity of the system it controls.”

This implies that, in order to fully understand the human brain, you need something that is more complex than the human brain. In other words, humans will never be fully capable of understanding how the human brain works and, therefore, will never be able to fully replicate it. We will not be able to program an AI to think the way we think because, quite simply, we don't know how we do it, which is something that Moravec's paradox has been warning us about for almost 40 years.

Because of all of the above, humans aren't bug-free, but at least they have the capacity to reduce the risk of bugs in high-risk areas, while code generated by an LLM has a probably of error that is equally distributed across all code. That, alone, makes AI-generated code riskier than human-generated.

"Wait a minute" I hear you say. "If we can review sensitive areas or operations of human-generated code, why can't we review sensitive areas or operations of AI-generated code?" You can, provided you have the expertise to a) identify the areas or operations that are sensitive and b) ascertain whether they are correct or incorrect. Which leads me to my next point: how do you build such expertise?

It's the effort, stupid!

When I started university 25 years ago, laptops were a luxury few people could afford, and certainly the type of people that had them wasn't university students. Heck, as I recall just a few of my teachers had a laptop. Attending lectures meant bringing pen and paper and furiously taking notes for hours on end; at the end of each day, my hand ached just as much as my head. Today, with the commoditisation of electronics, any university classroom counts as many laptops as people, if not more.

The ubiquitousness of electronic devices has lead to a decline in handwriting, and a number of researchers set out to understand how this is affecting us. As early as 2014, researchers found that students that take notes on a laptop take more notes than those using longhand, but they remember less of the topic at hand. The mechanism that drives this is not fully understood, but one strong theory seems to be the associated effort: humans are energy-conservation machines honed over millions of years of evolution, and the brain will only spend the necessary energy to remember a fact if the cost of remembering it is less than the cost of not remembering it. That may sound weird, so let's analyse a more intuitive, physical counterpart.

Visualise a gym full of sweaty people struggling to lift heavy (or not so heavy) weights. You could say that they are being pretty ridiculous since there are more efficient ways to lift weights. I mean, if it's so heavy, use a forklift, or a pulley, or some other device that would make the task easier. But that would defeat the point because the objective is not moving weights to a higher position, the objective is stressing the muscles so as to encourage their development and thus becoming stronger.

Gainz... the easy way! Lift weights with a forklift for a quick task completion without effort.

The same applies to the brain and note-taking: if you're just transcribing something for posterity, purely for archival reasons, then taking notes in a laptop is the most efficient way. However, if your objective is learning, then you want to stress your brain so as to encourage it to record facts, to learn. You want the hard way.

The advent of LLM-generated code has come at a time when we have plenty of people that have developed a career without these tools, they had to do things the hard way (and growing their expertise while at it). Yes, these people can inspect LLM-generated code and assess whether it is correct or not. However, a newer generation of developers, those who are not doing things by hand because asking an LLM today is so much easier, those aren't going through the same effort and therefore are not developing the same expertise. Their learning will be limited to what the LLM can produce which, as we discussed earlier, cannot go above average for structural reasons. The lack of effort is why some are already discovering that AI is a Floor Raiser, not a Ceiling Raiser; we will get to average faster, but we will impede further learning.

The future: a Gaussian Witch Hat

The natural conclusion of this is that, in terms of code production, we are effectively going to have fewer junior and senior developers, most people will produce the level of output equivalent to that of a mid-range developer. This means that the expected code quality distribution that we discussed earlier is going to become more and more pronounced: flatter tails, pointer middle. The famous Gaussian bell will transform into a Gaussian Witch Hat.

The future distribution of code quality: the Gaussian Witch Hat

The consequences of this are hard to foresee. On one side, with more and more developers using the code than an LLM has produced instead of writing it from scratch, we will stop producing new original content to train new LLMs. We could try to use AI-generated data to train AI, but research suggests that this leads to AI collapse (AI inbreeding, if you will). There will still be some people writing fully original code, but those will be few and far between. Since progress in the last decade has mainly come from using larger and larger corpora of data, future progress will stagnate unless we find a different avenue like entirely new model architectures; this is not as simple as throwing in more GPUs. Progress from this strategy will plateau.

On the other side, there are other avenues where further progress could come from. Prompt engineering is quickly becoming a science in its own right, with guidelines on how to phrase and even structure instructions becoming so precise that they're almost a new kind of programming language; as an example, OpenAI's recommendations to create custom GPTs include things like:

Separate paragraphs with a blank line to distinguish different ideas or instructions.
Incorporate “take your time,” “take a deep breath,” and “check your work” techniques to encourage the model to be thorough.
Break down multi-step instructions into simpler, more manageable steps to ensure the model can follow them accurately.

Incidentally, I find it remarkably interesting that instructing a computer to "take a deep breadth" would actually change the quality of its output. This further proves that LLMs don't learn the good, but the frequent: us humans are prone to give a sloppy answer unless we're pretty please asked to make an effort, and LLMs are doing just the same. That's what you get from mimicking human behaviour after a massive statistical analysis of existing code and user forums.

AI is the new plastic

It should be very obvious by now that, while AI is not going to increase the quality of production, it is going to make it incredibly cheaper. Drawing conclusions from previous automation revolutions, we can infer that this will be a game changer both for good and for bad.

One one side, automation has always lead to larger volumes, lower per-unit cost, lower quality, and an increase in the throwaway culture. This trend, inevitably, ends in some form of waste. Take the fashion industry as an example, where the cost reduction in garment production has lead to the concept of fast fashion and the creation of clothes damping grounds all over the world. We're beginning to see these effects in programming too: technical debt grows faster with AI, productivity is decreasing due to workslop, etc. You may have heard about a tactical tornado before, AI is going to supercharge this kind of programmer. Beware of the tactical sharknado.

The Tactical Sharknado is coming. Time to sh*t your pants.

But, on the other side, it is going to bring programming to areas where it currently doesn't reach because the cost isn't justified. In this sense, the appropriate comparison is that of ultra-processed, industrialised food: less nutritious, full of additives, and leading to all sorts of health issues in the long term, but cheap, very cheap. Now, if the choice is between good and bad food, you obviously choose good food, but if the choice is between bad or no food, you settle for bad. That's what AI will bring: an unbearable amount of AI slop, but at such low prices that for many use cases it will be the best if not the only alternative. If the use case is not critical, then a cheap system that probably works can be good enough. Like plastic.

Centaur programmers: the best of both worlds

Despite its name, AI is dumb, really dumb. It excels at producing the kind of code that other people have already written a million times before. The key thing to keep in mind is that AI lacks any creativity, and that's where we humans excel. That may change in time if we combine LLMs with other techniques like genetic algorithms, but that requires a lot of domain-specific knowledge and goes against the generalistic trend of today, so I don't see any breakthroughs happening soon.

The way forward is combining AI's processing power with human's creativity, the so-called centaur teams. If you think that AI-generated code is good because it saves you time typing all that boilerplate, think instead that you should be looking for ways to remove that boilerplate. And here is the kicker: you can use AI for that! Once you identify the repetition, you can ask AI for tips on how to remove it. On the other hand, there are repetitive but necessary tasks that AI can help with, we just need to remain in control. For instance, you can set up an AI bot to post review comments to your PRs; these aren't a substitute for your own clinical eye, but they can be useful at detecting the kind of thing that a human may easily miss. You can't replace a mathematician with a calculator, but a mathematician equipped with a calculator is incredibly more productive.

In summary, AI is an incredible tool that can bring many benefits to our daily lives, but we must be careful not to abdicate our responsibilities to it. Understand its strengths and limitations and make sure you leverage the former without succumbing to the latter. Today more than ever, creativity, excellence, and quality will be the differentiating factors.

Just remember: AI isn't here to do your job, AI is here to help you do your job. But it's still your job.