AI That Learns After Deployment

Published 26 May 2026

The biggest limitation of modern AI is not that it lacks knowledge. It is that it stops learning once it is deployed.

Today's systems can remember context, retrieve documents, store preferences, and be fine-tuned later. But the model itself is usually not changing inside the live loop of experience.

That means most AI can look adaptive without actually adapting.

If AGI means a system that can survive novelty, recover from mistakes, form new behaviours, and keep improving in the world, then live learning is not an optional feature. It is the core problem.

This essay is about one possible building block: plastic neural networks whose structure changes while they run.

How neural nets work today and their problems?

Simply put, numbers go in, numbers come out.

A simple neural net consists of three parts:

The input layer
The hidden layer
The output layer

The hidden layer can contain millions to billions of parameters in layer after layer, a bit like this.

[0.5, -0.2, 0.8, 0.1, ...]
[0.7, -0.4, 0.3, 0.2, ...]

The big problem is that it is very, very difficult and expensive to update these numbers, the hidden layer, after training without having a cascading impact on the other numbers. This can lead to catastrophic forgetting.

Fine-tuning is possible, but it is usually another training run. It needs data, compute, validation, and safety checks. That is not the same as an organism learning from experience as it moves through the world.

We want the learning to be live.

Why random synapse branching for learning?

The constraint I imposed on myself was to avoid fixed numeric weight matrices and traditional backpropagation. This is not how the brain works. The brain does not use matrix multiplication. The brain does not operate between -1 and 1 values.

At first I tried to understand how human neurons worked and what caused a neuron to fire. I couldn't understand it.

So I flipped it. What if the bulk of the logic is in the synapses?

Synapses are what connect neurons in the brain and they are not static. They sprout, grow, weaken, and decay. There is debate around how random this can be. For example, in the brain of a fly the majority of synapses are very short, indicating a highly optimised and not random path being locked in. A fly is not smart though.

For our purposes, what happens if it is just random branching and, over time, with sufficient locking and decay of those synapses, a functioning neural net can establish?

The best analogy I can use to describe this is that the neural net is a group of tributaries. Numbers come in and those numbers flow downstream through the easiest and strongest routes. This results in outputs that lead to muscle movement. Over time the feedback carves channels.

Demo

I started with a worm.

Whilst reading "A Brief History of Intelligence" I came across the nematode C. elegans, a very, very small worm that requires a microscope to see and has just over 300 neurons. This felt like a tangible number of neurons. It also has limited outputs and the sim operates in a 2D world. The task is simple: find the food.

The demo shows basic steering and, once synapses reinforce, faster movement toward future food.

How it works

There is a deep dive on the maths at the end of this post.

The worm receives inputs from external stimuli, such as smell, and internal states such as hunger and movement. These are the input neurons and input numbers into the neural net.
Random synapses start branching with a health of 25% (the stronger the health, the more of the original input is forwarded).
The numbers start flowing from left to right and result in the output drive. This is basic steering. Nothing is being learnt yet.
The worm eats the food and the synapses are reinforced by increasing their health and reducing the rate of decay.
The worm senses new food and these inputs travel faster and stronger down the strengthened synapses, resulting in faster movement. I consider this a very narrow form of learning, but it is a live loop:

input -> movement -> feedback -> structural change -> changed future movement

Ablations

Steering can be very basic

It is possible, with a lot of tuning, to directly wire a left input to a left motor output to achieve steering. Reinforce this route and the next time the inputs come in the worm will move faster. Is this a form of learning similar to my claim above? What is the point of the internal neurons then? At this time I do not have a great answer except an intuition that a larger net is needed to retain more advanced behaviour and memory (see the iteration examples below).

Non-monotonic performance when adjusting neuron sizes

Adjusting the sizes of input, internal, and output neurons does not lead to a predictable improvement in performance. This is partly coupled to the heavy impact the initial random branching has and how quickly the worm gets the food. Some random branching leads to a good worm. Other worms require dropping the food in front to build up the network. Suggesting structure, not just scale, is important: which paths exist, how they are reinforced, and how quickly useless structure is removed.

Iterations

Dopamine-style predictive reinforcement/decay as delayed credit assignment

There is a global neuron that contains traces of all inputs up until the food, or reward, is eaten. After repeated food encounters, the worm can begin treating earlier inputs as predictors of reward. This means the system is no longer only reacting to food contact; it can start strengthening paths when it detects a cue that has previously led to food.

Disappointment / anti-lock

Prediction creates a new problem: false confidence. If the worm expects food and no food arrives, the network needs a way to weaken that expectation. The disappointment iteration adds negative feedback when an expected reward fails to appear. This prevents the system from permanently locking onto a once-useful but now-wrong path.

Pain and stress routing

Food is only half of embodied learning. The next iteration added pain and stress. Pain creates an aversive signal that inverts synapse routes when the worm encounters something harmful. At first I thought just weakening the synapses would be enough, but that just causes the worm to get stuck in the pain area. Inverting means the creation of a second network that gets strengthened, leading to different movement, ideally the opposite direction. This lets the network learn not only what to approach, but what to avoid.

Hunger as internal drive forcing exploration

A big problem has been encouraging exploration when there are no reward cues such as smell. The worm with no inputs will collapse into a circle. Hunger and other internal states act as an input, spiking activity through the network to encourage movement. Currently this input is too strong and I am trying to work out how to rebalance the network once the smell inputs start being received. One idea is to have some rough memory on the inputs so that if smell has not been seen it is a very strong input.

3D

I have some early 3D versions working, but it is considerably harder. There is a lot of overshooting and I suspect there need to be more types of input neurons to make it more accurate, coupled with solving the life cycle saturation problem.

The next task is to demonstrate real-time learning?

The cleanest test is to reverse the inputs and rewards.

On the first encounter:

Smell A = Food
Smell B = Pain

On the second encounter:

Smell A = Pain
Smell B = Food

On the second encounter, the expectation is that the agent will first hit the pain. This should cause a live structural change in the neural net. One that avoids smell A and forces enough exploration to test Smell B.

Open questions

Does this scale?
Would the non-monotonic nature change once there are more balancing capabilities such as fear, simulation, and long-term memory?
Have other people done this before? Why didn't it work in the past?
Advanced animals have a neocortex. This has a very similar synapse structure and is not random per se like the worm. Is this perhaps a feedback loop mechanism that allows for diffs to be created and hence superior intelligence?
What other brain structure / network topology is needed?
Can long-term memory be created by scaling this network? Is sufficient size enough to encode different memories?
How else to demonstrate learning? One idea is Thorndike puzzles. If an agent finishes a puzzle faster and faster, it demonstrates memory/learning.
How much of intelligence is a result of the hard outer loop of evolution? This is the nature vs. nurture debate.

Conclusion

The innovation is not a single trick. It is a live selection loop over structure.

Random branching creates variation. Reward, disappointment, hunger, pain, and decay create selection. The network learns because its physical graph changes.

This simple plasticity is not AGI by itself. But it could be one of the building blocks for true AGI: systems that keep changing, adapting, and learning after they are deployed.

Appendix: Deep dive on the maths

How the simple version works

At any tick, each neuron has a value.

For an input neuron, that value might represent smell, hunger, touch, pain, or movement.

Each synapse has a health value.

High-health synapses carry more signal. Low-health synapses carry less signal. If health gets too low, the synapse can be pruned.

The rough signal equation is:

$$ \mathrm{signal}_{i,j}(t) = \mathrm{value}_i(t) \cdot \mathrm{health}_{i,j}(t) $$

The receiving neuron combines incoming signals:

$$ \mathrm{value}_j(t+1) = \operatorname{clamp}\left(\mathrm{baseline}_j + \sum_i \mathrm{signal}_{i,j}(t), 0, 1\right) $$

This is the river analogy in simple maths. Signals flow most strongly through branches that already have higher health.

At each tick, new candidate branches can appear:

$$ P(\mathrm{newBranch}) = \mathrm{branchingRate} $$

A new branch starts weak:

$$ \mathrm{health}_{\mathrm{new}} = \mathrm{initialHealth} $$

In many experiments, that initial health is around 25 percent.

Most new branches are useless. They are just variation.

The learning question is whether feedback can preserve the rare useful branches before they decay.

Reward and decay

When the worm eats food, it receives reward.

For recently active branches, reward can increase health:

$$ \mathrm{health}_{i,j}(t+1) = \mathrm{health}_{i,j}(t) + \mathrm{reinforceRate} \cdot \mathrm{feedback}(t) \cdot \left(1 - \mathrm{health}_{i,j}(t)\right) $$

The remaining room to grow part matters. A weak branch can strengthen quickly. A strong branch approaches its ceiling more slowly.

If a branch is not protected, it decays:

$$ \mathrm{health}_{i,j}(t+1) = \mathrm{health}_{i,j}(t) \cdot \left(1 - \mathrm{effectiveDecay}_{i,j}(t)\right) $$

If health falls below a threshold, the branch is removed.

So the simplest version of the system is:

random growth + reward protection + decay = structural selection

That is the core idea.

Learning is not just updating a number in a fixed graph. Learning is deciding which routes continue to exist.

Delayed credit

Reward usually arrives late.

The worm smells food before it eats food. It turns before it reaches food. It moves through many intermediate states before the reward arrives.

So the system needs a short memory of what was active before reward.

For each input cue, the worm can keep a trace. A bounded version is:

$$ \mathrm{trace}_i(t+1) = \mathrm{traceMemory} \cdot \mathrm{trace}_i(t) + \left(1 - \mathrm{traceMemory}\right) \cdot \mathrm{cue}_i(t) $$

I like to experiment with unclamping as much as possible. When doing matrix multiplication you have to keep things between -1 and 1, but in a lot of cases here we don't need to. Perhaps this allows for greater range of behaviour. An unclamped version is:

$$ \mathrm{trace}_i(t+1) = \mathrm{traceMemory} \cdot \mathrm{trace}_i(t) + \mathrm{cue}_i(t) $$

The bounded version stays on the same scale as the input cue. The unclamped version lets repeated cue evidence build up beyond 1.0.

This lets the system say:

Food arrived now, but these signals were active shortly before it happened.

Those recent signals can then receive credit.

Prediction and surprise

Over time, the worm can learn that some traces predict reward.

$$ \mathrm{prediction}(t) = \sum_i \mathrm{weight}_i(t) \cdot \mathrm{trace}_i(t) $$

Then the important value is the difference between what happened and what was expected:

$$ \delta(t) = \mathrm{reward}(t) - \mathrm{prediction}(t) $$

If delta is positive, reward was better than expected.

If delta is negative, reward was worse than expected.

Positive surprise means: protect what just worked.
Negative surprise means: stop trusting what just predicted reward.

Dopamine-style reinforcement

The dopamine-like signal is the positive part of surprise:

$$ \mathrm{dopamine}(t) = \operatorname{clamp}\left(\max(\delta(t), 0), 0, 1\right) $$

Dopamine can protect and reinforce recently active branches.

That protection can be stored as a lock trace on the synapse:

$$ \mathrm{lockTrace}_{i,j}(t+1) = \operatorname{clamp}\left(\mathrm{lockMemory} \cdot \mathrm{lockTrace}_{i,j}(t) + \mathrm{dopamine}(t), 0, 1\right) $$

That lock trace reduces branch decay:

$$ \mathrm{effectiveDecay}_{i,j}(t) = \mathrm{decayRate} \cdot \left(1 - \mathrm{lockTrace}_{i,j}(t)\right) $$

This is how a successful route survives long enough to become part of the worm's behaviour.

Predictive dopamine

Reward is not the only possible dopamine trigger.

Once a cue has reliably predicted reward, the cue itself can release dopamine before the reward arrives.

First convert the learned cue weight into a trusted weight:

$$ \mathrm{trustedWeight}_i(t) = \operatorname{clamp}\left(\frac{\mathrm{weight}_i(t) - \mathrm{threshold}}{1 - \mathrm{threshold}}, 0, 1\right) $$

Then look for a rising cue:

$$ \mathrm{cueRise}_i(t) = \max\left(\mathrm{cue}_i(t) - \mathrm{cue}_i(t-1), 0\right) $$

The predictive drive is:

$$ \mathrm{predictiveDrive}(t) = \operatorname{clamp}\left(\sum_i \mathrm{trustedWeight}_i(t) \cdot \mathrm{cueRise}_i(t), 0, 1\right) $$

Then dopamine can combine reward surprise with trusted cue onset:

$$ \mathrm{dopamine}(t) = \operatorname{clamp}\left(\max\left(\max(\delta(t), 0), \mathrm{predictiveDrive}(t)\right), 0, 1\right) $$

This matters because the system is no longer only reacting to food contact. It can start reinforcing paths when it detects a cue that has previously led to food.

That is the shift from reward reaction to reward prediction.

Disappointment and anti-lock

Prediction creates a new failure mode.

The worm can become confident in the wrong cue.

If the worm expects food and food does not arrive, the system needs a way to weaken that expectation.

A disappointment event can be written as:

$$ \mathrm{disappointment}(t) = \operatorname{clamp}\left(\mathrm{expectationStrength} \cdot \max(-\delta(t), 0), 0, 1\right) $$

The conservative version is not to destroy the branch directly. Instead, disappointment removes protection:

$$ \mathrm{lockTrace}_{i,j}(t+1) = \mathrm{lockTrace}_{i,j}(t) \cdot \left(1 - \mathrm{disappointmentGain} \cdot \mathrm{disappointment}(t)\right) $$

Protected branches become unprotected. Then ordinary decay can act again.

This is the anti-lock idea. The system should not permanently lock onto a once-useful but now-wrong route.

Pain and stress

Food is only half of embodied learning. An animal also needs to avoid harm.

Pain is different from disappointment. Disappointment says: the expected good thing did not happen. Pain says: this route is actively bad.

One simple way to model pain is to weaken active routes:

$$ \mathrm{health}_{i,j}(t+1) = \mathrm{health}_{i,j}(t) \cdot \left(1 - \mathrm{pain}(t)\right) $$

But in the experiments, simply weakening painful routes can leave the worm stuck. A more interesting idea is route inversion.

First maintain a stress trace:

$$ \mathrm{stressTrace}(t+1) = \operatorname{clamp}\left(\mathrm{stressMemory} \cdot \mathrm{stressTrace}(t) + \mathrm{pain}(t), 0, 1\right) $$

Then use it to invert the live transfer path:

$$ \mathrm{effectiveHealth}_{i,j}(t) = \operatorname{lerp}\left(\mathrm{health}_{i,j}(t), 1 - \mathrm{health}_{i,j}(t), \mathrm{stressTrace}(t)\right) $$

When stress is low, the branch behaves normally. When stress is high, strong routes become weak and weak routes become strong.

The stored network is not erased. The live transfer path changes temporarily. That gives the worm a chance to do something different immediately.

Hunger as internal drive

Hunger is not meant as a claim that the machine suffers. It is an internal drive signal.

$$ \mathrm{hunger}(t) = \operatorname{clamp}\left(\frac{\mathrm{timeSinceFood}}{\mathrm{hungerWindow}}, 0, 1\right) $$

If there is no smell, no reward, and no obvious direction, the worm can collapse into circling or flat movement. Hunger can push activity through the network so the worm keeps exploring.

The hard part is rebalancing:

$$ \mathrm{drive}(t) = \operatorname{combine}\left(\mathrm{smell}(t), \mathrm{hunger}(t), \mathrm{pain}(t), \mathrm{movement}(t), \mathrm{touch}(t)\right) $$

If hunger is too weak, the worm does not explore. If hunger is too strong, it drowns out smell when food appears.

The right mechanism probably needs hunger to dominate when there are no external cues, then fade once useful sensory evidence returns.

A version of this post was published on https://x.com/HughHopkins/status/2059222630788432359.