Essay № 032 · Line: Matter · 13 min read
States of a Machine. The Fluctuation We Call Stability

States of a Machine. The Fluctuation We Call Stability

№ 032 · Matter 13 min

Modern DRAM memory suffers correctable errors (spontaneous bit changes in memory) far more often than the industry assumed, according to the field study by Schroeder, Pinheiro and Weber published by Google in 2009. A machine doesn't get tired, but it fails. It heats up, loses precision, drifts in its clock, accumulates errors. To say it's stable because it "has no emotions" is to ignore that it has its own variability, only we don't call it that. To compare a tired human with an AI under intensive use is already to compare two systems with states, not one with a state and another without.

The Axiom Nobody Verifies

The claim is repeated with the cadence of an axiom. Machines are consistent and humans aren't. They don't get tired, don't have a bad day, don't get distracted, don't need sleep, don't fight with their partner before heading to the office. That's why, they say, you should put machines where consistency matters.

The sentence has the air of the obvious.

And like almost everything with the air of the obvious, it doesn't survive half an hour of technical inspection.

The Silicon That Flips a Bit Without Anyone Touching It

Let's start with memory. Schroeder, Pinheiro and Weber published in 2009 a field study of Google's fleet, tens of thousands of machines over two and a half years, and found memory error rates much higher than the industry had been assuming. Orders of magnitude above the range the manufacturers gave as a reference. An individual DIMM (a physical memory module) suffered thousands of correctable errors a year in a non-trivial fraction of the machines.

Correctable errors means bits that flip on their own, in memory, without anyone touching them. ECC (Error-Correcting Code) repairs them on the fly and the operating system never notices. When ECC fails, what goes down is the machine. When ECC works, what you have is an apparently consistent machine that's actually correcting garbage constantly.

Sridharan and Liberty refined the picture three years later at SC12. Most errors aren't random, but concentrated. There are modules that fail little and modules that fail a lot, and the ones that fail a lot fail in specific spots. A specific cell, a specific row, a specific bank. The machine ages, but it ages unevenly, like a car that wears out one brake disc before the other. This doesn't show up on the spec sheet. It shows up later, in production, when someone has a hundred thousand servers over several years and bothers to look.

Temperature Rules

Any modern processor has a thermal-protection mechanism called throttling (frequency throttling: the chip automatically lowers its speed when it gets too hot). When the silicon exceeds a certain temperature, the chip reduces its frequency so it doesn't burn out.

That means, in plain terms, that the same machine doing the same task will take more or less time depending on how hot the room is, how well ventilated the case is, how clogged with dust the fan is. A server in August at three in the afternoon performs worse than the same server in February at four in the morning. The difference is measurable, it's documented in the data sheets from NVIDIA, AMD and Intel, and reliability engineers know it perfectly well.

But the public discourse on machine consistency works as if that variability didn't exist. Or as if it were so small it's not worth mentioning.

It isn't small. A GPU training a large model can lose appreciable sustained performance to thermal throttling if the cooling doesn't keep up. In inference loads with peaks, latency varies much more. A user asking the same model the same question twice at different hours won't get the answer in the same time. Sometimes they won't even get the same answer.

The Software Fluctuates Too

Language models aren't pure functions. When an LLM (Large Language Model) produces text, at each step it chooses a word among a set of candidates according to a probability distribution. That choice isn't deterministic unless it's forced to be, and it's rarely forced, because total determinism produces worse outputs.

Controlled Randomness, a Textbook Euphemism

The temperature parameter (the sampling temperature, regulating how much freedom the model has when choosing the next word) controls that openness. Top-p and top-k are two ways of trimming the candidate set before sampling. Holtzman and others published in 2020 a fairly complete analysis of what changes when you move those parameters, and the short answer is that everything changes.

With the exact same input, the exact same model, the exact same hardware, two consecutive calls can produce different outputs. Not slightly different. Structurally different. A line of reasoning that works in one call and derails from the third sentence in another.

This is called, within the industry, controlled randomness. It's a kind term. What's happening is that the system fluctuates, that the fluctuation is part of the product, and that without that fluctuation the product would be worse in other dimensions. The machine isn't deterministic in production even if its hardware is on paper.

Consistency, again, is a story told in sales presentations.

The Silicon Ages, Though Not Like Us

There's a process called electromigration (the electrical current literally pushes the metal atoms along the chip's tracks, displacing them over time). There's another called NBTI (Negative Bias Temperature Instability, a slow degradation of a certain type of transistor under sustained voltage and temperature). There's another called hot-carrier injection. Each has its physical mechanism and its time scale, and they all do the same thing for practical purposes.

A chip that meets its specifications today will meet them with less margin in five years, and in ten may have stopped meeting them. Constantinescu published years ago an overview of these trends in IEEE Micro, and the picture it painted hasn't improved. Hennessy and Patterson's own textbook, Computer Architecture: A Quantitative Approach, has for several editions devoted whole chapters to reliability, transient errors and the quantitative model of failure, the exact opposite of treating the machine as a fixed point.

The miniaturization of modern processes, the seven nanometers, the five, the three, aggravates almost all of those phenomena. The smaller the transistor, the fewer atoms compose it, and the more drastically losing a few of them affects it.

Bit rot (the slow degradation of stored data without anyone touching it) also has its storage version. A magnetic hard drive, untouched, unread, unwritten, gradually loses orientation in its magnetic domains over the years. An SSD loses charge in its cells. Serious file systems now incorporate checksumming (periodic computation of fingerprints to detect changes) and scrubbing (preventive re-reading of all data) precisely because they know stored data rots. It's exactly the word used in the technical literature. Rot. The silicon rots, slowly, in silence, while we think of it as a reliable medium.

A Cloud of States, Not a Fixed Point

This whole litany serves one purpose. It serves to hold that the machine, compared to the human, isn't a stateless system. It's a system with different states, modulated by different variables, manifesting on different scales. But states.

It has temperature. It has wear. It has internal fluctuation. It has degradation. It has transient errors and permanent errors.

When someone says they prefer an AI because it doesn't get tired, what they're actually saying is that they prefer a variability that doesn't present itself as such to a variability that does. The comfortable thing is that the machine doesn't cry, doesn't protest, doesn't complain about the boss. Its fluctuation comes in the format of logs almost nobody reads.

The Nuance That Avoids Demagoguery

Here it's worth introducing the nuance, because without it the comparison turns cheap.

Human fluctuation and machine fluctuation aren't the same. And not because of the magnitude, but because of something more interesting. Hunger regulates search. Fear regulates avoidance. Tiredness regulates rest. A human's affective state, however noisy it may be for a specific workday, has a measurable biological function, selected over hundreds of millions of years because it helps the organism survive and reproduce.

It's functional noise. It's noise with meaning. A tired person who keeps programming makes more mistakes, yes, but they're also receiving a signal from their body to stop. If they stop, rest, come back, they perform better. The loop has a logic.

The Entropy Without Function

The machine doesn't have that logic. The silicon's fluctuation serves it no purpose. It regulates nothing. It doesn't ask the chip to rest. It doesn't help it optimize its energy use toward an evolutionary goal, because the chip has no evolutionary goal. It's noise without function. It's entropy that engineering patches over in the form of ECC, throttling, redundancy, retries, checksums.

When the machine fails well, the patch covers the entropy. When it fails badly, the entropy gets through the patch and out comes the weird bug, the service outage, the incoherent answer that appears once and doesn't appear again when you try to reproduce it. That kind of failure isn't an exception. It's the system's normal state becoming visible for a moment.

Who Gets the Noise Pinned on Them

The industry sells consistency. And it sells it compared to what. Compared to the human operator who gets distracted, who gets angry, who slips up at the end of the shift. The comparison has a basis, because the human does indeed fluctuate very visibly and the machine fluctuates very invisibly. But it hides something. The machine is consistent only relative to new hardware, to a model that isn't updated, to a training corpus uncontaminated by attacks or by drift (the progressive change of the data distribution relative to the training one), to an infrastructure whose errors fall within the SLA's budget (Service Level Agreement).

Anything outside that box stops being consistent. And you leave the box soon enough. A couple of years in production is enough. The Google SRE Book itself, written by the people who keep that infrastructure running, takes for granted that errors happen all the time and builds its whole discipline on failure budgets, not on the illusion of zero. It's the industry itself recognizing, in its internal manuals, that consistency is accounting, not physics.

There's an additional asymmetry worth naming. The human's fluctuation is attributed to the human. The machine's fluctuation is attributed to the user, to the network, to the prompt (the input instruction to the model), to the use case, to the deployment. When a person makes a mistake, the responsibility is theirs. When a computational system makes a mistake, the responsibility is spread across so many actors that in practice it falls on no one.

This isn't a moralizing critique. It's an observation about how risk is distributed in opaque systems. To recognize that the machine has states, that its fluctuation is real and measurable, forces us to redistribute that responsibility differently. If the occasional error isn't an exception but a manifestation of the state, then the manufacturer, the operator and the integrator have to take some of that error as their own. Not externalize it every time onto the user who didn't know how to phrase the query.

The Figures Have Been Published for Decades

To call consistent a system that corrects thousands of bit-flips a year, that lowers its frequency when it's hot, that samples with controlled randomness, that ages in its copper tracks and loses charge in its cells, is a rhetorical decision. It only holds if nobody looks at the figures.

The figures have been published for decades. In SIGMETRICS, in SC, in IEEE Micro, in the big operators' internal manuals. They say the machine is a cloud of states, not a fixed point.

Comparing Two Clouds, Not a Cloud With a Point

Whether the cloud is small compared to the human cloud is debatable, and it depends on the axis you measure on. In stability of workday length, the machine wins. In stability of reasoning under thermal stress, not necessarily. In stability of behavior over five years, the silicon loses much sooner than the word consistency suggests.

Comparing ourselves to the machine isn't comparing a fluctuating system with a fixed one. It's comparing two fluctuating systems that fluctuate for different reasons, on different scales, with different consequences, and whose fluctuation has radically different meanings.

The human fluctuates because it's alive. The machine fluctuates because the silicon doesn't perfectly withstand what we ask of it. That we call one of the two stability and the other weakness says more about us than about the systems.

Definitions

Bit-flip. The spontaneous change of a bit's value in memory, usually with no visible external cause, attributed to cosmic rays, thermal noise or silicon degradation. In modern memories, frequent enough that serious systems incorporate automatic correction mechanisms.

ECC (Error-Correcting Code). Hardware error-correcting code. It adds redundant bits to each memory word to detect and correct one- or several-bit errors on the fly, without the operating system noticing they happened.

Thermal throttling. A mechanism by which a processor automatically reduces its clock frequency when it exceeds a certain temperature, to avoid damage. It implies a variable loss of performance that depends on ambient conditions.

Electromigration. A physical phenomenon by which the electrical current displaces atoms from the chip's metal conductors over time, deteriorating the tracks and eventually causing permanent failures.

Bit rot. The slow degradation of information stored on a digital medium, whether magnetic or flash, due to physical causes like charge loss, progressive demagnetization or accumulated errors.

Temperature, top-p, top-k. Parameters that control sampling in language models. Temperature adjusts how much freedom the model has when choosing the next word. Top-p limits the candidates to the subset whose cumulative probability reaches a certain threshold. Top-k limits the candidates to the k most probable.

Drift (of model or data). The progressive change of the real data distribution relative to the distribution a model was trained on, which degrades its performance even though the model itself doesn't change.

SLA (Service Level Agreement). A formal agreement between provider and client that sets thresholds of availability, latency or acceptable errors. Errors within the SLA's budget are counted as normal, not as failure.

References

Schroeder, B., Pinheiro, E. and Weber, W. (2009). DRAM Errors in the Wild: A Large-Scale Field Study. SIGMETRICS. The field study of Google's fleet from which the observations on the frequency of correctable memory errors come. https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

Sridharan, V. and Liberty, D. (2012). A Study of DRAM Failures in the Field. SC12. A refinement of the earlier work, the source of the datum on the non-random concentration of failures in specific modules and cells.

Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. ICLR. The reference for the analysis of controlled sampling and the effects of temperature, top-p and top-k on language-model output. https://arxiv.org/abs/1904.09751

Constantinescu, C. Trends and Challenges in VLSI Circuit Reliability. IEEE Micro. An overview of electromigration, NBTI and silicon-aging mechanisms referred to in the section on long-term degradation.

Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach (Morgan Kaufmann, 6th ed., 2017). The classic reference textbook for the chapters on reliability and transient errors mentioned in the article. https://www.elsevier.com/books/computer-architecture/hennessy/978-0-12-811905-1

Google SRE Book. Public documentation on failure management at scale, the general context for the claim that errors within the SLA are treated as normal. https://sre.google/sre-book/table-of-contents/

También te interesa

En otros sitios

Comments · 0

No comments yet

No comments yet. Be the first.

Leave a comment

Subscribe to our newsletter