Notes

Three ways an LLM in the loop breaks reproducibility

·7 min

Every computational scientist has a story like this. Mine was over a decade ago: I'd built a dataset from two different versions of a genome annotation that disagreed about where things were. They put me on a wrong path, and it took weeks to work out why the results made no sense.

The field's worst version of that scar is the Duke affair. Gene-expression signatures that claimed to predict which chemotherapy a cancer patient would respond to turned out to be irreproducible from the start; when Keith Baggerly and Kevin Coombes went looking, the tell was almost banal: an off-by-one error, the reported gene sitting one row off from the gene the analysis intended. The sign, in their words, that "somebody is using software they don't understand." Before anyone untangled it, those signatures had been used to enroll over 100 patients across three clinical trials.

That fear has never left me. It's been the recurring nightmare of my career. I work with experimental collaborators who act on my results: they design and run slow, expensive experiments based on the numbers I hand them. There is no room for second-guessing: the result has to be trustworthy enough to commit real lab resources to without re-deriving it first.

Reproducibility was hard, but possible

The field has changed beyond recognition since. We now generate high-dimensional data for almost any assay we can name: far more measured features than samples. That imbalance makes it dangerously easy to be sloppy with the analysis and derive incorrect results. The counterweight is that statistical rigor and the awareness of reproducibility grew alongside it: the methods to handle high-dimensional data got more rigorous, and GitHub, Markdown notebooks, and Docker made the analysis reproducible. It was never easy, but it was possible.

Then we invited the LLM in

An LLM upsets that balance: it is perfectly capable of being the sloppy one.

This really happened to me: I asked Claude 4.6 to build a differential-expression pipeline, and its first instinct was a clean-looking one that runs a t-test gene by gene, ignoring the count distribution and the variance shrinkage that DESeq2, edgeR, and limma-voom exist to handle. It reads well. It runs. It's wrong.

So I named the method, use limma, and it built another good-looking pipeline, this time with no covariates in the design. Each version reads well and runs; each is wrong in a way you catch only by reading it, and only if you know what right looks like.

The model isn't ignorant of limma or of covariate adjustment: a per-gene t-test, or a model with no covariates, is just the simplest plausible thing, what a decade of intro tutorials reach for. It reproduces the centre of mass of its training data, not the field's current best practice. Left unsupervised it defaults to the popular method; supervised loosely, it still cuts the nearest corner. And the sloppiness wasn't only the model's: I'd assumed it would figure out the right method on its own, because it is so uncannily smart that it becomes easy to stop checking.

So here is the strange thing we are all doing. The model is so good (so fast, so capable, so often right) that we are now willingly inviting a non-deterministic LLM into the middle of the pipeline. Its single most common failure mode is the exact off-by-one from the Duke story, and the why behind every line it writes evaporates by default. The containers and lockfiles still guarantee the run; they say nothing about how the code that fills them got written, and that act is now non-deterministic. It calls for a new layer of practice: one that captures what the LLM gives us and keeps reproducibility standing, just as we adopted reproducible workflows to keep up with high-dimensional data.

Three ways LLM-assisted analysis breaks reproducibility

The standard playbook for reproducible bioinformatics rests on three things: a pinned environment, seeded stochastic steps, and a code path you can trust (version-controlled, reviewed, with the why of each choice captured in a comment, a commit, or a PR thread). Pin the first two and they hold. The third is where an LLM in the loop opens three new cracks.

  1. The decision trail disappears into the chat. In a normal analysis, the choice of Leiden resolution = 0.5 is justified in a code comment ("0.3 collapsed two known cell types; 0.7 over-fragmented") or a commit message. With an LLM in the loop the rationale lives in a paired-programming conversation that gets compacted, scrolled past, or never written down. A future reader sees sc.tl.leiden(adata, resolution=0.5) with no comment and no way to recover why 0.5.
  2. The code trail is opaque. Claude writes code from its context window (your question, the dataset shape, prior cells, library docs) and from its training. The code alone doesn't show what the model saw, what alternatives it weighed, or which model version wrote it. Different model versions can produce different code for the same prompt, and the chat history that drove it is ephemeral by default.
  3. Non-determinism is layered. A hand-built pipeline has one non-determinism axis (the seeded RNG behind clustering, dimensionality reduction, weight init), and seeding it makes the outputs byte-reproducible. An LLM adds a second axis, the generation itself, with no seed: on the current frontier Claude models temperature / top_p / top_k have been removed entirely (they return an error), and temperature=0 never guaranteed identical output anyway. The numerics depend on the server's batch size (one open-model study logged 80 distinct completions from 1,000 temperature=0 runs of a single prompt). The outputs of a frozen, seeded pipeline stay byte-reproducible; the path to the code does not. Re-run the session tomorrow and you get the same results from different code, with nothing to pin.

Paired-programming vs. agentic

There are two ways to have an LLM in the loop, and the difference decides how hard these bite. In paired-programming mode the model writes the code and you run it: the non-determinism lives in the authoring, and you're present for every decision, so you at least could write the rationale down. In agentic mode (Claude Code running the pipeline itself, editing across files, executing cells, inspecting outputs and iterating on its own), the model is in the execution path too, and it makes dozens of micro-decisions you never see. Even the ones you could see arrive so fast, and in such volume, that reviewing each one is hopeless.

Agentic mode doesn't add a fourth challenge; it sharpens all three. The decision trail vanishes faster, because you weren't in the loop for most of the decisions. The code trail stops being one cell you can read and becomes a multi-file tool-call trace. And the non-determinism reaches the results, not just the source: when the agent picks a threshold, annotates a cluster, or interprets a figure, its output feeds the finding directly, and that output has no seed either. The more autonomous the agent, the more the discipline has to catch what you can't.

What discipline can't fix

A few things the discipline can't fix:

  • Prompt wording is a hidden variable. Exactly how you asked Claude to write a function may have shaped its structure or its correctness. There's no good way to capture this short of a full session log, and full logs are too noisy to make sense of after the fact.
  • Model-version archaeology. Within a year or two, "Claude Opus 4.8" may already be a name no one remembers in detail. That model's behaviour isn't preserved anywhere except in the artifacts it produced: your repo. Treat the repo as the canonical record; treat the model name as metadata, not a recoverable dependency.
  • Conversational interpretation. Sometimes the meaning of a result is settled in the chat ("this cluster looks like inflamed macrophages, marker X high and Y low") and never reaches the notebook. Writing it down is the only fix, and it's the discipline that goes first under time pressure.

What to do about it

None of this is an argument against the tools. The speed is real, and so is the leverage. It's an argument for matching your discipline to the blast radius of the work: how far the damage spreads if it's wrong. A scratch exploration that misleads me for an afternoon doesn't deserve the same care as a result a collaborator will build a year of expensive experiments on.

I have been thinking about this problem for some time, and have some half-baked thoughts based on the tiers of blast radius. Writing them down is how I can be more disciplined on reproducibility, so I'm putting them in a next post.