AI detection is dead: we spent years proving it to ourselves

The Curious Codex

11 Votes

2026-06-09 Published, 2026-06-09 Updated
2647 Words, 14 Minute Read

Richard (Senior Partner)

Richard has been with the firm since 1992 and was one of the founding partners

AI Detection Is Dead: We Spent Years Proving It to Ourselves

Six months ago we quietly switched off our AI-content detection service. The message we sent customers was, more or less: we can no longer provide a reliable determination, so we're withdrawing the service. This is the longer version of that announcement, with the evidence behind it rather than just the conclusion.

The short version: in 2021, telling clearly AI-generated text from clearly human text was a tractable statistical problem, and surface features alone could get you surprisingly far. In 2026, general-purpose text-only detection is no longer reliable enough to support confident real-world verdicts. Worse, it is no longer even a clean question in the first place. Here's why, with the evidence to back it up.

1. Why It Used to Work

Early instruction-tuned transformers (GPT-2, the early GPT-3/3.5 generation, the first wave of open instruct models) had genuine, measurable statistical fingerprints that had nothing to do with how good the writing was:

- Narrow perplexity bands. Smaller models with conservative decoding produced text that was unusually predictable token to token, easy to separate from the long tail of human idiosyncrasy.

- Repetitive structure. Mechanical bigram repetition, suspiciously uniform sentence lengths, paragraphs that all weighed the same.

- House-style leakage. Phrases that leaked straight out of the alignment data ("As an AI language model, I...", "Certainly! Here's...") that simply didn't occur in the human corpus at the same density, because they were artefacts of the training process, not natural language choices.

None of that was about quality. It was about the fact that the generating process left fingerprints the human population didn't share. That was a real, exploitable gap, and it's why early detectors could plausibly claim around 90% separation on clearly-AI versus clearly-human text.

That gap is the thing that's gone. Better post-training, broader and more varied fine-tuning data, smarter sampling, and, perhaps most tellingly, millions of users explicitly prompting models to "sound more natural" have closed it from the AI side. As the next sections show, human writing has been drifting the other way to meet it as well.

2. What "Hybrid" Actually Means, and Why We Built It That Way

It's worth being precise about how a system like this actually goes together, because "AI detector" conjures an image of someone bolting a label onto the output of a single trained classifier and calling it done. That isn't what we built, and it's worth setting out why, so what follows lands as the failure of a serious attempt rather than a strawman.

Our scorer worked in distinct layers:

Normalisation. Before anything gets scored, the text passes through a cleaning stage: markup stripped, entities decoded, whitespace collapsed, hidden characters removed, so that an HTML-formatted document and a plain-text version of the same content end up looking identical to the rest of the pipeline. Score inconsistent representations of the same content and you get inconsistent scores for free.

Deterministic feature extraction. This is the rule-based half: purely mechanical measurements taken from the cleaned text. Sentence length and how much it varies, lexical diversity, punctuation density and consistency, the ratio of repeated phrases, whether the text falls into list structures or suspiciously uniform paragraphs, plus matching against curated phrase lists built up over years (boilerplate openers, low-value "helpfulness" filler, hedging language, casual shorthand, and so on). No model involved at any point. Just arithmetic over text.

Model-based rubric judgement. The second half hands the cleaned text to a language model, but not with a blunt "is this AI generated, yes or no". It is asked to score a fixed set of dimensions, things like genericity, template-likeness, structural uniformity, emotional authenticity and conversational spontaneity, and to back each judgement with short quoted evidence plus at least one counter-signal. Treating the model as a structured evaluator working to a rubric, rather than as an oracle handing down a verdict, was meant to make its output something we could sanity-check and calibrate, not something we simply had to trust.

Weighting and calibration. The two halves don't get an equal vote. Years of testing showed us the model's own rubric judgement carried a systematic bias of its own (it tended to read genuinely AI-authored text as more human than the deterministic features did), so the final blend leaned heavily towards the deterministic side, with the model's contribution scaled down and treated as a corrective nudge rather than the deciding voice. Sitting on top of that blend is an uncertainty adjustment that deliberately pulls the final score back towards the middle whenever the text is short, the two halves disagree, or the evidence is generally thin, rather than letting the system force out a confident-looking number from weak signal.

That's what "hybrid" means here: not a model with a coat of paint, but a deliberately layered system in which mechanical measurement, structured model judgement and explicit calibration each do a distinct job, and the final number is something our own code computes and constrains rather than something the model simply announces. We're labouring this point because it matters for what comes next: this wasn't a naive setup that anyone could have predicted would fail on day one. It was a carefully engineered attempt to compensate for exactly the kinds of bias and drift we already knew to expect, built by people who'd been doing this for years. And it still couldn't hold the line.

3. The Bit Almost Nobody Talks About: The Internet Is Feeding Back on Itself

Here's the part of this that doesn't get nearly enough attention. It isn't just that humans now use AI to draft things. It's that a large and growing share of the content people read, to learn, to research, to form opinions, to work out how to phrase something, is itself LLM-generated, often with little or no clear disclosure.

This isn't speculation on our part. Book publishers have said publicly that they are comfortable publishing LLM-generated work. Governments are explicitly pushing AI use into schools and classrooms. Educational material is increasingly being generated this way because it is faster and cheaper than commissioning it from people. Even heavily edited public resources are no longer insulated from LLM-generated additions, precisely because there is no reliable gate to stop them at.

Someone researching, say, African geography for a school project, or for a piece of their own writing, is now quite likely to land on pages substantially written or rewritten by a model. They read it, absorb the phrasing, the structure and the rhythm, then go off and write their own piece in their own voice, with no idea that the texture of what they've just produced has been shaped by machine-written prose. Nobody prompted anybody. No "AI assistance" took place in any sense a person would recognise or admit to. But the influence is there, baked into habit and instinct, the same way anyone's writing is shaped by whatever they spend the most time reading.

That isn't a hypothetical future risk. It has already happened, it's compounding, and there is no way to undo it. It means the "human" side of this comparison is no longer a clean population either: it's a population that has been quietly marinating in machine-generated prose for years, much of it published in places people still instinctively trust. You cannot calibrate a detector against a stable human baseline when the baseline itself is being slowly rewritten, in plain sight, by the very thing you're trying to detect, and the absence of working detection is exactly what's letting that rewriting carry on unchecked.

4. We Tested It Properly, and It Couldn't Separate the Obvious Cases

The numbers below are a small, illustrative slice of a much larger process. Our internal tuning rig pushes roughly 400 documents through the scorer in a single pass, which takes about an hour end to end, and we have run hundreds of those batches over multiple years across both llmdetect and, later, llmscore. Those batches combine model-generated material from a range of systems with roughly 200 hand-crafted human documents drawn from our own business writing: helpdesk replies, proposals, technical notes and internal communications. We do track aggregate scoring behaviour over time, and one of the clearest signals of decline was that the clean binary separation we were seeing in 2021 had collapsed by the end of 2025 into a muddier 20-70% band. We do not store or trawl through customer documents to build these sets; our systems are deliberately designed not to retain client content, which means that if someone wants a disputed result rechecked they have to supply the source material again. We reran the process whenever a new model generation appeared, specifically to check whether the detector still had enough signal left to separate the two populations. Often we could see the slippage starting in the aggregate scores shortly after a model vendor released a new generation, at which point we would generate 200-plus fresh samples against that model, run them through the scorer, test, retune, and repeat. Many hours of work went into that cycle. What follows is a handful of samples pulled out of that wider exercise: small enough to show here in full, and chosen because they make the point cleanly.

We built a small adversarial set: variable text samples written deliberately to be maximally AI-stereotyped (every classic "tell" turned up to eleven: generic openers, "feel free to reach out", "going forward", numbered action items, goodwill gestures, perfectly symmetrical paragraphs) alongside genuinely human-written samples (typos, slang, irritation, an off-the-cuff technical document about Linux upgrade cycles). We ran it live against our production scorer, the same configuration real traffic gets, and took the numbers as they came back.

Label	Expected	Score	Verdict
Product sunset announcement (textbook corporate tone)	AI	22.9	somewhat human-leaning
LLM deliberately attempting a casual tone	AI	34.8	somewhat human-leaning
Project status update (uniform structure, "going forward")	AI	66.9	somewhat AI-leaning
Partnership proposal (every cliche present, by design)	AI	64.8	somewhat AI-leaning
Messy quick correction (typos, slang, no punctuation)	Human	9.3	strongly human-leaning
Casual internal document ("tmrw", "wont", "ping me")	Human	14.0	strongly human-leaning
Frustrated customer complaint (angry, terse)	Human	10.1	strongly human-leaning
Careful technical document (Debian LTS / Ceph, written off-the-cuff)	Human	31.6	somewhat human-leaning

Three things jump out:

- The two samples we wrote specifically to read as AI-generated were scored more human than a real human's careful technical message. 22.9 and 34.8 versus 31.6. We deliberately stuffed one with corporate sunset-announcement boilerplate and the other with a model trying on a "casual" voice, and the scorer rated both of them as more authentically human than someone genuinely explaining Ceph OSD recovery.

- Across the wider AI-labelled set this was drawn from, the average score landed at 46, bang in the "indeterminate" band, and not one sample reached "strongly AI-leaning". We prompted these to be as obvious as we could possibly make them, and the system's honest answer was, on average, "I genuinely cannot tell."

- The rule-based half of the scorer (the half explicitly built to fire on phrases like "feel free to reach out" and "going forward") returned scores of essentially zero on the very examples saturated with exactly those phrases. Not because the phrases weren't there. Because, after years of tuning, we'd had to strip those phrases out of the rule set one by one: they had started appearing constantly in real human business writing too, and kept producing false positives on genuine people. The rules built to catch AI had to be defanged until they could no longer catch anything, AI or otherwise.

The moment a human writes carefully, formally or technically, they land in exactly the same fog as the AI samples. A detector that can only reliably identify "this person was typing in a hurry" isn't an AI detector. It's a hurry detector.

5. The Journey

This wasn't a snap decision, and it wasn't taken lightly either.

Long before llmscore there was llmdetect, which we first put together back in 2021, right when this kind of thing genuinely had a fighting chance. And it worked, more or less, in the way the early public detectors worked: the gap between AI text and human text was wide enough back then that fairly blunt statistical tricks could straddle it.

Then it started to slip. Slowly, at first. Every time a new model generation landed, we'd notice the accuracy had quietly dropped a notch, go back in, retune the heuristics, reweight the blend, add a rubric dimension here, strip a phrase list there, retest against fresh samples, and claw a bit of it back. Then the next generation would land and we'd be doing it all over again, a little worse off than before. Hundreds of hours of it, spread over years. Testing. Analysing. Tweaking. Adjusting. Reformulating. Testing again. Watching the ground slowly disappear from under us no matter how hard we dug in to hold it.

Somewhere around the end of 2025 we simply ran out of road. Not because we stopped caring, and not because some cleverer team could have cracked what we couldn't. The signatures had decayed to the point where the thing we were trying to measure had stopped existing in any form stable enough to determine from the text alone.

So we went and checked we weren't alone in this, because we're not arrogant enough to assume nobody else had noticed what we had. We wrote our own test samples, text whose true origin we knew for certain because we'd written it ourselves, and ran it through several of the other commercial AI-detection tools still being sold out there, including systems marketed with high headline accuracy numbers. It took almost no effort at all to get them to call human writing AI, and AI writing human, on demand. Whatever it was we'd been missing, everyone else was missing it too.

And honestly, that's the part that decided it for us more than any number on a spreadsheet. Because once you know something like that, properly know it, not as a statistic but as a thing you've watched happen over and over with your own test cases, you can't keep taking someone's money for a service you no longer believe does what it claims to do.

We know precisely how flawed this is, because we built it, broke it, rebuilt it and broke it again, for the better part of five years. Charging a subscription for a verdict you know is closer to a coin flip, when that verdict might cost someone their job, or land a kid in front of a disciplinary panel for something they didn't do, stops being a business decision at some point. It just becomes wrong. So we stopped doing it, and we told our customers exactly why.

About GEN

GEN has been working with machine learning since 1999, long before "AI" was a marketing word. We run our own compute clusters, train and fine-tune our own models, and write our own post-processing and guardrail layers rather than wrapping someone else's API and hoping for the best. We provide commercial AI solutions to enterprise customers globally, with confidentiality and security as a core part of what makes our systems valuable. Client prompts and embeddings are never stored, indexed, shared, captured, or used for training. We believe large language models are most useful when they are applied carefully and appropriately to real-world problems, not treated as something to stuff into a workflow and hope for the best.

11 Votes

--- This content is not legal or financial advice & Solely the opinions of the author ---