Subscribe to GEN
Login to GEN
Add a Comment
Most digital history is not lost in a dramatic fire or a famous bankruptcy. It disappears quietly, a shelf at a time, a filing cabinet at a time, a box of manuals at a time, until nobody remembers who had the last complete set of reference documents for a machine that once mattered enormously.
For years, GEN has been the steward of an extraordinary archive of historic computer documentation: material from roughly 1,800 vendors, covering the period from 1950 to 2005, and comprising more than 700,000 individual assets across PDFs, scanned images, photographs, magazines, brochures, and binaries. In storage terms alone, it is vast at around 8TB. In historical terms, it is even larger.
Much of it was scanned by us in the early 1990s from our own physical libraries. More arrived later as other archives closed, downsized, or simply vanished. Over time we became not just a collector, but a custodian of material that in many cases no longer appears to exist anywhere else in organised form.
The problem was never recognising the archive's importance. The problem was scale. Anyone can manually catalogue a few hundred files, or even a few thousand, if they are patient enough. Cataloguing hundreds of thousands of assets from nearly two thousand vendors is something else entirely. That is not a side project. That is a life's work.
We did make attempts over the years. Some vendors were curated by hand, and large names such as IBM and Acorn received serious attention. But the archive kept growing. Donations arrived faster than they could be processed, and much of the incoming material was placed into unsorted holding areas simply so it would not be lost. The result was preservation without access: safe, but not truly usable.
At the start of April 2026, that changed. In a meeting with partners and senior managers, we made a simple decision: either we curate this archive properly, or we pass it to someone else who can. It is too historically significant to spend another decade in limbo.
GEN has been working with AI for a very long time. We were experimenting with transformer-era language technology years before it became fashionable, and we now have substantial in-house compute capacity. Most of the time that infrastructure is waiting for work. This archive finally gave it something worthy to do.
This is exactly the sort of job automation should tackle: not replacing judgement, but making an impossible human task tractable. Large language models are not magicians, and they are certainly not historians. What they are good at is processing large volumes of messy material, extracting structure, and producing usable descriptions at a speed no manual team could realistically match.
So we broke the project into four practical stages: organisation, text extraction, asset understanding, and searchable curation.
The first step was not glamorous. Before you can search or describe an archive, you need to know what is actually in it. We wrote classification tools to iterate across all vendor folders and infer identity from folder names, filenames, and subfolder contents. That allowed us to quickly distinguish material belonging to IBM, CDC, RML, Acorn, and hundreds of others, even where earlier donations had been filed inconsistently.
From there came the slow, essential work of digital housekeeping: moving files into the right places, creating cleaner folder structures, and collapsing subfolders that existed only because donated material had been dumped into the nearest plausible location years ago. It was less like programming and more like archaeology with automation.
The archive contains an enormous number of PDFs, but many historic PDFs are not really documents in the modern sense. They are photographs of paper. That means there is no text layer to search, no metadata worth trusting, and no easy way to understand what the file contains unless a human opens it page by page.
For roughly 500,000 scanned PDFs, the answer was OCR. We chose Tesseract, not because it is fast, but because it remains exceptionally strong on awkward source material: skewed pages, old print, sideways scans, handwritten notes, and the general untidiness of documents that have survived for decades by luck rather than design.
We built a distributed OCR stack using Tesseract containers spread across multiple compute nodes, then wrote the orchestration code to feed jobs through the cluster until completion. It is not glamorous work, but it turns image-only documentation into text we can finally index, analyse, and use.
Historic archives are not just manuals. They contain board photographs, adverts, screenshots, technical diagrams, magazine covers, installation media, and countless oddities that made perfect sense in 1987 but arrive in 2026 as a mystery file with a bad name.
For image classification we used a combination of vision models, including CLIP-style classifiers for broad categorisation and Qwen where deeper interpretation was useful. This does not magically create perfect art-historical commentary. Old computing equipment is underrepresented in most training data, and many images in this archive may be unique. Still, being able to say with confidence that an image is a circuit board, a front-panel photograph, a product shot, or a technical diagram is an enormous step forward.
Binaries are the most difficult category, and we have deliberately placed them later in the workflow. In many cases, context already tells us enough. A disk image found in an Acorn collection is likely to be Acorn-related. Spending large amounts of effort perfectly fingerprinting every obscure binary from 1985 before the documentation itself is curated would be the wrong priority.
Once OCR has produced transcripts and the image pipeline has produced descriptions, the archive becomes something new: not just a pile of files, but a body of text that can be reasoned about.
This is where large language models become genuinely useful. We feed each transcript into an LLM and ask it for a concise listing summary: what the document is, what kind of reader it was meant for, and why it matters. For very large PDFs we do not waste tokens on all 700 pages; taking the opening portion is usually enough to identify the document accurately.
A second stage then groups documents into meaningful sections such as hardware manuals, software manuals, brochures, magazines, and reference material. It also creates an overview of each vendor archive and, where enough material exists, assembles a proper introduction covering company history, product lines, and chronology. In other words, the system does not merely describe files. It begins to describe collections.
We tested the process on two substantial vendors first: IBM and Acorn. The outcome was extremely encouraging. Thousands of files that would previously have required manual opening now had searchable transcripts, readable summaries, sensible grouping, and useful vendor-level context. What had been a warehouse became, at least in those test cases, a library.
After that, we let the system loose. Within days it had consumed hundreds of millions of tokens and was moving through the archive at a speed no traditional indexing project could approach.
Curation is only the first half of the job. Once we have transcripts and descriptions, the obvious next challenge is retrieval. Traditional full-text search would work, and we may still use it in some places, but we want something more ambitious. The plan is to vectorise the archive into our Qdrant cluster and build a human interface on top of it.
That means the archive should eventually become explorable in natural language. Instead of hunting through filenames and manually opening PDFs, a user might ask a direct question such as "What are the DIP switch settings on an Apple II expansion card?" and receive a relevant answer extracted from the right manual. Whether that works beautifully or awkwardly will depend on OCR quality, embeddings, ranking, and interface design, but it is absolutely worth attempting.
Historic computer documentation is not nostalgia in paper form. It is the operating memory of an industry. It explains how systems were built, configured, repaired, sold, and understood. It preserves the practical knowledge behind machines that shaped business, science, education, and everyday life. Once that reference material disappears, entire chapters of computing history become harder to reconstruct.
What we are doing now is not simply digitisation. The scanning happened long ago. This is the harder stage: rescue through understanding. It is the process of taking one of the world's largest archives of historic computer reference documentation and making it usable before the chance is gone.
There is still a great deal left to do. But for the first time in years, this archive is not just being stored. It is being read, organised, and brought back to life.
--- This content is not legal or financial advice & Solely the opinions of the author ---