What is Big Data?

A good while ago, somewhere around 2000, GEN were providing 'database' as a service, we were one of the first to offer such a service, and the need had arisen because data workloads were ever expanding and yet hardware was still very expensive. Our first customer for that platform was an insurance company who wanted to be able to cross reference their policyholders with other datasets, such as the census data, credit reference data and postcode data. This invovled billions of rows of data that needed to be joined together, sometimes in a fuzzy way and to achieve this in a timely manner required serious processing power.

As the years rolled by so did the demands, and 'big data' became a thing. In 'Big Data' we're talking about a computing platform that is specifically designed for processing very large volumes of data, very fast. This generally falls into two flavours.

Complex

We may have a source table of a hundred million rows, and we need to join this with census data of fifty million rows, and with postcode data of 2 million rows, and local council data of 4 million rows, and credit reference data, cameo data, fraud prvention data, insurance data, flood risk data, crime statistics, and the list goes on. This kind of data enrichment is used by companies to provide context to their decision making and business processes. We regularly 'wash' data for customers before they can use it, by removing records from exclude lists like the TPS, or SPA data and this is performed automatically through a process often called a 'data pipeline' basically being upload -> process -> download. This kind of batch and data pipeline processing, with complex queries probably accounts for about one third of the big-data market.

Simple

We have ecommerce companies that need lightning fast processing of transactional data, such as pulling a list of products given a search term, or a basket, or a list of past orders. These requests are simple and often use a key/value approach like MongoDB to provide the performance. We had one customer during the pandemic who were throwing over thirty thousand requests a second at our data platform to handle travel booking and certification data. This kind of fast but simple processing is the other side of big data, and is probably two thirds of the big-data market.

History

When GEN first launched a 'database' service, it was called GENDATA and consisted of a Sybase SQL and a DB2 database engine hosted on two HP DL360 G3 servers. Storage was provided by 7.2k drives and was slow, and we only had 48GB of ram which, whilst massive at the time, didn't help, yet in those days the workloads were much smaller and the demands lighter.

As time went on, demands grew and so did impatience. When a data processing task would take a night to run, customers now wanted it faster so we had scale up the deployment, and as both Sybase and DB2 became obsolete, we moved to Oracle and doubled the hardware. This new deployment was capable of handling much larger datasets, but it still couldn't run everything in RAM and some workloads still took a while to run. Around this time, MySQL was growing in popularity, but Oracles purchase made it unattractive. We'd already had a good few years of 'Oracle' and we know how the company worked and the problematic licensing requirements. With the MariaDB Fork, we took the decision to leave Oracle and go with MariaDB.

Our current data lake consist of 8 HPE-580-G10 servers with 512GB of RAM and a fibre channel storage fabric, working as a cluster to provide both MariaDB and MongoDB in a load balanced environment. The demand for this data cluster has increased significantly, because it is now available through our virtualisation and containerisation platforms for larger customer workloads.

Data Lake vs Data Warehouse

People often confuse the difference between Data Lake and Data Warehouse, but actually they are fundamentally different.

Data Lake

A Data Lake is a collection of data that is stored in either a structured or unstructured format, at any scale. Data lakes are generally used to store data for tasks such as data mining, realtime analytics, machine learning and AI. Data lakes, do not need a rigid schema and can store anything, even vector data and images. We often describe data lakes as schema-on-read, which means we define the data structure when we read it in the format we require, which is not the format it is stored.

Data Warehouse

A Data Warehouse, on the other hand is a collection of data that is stored in a structured format (SQL) that is used for large scale business intelligence, reporting, and analysis. We often refer to a data warehouse as schema-on-write, which means the data is written in the same format as it is stored, and retrieved.

Vectors and AI

Some notable changes over the last few years is the rise of Vector databases. Vector databases store data in a special format that allows relationships to be inferred quickly. As a very basic example, if we vectorise a sentence, we get a vector of numbers that represent the words in the sentence. Then we can use these vectors to find the closest match to a new sentence. The structure of vector databases is way outside the scope of this article, but their use is becoming mainstream. We already use vector data from our data lake to power our machine learning and AI solutions, and one common example is AI powered assistants. We have a solution where we use vector data to store a companies knowledge base, then we pass this into an AI model, together with a promot (query) and the model extrapolates an answer from the knowledge base. (We have an interesting demo of this available on our website where we match an AI powered response vs a more conventional elastic search solution).

Data Analytics

A decade ago, data analytics was based on SQL queries that ran against a data warehouse to provide things like, Sales figures by department, product, area, customer and so on. We were very good at it and could even produce some probability based projections on future sales, but it was all math based, and was really volume dependant. If we had 5000 rows of data, any probabilistic projections were weak at best, but with 500 million rows, we could be very confident in our projections. Then Large language models became a thing, and suddenly we were able to do a lot more with data, by training a model on a companies past sales data, we were able to then make more confident predictions about the future. This is because in math based probabilistic modelling we can only consider a very limited set of parameters, but with LLM's we're now getting relational predictions based on all the data. It has to be stated at this point, that its not perfect by a long way, and sometimes the output we get is just plain nonsense, but we're getting better all the time, and the insights we're seeing are helping to drive business decisions.

As an example of one such AI driven solution, a cosmetic surgery company with many physical locations, doctors and clients wanted to analyse past clients procedures and predict how many were likely to come back for future procedures, and what those procedures would be. This obvious aim of this query was to shape business development and capacity planning. Their combined client data was around 7 million records, with 440 different procedures, across 71 sites internationally. We vectorised all this data, and then trained an AI model to predict the probability of a client coming back for any procedure, and even what that procedure might be. Now, it sounds like a lot of data, 7 million records, but actually its not enough to provide any firm insights, but it is enough to get a general idea, and after refactoring the data a few times, we were able to get some meaningful data out. This high level predictive data is not 100%, or even 80% accurate, and there will undoubtedly be anomalies, but that's the current level of the technology. What we can do, is then attempt to validate the predictive data with math based probability modelling to provide a confidence factor, and that's what we did.

GEN

GEN have been in the 'big data' arena since GENDATA launched in 2000, and today our data processing clusters provide a high performance solution for companies of all sizes. If you have a need for large scale data processing, or data vectorisation then please contact us for a quote. Unlike most of our competitors, we don't just give you some credentials and leave you with it, we work with you every step of the way to ensure your requirements are met and you get the results you need.