Cold and Hot Spares in Cloud Computing

The Curious Codex

             7 Votes  
100% Human Generated
2024-10-10 Published, 2024-10-28 Updated
1080 Words, 6  Minute Read

The Author
GEN UK Blog

Matt (Virtualisation)

Matt has been with the firm since 2015.

 
Cold and Hot Spares

Cold and Hot Spares in Cloud Computing

Today I was working on an incident from a new customer that came in as a P1 with a cloud server that was dead. Their cloud provider could 'reboot' it for them but they could do that themselves, and didn't work. Their cloud provider gave no further help. This is how the industry segments, cheaper cloud providers are 100% DIY, "your VM, you deal with it", whereas managed cloud providers offer a wealth of technical support on the VMs they host, so pick the one that best suits you abilities and clearly establish the level of support you require.

Anyway, in this case we had to get it back up asap, and I don't want to drone on about how we restored the service, what's important here is that we could have done it far faster with a cold spare in this scenario.

Images, Snapshots, Replication?

There are two main storage concepts in cloud architecture, images and snapshots. An Image is a complete copy of the virtual machine including its configuration and local storage, whereas a snapshot is a image of changes between two points in time.

Snapshots are quick, easy and a fairly reliable way to 'roll back' changes, so we use them before updating the OS, or applying patches with risk. If anything fails to work as expected, we can roll back to before the change within minutes.

Snapshots however, are not perfect, and going forward again once you've gone backwards (if your virtualisation platform allows) is fraught with issues, you can't take them on Non-quiescent servers without risk, and having snapshots attached to a virtual machine will, again in some platforms, make live migration and HA a problem.

Images, on the other hand are complete copies including everything, OS, Data, Setup, the lot, and switching between images, whilst not as simple as snapshots can be done relatively easily, but slowly.

Images also have a super-power, they can be converted into other VMs, in fact we use images when provisioning new clients, we have images for NextCloud, TrueNas, Docker etc and we just spin up a new VM from the image, with all the software install and partially configured, and once booted, scripts take over to do the remainder.

Replication is a concept where data, be that an image or data is continuously replicated to another location. This is usually a block level replication and replicates the entire VM, or individual storage volumes.

So, back to Cold Spares

Many clients logically separate the 'system' and the 'data', as they should, and backup the data separately to the system. In a lot of smaller configurations the VM is backed up daily, but the data is offloaded hourly, or twice daily or whatever the client needs. This is important because backing up the 'data' especially with databases requires careful consideration - you can't just copy the database, you need to use the tools provided to perform a safe backup and restorable backup.

Having data and OS isolated like this allows for VFR (very fast recovery) of server failures due to configuration corruption, or data corruption, using cold spares.

Lets assume an administrator ran a kernel driver update that failed, upon rebooting the server it fails to come up properly. In this scenario, a snapshot rollback is the solution, assuming someone was good enough to take one, simple and effective.

What about when something goes wrong, but not enough to bring the server down, but enough to cause software failures over time, maybe during the day the software solution simply quits working, or works intermittently.

Well, a snapshot might be an option, but realistically taking snapshots on a live system is a risk, and you don't want hundreds hanging around, that in itself causes issues, so its images.

We would take the last backup from the backup solution, restore it back, and then deploy the data. Its a process we've done a thousand times and it works without fail, the only real issue is the time it takes.

For a small VM, maybe 1TB of local storage, an image restore might take 15 minutes, maybe a little less. Restoring the data could be anything from a few minutes to a few hours depending on how it's backed up and to where. A large image on the other hand, for example 10TB or more can take considerably longer, perhaps hours to restore the image.

This is where a cold spare comes into its own. A cold spare is essentially an image of the virtual machine, with all the software setup and configured, minus the data. This is often generated when the VM is first setup, or in its first week of operation, and is periodically rebuilt as things change, like software version upgrades etc.

Having a cold spare available means no restoring the image, we simply bring it online and that takes moments. Then we're directly into data restore and that takes the same amount of time.

So Hot Spares are?

A hot spare is a more complex form of a cold spare. Remember that we periodically re-image the cold spare as things change, but a hot spare re-images regularly on a schedule, for example, every hour, providing a backup that can be brought into service within minutes should something terrible happen to the live VM. Hot spares are not suited to all scenarios and they require complex configuration with costs involved, but for companies who don't want to go for clustered, but don't want cold spares it's a middle ground.

Block Level Storage, Fabrics and SAN

Yes I know, in larger configurations having storage on a storage area network is commonplace, and because information exists in several locations at once with redundant hardware, it's impossible to loose data. Shared storage, replication, and shadow images are complex configurations which again protect the VM from data loss. However, none of that will protect you from corruption, misconfiguration, malice or stupid.

Backup

I can't finish this article without emphasising the importance of backups, images that is not snapshots. Snapshots are great when the server is quiescent and you take one before making system changes or deploying updates, but they aren't a guarantee.

You need a reliable backup solution no matter what, even with cold spares, still have a backup. Your cloud provider will have a range of backup options, so select the most appropriate and TEST the backups regularly - bring up a cold spare and do a restore ;)


             7 Votes  
100% Human Generated

×

--- This content is not legal or financial advice & Solely the opinions of the author ---


Index v1.028 Standard v1.114 Module v1.062   Copyright © 2024 GEN Partnership. All Rights Reserved, Content Policy, E&OE.   ^sales^  0115 933 9000  Privacy Notice   162 Current Users, 257 Hits