On disaster recovery plans

After all the recent datacenter files, such as OVH SBG and WebNX, it's really got me thinking about my disaster recovery plans these days...

Background

Before I dive into my thoughts on disaster recovery, I first wanted to dive into the most recent fire at WebNX. It's come out that a generator had caught fire which resulted in multi-day outage at the facility. It's not uncommon for utilities to break, however, the most concerning thing was how they handled the fire.

Most datacenters in the event of a fire will not use water for suppression – especially around servers. It makes significantly more sense to use non-water based supression (eg, foam) or even a giant vacuum tube that drains all the oxygen from the server floor (fire needs oxygen). I don't understand why anyone thought water supression was a good idea in a datacenter floor unless it's an absolute emergency (eg, if the entire datacenter floor is on fire, it's probably a wash anyways).

On to DR

My diaster recovery plans today suck. I decided to review it in depth, and because lots has changed over the past months, they absolutely suck.

Here's my current disaster recovery plan:

  1. Data is hosted on a colocated server remotely in another provience (3+ hour drive away)
  2. Data is backed up once to a different VM on the same server (30+ VMs -> 1 VM)
  3. That VM is backed up to BackBlaze B2 encrypted using Restic
  4. In the event of a disaster, pray the current machine is fine and replace the components necessary (eg, hard drive failures) and restore from the backups

That's my entire disaster recovery plan. It definitely needs work – but how? A very reasonable question, and after diving many hours into it... I think I've figured out the best way for me. I emphasied "best way for me" because your mileage may vary.

The New Plan

After taking a review of my entire infrastructure and what data I care about (do not want to lose), I've devised a new strategy:

  1. Data is hosted on a colocated server remotely in another provience (3+ hour drive away)
  2. Data is backed up once to a different VM on the same server (30+ VMs -> 1 VM)
  3. That VM is backed up to BackBlaze B2 encrypted using Restic
  4. Another colocated server runs identical in a HA setup and Ceph for the file system backend to keep everything in sync. This other server will live in a different facility run by my colocation provider.
  5. Data is backed up once again to a different VM on the second server (30+ VMs -> 1 VM)
  6. Data is backed up to Wasabi encrypted with Restic
  7. In the event of one of these machines going down, the other one in HA format will take the load (IP addresses are announced over BGP, so the BGP announcement will kick in and serve the traffic)

Doing this means I'll need another server. Instead of doing what I normally do, which involves combing eBay for hours to find a suitable machine, I decided that it's time for a small upgrade.

I'm a huge HPE fan. I've had enough Dell issues to ward me off for life. HPE makes good machines in my opinion, and they last quite some time.

My current colocated server consists of 64GB of DDR3 ECC memory, 4x3TB storage in RAID1 and dual E5-2650v2 processors (8 core/16 thread per processor, giving 16 cores and 32 threads total) all packed in a HP Generation 8 LFF rack server box. It's a beautiful machine, it hums along without issue.

After searching for hours on eBay for an identical (or, very similar) machine for consistency, I found a few suitable ones that were simply overpriced. I paid under $300 for my Generation 8 over a year ago (an upgrade from my Generation 7). So I'm on the hunt for a newer generation.

I'm looking at buying from HPE directly for this new server and for about $1200 USD (excluding storage which I already have) I can obtain a nicely sized machine. I'm looking at doing the memory upgrades myself (DDR4 prices are through the roof – additionally, I don't need the ECC memory) and opting for long form factor (LFF - 4x3TB HDD) and 10 core processor (Xeon Silver 4210R). I'm not set in stone about it though, as I'd love to build an AMD-based server to run on it (eg, Ryzen 7 or even AMD EPYC).

Now that I've got the plan in place, it's time to finalize what server to deploy... I'll make another blog post on the selected box and specifications when it's finalized (stay tuned). I'm hoping to have this box assembled and deployed within 2 months.