After yesterday's outage at UserLand, we scrambled to check how our back-ups are doing. Honestly we found a few glitches that were fixed right away.
This is what we currently do:
- The standard IdeaTools deployement architecture is made of two identical machines, one running Windows+Frontier (this is what we call the IdeaTools Server), the other running Linux+Apache. Our main servers at the web farm are a couple of IBM Xseries 330, 2 PIII 1 Ghz, 512 MB RAM, HD 2 x 40G raid HW, 1 U.
- A Frontier script on the IdeaTools server every night makes a backup of all root files + copies from the Linux server all contents that are served statically via Apache
- A script on the linux server download an additional copy of the backup on the Linux server's disc
- At this point we have complete backups of both servers on both servers
- Yet another script compacts all root files and sends the compressed file to another server which is at our offices
- A last script saves a copy of one of two root files every night (this helps a lot to prevent database corruption and keep everything speedy)
- Both servers have a RAID system that should keep everything up and running also in the case one of the SCSI discs explodes
- We have a quite impressive network intrusion detection system called Demarc that keeps track of everything happening on our networks (since we installed it we realized just how many failed attacks are happening every day)
- Most important of all, we have a team of incredibly good people that keep everything running and constantly updated!
Overall we should be able to recover from almost any crash scenario... but of course I've worked for enough time in this business and have seen enough "asteroid-smashing-server-farms" movies not to feel totally safe