What are people talking about when they say “disaster” recovery? There is so much aggressive marketing on this topic that all you can think about are the most overly dramatic examples: hurricanes, tornadoes, earthquakes, fires, or floods. The oversimplification of the issue to the point of fear mongering is a disservice.
While natural disasters are considerable threats that bring certain tragedy with them, this kind of marketing has largely failed to grasp the definition of “disaster” as it is used in information technology.
There are plenty of more common, more mundane disaster scenarios that happen all the time. IT disasters are not rare and should not be regarded as unlikely, far over the horizon threats that may never happen.
In information technology, a disaster is any event that interrupts the normal function of your business. It does not necessarily mean the entire business has stopped functioning, but it does mean that an asset essential to production has dropped out of production.
If you’re a very small business with no server and only a few desktops and laptops, and those computers have critical data on them and if one’s hard drive fails, it’s a disaster.
If all your company’s data is on a server and that server fails, no one can access their data, it’s a disaster. You can still take phone calls and write down notes and maybe even conduct some business on pen and paper, but it’s a disaster.
Pressures begin to mount as time passes while you cannot get to any of your data. You can’t view the billing or work history of a customer or send them an invoice, for an example. Maybe you can’t get any email or collect payments.
The scope of a disaster could be limited to one application, one database, one folder, or it could have a larger scope such as a network or power equipment failure that takes a number of machines offline. It could be a corrupt config file, failed software update, or user error that takes down an important service or application.
The cause of disasters could be random and outside of your control, and the desired remediation could also be out of your control, such as a prolonged loss of power to your building.
It could be a failed hard drive, storage system, or it could be a ransomware virus that spread from an employee computer, corrupting whatever files it accesses.
Every company is set up differently and exposes itself to different types of risks. In terms of affected equipment, disasters can be total, or they can be limited to a single component of a system, but their impact on productivity is unmistakable.
Disasters involving storage equipment are particularly costly if there is no backup. If you do have a backup, you also need to make sure data can be restored quickly to alternative storage to minimize downtime.
A good disaster recovery plan will account for all of these scenarios and more. A good disaster recovery plan will rehearse the most likely of them so that you know what the recovery procedures involve, how long those procedures are supposed to take, and document the procedures thoroughly so there are no surprises during the real thing.
Putting A Plan Together
It is a common and terrible mistake to lay all of this at the feet of your IT department. While most aspects of disaster recovery procedures could be executed by IT personnel, disaster recovery planning is a corporate issue and not a technology issue. Proper formulation of recovery procedures involves every department and almost every rank in the company.
A dead hard drive may at first appear to be an IT issue, but if the result of that failure pre- vents the sales department from fulfilling orders, it becomes a corporate training issue: the sales department needs to know how to continue working while the system is being recovered.
This is one reason DR planning is perceived as such an undesirable chore. First, it requires us to disengage from our routines. It forces us to take a long, hard look at our own clutter, our own processes and workflows, good and bad habits, and the software and data we use individually, in our department, and in the company as a whole, and make good documentation that must be reviewed up the entire chain of command in order to see which processes can survive a disaster and which require modification.
Once you get into it, this process always yields valuable insight into the state of things: how your business works, how the pieces fit together, and most importantly, how to rank each part of the business by order of importance to production.
Quick recovery of data and services means having an understanding of what those services are, how they are configured, how you can standardize, document, or automate this configuration, where the data needs to be located, how to prioritize the restore of these services, and how to actually backup and quickly restore that service to new equipment.
Start with some basic questions:
- Which applications or services are absolutely critical and must be brought online first?
- Which services would be the first to be noticed by your customers if they were not available?
- How much downtime can they tolerate? How long can non-critical services be down?
- How can people fulfill their job functions while their technology portion of the company is being recovered?
Don’t just think about it. Write it down, share with your colleagues and staff, and put it into practice, and then begin to document regular rehearsals of these procedures.
Every business has different systems, different use cases, and different tolerances for different disasters, but these are the key questions that must be answered. Those answers must be documented and reviewed at least on an annual basis.
Test your restores to a blank slate on completely new hardware. Have your employees test their workflows against the restored systems, and make sure you are able to restore both a complete system on a macro level as well as individual applications and data on a granular level. You simply have no idea how your recovery will go if you do not test.
Complete restore is necessary in a complete disaster to new hardware, but in other cases it is usually preferable to only restore the particular item, or small number of items that you need rather than the entire system. You need the capability to operate at the correct granularity corresponding with the scope of the disaster scenario you’re facing, and that means taking backup data at both coarse and fine grains.
Bottom line; Plan your recovery, plan your restore and let that define how you run your backup.