Isnin, 11 Januari 2010

Disaster Recovery – What can happen and what would you do “if”?

Satu artikel yang menarik dari Chris Burdick utk dikongsikan pada kalangan System Administrator ataupun pada siapa yang berminat utk menceburi bidang IT..

With my roots in Systems Administration, Disaster Recovery or “DR” is a subject near and dear to my heart. However, I find that most people discount the importance of having a solid plan of what to do if the unexpected occurs.

When a person has a valuable asset, they never hesitate to insure it “just in case”. Yet with a website, which has inherent value and revenue potenial, the idea of insuring their site “just in case” never comes to mind. Unfortunately, it only takes a single mistake to wish that you had a DR plan. Many people will say, “We have a DR plan! What do you think those backups are for?” Unfortunately, having backups is not a DR plan. It is simply one of the counter measures that may come in handy when recovering from a disaster.

Let’s take a look at several scenarios when planning for DR:

Data Center

When choosing a hosting provider, always make sure that you have had the opportunity to review their data center locations. Ask about security, age of the facility, power redundancy, bandwidth provider and the address of the facility. It also never hurts to ask if you could take a tour of the facility. It is not so much that you are going to hop on a plane and visit, but if they deny the request entirely, you may want to keep shopping. This could be a sign of issues that they are trying to hide. Usually a claim of SOX / PCI Compliance and umbrella insurance documentation is enough to get a tour. It is not uncommon to require pictures of the location that is physically housing your server.
Mother Nature

So, Hurricane Jane just passed over your data center, leaving a rather sizeable hole in the ceiling.

Never underestimate the power of a storm. While this is more of a concern for your service provider, the end result is the same. If the hole is right above your server rack and water has penetrated your rack and hardware, your server will most likely be out of commission. If someone does not have a contingency plan for how the site will get up and running again, you may be out of business until services can be transferred to another data center or another server nearby. Hint: If your data center is located at the foot of an active volcano, in the most active segment of “tornado alley” or below sea level, you may want to put a little extra effort into your DR Plan (and consider moving).
Theft

John Doe, systems administrator extraordinaire, is doing routine maintenance in your rack. While he is taking his lunch break, he leaves the door open to your rack.

What would you do if someone stole a hard drive containing your entire customer database? While this does not happen every day…nothing in a Disaster Recovery plan does. Competition can be fierce…are YOU located in the same data center as your competitor? Physical theft of components not only means you need to replace the component (hard drive, SAN, backup device), but you also need to replace the data and find out who may have stole it in the first place.

Another reason to consider theft when planning is due to the location of most data centers. Some of the best Tier 1 provider data centers are located in the “bad parts of town” so to speak. Hosting requires 3 things: Space, Bandwidth, and Power. The cheaper the space, the higher their profit margin. Looking for places which have a nice balance between crime rate and the level of security of the facility are always worthwhile choices to house your data.

Also remember, if you are going to manage your own rack…do you want to get a call at 3 am and have to head to a high-crime area? You are now not only responsible for your equipment safety, but your own personal safety as well.
Power Outages

After a thunder storm, the main transformer leading into your data center is ‘fried’.

How redundant are the power sources in your data center? The best choice in a data center is one that provides redundant generators. Thus in the case of a power outage, your site will remain online during the repair to the data center.
Bandwidth Outages

During some routine upgrades, the main router for bandwidth control into your data center loses its configuration.

This is one that many people don’t think of. Does your data center provide you with redudant bandwidth sources? If their router is down, your site could end up down for hours while they restore the configuration.
Systems Administrators

John Doe, the systems administrator, runs updates on your server and performs a restart without authorization.

Did John discuss the update with your application team? Could one of these updates break the site? Perhaps your site requires configuration after a restart to fully configure a shared folder or other service you do not want started automatically. Regardless of the situation, John most likely took down your site. A proper DR plan accounts for situations like this which may not be apparent at first glance.
Application Support Team

Jane, one of your application developers, pushes an update to your site before headed out for the day. Having had a rough day, Jane turns off her phone and goes to bed early.

If Jane did not ensure that the update worked properly, the site could be down and no one knows it yet. Jane in particular, as she turned off her phone. This problem may persist until she makes it to work the next day.

This can usually be taken care of with an escalation system of some sort. In it’s simplest sense, it is a list of people to call in a hierarchical order. The more complex involves a system which allows a user to log a ticket and the system escalates the request automatically. Having proper procedures in place for application deployment can certainly help to alleviate this situation, but accidents do happen.
Novice User

Jake, a new user to your Content Management system, is asked to perform a cleanup task on some of the old articles.

This could end poorly for Jake and his boss. If Jake accidentally deletes articles or pages he was not supposed to, how will this content be retrieved?

Having appropriate backups of the database will certainly help in a situation like this. With a proper DR plan, Jake can simply call the appropriate person and the backups may even be restored before his boss finds out! This is good news for Jake and for the site as well.
Equipment Age

In an attempt to save money, John’s widgets puts his website up on a server with 5 year old hard drives. After a successful launch to their brand new website, their main hard drive fails bringing their online sales to a halt.

For anyone that doesn’t know, a hard drive’s life is measured using what is called “Mean-Time-To-Failure” or MTTF. It isn’t a matter of “IF”, it is a matter of “WHEN” the drive will fail. Their failure rates are normally within the first 3-4 months or 6-7 years. The 3-4 month range is due to design flaws and the 6-7 year is the life expectancy of the drive. Hard drives are devices which wear out and should be part of a cyclic hardware refresh program. Every year, a company who purchases and manages their own hardware should plan a portion of the budget to phase out old hard drives, upgrade ram and even upgrade servers over time. By replacing this equipment in a timely manner, a company can prevent server failures due to failing equipment.
Facilities

After a business in an adjacent office leaves a candle burning overnight, a fire spreads throughout the office complex burning all documents, equipment and leaving the widget of John’s widgets in shambles

While a DR plan should cover your servers, it should also cover disasters which may occur at your office locale as well. If your office building burned, was flooded or was closed due to an environmental contaminant…what happens? Does everyone have the rest of the month off while the office is repaired? In a good DR plan, there are clauses covering “continuing operations”. So all employees should know what happens if a disaster were to happen in the office building.

If something were to happen to an office building, the DR plan should lead a CEO / CIO to everything needed to restore operations as soon as possible. Situations such as unexpected equipment damage are why a “complete DR plan” will also include copies of all insurance policies, documentation of assets (all serial numbers with dates of purchase) and a plan to obtain replacement assets as soon as possible. It may also include a rendezvous point for any emergency meeting which may arise from a facility disaster.
Is that it?

Depending on your business domain, you may find that you have more issues to worry about than this. The best way to put together a DR plan is to have a brainstorm session with a diverse group from your company. At least one person from each department should chime in on what issues they can foresee. Once you’ve compiled issues like the examples above, appropriate counter-measures have to be planned for. This is where the actual plan may take the form of a dialog or flow-chart to handle any issues which may arise, in the appropriate manner.

Once a DR plan is made, it should be distributed to all employees or hosted in a location where everyone has access. An offsite copy should also be kept in case the digital version is destroyed.
Conclusions

Now you may be thinking, “Isn’t this a little excessive?” This sounds like a large investment of time and money! This is, in fact, true. A DR plan is a huge investment in protecting your company from the unforeseen. However, if your website goes down and all you have is ‘backups’, how long will it take you to get that site back up and running? Do you know what schedule the backups run on? Do you have hardware to replace the dead server or stolen equipment? Is your sole sysadmin on vacation for the next two weeks in Orlando? If your site goes down for even just a few hours, people lose faith in your brand. Particularly if you are in the IT industry in any shape or form. If you can’t keep your own equipment running, many people may doubt your ability to keep their equipment running properly as well. So the answer to excessiveness: How much is your brand worth to you?