中国项目管理资源网

IT项目启示录——来自泰坦尼克号的教训(第十二篇)

2005/12/14 17:50:56 |  2547次阅读 |  来源:原创   【已有0条评论】发表评论

文/Mark Kozak-Holland  译/杨磊

回顾泰坦尼克号当时的情形:当船重新起航后(见第10部分),渗水演变成一场大灾难。当晚12时45分左右,即在船体搁在冰架上65分钟后,船长令指挥员们打开救生艇并把所有乘客和船员召集到甲板上。船员们因不清晰的沟通而处于困惑之中(见第11部分),行动迟疑,不相信一切已经不对头了。毕竟,其时大灾难的迹象尚未显见。

在今天,灾难恢复的概念是把在线运行转移到另一个替代性的服务环境。但是形式却是多种多样的,从数天内完成单个应用的数据/文件的简单恢复,到数分钟小时内就得完成整个业务运行的相对复杂的恢复。灾难可能呈现三种态势,即:完全状态(绝对而立即),急迫而逼近,缓慢而无毒害。当灾难被确认后,应急计划就启动了,灾难也将被公诸于众。

在泰坦尼克号上,灾难属于缓慢而无毒害型的。虽然全面的恢复计划不再可行,船长与指挥官们仍可展开局部的恢复。而在缺乏正式的撤离或灾难恢复计划的情况下,他们能做的也只能是在灾难迹象明显之前,发令阻止恐慌和混乱的蔓延。在设计时(见第3部分)对灾难恢复的场景假想,是用救生艇把乘客们转移到另一艘船上并带回港岸,就是说,救生艇会往返运载乘客,因此对其数量的要求就很小。但这一假想的前提是基于泰坦尼克号是不会沉没的,至少能自己漂浮在海上待援。

而今我们开发一个灾难恢复计划时,必须考虑全IT方案中可能引发灾难的所有形式的故障。例如:
●技术上的物理故障或有形缺陷
●设计错误,含系统/应用程序软件设计的失败和代码问题
●由运行操作人员因事故,不熟练,培训不足,不按规程甚至蓄意恶意造成的运行失败

环境(如动力系统,冷却系统,连同网络)的故障,可以和自然灾害、恐怖行动一样,对运行中心造成同等的破坏。

在过去400年中,绝大部分与横渡大西洋有关的环境因素,都已经被发现,植入图表和载入文档了。内容包罗万象,从全年的自然情况(如海流的变化),天气情形(如风暴和飓风),到自然危害(如海上浓雾,冰原,冰山带和危险的海岸线,礁石等等)。然而,在泰坦尼克号项目中弥漫的一种信念就是,这艘不会沉的巨大铁船能应对一切自然问题。

在设计一个灾难恢复计划时,还需考虑灾难的级别。比如,当较小的风暴,火灾或者水淹来袭时,你的顾客希望得到某种相对迅速的应急服务。现在,你就需要对所有这些都准备应急措施,以至对更大的灾难也一样。

灾难恢复的相关费用,会因耗时,引发原理,恢复程度的不同而相异。这些费用,应作为计划的一部分,针对每个特定的IT方案对象,仔细确定。

对泰坦尼克号而言,按海运惯例本应有一个考虑到了上述一切情况的灾难恢复计划,来将所有人带到救生甲板,把他们转移到座位宽绰有余的救生艇上,安全放下并让训练有素的船员带走他们。在金斯顿的救生艇训练中,应该已经测试过计划中的这后一部份(见第5部分)。

在生产环境下大量的严重问题都开始于无毒无害的状态,即在问题刚开始时,你的组织也许甚至都不会留意到它及其影响后果。如,IT方案中一个不紧要的部分停下来了,未被注意,但是因为各个部件和应用之间的内在关联,出现一种连锁效应并很快使得该方案的其他部分受到影响,这将在极短时间内引发大的灾祸。

在泰坦尼克号上,救生艇的释放明显晚了,说明方式犹豫到最后才不得不发放的。指挥员的缓慢反应,可能因为总觉得该船不可能沉没,事态也不明显,当时一切都尚显正常。还有,900船员中,真正意义上的水手只有83个(见第5部分),只有这些人掌握了把30英尺长的救生艇(可乘65人)怎样放到60英尺下海面上的复杂操作。这样的救生艇一共16艘,此外另有4艘较小的可拆装式的称作Englehardts的救生艇(可乘45人)。

结论

如今,不少IT项目完全忽视灾难恢复,其理由是不在项目范畴内,和另有年度计划流程来覆盖。IT项目本身除了确立商务理由,针对IT方案进行设计外,其实也包括了对所需恢复展开深入的了解。对影响IT方案的灾难后果所作的严肃思考,需在项目早期尽早完成,以便对整体的灾难恢复计划进行调整。下一部分我们仍将着眼于灾难恢复。

原文:

In recapping Titanic’s situation, following the restart of the ship (Part 10) the flooding became catastrophic. Around 12:45 p.m. , 65 minutes after the initial grounding on the ice shelf, the captain gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. The crew, confused by unclear communication (Part 11), operated in a state of disbelief, refusing to believe that anything was wrong. After all, there were still few signs of the disaster.

In today’s world, disaster recovery is the concept of switching the online operation to an alternate service-delivery environment. However, it takes many shapes and forms, from the relatively simple recovery of data and files from a single application in a timeframe measured in days, to the relatively complex recovery of a complete business operation in a timeframe measured in minutes or hours. A disaster can take three forms, namely: total (absolute and immediate), rapid and imminent, slow and innocuous. When a disaster is recognized, contingency plans are invoked and a disaster is declared.

On board Titanic, the disaster was slow and innocuous. Although a full recovery was not feasible anymore, the captain and officers could enact a partial recovery. But without a formalized evacuation or disaster recovery plan, the best they could do was to bring some order to prevent widespread panic and chaos once the disaster signs became more obvious. The envisioned scenario for disaster recovery, at the time of the design (Part 3), was to transfer passengers through lifeboats to another ship and then deliver them to port. The lifeboats would ferry passengers back and forth to the rescue ship, requiring a much smaller total lifeboat capacity. This scenario was based on the perception that Titanic could not possibly sink, but would float in an incapacitated state waiting for help.

In today’s world in defining a disaster recovery plan, thought needs to be given to all the types of failures that could possibly happen to an IT solution and lead to a disaster. For example:
· Physical faults or failures in the technology
· Design errors which include system or application software design failures and bugs
· Operations errors caused by operations services staff because of accidents, inexperience, lack of due diligence or training, not following procedures or even malice
Environmental failures can be equally devastating, such as those in power supplies, cooling systems and network connections--as can natural disasters and terrorist activities against the operation center itself.

In the past 400 years, most environmental factors related to crossing the Atlantic had been observed, charted and documented. This included everything from year-round natural conditions like changing ocean currents and weather patterns like storms and hurricanes to natural hazards like fogbanks, ice fields and iceberg areas, and dangerous shorelines and rocky outcrops, etc. However, a belief had evolved during Titanic’s project (Part 4) that anything that nature could hand out could be handled by this enormous iron ship that was practically unsinkable.

In defining a disaster recovery plan, the scale of disaster is important to consider as well. For example, if a relatively minor storm, fire or flood knocks out your online operation, your customers are going to expect some contingency of service relatively quickly. In today’s world, you need contingency for all of these, even the most catastrophic disasters.

The associated costs of disaster recovery vary, based on the window of recovery (time), the elements of the disaster and the degree of recovery required. As part of a plan, these costs need to be carefully determined specifically for the IT solution created.

For Titanic, under maritime convention there should have been a disaster recovery plan defined for all the above situations that brought everyone onboard to the lifeboat deck, loaded them into the lifeboats with places to spare, lowered the lifeboats safely, and put them adrift with experienced crews to handle them. The life boat drill in Queenstown should have tested the latter part of the plan (Part 5).

Many serious problems with a production environment can start so innocuously that, in the first hour, your organization might not even be aware of it or its implications. For example, a less-critical part of the IT solution might be "down," so it goes unnoticed. However, because of interdependencies between components and applications, there tends to be a "knock on" effect and very quickly other parts of the IT solution can become affected. This leads to a catastrophic failure in a very short time.

On board Titanic there was a major delay in getting the lifeboats down, indicating a hesitation to launch the boats until as late as possible. It is likely the officers reacted slowly for several reasons: the ship was believed to be unsinkable, the gravity of the situation was not apparent and everything appeared so normal at the time. Also, only 83 of the crew of 900 were actual mariners (Part 5) and therefore familiar with the somewhat complex drill of lowering a 30 foot (65 person) lifeboat 60 feet to the water. There were 16 of these lifeboats in total, plus four smaller collapsible lifeboats (45 person) or "Englehardts."

Conclusions
Today, many IT projects completely ignore disaster recovery as something beyond their scope and covered off by a yearly IT planning process. Yet it is the IT project that determines the business justification and design around the IT solution, and develops an in-depth understanding of the kind of recovery that is required. Serious thought needs to be given to the consequences of a disaster impacting the IT solution, and this needs to be done early enough in the project so that adjustments to the overall disaster recovery plan can be made. The next installment will continue to look at disaster recovery.

【 发表评论 0条 】


网友评论
网友评论(共0 条评论)..

请您注意·自觉遵守:爱国、守法、自律、真实、文明的原则
·尊重网上道德,遵守《全国人大常委会关于维护互联网安全的决定》及中华人民共和国其他各项有关法律法规
·严禁发表危害国家安全,破坏民族团结、国家宗教政策和社会稳定,含侮辱、诽谤、教唆、淫秽等内容的作品
·承担一切因您的行为而直接或间接导致的民事或刑事法律责任
·您在中国项目管理资源网新闻评论发表的作品,中国项目管理资源网有权在网站内保留、转载、引用或者删除
·参与本评论即表明您已经阅读并接受上述条款