中国项目管理资源网

IT项目启示录——来自泰坦尼克号的教训(第九篇)

2005/12/14 17:37:00 |  2679次阅读 |  来源:原创   【已有0条评论】发表评论

文/Mark Kozak-Holland  译/杨磊

再扼要回顾一下把当时的情况:泰坦尼克号的指挥官们拼命想躲过一场撞击(见第8部分)。但是,“S型转向”这个正确的决策仍未能使船足够减速。数以百计的旅客在事后说,泰坦尼克的船体几乎平白无故地来了个停顿,颤动着、响起数秒钟咕噜噜的滚动和摩擦声音,如同船体正从大量的石头弹子上翻侧过去似的。

并没出现所谓“骤然急停”、灾祸、或者哪怕是轻微的受伤什么的。也没出现猛烈的侧向摇晃,或沿船体侧线的重复冲撞。这些情况,本会在船体要费力避开从侧面撞来的冰山的时候出现的。放在饭厅的早餐餐具几乎没颤动,头等吸烟室和休闲厅内的饮料也一点没洒漏。一切迹象说明,船底刚好给搁在位于水下冰山基部的某一处冰架上了。默多克大副成功地规避了一场本可能让头前四个船厢粉身碎骨、并将杀伤数百名旅客的“迎头一击”。

同样地,一个IT解决方案在生产营运阶段出现不稳定时,根据项目本身预先准备、计划、和测试过的规程,会采取一系列行动(见第4部分)。这种规程应以所谓MTTR (平均恢复时间)为基准,其主旨是使得该IT解决方案能从故障中尽快恢复上线,以满足所谓的“服务水准协议”SLAs。继而在后台通过某种临时、或者长远的修补来得到完善。

诚然,正式上线前,方案的完整性首先要得到确立,以防故障再次出现。以时间为基准,运营团队将走完前述流程和故障的四个区间,即“故障探测”、“故障确定”、“故障解决”和“从中恢复”。 “平均恢复时间”MTTR一旦开始计时,意味着“服务中断”(一种损失,见第二部分)的开始,应按“用户损失分钟数”来采集评测指标,该“用户损失分钟数”可衡量出多少用户得不到服务以及持续了多久。

这种方法,远比常用的所谓“服务可用性百分比”,如99.999%的评测方式来得更精确。那末,泰坦尼克(“平均恢复时间”MTTR中)的“故障探测”期间,就是瞭望观察哨给出警报的那37秒。但在IT解决方案中如此(长的“故障探测”期)并不常见,通常的情况倒是会在大问题出现前,就为处理故障给出了较好的提示警报。这给了自动化的、或者是人力的营运者以时间来首先防止问题的发作。(见第八部分)。

接下来,泰坦尼克号的船长、主管和指挥人员们在舰桥部集积确定行动步骤。作为“确定故障”的一部分,两组人员分别被派往船的头部、中部调查受损程度。第一组在10分钟内就带回了积极的报告:无大损伤,无漏水。在主管布鲁斯-埃斯梅的头脑中,故障的“探测”和“确定”期就此结束了。至于以发出遇险或者求救信号的方式来完成随后的“故障解决”环节,对他来说却真是个大问题了,因为那会给泰坦尼克招来大量流言蜚语,将有损白星公司的市场位置,并且那种吸引了满世界富豪精英都来乘坐这有史以来最安全航班的辉煌市场效应,也将毁于一旦。

其实,此时更好的“故障解决”方案,应该是把船开回加拿大哈里发克斯港,避开纽约这一世界新闻中心。这样,他也可编出一个更好的新故事,把此次事故边沿化成一桩小事而已。他还能让乘客们都弃舟而改上火车,把船体修补一下后就开回贝尔法斯特作大修。事实上,他甚至可以大谈装备了最新式应急系统的泰坦尼克、本身就是一艘怎样的救生船,是如何从一场巨大灾难的边沿中成功自救的,还能把白星公司航线的安全性更进一步地加以宣传推广。

现今的IT解决方案中,“故障的确定”要评估其给用户带来的影响。“确定”本身,必须有“证据”可支持,在确定问题是否恶化升级了、引发源头是什么上面,重新调查反馈机制和日志是至关重要的。

在一个大型复杂的IT解决方案中,常现所谓“多米诺联动效应”,即一个小的故障点比如某个子系统,会波及其相关邻接者,从而引发大量的后续问题。如果不准确地理清这些故障事件之间的关联顺序,将导致误判乃至做出错误修补,以及问题的再次发生。只有当对问题根因的估计得到测定和证实后,故障的“确定”期间才算正式完成。

对一个IT解决方案,重要的是保证掌握了“证据”,并提出下列问题:该方案是否预知自己将出现故障?如果是,那末是否有任何(自动的)防范行为发挥作用了?这些防范行为是否通知了人、或自动化操作者?反馈机制是否本身有问题、或反馈了不可信的数据?“故障的探测”
是否正确完成了?


泰坦尼克已处于紧要关头,但还未陷灾难。埃斯梅为保全面子所累,而他对白星公司好名声的渴求所造成的环境氛围,使得任何问题都容易发生。泰坦尼克蹲在水下的冰架上,似乎完全没事;如果报安全为上、以防万一的态度收船回航,也可能发现不过是小问题而已;埃斯梅仓促之间作出决策。而此时第二组带了结构师和木工的损伤调查组尚未有评估报告返回。

今天的IT项目可吸取的教训在于:“故障的解决”中,重要的是在对可供选择的行动方案一一考察时,要在所有的“证据”基础上、考虑相关风险。唯此后,才可开始这最后“从故障中恢复”的环节,即营运团队根据“服务水准协议”SLAs让IT方案重新上线恢复服务。

在泰坦尼克上,作为故障解决的一环,并未对所有的可选行动方案进行充分考虑。埃斯梅做出了错误的决策,让船继续前进,并电告引擎室“以最低速度前进”来完成“从故障中恢复”的环节。工程师时候证实,船以伴有碾摩杂音的3节速度继续前进。

结论

今天,许多IT项目在营运阶段大打折扣,是因为项目计划的某种不充分:即没有以MTTR时间为基础来计划“故障解决流程”。这样(计划充分的)流程,在帮助营运团队迅速恢复服务并保持一定的服务水准方面都至关重要。这种(计划充分的)流程,也应通过系列检查来实施各部门之间的相互制衡,以将在压力状态下犯错的可能性降到最低。这种(计划充分的)流程,还要析构出“角色与职责”结构,以保证让正确的职员作正确的决策。

下一部分,将着眼于灾难状况中的泰坦尼克指挥人员是如何作反应的。

原文:

In recapping the famous ship’s situation, Titanic’s officers tried desperately to avoid a collision (see Part 8). However, the S-turn, a good decision, failed to decelerate the ship enough. Titanic almost innocuously came to a halt later described by hundreds of passengers as a quiver, rumble or grinding noise that lasted a few seconds as if the ship was rolling over a thousand marbles.

There was no "crash stop," fatalities or even minor injuries. There was no violent jolt sideways or repeated strikes along the ship’s length. This is common with a side swipe against an ice spur when a ship is turning very hard away from it. The breakfast cutlery that was laid out in the dining salons barely trembled, and drinks remained unspilled in the first class smoking rooms and lounges. All the evidence indicates that the ship came to rest on an underwater ice shelf at the base of the iceberg. Murdoch had prevented a head on crash that could have demolished the first 4 compartments, and killed and maimed hundreds of passengers.

Likewise, when an IT solution falters in production steps are taken according to a process prepared, planned and tested in the project itself (see Part 4). The process should be based around a Mean Time To Recovery (MTTR) clock were the principal objective is to get the IT solution back on-line as quickly as possible to meet Service Level Agreements (SLAs). The solution is then patched up in the background and a temporary or permanent fix applied.

However, before going on-line, the integrity of the solution needs to be first established so the problem does not reoccur. With an eye on the clock, the operations group steps through the process and the four "problem" quadrants of detection, determination, resolution and recovery. When the MTTR clock starts ticking, signifying the beginning of loss of service (an outage, see Part 2), metrics should be captured as User Outage Minutes (UOMs), which measure how many users experience service loss and for how long.

This is far more accurate than measuring with the more commonly used percentage of service availability, e.g., 99.999 percent. Problem detection on Titanic was 37 seconds of warning given by the lookouts. This is not typical with an IT solution, which is likely to put out errors and warnings well before any significant failure occurs. This provides operators, automated or human, time to prevent the problem from occurring in the first place (see Part 8).

Titanic’s captain, director and officers gathered on the bridge to determine a course of action. As part of problem determination to the extent of the damage, two search parties were dispatched into the bowels of the ship, front and mid-ship. The first party returned within 10 minutes with a positive report of no major damage or flooding. In director Bruce Ismay’s mind, problem detection and determination were now complete. Resolution with a distress call was a problem for him as it would compromise White Star’s position by shattering the hype around Titanic and destroy the brilliant marketing (see Part 2 and Part 5) that had lured the world’s wealthy elite onto the safest liner ever built.

A better resolution would be to get the ship back to Halifax, away from New York and the center of the world’s press. He could then better contain the news story, and marginalize it as a minor incident. He would be able to disembark passengers onto trains, patch the ship up and sail her back to Belfast for repairs. In fact, he could boldly claim that Titanic, a lifeboat in itself with all the latest in emerging technologies, was able to save herself from a potential disaster and further push the safety claims of White Star lines.

With an IT solution today, determination of the problem assesses the impact of the solution on users. Determination has to be consistent with the available evidence. Reinvestigation of feedback mechanisms and logs is vital to determine if the problem has been building up and what is causing it.

In a complex IT solution, it is common to see the domino effect, where a small faulty element like a subsystem knocks out elements around it and triggers a cascade of problems. Not working out this precise sequence of events could lead to a misdiagnosis where a wrong fix is applied and the problem reoccurs. Determination is completed when the root cause assumptions of the problem are tested and proven to be correct.

With an IT solution it is important to be sure of the evidence at hand and to ask the following questions. Was the IT solution aware it was going to fail? If so, were any (automated) preventative actions attempted? Did it alert human or automated operators? Were any of the feedback mechanisms faulty and provide unreliable data? Is the diagnosis of the problem correct?

Titanic’s situation was critical but not catastrophic. Ismay was hell bent on saving face and his anxiety over White Star’s reputation created an atmosphere where mistakes were easily made. Titanic appeared to be completely stable, sitting snugly on the underwater ice shelf. May be with due care they could dislodge the ship with a minimum of damage. Ismay rushed into making a decision. The second search party with the architect and carpenter had not even returned with an assessment.

The lesson from this for IT projects today is that in resolving the problem it is important to consider the alternative courses of action available with the risk associated with each based on all the collected evidence. Only then should the last quadrant of recovery commence. This is where the operations group puts the IT solution back on-line and resumes services, according to SLAs.

On Titanic, not all courses of action were adequately explored as part of the problem resolution. Ismay made the fateful decision to sail forward and telegraphed the engine room "dead slow ahead" in recovering the situation. Engineers later testified the ship moved forward at 3 knots with a grinding noise.

Conclusions
Today, many IT projects severely compromise the operation stage by not planning adequately in the project for a process to deal with problems around a MTTR clock. A process is critical for enabling the operations group to quickly restore service and maintain service levels. A process should also carry the checks and balances (through reviews) to minimize the likelihood of mistakes made in a pressure situation. A process should outline responsibilities and roles to ensure the right personnel make the right decisions.

The next installment will look at how the officers reacted to the disastrous situation.

【 发表评论 0条 】


网友评论
网友评论(共0 条评论)..

请您注意·自觉遵守:爱国、守法、自律、真实、文明的原则
·尊重网上道德,遵守《全国人大常委会关于维护互联网安全的决定》及中华人民共和国其他各项有关法律法规
·严禁发表危害国家安全,破坏民族团结、国家宗教政策和社会稳定,含侮辱、诽谤、教唆、淫秽等内容的作品
·承担一切因您的行为而直接或间接导致的民事或刑事法律责任
·您在中国项目管理资源网新闻评论发表的作品,中国项目管理资源网有权在网站内保留、转载、引用或者删除
·参与本评论即表明您已经阅读并接受上述条款