文/Mark Kozak-Holland 译/杨磊
把当时的情况再扼要地回顾一下:由于没能从关键的反馈机制[见第7部分]发现各种问题,泰坦尼克号的提前警报系统实际上已经失效,这可能是缘于害怕报复;除此因素外,对该船的安全系数[见第4部分]存在普遍的过分相信,对法国船班尼亚加那号那样的结局也冷漠不惊[见第6部分],对有关巨型冰原规模的相关信息又不准确[见第7部分],所有这一切导致了总状况上的截然不同。最终,Ismay的压力和新的SLO(“服务水准目标”)[见第5部分]则把泰坦尼克推向其最高的航速,超出了其运行极限。
泰坦尼克号就这样驶向那场撞击。这其实已几乎无法避免。在漫布着小冰川和碎冰团的静固冰水中,船体仍以全速前行。瞭望监视哨兵们在缺乏双筒望远镜,刺骨的寒风不停击打眼睛的情况下,还试图从此时常会出现的雾层中分辨出地平线之所在。因此在他们费力地想从蜃景般若隐若现的前方视野中辨认出那团巨型黑影的过程中,向舰桥指挥部报告的时间早已被耽误了。
在此,而今的IT项目可吸取的教训是,对一个新的运行方案,运行操作人员只有在非常熟悉它以后才能掌控之。(对新的运行方案),他们应持的姿态是首先要防患于未然,并保证该方案符合服务的级别和水准。同时,对该运行方案的内部以及周遭相关的环境,他们也需好好加以洞察。面对从建立于项目计划和测试阶段的反馈机制中收集上来的数据[见第4部分],他们也应能迅速地加以分析和评估。而当反馈机制中信噪变得交杂不清的时候,他们须对情况进行诊断,确定出与标准的偏差,确定出潜在的影响和影响的综合程度。他们还需对问题是否该上报了、以及上报各种问题时的优先级做出正确的决策。
由于当时不仅海面平静,而且也没有浪花能涌现于“残冰山”的基部,所以几乎不可能从远处的雾层中及早发现这样的“残冰山”。泰坦尼克的瞭望哨兵认定那一大团黑影其实就是“残冰山”,或所谓“黑冰山”----一种翻倒游动的黑色冰山的时候,情况已直转危急了。哨兵们一旦确信自己的观察后,就向舰桥指挥部发出了那句著名的报告“前面有冰山!”。而指挥官和值班大副默多克,镇定地听完报告并用双筒望远镜目测出了与冰山的距离为900码。今天,从可获得的所有证据看,当时默多克大副采取了如下行动来应对:
·首先,他关掉了引擎。这是合理的,因为如果此时直接倒车,不仅只会搅拌船下的海水,还会抑制方向盘的转动,使船难于控制。
·接下来,由于已不够距离让船停下来,又没法绕过冰山,因此他试着转左舵,或走一个s型---先急打左转舵,紧接急打右转舵---以设法使船能骤然减速。在仅有的短短40秒反应时间里,这样的动作可能让他的船能与冰山平行起来,而不是迎头撞上。
·第三,为防范计,他把电控开关打到了关闭舱壁水密舱门的档位。事后看来,这些可能都是当时所能做的最好的应急措施了.
现今的IT项目从这里可吸取的教训是,在紧急情况下发现的任何异常,都应在运行操作员(瞭望哨)和各级技术支持人员(舰桥指挥官员)之间平滑地逐级上报。这种为安全起见的逐级上报系统,须在项目的测试阶段,就通过对其可操作性的测试和实际运行操作的测试来建立好。只有当操作人员对解决方案和工作环境都熟悉了以后,才可建立更简捷的上报程序。
在此节上,泰坦尼克号项目本身也明显存在欠缺。比如,为测试所留出的时间太短,海上试验中指挥官员也根本没有尝试过操纵这艘船走“s型”;也未曾把在困难、可怕、或突发紧急状况下模拟对船只的操纵,作为事故预防工作的一部分来完成。
现今的IT项目从这里可吸取的教训是,对与解决方案的可操作性有关的各种危急情形,运行操作员和技术支持职员都需专门花时间来予以设想,为故障的预防制定出策略、定出设想中的和检验过的行动步骤。所有这些工作,都需在项目执行和实施之前就完成并通过验证。其间还要考虑对自动化操作员的屏蔽,否则在紧要情况下他们的操作可能使问题变得更大化。总而言之,最终目标就是首先要防止停运,或整个服务的终止。
当泰坦尼克号摇转回右舷时,默多克大副已避不开冰山了,他和他的舰桥同事们只好打起精神来应付一场撞击了。
结论
今天许多IT项目,因没有足够重视其运行操作期而大打折扣。对运行操作平台的设定,变成了事余的工作。而运行操作平台中的相关职员,晚到项目的具体实施才进入项目组,而没有在项目计划和测试阶段就加入并扮演重要的角色。可是在商务上,运行操作平台毕竟对维持服务的水准负有直接而根本的责任。对某个解决方案,如果没能首先为其设立起足够的运行操作平台(人,工作程序,工具),那末其结果不可避免将导向成日,成周,甚至成月不断出现运行问题和潜在的故障,甚至于整个服务的停运。
泰坦尼克号的各支持阶层没时间来熟悉他们的这艘船。他们没能弄清楚相关异常的范围,没能集思众智。默多克的最后指示和尝试虽被很好地执行了,但如果他的这一尝试经过些事先的测试,也许能使船幸免遇难。在一线运行操作员和技术支持阶层之间关于失踪双筒望远镜的摩擦,也于事无补,瞭望哨位的犹豫则浪费了最宝贵的最后数秒时间。
下一部分将着眼于一个可控的局面如何演变成了一场灾难。
原文:
In recapping the situation, Titanic’s early warning system had failed because of the failure to report problems with key feedback mechanisms (see Part 7), possibly because of the fear of reprisal. This, coupled with general over-confidence in the safety of the ship (see Part 4), apathy to the fate of the French Liner Niagara (see Part 6), and inaccurate information on the extent of the giant ice field (see Part 7) led to a state of gross indifference. Finally, Ismay’s pressure and new SLO (see Part 5) pushed Titanic to her highest speed and past her operational limits.
Titanic was heading for a collision. In fact, it was almost inevitable. The ship, at its maximum speed, raced through icy still waters littered with small bergs and pieces of ice. The lookouts, without binoculars and a freezing wind hitting their eyes, were trying to outline the horizon through the haze common in these conditions. As they struggled to make out the shape of a dark mass looming in front of them they delayed reporting this to the bridge.
The lesson for today’s IT projects is that in monitoring a newly operational solution, operations staff needs to be very familiar with it. They need to be in a position to proactively prevent failures from happening in the first place and ensure it meets its service levels. They need good visibility into the solution and surrounding environment around it. They need to be able to quickly assess and analyze data in front of them, collected from feedback mechanisms set up during the planned testing stage of the project (see Part 4). As the mechanisms become noisy they need to diagnose situations and determine deviations from set norms, any potential impacts and overall extent. They need to clarify whether there is something actually wrong or just problematic. They need to make the right decision as to whether to escalate, and at what priority.
Titanic’s lookouts determined the dark mass was in fact a "growler," or "black iceberg"--an iceberg that has flipped over and is dark in color. With a calm sea and no breakers against the base of the growler it was practically invisible in the haze. This had now turned into a critical situation. Once sure of their sighting they notified the bridge with the infamous "Iceberg dead ahead!" Officer Murdoch, chief duty officer, calmly took the call and with his binoculars confirmed the sighting about 900 yards ahead. From all the evidence available today, Murdoch took the following actions:
· First, he cut power to the engines. This made sense as putting the engines into reverse would just churn up the water and limit the steering and handling capability of the ship.
· Second, there was not enough distance to stop the ship and he could not get around the iceberg. So he attempted a port-around or an S-turn first steering hard a port, and then hard a starboard in an effort to sharply decelerate the ship. With only 40 seconds of reaction time this would bring him parallel to the iceberg rather than a head on collision.
· Third, he threw the electric switch to close bulkhead watertight doors as a precaution.
In hindsight these were probably the best possible course of actions.
The lesson for today’s IT projects is that in a critical situation, any anomalies spotted are enacted on with a smooth escalation between operations (lookouts) and the levels of technical support staff (bridge officers). This trouble-free escalation needs to be established in the project testing stage (see Part 4) attained through operability and operational testing. As operations become familiar with the solution and environment they set up more effective procedures.
At this point it is evident that there were serious deficiencies in Titanic’s project itself. For example, time set aside for testing was too short, the officers did not go through any s-turn maneuvers during sea trials, or simulate handling the ship under rough or dire conditions, or an emergency situation as part of accident prevention.
The lesson for today’s IT projects is that operation and technical support staff need time to map out critical scenarios for the operability of the solution, work out strategies for failure prevention and determine preset and proven courses of action. These need to be carefully carried out and tested prior to implementation. This includes considering automated operators which need to be overridden, otherwise they could cause more problems in a critical situation. After all, the ultimate goal is preventing an outage from occurring, or loss of service, in the first place.
As Titanic swung back to starboard, Murdoch just failed to clear the iceberg and he and the bridge staff braced themselves for a collision.
Conclusions
Today, many IT projects severely compromise the operations stage by not paying enough attention to it. Setting up operations is an afterthought and staff is not brought into the project until implementation rather than taking a prominent role in the planning and testing stages. After all, operations are ultimately responsible for upholding the service levels of the solution to the business. The inability to set up an adequate operation (people, processes, tools) around a solution in the first place will inevitably lead to operational problems that manifest themselves days, weeks or even months after going live and a potential failure or a worst case outage.
Titanic’s levels of support had little time to familiarize themselves with the ship. They had failed to clarify the scope of anomalies and put together the intelligence. Murdoch’s maneuver was well executed, but perhaps with some testing he could have pulled it off. The friction between operations and technical support over the missing binoculars did not help in the situation and the lookouts hesitation cost vital seconds.
The next installment will look at how a manageable situation was turned into a disastrous one.
【 发表评论 0条 】