文/Mark Kozak-Holland 译/杨磊
回顾一下泰坦尼克号当时的情形:撞击发生后(见第8部分)船体摇晃驶离冰架,重新启航,开向海尔法客斯。一切都似乎无碍,但8节航速下20分钟后,当初的决策有多不准确就已经很显见了。续航的行动终尝恶果,船进了更多的水。其他本未受撞击影响的部分也在水压下开始漏水了。上涨的海水正演变成一场大浩劫。
如今,第一要务是边确定永久性的修复方案,边通过临时性的补救措施来使服务迅速恢复上线。但是,此时根本之处在于,应密切监视服务环境,观察补救措施是否见效。
包括结构师托马斯-安得鲁斯和木匠约翰-哈金斯的第二调查组,报告说有5个船部的主体被淹了,并认为这大违泰坦尼克号的设计初衷。沿船底的摩擦已严重撕裂了外壳并损坏了双层船体。6个主要船部进水速度的不同,也说明顶部船体已损。事态竟然会糟糕到如此境地,这超出了设计者的预想。
在如今的IT项目中,至关重要的是项目团队要对这样一类任何补救措施都无济于事、事态发展将超出MTTR规程(见第9部分)的不测,预作计划。对最终用户和客户,服务中断了且难于修复。针对这样的情形,在项目之内就应建立、准备、计划、测试灾难恢复规程(见第4部分),并且配以专人(运行团队/技术支持)使之制度化。
结构师意识到,泰坦尼克号状况已超一般的事故恢复范围,已演变成一场大浩劫。他说,船离沉没还有2个半小时到3个小时。并准确认定已无力回天。太多的船部破裂,水淹至抽水机都不及挽救。各船部之间的防水隔墙,没做到水密水平横断线的高度,所以当船鼻下沉时,水从一个船部渗进另一个,就像水浸过制冰格盘一样。舞厅实际上成为让水向各部分派发的大通道。
此时我们已可发现,项目建设阶段(见第3部分)在非功能性需求上的那种妥协,在这场浩劫中是如何引发巨大恶果的。
只有船长和部分指挥官确知损坏程度,而眼下只能眼睁睁看着船的下沉。没有发出过“弃船”或其它正式的灾难公告。只在撞击后的65分钟时,船长命指挥官们打开救生艇的遮布,并让乘客和船员们都到甲板上。泰坦尼克号上没有正式的灾难恢复计划。
如果发生在今天,接下来应启动灾难恢复计划,并向所有人沟通该计划。每个灾难恢复计划都应有考虑周全的沟通计划,需向不同的听众清楚无疑地进行沟通。
泰坦尼克号的船长在碰撞后很快就明白了问题的严重性,但是,他没有通过其船员与乘客们完成沟通。这船上人们的困惑加剧了,尤其是船员们。比如,引擎室向甲板派出了工程师,可指挥部却让他们返回去。对船上这样糟糕的沟通问题,可能的解释有:
●船上装备的沟通系统有限,没有公告系统。重要信息只能通过船员们到各个舱位敲门后口传给乘客。考虑到舱位数以百计,这太费时了。
●船员们本身就对实情不清楚,所以乘客们所能知晓的就莫衷一是。这个老船长对船体的安全系统太有信心,也许难于相信结构师的判断,因此开始的时候一切似乎都还正常。船长的表现几乎就相当于好像一切正常。
●船长深知救生艇数量不敷所需,大约只够带走全船2223人中的一半。所以,也许最好还是不制造恐慌,而在适当时候让救生艇在一片平和中有秩序地载走乘客。船体水平状的结构,和舱位等级的界别,意味着头等舱的乘客们可更优先得到救生艇位。
●船长担心恐慌的扩散。他同下属都知道14年前法国客轮La Bourgogne下沉的故事。当时也只有一半乘客有救生艇位,引发一片恐慌。史密斯船长知道,他可以通过让那些足够幸运者都上到救生艇上,来挽救尽量多的人。所以,他没告诉所有乘客,尤其是3等舱的那些人。
如今,沟通计划可能与灾难恢复计划一样重要。原因如下:
●与雇员的内部沟通极有助于控制灾难的影响度。同时,沟通的速度也很重要,比如可首先让面向客户的那些雇员获悉讯息,因而他们能转达客户。
●与客户的外部沟通也很重要。沟通计划需要根据问题或灾难的大小范围,以不同渠道来向顾客各个层级传达。
●根据服务中断的严重程度,和公众媒体的沟通也许是必要的。这需要确定什么是关键信息,如何沟通发布,通过什么渠道。许多公司不再设防,流动通信员带着一些陷阱问题访问不知情的雇员们。
结论
如今,许多IT项目由于没有对最坏情况准备对策,而在运行中大打折扣。光有MTTR规程还不够。除了灾难恢复计划,一个考虑周全的沟通计划也必须到位。下一部分将着眼于灾难恢复的启动。
原文:
In recapping Titanic’s situation, following the collision (Part 8) the ship was restarted and limped off the ice shelf with the objective of sailing back to Halifax. Everything appeared to be in good shape, but after 20 minutes of sailing at 8 knots it was apparent that the initial determination was grossly inaccurate. The forward motion had taken its toll and the ship had taken on more water. Parts of the ship initially unaffected under the strain of the water had started to spring leaks and the increase in flooding was becoming catastrophic.
In today’s world, getting service back online is a top priority by applying a temporary fix whilst a permanent fix is created. However, in such a situation it is essential the service delivery environment is closely monitored to whether the fix is holding.
The second search party, with the architect Thomas Andrews and the carpenter John Hutchinson, reported major flooding in five compartments and recognized that Titanic was not designed for this. The grinding along the bottom had badly ruptured the outer skin and damaged the double hull. The different rates of flooding in the six primary compartments indicated the top hull or tank top was damaged. It was beyond the expectations of the designer that something in nature could inflict so much damage.
In today’s IT projects, it is vital that the project team plan for such an eventuality where the fix is not resolving the problem and the situation goes beyond the Mean Time To Recovery (MTTR) for the IT solution (see Part 9). The service is unavailable, to end-users and customers, and not readily recoverable any more. For this situation disaster recovery procedures need to be set up, prepared, planned and tested in the project itself (Part 4) and "institutionalized" with the staff (operations groups/technical support).
The architect realized the situation onboard Titanic had gone beyond normal problem recovery and had become a disaster. He stated that the ship had 2.5 to 3 hours before completely sinking, and accurately determined that the problem could not be fixed. Too many compartments were ruptured and were rapidly flooding beyond the capacity of all the pumps. The bulkhead walls, separating the compartments, had not been carried up to watertight horizontal traverses. Therefore, as the ship’s nose went down, water spilled from one compartment to another rather like an ice cube tray filling with water. The ballroom acted as massive channel for distributing water horizontally across the ship.
At this point in the story we see how the compromises to the non-functional requirements during the construction phase (see Part 3) of the project had a massive consequence in the disaster.
Only the captain and a few officers knew the extent of the damage and were now resigned to the ship sinking. No "abandon ship" command or formal declaration of a disaster was given. Around 65 minutes after the collision the captain just gave orders to the officers to uncover the lifeboats and get the passengers and crew ready on deck. No formalized disaster recovery plan was in place on board Titanic.
In today’s world, the next step would be to invoke a disaster recovery plan and communicate it to all onboard. Every disaster recovery plan needs to be accompanied with a well-thought-out communication plan. This needs to clearly communicate with different audiences.
Titanic’s captain knew the seriousness of the situation relatively quickly from the collision, but did not communicate this through the ranks of crew and passengers on board. This increased the confusion, particularly with the crew. For example, the engine room sent some engineers to the boat deck, but the bridge sent them back down to the engine room. There are number of possible explanations for the poor communication aboard Titanic:
·The ship had very limited communication, with no public-address systems. Important information was communicated to passengers by word of mouth, the crew knocking on each cabin door and common room. Considering there were hundreds of cabins, this could take hours.
·The crew didn’t have accurate information on the situation, so varying degrees of information were passed to passengers. The experienced captain believed in the safety systems of the ship and might have found the architect’s verdict very hard to accept because everything appeared so normal in the first hour. The captain acted almost as if the situation was "business as usual."
·The captain realized that the carrying capacity of the lifeboats was inadequate, with only enough room for about half of the estimated 2,223 people on board. Perhaps better to keep things calm, and allow the lifeboats to be filled in an orderly manner when the timing was right. The ship’s hierarchical structure and segregation of classes meant that first-class passengers had the best access to the boats.
·The captain feared widespread panic. He and the other officers were aware of the French liner La Bourgogne, which sank 14 years earlier. With room in the lifeboats for only half the people onboard, widespread panic had broken out. Captain Smith knew he could save the maximum number of lives by loading only those who were lucky enough to reach the boats. So, he may have avoided informing all the passengers, specifically in third class.
In today’s world a communication plan is probably as important as a disaster recovery plan, for several reasons:
·Communicating internally with your employees can greatly help control the impact of a disaster. Also, the speed of communication is essential. For example, get information to customer-facing employees first, so they can inform customers.
·Communicating externally with your customers is essential and the plan needs to cater to customer segments using different channels, depending on the scope of the problem or disaster. A customer-retention strategy might need to be offered.
·Communicating with the press may be necessary depending on how serious the loss of service is. This requires the identification of key messages, how these are communicated, and through what channels. Many companies have been caught off guard when roving reporters trap unaware employees with questions.
Conclusions
Today, many IT projects severely compromise an operation by not preparing for worst case scenarios. In today’s world, MTTR procedures are not enough. Aside from a disaster recovery plan, a well-thought-out communication plan needs to be in place. The next installment will look at invoking disaster recovery.
【 发表评论 0条 】