A Hierarchical Framework For Modeling And Analyzing Systemic Risk In Socio Technical 2847431

TRIBUTE TO FOUNDERS: ROGER SARGENT. PROCESS SYSTEMS ENGINEERING TeCSMART: A Hierarchical Framework for Modeling and Analyzing Systemic Risk in Sociotechnical Systems Venkat Venkatasubramanian and Zhizun Zhang Dept. of Chemical Engineering, Complex Resilient Intelligent Systems Laboratory, Columbia University, New York, NY 10027 DOI 10.1002/aic.15302 Published online in Wiley Online Library (wileyonlinelibrary.com) Recent systemic failures in different domains continue to remind us of the fragility of complex sociotechnical systems. Although these failures occurred in different domains, there are common failure mechanisms that often underlie such events. Hence, it is important to study these disasters from a unifying systems engineering perspective so that one can understand the commonalities as well as the differences to prevent or mitigate future events. A new conceptual framework that systematically identifies the failure mechanisms in a sociotechnical system, across different domains is proposed. Our analysis includes multiple levels of a system, both social and technical, and identifies the potential failure modes of equipment, humans, policies, and institutions. With the aid of three major recent disasters, how this framework could help us compare systemic failures in different domains and identify the common failure mechanisms at all levels of the system is demonstrated. VC 2016 American Institute of Chemical Engineers AIChE J, 00: 000–000, 2016 Keywords: artificial intelligence, design, fault diagnosis, safety, process control Systemic Failures: Introduction Recent systemic failures in different domains such as the Global Financial Crisis (2007–2009), BP Deepwater Horizon Oil Spill (2010), and Indian Power Outage (2012) continue to remind us of the fragility of complex sociotechnical systems. Systemic failures occur when an entire system collapses, where the system is typically a large entity, whose failure negatively impacts a large number of people and their environment, causing enormous financial losses. Examples of such systems are refineries, inter-state power grids, country-wide financial networks, large institutions, and so forth. Union Carbide’s Bhopal Gas Tragedy in 1984, in which an estimated 5000 died and about 100,000 were seriously injured by the accidental release of methyl isocynate was a systemic failure. Another example is the Piper Alpha Disaster in 1988, where an offshore oil platform operated by Occidental Petroleum in the North Sea, U.K., exploded killing 167 and resulting in about $2 billion in losses. The Challenger (1986) and Columbia (2003) Space Shuttle Disasters, Schering Plough Inhaler Recall (1999), the Northeast Power Blackout (2003), the spread of SARS (2003), the BP Texas City Refinery Explosion (2005), and the Johnson & Johnson Multidrug Recall (2010) are all examples of systemic failures in different domains. Examples of financial systemic failures include Enron (2001) and WorldCom (2002) collapses, and the Madoff Ponzi Scheme (2008). The collapse of the News of the World newspaper organization (2011) is an example of systemic failure in the media domain. In each case, official postmortem inquiries were conducted and reports of the accidents were produced. Chemical engineers might study the BP Texas City Refinery Explosion Report,1 and people from the financial world may browse The Financial Crisis Inquiry Report,2 but rarely does one compare failures across the different domains to study their commonalities and differences. But when one undertakes such a comparative study, one is struck by the commonality across different domains. There is an alarming sameness about such disasters, which can teach us important fundamental lessons. Although the failures listed above occurred in different domains, in different facilities, triggered by different events, there are, however, common failure mechanisms that often underlie such events. Systematically identifying and understanding these mechanisms are essential to avoid such disasters in the future. Modern technological advances are creating an increasing number of complex sociotechnical systems. By sociotechnical we mean that these systems comprise of social elements (i.e., humans) as well as technical elements (such as pumps, valves, reactors, etc.). The human elements are not only an integral part of the system, they are also often the cause of major failures. The task of designing such systems, and their control mechanisms, to ensure safe operations over their life cycles is extremely challenging. Complex sociotechnical systems have a very large number of interconnected components with nonlinear interactions that can lead to “emergent” behavior—that is, the behavior of the whole is more than the sum of its parts—that can be difficult to anticipate and control.3 Moreover, these systems are not isolated—they interact with humans and the physical environment; in particular, human decision making and the associated errors are part of the feedback processes in these systems. The cumulative effect of the nonlinearity, interconnectedness, and interactions with humans Correspondence concerning this article should be addressed to at V. Venkatasubramanian at venkat@columbia.edu. VC 2016 American Institute of Chemical Engineers AIChE Journal 2016 Vol. 00, No. 00 1 and the environment makes these system-of-systems potentially fragile and susceptible to systemic failures. We propose a conceptual framework that can assist in systematically identifying the failure mechanisms in a complex sociotechnical system. Much like hazard and operability (HAZOP) analysis, which helps us identify potential hazards in equipment and process flowsheets systematically, by examining the failure modes of different components methodically, our framework examines the entire sociotechnical system, including the corporate, regulatory and societal layers, and identifies the potential failure modes of equipment, humans, policies and institutions. We also demonstrate how this new framework helps us compare systemic failures in different domains, in a detailed manner, and reveal the common failure mechanisms at all levels of the system. We compare the BP Texas City Refinery Explosion, Global Financial Crisis and Northeast Blackout. Such a comparative analysis has not been conducted before, as most people generally think that they are completely different events, occurring in entirely different domains, and, therefore, are unlikely to have any common features of any value. We show that there are indeed common valuable lessons. This article is organized as follows. Next section discusses the common patterns of failures at multiple levels. The section after introduces our hierarchical modeling framework (TeleoCentric System Model for Analyzing Risks and Threats [TeCSMART]). After that, it presents failure analysis and comparison, and analyzes three prominent case studies— Global Financial Crisis, BP Texas City Refinery Explosion, and Northeast Power Blackout—using the TeCSMART framework, and discusses their similarities and differences shedding new light on systems failures. Such a model-based comparative study has not been made before. The last section discusses the future directions. Systemic Failures: Common Patterns of Failures at Multiple Levels Postmortem investigations of many disasters have shown that systemic failures rarely occur due to a single failure of a component or personnel. Even though the senior management of a company typically tried to spin the blame on some unanticipated equipment failure, operator error, or a rogue trader, that is rarely the case for major disasters. For instance, Union Carbide initially claimed that the Bhopal Gas Tragedy was caused by a disgruntled employee, who had sabotaged the equipment.4 Enron management initially blamed Andrew Fastow, Enron’s CFO, as the sole culprit.5 But, again and again, investigations have shown that there are always several layers of failures, ranging from low-level personnel to senior management to regulatory agencies that have led to major disasters. Such investigations have shown that the safety procedures had been deteriorating at the failed facilities for months, if not years, prior to the accident. For example, in the case of Piper Alpha, the Permit-to-Work system had been dysfunctional for months.6 In Bhopal, regular maintenance of safety backup systems had not been conducted for months.4 Massey Energy ran up about 600 safety violations in its Upper Big Branch mine during 2009-2010.7 OSHA statistics show that BP ran up 760 “egregious, willful” safety violations during 2008–2010 in Ohio and Texas. Compare this with the corresponding numbers for the other oil companies: Sunoco (8), Conoco-Phillips (8), Citgo (2), and Exxon (1).8 These are clear evidences of a breakdown of the corporate safety culture for months or years. One sees a similar pattern in financial disasters as well. For example, in Enron, its senior management, led by Ken Lay and Jeff Skilling, created an extreme performance-oriented risky culture that seems to have tolerated unethical behavior, which resulted in many violations, market manipulations, and so on.5 In the subprime crisis, the perverted incentive mechanisms in mortgage lending and its subsequent securitization and trading, caused individuals and corporations to make highly-leveraged bets that resulted in risk extremes which were unsustainable. Thus, it was not a question of if a disaster would occur but when. Another common pattern is that people had not identified all the serious potential hazards. They had often failed to conduct a thorough process hazards analysis that would have exposed the serious hazards, which resulted in the disasters later. Such incomplete hazards analysis was highlighted in the Cullen enquiry of Piper Alpha.53 Failure to perform such a hazards analysis was partially responsible for the meltdown of Bear Stearns, Lehman Brothers, Merrill Lynch, and others in the subprime market fiasco.9 However, the few who had performed such hazards analysis did see the crash coming and profited billions of dollars, as described in Michael Lewis’ book, now a movie, The Big Short. 10 Yet another common cause is the inadequate training of the plant personnel to handle serious emergencies. All in all, typically, the responsibility for a systemic failure goes all the way to the top levels of company management, who had only paid a lip service to safety, tolerated noncompliant behavior, even encouraged excessive risk taking and unethical behavior, all of which resulted in a poor corporate culture of safety,1,11–13 which in turn paved the way for the disasters. We also find that serious failings by regulatory, ratings, and auditing agencies, tolerated, sometimes even encouraged, by a laissez-faire political environment, playing a significant role. First and foremost, it does not matter whether the systems are chemical, petrochemical, or financial—self policing does not work. This seems so obvious that people should not have to die, or lose all their money, to make us realize this. Sensible regulations are essential, but, more importantly, they must be audited and enforced by suitably trained personnel who have no conflicts of interest. The betrayal of public trust by Arthur Andersen, the supposedly independent auditor of Enron, whose aiding and abetting of Enron’s cooked books was instrumental in its systemic failure.5 The subprime market failures showed us that the rating agencies, which were supposed to make an independent assessment of the subprimemortgage-backed securities, were so dependent on their Wall Street clients for their business that they merrily went stamping AAA ratings on junk instruments. Of the AAA-rated securities issued in 2006, an astonishing 93% were later downgraded to junk status.14 It is the same lesson we were taught by the BP Deepwater Horizon Oil Spill—how the Minerals Management Service was inherently conflicted between its goals of awarding leases and enforcing safety regulations.15 But, this lesson should have been learnt a long time ago after the Piper Alpha Disaster. Based on the Cullen Report’s findings in 1988, the British government moved the responsibility for safety oversight from the Department of Energy to the Health and Safety Executive (HSE), the independent watchdog agency for work-related health, safety and illness. A separate division was created 2 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal within the HSE to monitor safety of the offshore oil and gas industry.6 Indeed, the importance of addressing non-technical common causes, as those described above, as an integral part of systems safety engineering, was pointed out as far back as 1968 by Jerome Lederer, the former director of the NASA Manned Flight Safety Program for Apollo, who wrote: System safety covers the entire spectrum of risk management. It goes beyond the hardware and associated procedures to system safety engineering. It involves: attitudes and motivation of designers and production people, employee/management rapport, the relation of industrial associations among themselves and with government, human factors in supervision and quality control, documentation on the interfaces of industrial and public safety with design and operations, the interest and attitudes of top management, the effects of the legal system on accident investigations and exchange of information, the certification of critical workers, political considerations, resources, public sentiment and many other non-technical but vital influences on the attainment of an acceptable level of risk control. These nontechnical aspects of system safety cannot be ignored. To understand systemic failures and learn from them, one needs to go beyond analyzing them as independent one-off accidents, and examine them in the broader perspective of the potential fragility of all complex systems. One needs to study the disasters from a unifying sociotechnical systems engineering perspective, so that one can thoroughly understand the commonalities as well as the differences, gain insights about the system-wide breakdown mechanisms in order to better design, control and manage such systems in the future. It is quite clear that to properly model and analyze systemic risk, one not only needs to model failures at the lowest level of a sociotechnical system (such as at the failures of equipment) but also, more importantly, model the human and institutional failures that occur at the higher levels of the system. The human elements are not only an integral part of the system, they are also often the cause of major failures. Hence, it is important to account for them, as explicitly as possible, in any risk modeling framework. This has not always been the case in the engineering modeling literature. For instance, most modeling studies in the process control literature do not account for errors committed by humans in their methodologies. HAZOP analysis, as another example, considers only equipment and operation failures in its guide-word based approach. We need a systematic methodology that can identify potential failure mechanisms, due to equipment, process, human, and institutional failures, at different levels of a sociotechnical system. This is what we try to accomplish in our article. This article is largely a conceptual contribution, describing a new modeling framework that articulates how the different levels of a complex sociotechnical system may be formally approached using control-theoretic ideas. Building on our prior work,16,17 we present such an integrative multiscale modeling framework, which addresses the role of the human element explicitly, and discuss its implications in the context of several prominent systemic failures in different domains. In recent years, there has been interesting progress in understanding and modeling systemic risk in complex sociotechnical systems. Economists and physicists have used network theory to do this for financial systems.18,19 Control theorists have proposed approaches by adopting traditional control theory for understanding systems.20,21 Others have proposed agent based modeling22 or domain-independent system safety principles.23 Our prior work in this area has stressed the need for modeling cause-and-effect knowledge explicitly as well as the need for a multiscale modeling framework.16,17,24–28 Philosophically, our framework is similar to what has been proposed by Rasmussen and Svedung.29 and by Leveson.30–33 In particular, it shares the main theme discussed by Leveson and Stephanopoulos,30 but we differ in the conceptual details of the underlying modeling framework. In addition, we demonstrate the utility of our framework across different domains using a comparative analysis of three well-known systemic failures which has not been done before. TeCSMART Framework Complexity, in general, is hard to define and quantify precisely as it comes in different flavors and can mean different things in different contexts. For instance, there is algorithmic or computational complexity as defined by computer scientists, which measures how much computational effort or time a particular problem might require for its solution—for example, polynomial vs. exponential time, as a function of some key scaling parameter of the given problem. Then there is the physics perspective, dynamical system complexity, which originated from the field of nonlinear dynamics and chaos. This deals with the general inability to predict the future behavior of a nonlinear dynamical system. In other fields such as biology (life and social sciences, in general), complexity is used to describe, in qualitative terms, the incredible diversity, organizational sophistication, and characteristics of individual agents (e.g., a cell or an animal), systems (e.g., ecosystem, human society), processes/phenomena (e.g., intercellular and intracellular interactions), and so forth. While it may be hard to state exactly what complexity, or what a complex system, is, there is consensus, however, as to what features are typically associated with a complex system. Complex systems typically consist of many diverse, autonomous, and adaptive components that interact with one another, and their environment, in nonlinear, dynamical ways to produce a very large set of potential future states or outcomes. Interactions between such parts at a given scale typically give rise to “emergent” properties at larger scales in space and/or time, sometimes through self-organization, without any global knowledge or central control, that are hard to predict from the properties of the parts. They tend to have many feedback loops (both positive and negative), among their components as well as with their environment, which can cause adaptation and induce a goal-directed (i.e., teleological) behavior, either intentionally or implicitly, thereby potentially altering the course of their future behavior. Hence, their characteristics are typically not reducible to an elementary level of description. Thus, the essential features of a complex sociotechnical system may be summarized as: (1) goal-driven behavior, (2) many agents or components/sub-components, (3) organized in a multi-layered hierarchy or network, (4) nonlinear dynamical interactions among its agents (or components) and with the environment, (5) feedback loops, (6) decentralized control (i.e., local decision making), and (7) emergent behavior. Most human engineered complex systems, such as chemical plants, corporations, transportation networks, power grids, governments, societies, and so forth, are organized as a hierarchical network of human and nonhuman (e.g., machines) elements. Generally speaking, they comprise of autonomous AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 3 and non-autonomous elements, which usually translate to human and nonhuman entities. In this article, we are not considering nonhuman entities that are autonomous, such as robots, as they have not reached human-like autonomous capabilities yet, even though this is going to be an important development in a couple of decades. We call our modeling framework as TeCSMART (TeleoCentric System Model for Analyzing Risks and Threats). Telos means goal or purpose in Greek. The central theme of our approach is the emphasis on recognizing and modeling goals of different agents, at different levels of abstraction, in a complex sociotechnical system. Both individual players and groups are goal-oriented, driven to act by their goals and incentives, in a complex system. Therefore, it is important to recognize and model this goal-driven behavior. Individuals (or groups) usually have different goals, or even goals with conflicts of interests with each other or with goals from other individuals. The dynamics of how goals across the system interact, transform and disperse in the hierarchy, affects both individual and systemic performances. We use a simple feedback control module as a model for representing this goal-driven behavior as we discuss below. We propose an integrative framework that tries to capture the essential features of a complex teleological system with the purpose of modeling, analyzing, and managing systemic risk by accounting for the effects of both autonomous (i.e., human) and nonhuman (i.e., “machines” or “mechanical”) entities in a unified and systematic manner. We model a Figure 1. TeCSMART framework. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] 4 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal complex teleological system as a sociotechnical entity that is embedded in a society, affected by the society’s goals and political environment. This leads to a multi-scale modeling framework, having seven layers organized as a hierarchy, as shown in Figure 1, that naturally arise and represent different perspectives of the entire system. Each layer above is a zoomed-out, aggregate, view of the immediate layer below. For example, the block representing process unit in the network of Plant View contains the individual feedback loop in Equipment View. The bottom layer of the stack is the basic building block of a system (e.g., equipment and processes). The top layer of the stack is the macroscopic view of a society. Each layer has its own set of goals, which drive the decision-making and actions taken by the agents in that level. The decisions are taken based on the inputs the layer receives from the layers immediately above and below it. Similarly, the actions are communicated to these adjacent layers as outputs. These decisions/actions are indicated, in Figure 1, by the arrows that capture these information flows, up and down the hierarchy. These information flows are the feedback loops between the layers (i.e., interlayer feedback loops). There are also feedback loops within a given layer, as depicted in Figure 1, which are intralayer loops. Associated with each layer is a set of agents (autonomous and nonautonomous), organized in a particular configuration that is appropriate for the goals of that layer (e.g., the layout of equipment in a chemical plant, called a flowsheet). Such a multilayered representation lends itself naturally to account for emergent phenomena that arise from one scale to another. We propose a uniform and unified input-output modeling framework, that is conceptually the same across all levels. This elementary input-output model structure that serves as a building block in our framework is shown in Figure 2. Specifying such a uniform modeling structure across all levels has the advantage of integrating and unifying the analysis of the outcomes at different levels in a consistent manner. Such a template structure allows us to systematically identify the various failure modes of the different elements at different levels of the hierarchy as we discuss below. There are five key elements in this control-theoretic information modeling building block: (1) sensor, (2) actuator, (3) controller, (4) “process” unit that transforms inputs to outputs, (5) connection (e.g., wires and pipes). These combined with input and output complete the picture. The functions of these elements, as well as their failure modes, at different levels of the hierarchy are illustrated with examples in the discussion below, using examples from chemical engineering. It is relatively easy to generalize this discussion to other engineering domains. The domain of finance requires a special treatment and we make that connection wherever needed. As an organized group, these entities collect, decide, act on, report, and receive a variety of performance information and metrics. At any level, the layer below acts as sensors, actuators, and processes in the interlayer feedback loop, while the layer above it behaves like a controller that evaluates the lower level performance and sets new goals. In a chemical plant, for example, in the Equipment View Layer, they collect, decide, and act on individual process and equipment performance data and metrics (such as temperature, pressure, flow rate, batch times, etc.), which are vital for safe, efficient and profitable operation, and report them to the Plant View Layer, and receive, in turn, local control specifications (such as temperature and pressure set points) from Plant View Layer. The Plant View Layer agents make these decisions by considering information from all the processes and equipment under its purview as well as by considering manufacturing targets (such as what to make, how much to make, when to make, etc.). These targets, in turn, are decided by the agents in the Management View, which get translated into the associated set points and constraints by the agents in the Plant View, and communicated down to the Equipment View as inputs. The target metrics are decided by the agents in Management View by responding to competitive market conditions as dictated by the Market View. In a similar manner, relevant information regarding market or company stability, performance, fair competition, etc., are monitored and acted on by the agents in the Regulatory View, by enacting and enforcing appropriate regulations approved by the agents in the Government View (such as the Congress in the U.S.). In an ideal democracy, a government is elected by the citizens of that society, the Societal View, who have the final word in determining what kind of government and laws they would like to live by. Similar activities occur within layers through intralayer feedback loops. In the Equipment View Layer, for example, a stirred tank heater depicted in Figure 3 has sensors to measure temperature and tank level. Controllers evaluate these metrics, and send new control signals to valves. In the Management View Layer, a firm’s accounting team collects the performance data and share with the Board of Directors. The Board sets company’s goal based on the data. Each division follows the goal and carry out its daily operations. Periodically, new performance data is collected and the goal updated. At each Figure 2. Schematic of a feedback control system (Adapted from Ref. [34], Chemical process control, fig. 13.1b, pp. 241). Figure 3. Stirred tank heater example (Adapted from Ref. [34], Chemical process control, pp. 89). AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 5 layer, if autonomous or nonautonomous agents do not comply with the goal, disturbances arise at that layer. Controllers take the disturbance into account and set goals accordingly. Such intralayer feedback loops exist in all seven layers. Details of each layer will be presented in the following discussion. Perspective I: Equipment View Layer In the Equipment View Layer, the focus is on individual equipment such as reactors and distillation columns in the context of a chemical plant and their operating conditions. A chemical plant is a collection of a such process units suitably organized (called a flowsheet) to meet the plant-wide goal of manufacturing a desired chemical product at targeted levels of quality, quantity, cost, time of delivery, etc., safely and optimally. This collection is seen in Perspective II, the Plant View Layer. The time scale for the Equipment View Layer is typically in seconds and minutes as process dynamics happens in real-time. In the Equipment View Layer, the autonomous agents involved are typically engineers and operators, and the nonautonomous agents are equipment including control systems. While regulatory control systems can exhibit a certain degree of autonomy, that is negligible compared to the range of autonomy exhibited by humans. Hence, we classify regulatory controllers as nonautonomous. Consider, for example, a stirred tank heater process (Figure 3) where the goal is to control the level h and temperature T of the fluid in the tank that is subject to fluctuations in the inlet flow rate Fi and temperature Ti. The desired level of the fluid is referred to as the set point level hset and the desired temperature Tset. These are accomplished by the two feedback controllers (loops 1 and 2), which receive the current F and T in real-time from the sensors (level gauge and thermocouple), by suitably manipulating the outlet flow rate F and steam flow rate, Fsteam, by opening or losing the respective control valves (actuators). The seven elements of the information modeling block for this system are: (1) input: Fi, Ti, Fset, Tset, Fsteam, (2) output: h and T, (3) sensors: level gauge and thermocouple, (4) actuator: outlet flow and steam valves, (5) controller, (6) “core” process unit: tank and heater, and (7) connection: pipes and wires. The constraints are lower and upper limits on the level and the temperature of the fluid in the tank. The goal at the Equipment View level is centered on the performance of individual equipment such as heaters, reactors, distillation columns, and so forth—that is, each equipment has its goal of operating at the set point(s). At this level of granularity, typically, for engineering applications, one can develop detailed dynamical models of the equipment and processes. These tend to be a set of differential and algebraic equations (DAEs) that are solved to simulate process/equipment behavior. Since the purpose of this article is not to discuss these models at length, we refer the interested reader to several standard sources in the literature.34–37 As an example, we list below the dynamical model equations for the stirred tank heater A dh dt 5Fi2F Ah dT dt 5FiðTi2TÞ1 Q qCp Another kind of model used at this level, called signed directed graph model (or signed digraph model [SDG]), is based on graph-theoretic ideas to represent cause and effect relationships in a process or equipment.24–26 The SDG model for the heater example is shown in Figure 4. The nodes represent input and output variables. The arcs represent either positive (solid lines) or negative (dotted lines) relations between nodes. The figure is read as follows: a change in the inlet temperature Ti positively affects the temperature T in the stirred tank, for example, if Ti increases, T will increase. T negatively affects the temperature difference T, which is the set point temperature Tset minus stirred tank temperature T. As T increases, T decreases. It means that less steam Fsteam is needed in the stirred tank, because T gets close to the set point temperature Tset. This positive relation between T and Fsteam is depicted by a solid arc between the two nodes. Fsteam, in turn, positively affects the temperature T in the stirred tank. This causal behavior among T, T, and Fsteam refers to loop 2 in Figure 3. These qualitative models are easier to develop and analyze, in comparison with the DAE models, particularly for modeling and analyzing failure modes and hazards.17,28 However, as they are qualitative in nature, they are limited to certain kinds of queries and can lead to ambiguities. Nevertheless, such cause-effect based qualitative models are very useful when modeling a social system, where DAE models are usually hard to develop, such as a bank-dealer system as discussed by Bookstaber et al.38 In this case, the nodes are variables related to a bank-dealer’s investment and lending activities. In Figure 5, the left-hand side depicts the connections and activities within the bank-dealer, while the righthand side shows the SDG model. A bank-dealer consists of three major desks, among which the finance desk determines where money should go; the prime broker determines how much money to lend based on the collateral collected; and the trading desk determines whether sell to the market or buy from the market based on money received from the finance desk and the leverage ratio it holds. The SDG model is read as Figure 4. SDG for the tank heater example. 6 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal follows: finance desk collateral CFD positively affects the funding capacity VFD. VFD in turn positively affects the loan capacity of prime broker VPB and the leverage set point of trading desk kSP TD. In the prime broker, both the collateral amount CPB and the margin rate vPB positively affect the loan capacity VPB. In the trading desk, the leverage set point kSP TD and current leverage kTD determine the leverage different TD, which positively affects the inventory quantity of trading desk QTD. As Bookstaber et al.38 demonstrate, using the SDG model, one can quickly examine the causal relations of a social system like the bank-dealer system, and study unstable conditions and risks such as the fire sale and funding run scenarios. One can always incorporate other modeling methods with the TeCSMART framework. Usually, in order to develop a quantitative model (DAE model) or a qualitative model (SDG model), one needs to determine the initial conditions of a system. System initial conditions at this level are values associated with equipment, such as sensor readings or controller parameters. Examining failure modes using TeCSMART framework provides a systematic way for identifying system initial conditions. By giving different system initial conditions, modelers can develop suitable models to describe the system and conduct in-depth risk analysis. Therefore, no matter what modeling methods or risk assessment tools one will use, a HAZOP-like systematic analysis using TeCSMART framework is feasible for analyzing risks in a sociotechnical system. It enables a systematic hazard identification for the risk assessment of a sociotechnical system. The basic functional building block in Figure 2 allows us to model systematically the potential failures at different levels of both human and non-human elements. In the Equipment View Layer, let us consider a sensor, for example. Using a commonly used model of its failure modes, we can state that a sensor can fail high, low, or zero (i.e., no response, sensor is dead). Similarly for an actuator (a valve can fail high, low, or zero) and a controller. A process might have more failure modes depending on its complexity, but it is usually not in hundreds, more like a dozen or so. The connections can fail, too, again high, low, zero, or reverse (in the case of flow rate in pipes, for example). One can modify these to make the set of failure modes more sophisticated, if needed, but even this elementary set goes a long way as we discuss below. We will show below how these failure modes can be generalized to accommodate typical human failures as well at different levels of the hierarchy. Perspective II: Plant View Layer The Plant View Layer is a collection of all the equipment and processes organized in a particular configuration (or flowsheet) to manufacture a desired product safely and optimally. The autonomous agents involved in this layer are managers and supervisors, and the nonautonomous agents are equipment clusters. These clusters are usually grouped as critical process steps or unit operations,39 such as reaction, distillation, etc., which are needed in the manufacture of the desired product. Similarly, in the financial system example, the left figure in Figure 5 is the simplified “flowsheet” of a bank-dealer system. The Plant View agents collect and report metrics regarding aggregate production performance and safety to Management View and receive, in turn, plant-wide target specifications from Management View, as noted above. Although this level is also operating in real time, the Plant View decisions typically have a larger time scale (hours or even days). The goal at this level is to ensure meeting production performance targets (typically, product quantity and quality, cost, and time of delivery) safely and optimally at the overall plant level. These plant-wide targets would translate into equipment specific targets implemented as set points and constraints that are communicated to the Plant View level. Models at this level tend to be DAE models from Perspective I integrated together reflecting the overall flowsheet organization of the plant. The flowsheet is then simulated to obtain plant-wide process and equipment behavior. One can also formulate such connected models using the SDG models from the lower level as well to explicitly capture the cause-and-effect relationships which are then used for applications such as process hazards analysis.17,28,40–44 The input-output information model at this aggregate level is shown in Figure 1. From this level onward, going up to the higher levels, the emphasis shifts from decisions/actions made by individual equipment to those made by personnel, and from real-time sensor data to aggregate information concerning the overall plant performance. It moves from a data-centric to information-centric perspective. This is required to reflect the goal of this layer—to make the desired products at the targeted level of quality, quantity, cost, time of delivery, safely and optimally. That is the charge of the Plant Manager, given to her by the senior management at the next layer above. The seven elements here, therefore, reflect this aggregate nature of information needed and used at this level: (1) input: aggregate, plant level, information on target as well as actual performance metrics, (2) output: schedule, set points, resource allocation, and so forth, (3) sensors: product quality and Figure 5. SDG for the bank/dealer example (Adapted from Ref. [38] Process Systems Engineering as a Modeling Paradigm for Analyzing Systemic Risk in Financial Networks). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 7 quantity, resource utilization data, etc., (4) actuator: plant personnel, (5) controller: Plant Manager, (6) “core” process unit: the entire plant, and (vii) connection: various communication channels among plant personnel such as the Manager, Supervisors, Engineers, and Operators. The failure modes associated with the elements at this level are conceptually similar to their counterparts at the lower Equipment Layer. For instance, sensors in this layer are not physical entities like thermocouples, but informational entities that aggregate and transform relevant data into actionable information such as the projection made about the plant’s product output for the current month. This transformation is carried out by a human, such as a process engineer. The engineer can also “fail” high, low, or zero in the sense that the estimation reported to the Plant Manager can be erroneous along these lines—for example, the projection may be too optimistic (i.e., failing high), too conservative (i.e., failing low), or no projection is made (i.e., failing zero). Likewise, communication can also fail along these lines—perhaps the projection was made, but the Manager was not informed. Similarly, in a bank-dealer system, this layer represents the aggregation of investment and funding activities of different asset classes. The three major desks are divided into groups (actuators) to handle portfolios consisting of different assets. Sensors (i.e., analysts monitoring the metrics) in the Equipment View Layer for a bank-dealer system report leverage ratios or collateral collected; while sensors in this layer are risk models of portfolios, which aggregate and transform individual risk factors into a comprehensive picture that describes the portfolio’s risk. We, thus, see that this template helps us identify systematically where and how things can fail at different levels of the hierarchy. It is important to note that we are not claiming that our framework would capture all things that go wrong in a complex system. We are only suggesting that such a systematic approach could capture many of the typical failures seen in practice and we demonstrate this with the aid of three case studies. Perspective III: Management View Layer The next level up is the Management View, where the agents involved are the critical decision makers such as the CEO, Senior Vice Presidents, and Board of Directors. Their goal is to maximize profitability and create value for the shareholders by making sure the company’s business performance metrics (including safety) meet the expectations from the Market (which is the next level up). Influenced by the nature of business and accounting cycles, this layer operates in a time scale of quarter (i.e., 3-month period) to a year. As seen in the control-theoretic information model of this level in Figure 6, this group of decision-makers (Management team) sets the overall policies that “control” (i.e., manage) the behavior and outcomes of the corporation including its autonomous and non-autonomous assets. Autonomous agents at this layer include managers and supervisors of each division, while the nonautonomous agents are corporate assets. The Market at the next level up sets and demands certain performance targets be met by the company for its survival and growth. These metrics are usually financial at this level such as ROI, ROE, market share, sales growth, and so forth. These are the set points and constraints given to the Management team. The Management team, in turn, translates these targets into actionable quantitative information such as production performance metrics, strategic deployment of resources, and so forth, at different plants (the corporation might have several plants distributed all over the world) as well as more qualitative ones that define the company culture including the safety culture. They also set the incentive policy to encourage better performance from the employees. These are communicated to the Plant View Layer as their set points and constraints. The Management team decides on these targets by taking into account of all relevant information concerned with the survival, profitability and growth of the company in a competitive and regulatory environment. Thus, the information flow is not only from the company’s internal sources but also from the environment, which are the two levels immediately above. Differing from the control policies at the lower levels, which mainly focus on controlling equipment (i.e., nonautonomous agents), the policies from this layer onward, at the higher levels, focus more on achieving the desired behavior and outcomes from autonomous agents (i.e., humans). As a result, while the lower level control policies can be based on precise models of process/equipment (as captured by DAE models), the higher level policies will necessarily have to deal with imperfect models of human behavior which cannot be reduced to a set of equations. Consider, for instance, the difficulties involved in “modeling” the culture of a corporation. At best, we might be able to identify certain key features or characteristics that define a corporation’s culture. From this level onward, we have to rely more on graph-theoretic, game-theoretic and agent-based modeling frameworks. Thus, from this level onward modeling becomes trickier, and the notion of “control” of agents transitions to the “management” of agents. Moreover, the importance of TeCSMART failure modesbased examination becomes more obvious. Such a systematic risk analysis of human decision-making would help improving safety-related management activities, among other things. The Management team acts as a “controller” to monitor the various performance metrics (e.g., sales, expenses, revenue, profits, ROI, ROE, etc.), compare them with the set points, and take appropriate actions by manipulating the relevant variables (e.g., cost cutting, acquisition, etc.) in order to meet the set point targets. The Management level deals with the big picture and general strategy for the corporation as a whole. These get translated into more detailed prescriptions and recommendations as they are communicated from this layer to the lower layers. The failure of the elements in Figure 6 can be modeled along the lines of Equipment View and Plant View Layers. For example, the Performance Monitoring task (i.e., “sensor”) may fail because of errors in the measurements or estimations (e.g., fail high, low, or zero) or they may be communicated (or not communicated at all) erroneously. One can methodically identify similar failure modes for the other elements including the connections (which are the communication channels). Figure 6. Control-theoretic model of management layer. 8 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal Perspective IV: Market View Layer Similar to the Plant View, the Market View is a collection of companies that compete, in the appropriate product/service categories, for economic survival, profitability and growth in a free market environment. The agents at this level are mainly the customers and corporations. Market is a well-studied concept in economics. It usually refers to the exchange activities that many parties engage in. In this article, we will not discuss the economic aspect of Market, but interpret Market as a collection of companies and their activities. Market activities such as cooperation and competition can be explained using the input-output model structure and intra-layer feedback loops. From this layer and above, activities mainly involve autonomous agents such as humans and human organizations. The information generated at this level (e.g., stability of individual companies and the market, fairness practices, etc.) are communicated to the Regulatory View and from there receive regulatory requirements and enforcement actions. While the market dynamics is in real-time, as with the Plant View, the relevant time scale is of the order of months. Perspective V: Regulatory View Layer As noted, regulatory agencies oversee the market and control the market behavior through the enforcement of regulatory policies (Figure 7). The primary goal at this level is to ensure the security, stability, and wellbeing of the society where these companies operate. This means, of course, the security and wellbeing of the citizens and their environment. This also means ensuring that the free market, where these companies compete, is stable, efficient and fair. The autonomous agents are regulatory agencies such as Occupational Safety and Health Administration (OSHA), Environmental Protection Agency (EPA), Securities and Exchange Commission (SEC), Federal Reserve (FED), Federal Energy Regulatory Commission(FERC), Mineral Management Service (MMS), Food and Drug Administration (FDA), and so on, and the appropriate executives from the companies. These agencies receive from the agents in Government View, namely, lawmakers and their staff, regulations which they enforce on the market participants. They also monitor the market and companies, collect information, and report the effects of regulations to the agents in Government View for potential improvements. This feedback control loop acts at a time scale of years. One typical example of this view is the activity of the SEC which regulates the securities industry. As shown in Figure 8, SEC receives laws and regulatory directives from the agents in Government View, such as the President, the Congress, and the Federal Reserve Board. Through its five Divisions and 23 Offices, SEC enforces federal securities laws, issues new rules, and oversees securities related activities. For instance, SEC regularly monitors the market for unusual trading patterns that might reveal illegal acts such as insider trading, and takes corrective actions, playing its role as a “controller” here, to ensure fairness in the security markets. While SEC should be praised for its postfinancial crisis actions on successfully going after various Wall Street entities for their misconduct, various failures of the SEC before and during the crisis contributed to the crisis, as Judge Rakoff argues persuasively.45 Many of these failures are failures of the elements in Figure 7 that can be modeled using our template of failure modes. In a similar manner, many of the failures at the Minerals Management Agency46 that contributed to the BP Oil Spill disaster can be modeled using our approach. While we do not get into all the details, as that would make our article too long, we do provide a summary of these failures in a series of tables that compare regulatory failures in three different domains later in the article. Perspective VI: Government View Layer The Government View, like the Plant and Market Views, is a collection of various agencies particularly organized to govern a society of autonomous and non-autonomous agents (e.g., physical assets). The objectives here are security, stability, and the overall wellbeing of the agents and their environment against a variety of risks and threats. Depending on the societal preference for capitalism, communism, socialism, monarchy, or dictatorship, the institutions and their structure can be widely different. The objective of our article is not to discuss these in any detail (there are vast resources on this subject in sociology and political science) but only to show how our control-theoretic framework accommodates the structures and functions at this level in a uniform and consistent manner which is helpful for a system-theoretic analysis of systemwide risks and threats. In the context of the U.S., this structure is the three branches of government—executive, congress, and judiciary—with the associated agencies they supervise. The agents are the members of these branches. The time scale is typically four years, the presidential election cycle, but institutional memory in congress and judiciary can prolong this to decades. That is, it can take that long to make significant changes in governance. Perspective VII: Societal View Layer Finally, we arrive at the top most level in this modeling hierarchy. The primary agents (autonomous) are the citizens and elected officials in a democracy such as the U.S. It is, of course, very different for other political structures, as noted. Again, while the presidential election cycle imposes a certain Figure 7. Control-theoretic model of regulatory layer. Figure 8. Control-theoretic model of Securities and Exchange Commission. AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 9 natural characteristic time, institutional memories can prolong this to decades. The societal “set points” are the preferences of its citizenry, which can vary over time, typically, of the order of decades or generations. In an ideal democracy, the citizens get to decide what kind of society or country they all would like to live in. The overall goals of the citizens in the U.S., as Table 1. Failure Taxonomy Part I Class Definition Examples2,12,56 1. Monitoring Failures Failure to monitor the key parameters effectively or having significant errors in the monitored data 1.1 Fail to monitor Failure to monitor key performance indicators (“failing zero”) In BP Texas City Refinery Explosion, numerous measures for tracking various types of operational, environmental and safety performance, but no clear focus on the leading indicators for the potential catastrophic or major incidents. In Northeast Blackout, MISO did not discover that Harding-Chamberline had tripped until after the blackout, when MISO reviewed the breaker operation log that evening. In Subprime Crisis, Moody’s did not sufficiently account for the deterioration in underwriting standards or a dramatic decline in home prices. And Moody’s did not even develop a model specifically to take into account the layered risks of subprime securities until late 2006, after it had already rated nearly 19,000 subprime securities. 1.2 Failure to monitor effectively Failure to detect/report problems in a timely manner In Northeast Blackout, the Cleveland-Akron areas voltage problems were well-known and reflected in the stringent voltage criteria used by control area operators until 1998. BP Texas City did not effectively assess changes involving people, policies, or the organization that could impact process safety. 1.3 Significant errors in monitoring Monitored data are significantly inaccurate. It is either overreporting (“failing high”) or under-reporting (“failing low”) the actual trend In BP Texas City Refinery Explosion, a lack of supervisory oversight and technically trained personnel during the startup, an especially hazardous period, was an omission contrary to BP safety guidelines. An extra board operator was not assigned to assist, despite a staffing assessment that recommended an additional board operator for all ISOM startups. In Northeast Blackout, from 15:05 EDT to 15:41 EDT, during which MISO did not recognize the consequences of the Hanna-Juniper loss, and FE operators knew neither of the lines loss nor its consequences. PJM and AEP recognized the overload on Star-South Canton, but had not expected it because their earlier contingency analysis did not examine enough lines within the FE system to foresee this result of the Hanna-Juniper contingency on top of the Harding-Chamberlin outage. 2. Decision Making Failures Failure to provide the correct decisions in a timely manner 2.1 Model failures Decisions are not supported by the local system (i.e., “plantmodel mismatch”) In Subprime Crisis, financial institutions and credit rating agencies embraced mathematical models as reliable predictors of risks, replacing judgment in too many instances. In Northeast Blackout, one of MISOs primary system condition evaluation tools, its state estimator, was unable to assess system conditions for most of the period between 12:15 and 15:34 EDT, due to a combination of human error and the effect of the loss of DPLs StuartAtlanta line on other MISO lines as reflected in the state estimators calculations. 2.2 Inadequate or incorrect local decisions Decisions made are unfavorable to the local system under supervision In BP Texas City Refinery Explosion, the process unit was started despite previously reported malfunctions of the tower level indicator, level sight glass, and a pressure control valve. In Subprime Crisis, financial institutions’ inadequate decisions of using excessive leverage and complex financial instruments. In Northeast Blackout, FE uses minimum acceptable normal voltages which are lower than and incompatible with those used by its interconnected neighbors. 2.3 Inadequate or incorrect global decisions Decisions made are unfavorable for the global system, but could be locally right In Subprime Crisis, the banks had gained their own securitization skills and did not need the investment banks to structure and distribute. So the investment banks moved into mortgage origination to guarantee a supply of loans they could securitize and sell to the growing legions of investors. But they are lack of global views of the entire market. In Northeast Blackout, many generators had pre-designed protection points that shut the unit down early in the cascade, so there were fewer units on-line to prevent island formation or to maintain balance between load and supply within each island after it formed. In particular, it appears that some generators tripped to protect the units from conditions that did not justify their protection, and many others were set to trip in ways that were not coordinated with the regions underfrequency load-shedding, rendering that UFLS scheme less effective. 10 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal expressed in the Declaration of Independence document, are Life, Liberty and the Pursuit of Happiness.47 Given these goals, in every election, the citizens get to vote on a number of issues related to economy, environment, education, health, security, privacy, race relations, etc. This is the top most layer of the model. In its feedback loop, there are citizens, elected government officials and regulators involved. In the Government View Layer, the three branches of the U. S. government act as the “controller” of a collection of regulatory agencies and the country. In the Societal View Layer, citizens oversee and influence the society through elections. It usually takes decades for a society to adapt and evolve in any significant fashion. The societal set point is related to the history and culture of a nation. In all systemic failures, such as the ones mentioned above, we all play a role, through the Societal View Layer, and are accountable for some of the blame, as it was our collective decision to elect (in the case of U.S.) a particular party, and its political and regulatory views, to govern us. This accountability is a direct consequence of our responsibility. Consider, for example, the responsibility of a CEO of a large petrochemical company with many plant sites and tens of thousands of employees. The CEO may not know everything about what goes on in all her plant sites, on a daily basis, but when a disaster strikes she and her c-suite executives are held accountable. Time and again, in all the official inquiries of major disasters, whether it was Bhopal, Piper Alpha, BP Oil Spill, Global Financial Crisis, Northeast Power Blackout, and so on, the management was held responsible and accountable for Table 2. Failure Taxonomy Part II Class Definition Examples2,12,56 2.4 Resource Failures Failure to acquire, allocate and manage the required resources properly to complete the tasks safely and achieve the goal(s) 2.4.1 Lack of resources Failure to acquire the necessary resources, such as funds, man power, time, etc. In BP Texas City Refinery Explosion, BP has not always ensured that it identified and provided the resources required for strong process safety performance at its U.S. refineries, including both financial and human resources. In Subprime Crisis, in an interview with the FCIC, Greenspan went further, arguing that with or without a mandate, the Fed lacked sufficient resources to examine the nonbank subsidiaries. Worse, the former chairman said, inadequate regulation sends a misleading message to the firms and the market. But if resources were the issue, the Fed chairman could have argued for more. It was always mindful, however, that it could be subject to a government audit of its finances. In Northeast Blackout, there is no UVLS system in place within Cleveland and Akron; had such a scheme been implemented before August, 2003, shedding 1,500 MW of load in that area before the loss of the Sammis-Star line might have prevented the cascade and blackout. 2.4.2 Inadequate allocation of resources Resources are deployed incorrectly. E.g., over-staffing (“failing high”) in some areas while under-staffing (“failing low”) elsewhere In BP Texas City Refinery Explosion, the incident at Texas City and its connection to serious process safety deficiencies at the refinery emphasize the need for OSHA to refocus resources on preventing catastrophic accidents through greater PSM enforcement. In Northeast Blackout, on August 14, the lack of adequate dynamic reactive reserves, coupled with not knowing the critical voltages and maximum import capability to serve native load, left the Cleveland-Akron area in a very vulnerable state. 2.4.3 Training failures Failures related to the lack of organized activity(ies) aimed at helping employees attain a required level of knowledge and skill needed in their current job. This includes emergency response training In BP Texas City Refinery Explosion, BP has not adequately ensured that its U.S. refinery personnel and contractors have sufficient process safety knowledge and competence. In Subprime Crisis, in theory, borrowers are the first defense against abusive lending. But many borrowers do not understand the most basic aspects of their mortgage. Borrowers with less access to credit are particularly ill equipped to challenge the more experienced person across the desk. In Northeast Blackout, the FE operators did not recognize the information they were receiving as clear indications of an emerging system emergency. 2.5 Conflict of interest Incorrect decisions reached due to a conflict of interest arising from competing goals that can affect proper judgment and execution of tasks. E.g., safety vs financial gain, ethical failures such as corruption In BP Texas City Refinery Explosion, cost-cutting, failure to invest and production pressures from BP Group executive managers impaired process safety performance at Texas City. In Subprime Crisis, many Moody’s former employees said that after the public listing, the company [Moody’s] culture changed; it went from [a culture] resembling a university academic department to one which values revenues at all costs, according to Eric Kolchinsky, a former managing director. In Northeast Blackout, these protections should be set tight enough to protect the unit from the grid, but also wide enough to assure that the unit remains connected to the grid as long as possible. This coordination is a risk management issue that must balance the needs of the grid and customers relative to the needs of the individual assets. AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 11 their companies failures. In fact, in a historic first, establishing an encouraging precedent, recently in April 2016, former Massey Energy CEO was sentenced to twelve months in prison as a result of the mining company’s disaster.48,49 Thus, the people in charge have to be held accountable for part of the blame. In a democratic society, the people in charge are, ultimately, us, the citizens who elected the government. Therefore, we are responsible, in some part, for the failures resulting from its policies. We are thus responsible for Bhopal, BP Oil Spill, Subprime Crisis, and so on. This is why it is Table 3. Failure Taxonomy Part III Class Definition Examples2,12,56 3. Action Failures Actions carried out incorrectly or inadequately 3.1 Flawed actions including supervision Failure to perform the right actions, or performing no action, or performing the wrong actions. Failure to follow standard operating procedures In BP Texas City Refinery Explosion, numerous heat exchanger tube thickness measurements were not taken. Some pressure vessels, storage tanks, piping, relief valves, rotating equipment, and instruments were overdue for inspection in six operating units evaluated. In Subprime Crisis, struggling to remain dominant, Fannie and Freddie loosened their underwriting standards, purchasing and guaranteeing riskier loans, and increasing their securities purchases. Yet their regulator, the Office of Federal Housing Enterprise Oversight (OFHEO), focused more on accounting and other operational issues than on Fannies and Freddies increasing investments in risky mortgages and securities. In Northeast Blackout, numerous control areas in the Eastern Interconnection, including FE, were not correctly tagging dynamic schedules, resulting in large mismatches between actual, scheduled, and tagged interchange on August 14. 3.2 Late response Failure to take the right actions at the right time In BP Texas City Refinery Explosion, Neither Amoco nor BP replaced blowdown drums and atmospheric stacks, even though a series of incidents warned that this equipment was unsafe. In the years prior to the incident, eight serious releases of flammable material from the ISOM blowdown stack had occurred, and most ISOM startups experienced high liquid levels in the splitter tower. Neither Amoco nor BP investigated these events. In Subprime Crisis, declining underwriting standards and new mortgage products had been on regulators radar screens in the years before the crisis, but disagreements among the agencies and their traditional preference for minimal interference delayed action. In Northeast Blackout, the alarm processing application had failed on occasions prior to August 14, leading to loss of the alarming of system conditions and events for FEs operators. However, FE said that the mode and behavior of this particular failure event were both first time occurrences and ones which, at the time, FEs IT personnel neither recognized nor knew how to correct. 4. Communication Failures Failures that are associated with the system of pathways (informal or formal) through which messages flow to different levels and different people in the organization 4.1 Communication failure with external entities Failures of communication between an individual and/or a group/organization and an external individual and/or organization In BP Texas City Refinery Explosion, BP and Amoco did not cooperate well to investigate previous incidents and replace blowdown drum. In Subprime Crisis, the leverage was often hidden. Lenders rarely discuss the leverage and the associated high risk with their investors. Investors relied on the credit rating agencies, often blindly. In Northeast Blackout, the Stuart-Atlanta 345-kV line, operated by DPL, and monitored by the PJM reliability coordinator, tripped at 14:02 EDT. However, since the line was not in MISOs footprint, MISO operators did not monitor the status of this line and did not know it had gone out of service. This led to a data mismatch that prevented MISOs state estimator (a key monitoring tool) from producing usable results later in the day at a time when system conditions in FEs control area were deteriorating. 4.2 Peer to Peer communication failure Failures of communication between an individual and another individual within a group and/or organization In BP Texas City Refinery Explosion, the night lead operator left early but very limited information about his control cations was given to day board operator. In Northeast Blackout, FE computer support staff did not effectively communicate the loss of alarm functionality to the FE system operators after the alarm processor failed at 14:14, nor did they have a formal procedure to do so. 4.3 Inter-level communication failure Failures of communication between an individual and another individual at a greater or lower level of authority within the same group and/or organization In BP Texas City Refinery Explosion, Supervisors and operators poorly communicated critical information regarding the startup during the shift turnover. In Northeast Blackout, ECAR and MISO did not precisely define critical facilities such that the 345-kV lines in FE that caused a major cascading failure would have to be identified as critical facilities for MISO. MISOs procedure in effect on August 14 was to request FE to identify critical facilities on its system to MISO. 12 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal vitally important for the citizens to stay informed, engaged and active in the political process. This is particularly important to remember as we begin to address the mother of all systemic failures, the Climate Change Crisis, which has been in the works for decades. TeCSMART: Comparative Analysis of Three Major Disasters Failure analysis and comparison In this section, we discuss the results of applying the TeCSMART framework to three prominent systemic failures, namely, the BP Texas City Refinery Explosion (2005), Global Financial Crisis (2008–09), and the Northeast Power Blackout (2003). We in fact studied the following thirteen systemic failures: (1) the Bhopal Disaster (1984), (2) the Space Shuttle Challenger Disaster (1986), (3) the Piper Alpha Disaster (1988), (4) the SARS Outbreak (2002-03), (5) the Space Shuttle Columbia Disaster (2003), (6) the Northeast Power Blackout (2003), (7) the BP Texas City Refinery Explosion (2005), (8) Global Financial Crisis (2008-09), (9) the BP Deepwater Horizon Oil Spill (2010), (10) the Upper Big Branch Mine Disaster (2010), (11) the Chilean Mining Accident (2010), (12) the Fukushima Daiichi Nuclear Disaster (2011), and (13) the India Blackouts (2012), by carefully reviewing the official postmortem reports of these disasters as well as other relevant sources. However, we are presenting the comparative analysis of only these three disasters for the sake of brevity. The other cases have similar failure patterns as well. We analyzed and classified over 700 failures mentioned in these reports.1,2,50–60 We categorize these failures into five primary classes, and 19 subclasses, that are consistent with the typical failure modes we discussed in the previous section. The five classes are as follows: (1) Monitoring Failures; (2) Decision Making Failures; (3) Action Failures; (4) Communication Failures; and (5) Structural Failures. Each category has sub-categories that define more detailed failures. Subclass details are listed in Tables 1–4. The five-class failure taxonomy reveals “what can go potentially wrong” in a complex sociotechnical system. It summarizes the failure modes modeled using the TeCSMART framework. Different failure modes give Table 4. Failure Taxonomy Part IV Class Definition Examples2,12,56 5. Structural Failures Deficient structures and/or models 5.1 Design failures Defects or deficiencies in the design of the system/component/model, or just wrong design of the system/ component/model In BP Texas City Refinery Explosion, occupied trailers were sited too close to a process unit handling highly hazardous materials. All fatalities occurred in or around the trailers. In Subprime Crisis, where were Citigroups regulators while the company piled up tens of billions of dollars of risk in the CDO business? Citigroup had a complex corporate structure and, as a result, faced an array of supervisors. The Federal Reserve supervised the holding company but, as the Gramm-Leach-Bliley legislation directed, relied on others to monitor the most important subsidiaries: the Office of the Comptroller of the Currency (OCC) supervised the largest bank subsidiary, Citibank, and the SEC supervised the securities firm, Citigroup Global Markets. Moreover, Citigroup did not really align its various businesses with the legal entities. An individual working on the CDO desk on an intricate transaction could interact with various components of the firm in complicated ways. In Northeast Blackout, although MISO received SCADA input of the lines status change, this was presented to MISO operators as breaker status changes rather than a line failure. Because their EMS system topology processor had not yet been linked to recognize line failures, it did not connect the breaker information to the loss of a transmission line. Thus, MISOs operators did not recognize the HardingChamberlin trip as a significant contingency event and could not advise FE regarding the event or its consequences. Further, without its state estimator and associated contingency analyses, MISO was unable to identify potential overloads that would occur due to various line or equipment outages. 5.2 Maintenance failures Failure to adequately repair and maintain equipment at all times In BP Texas City Refinery Explosion, deficiencies in BPs mechanical integrity program resulted in the run to failure of process equipment at Texas City. In Northeast Blackout, FE had no periodic diagnostics to evaluate and report the state of the alarm processor, nothing about the eventual failure of two EMS servers would have directly alerted the support staff that the alarms had failed in an infinite loop lockup. 5.3 Operating procedure failures Failure to develop and execute standard operating procedures for all tasks In BP Texas City Refinery Explosion, outdated and ineffective procedures did not address recurring operational problems during startup, leading operators to believe that procedures could be altered or did not have to be followed during the startup process. In Subprime Crisis, in addition to the rising fraud and egregious lending practices, lending standards deteriorated in the final years of the bubble. In Northeast Blackout, the PJM and MISO reliability coordinators lacked an effective procedure on when and how to coordinate an operating limit violation observed by one of them in the others area. The lack of such a procedure caused ineffective communications between PJM and MISO regarding PJMs awareness of a possible overload on the Sammis-Star line as early as 15:48. AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 13 rise to systemic failures in different domains. However, there are common failure modes shared by many, if not, all the systemic failures. Such common failure pathways help us identify, proactively, how things can potentially go wrong in a complex system. By studying these common failure mechanisms, people could become more vigilant for new systems. Thus, the common patterns identified by our comparative analysis are helpful not only diagnostically but also prognostically. The comparative analysis of the three case studies is performed in following three steps. (1) Carefully review the official post mortem reports and classify the failures into different classes/subclasses mentioned in Tables 1–4. For example, the level control valve was accidentally turned off by an operator in BP Texas City Refinery. This failure is classified as an flawed action (3.1 in Table 3). The over-grown tree is a known problem for all power grid operators. But First Energy (FE) failed to trim the over-grown trees, which led to line trips. The inadequate tree trimming is classified as a late response failure (3.2 in Table 3). (2) Once failures are classified properly, they are organized in the TeCSMART framework according to the relevant agents and the failure mechanisms. Relevant agents indicate the level of the failure in the TeCSMART framework, and the failing mechanisms explain which control component the failure is associated with. One layer can have multiple failures, and one failure can appear multiple times at different levels. Therefore, the level control valve failure is a flawed action of actuator at the Process View, and the inadequate tree trimming is due to late response of actuator at the Plant View. (3) Compare failures across domains to identify common patterns. Case Studies In this section, we briefly introduce the three prominent systemic failures: Northeast Blackout (2003), BP Texas City Refinery Explosion (2005), and Subprime Crisis (2008), and compare their failures applying TeCSMART framework. The comparison study shows the similarities and differences of the three systemic failures. Moreover, the common patterns indicate important failure modes, which can help improve system design, control, and risk management. Overview The Northeast Blackout, which happened on August 14, 2003, was the largest blackout of North America power grid. With many generating units tripping and transmission lines disconnected at noon, the cascading sequence essentially completed around 4:13 p.m. A shut-down cascade triggered the blackout. Supply/Demand mismatch and poor vegetation management triggered the power surges in transmission lines. FE’s operators did not pay attention to the warning signs, and poorly communicated with other line operators. Finally, the power surges spread and the blackout emerged.56 BP Texas City refinery is the third largest refinery in the United States. The refinery employs approximately 1800 BP workers. On March 23, 2005, the refinery initiated the startup of the ISOM raffinate splitter section. During the startup, the control valve was accidentally turned off by an operator and the tower was filled with flammable liquid for over 3h. The pressure relief valve was activated by high pressure in the tower and discharged liquid to the blowdown drum. The Figure 9. Cross-domain comparison table. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] 14 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal blowdown drum overfilled and the stack vented flammable liquid to the atmosphere, which formed a vapor cloud. When the flammable vapor cloud reached an idling diesel pickup truck, whose engine was on, an explosion occurred. The explosion and fires killed 15 people, injured 180 others, and resulted in financial losses exceeding $1.5 billion.12 In the summer of 2007, leading banks in the U.S. started to fail as a result of falling real estate prices. Bear Stearns, the fifth largest investment bank, whose stock had traded at $172 a share as late as January 2007 was sold to JP Morgan Chase for a fire sale price of $2 on March 16, 2008; Lehman Brothers, the fourth largest, went bankrupt; Fannie Mae and Freddie Mac were taken over by government; American International Group (AIG), the issuance giant, was bailed out by tax payers.61 Over half million families lost their homes to foreclosure. Nearly $11 trillion household wealth vanished. Between January 2007 and March 2009, stock market lost half its value.62 The final cost to the U.S. economy as a result of the biggest financial crisis since Great Depression was about $22 trillion! To get a sense of its magnitude, compare it with the US GDP in 2014 which was $17.4 trillion. TeCSMART Comparison A cross-domain comparison, shown in Figure 9, was conducted by analyzing and comparing failures of these three prominent systemic failures. Figure 9 is a table where rows are TeCSMART views and failure classes, and columns are the three systemic failures. Table 5 lists agents of the three systemic failures. As discussed before, we classify failure evidences found in the postmortem investigation reports into different failure classes, related to specific control components at the appropriate levels. Then we mark the failure class as a colored cell in the table, with a color code that blue represents BP Texas City Refinery Explosion; yellow represents Subprime Crisis; and brown represents Northeast Blackout. If the three colors appear in the same row, it means that particular failure had occurred in all three cases. Therefore, by comparing the colored cells, we are able to study the failure mechanisms, their similarities and differences. Figure 10 highlights failure classes classified in the comparison table (Figure 9). Failures were found at every level in all the three cases. Operational failures are more common at low levels; controller failures dominate at high levels. Among the many important observations and insights from the comparison, we highlight a few and discuss them in depth. The comparison shows that lack of appropriate training was a widespread problem. In Figure 9, we have seen training failures in the bottom three views of all three cases. Evidence shows that operators, even managers, have not received appropriate and sufficient training prior to the accidents. The operator training program was inadequate at BP Texas City Refinery. The training department staff had been reduced from 28 to 8; there were no simulators for operators to practice handling abnormal events.12 The training failure of BP is confirmed by the logic tree created by the Chemical Safety and Hazard Investigation Board (CSB), highlighted in Figure 11a. Similar things happened in the Northeast Blackout. FE operators were poorly trained to recognize emergency information. They received signals indicating line trips, but made poor decisions by relying solely on the Emergency Management System (EMS). Unfortunately, EMS failed at this time. FE engineers’ poor judgment and lack of training played a significant role in the failure. Their lack of training was also highlighted by ThinkReliability in their causal map, depicted in Figure 12. Such a pattern was also seen in the financial system failure.2,64 Decision-makers are “controllers” in the TeCSMART framework. In all three cases, almost every layer has shown decision making failures. For example, the decision of initializing the ISOM unit despite previously reported malfunctions of the raffinate tower level indicator, pressure control valve, and level sight glass, was a serious failure, which directly triggered the overall disaster.12 Moreover, BP’s cost-cutting decisions that led to the layoff of experienced workers from Amoco contributed to the accident as well.1 These failures are highlighted by CSB in Figures 11b, c. In Subprime Crisis, fund managers’ decision to invest in subprime securities without fully understanding the embedded risks was an leading cause of the financial system collapse.2 FE’s decision of using minimum acceptable normal voltages (highlighted in Figure 12), which are lower than and incompatible with those of its neighbors, directly caused power surges and transmission lines sag.56 At the management level, demonstrated by both our comparison study and the CSB analysis (Figures 11a, c), a critical failure was BP not providing enough resources for strong process safety performance in its U.S. refineries.12 At the same level, CEOs of financial institutions decided to maintain a large quantity of subprime related assets by using a very high leverage. The high leverage magnified the scale of the crisis dramatically. Moreover, sometimes a locally favorable decision may bring undesired consequences to the system. In Table 5. Agents of Each View View Agents BP Texas City Refinery Explosion Subprime Crisis Northeast Blackout Societal View U.S. citizens Citizens worldwide U.S. and Canada citizens Government View Employees of different branches of Government Employees of U.S. and Foreign Governments Employees of U.S.and Canada Governments Regulatory View Employees of OSHA Employees of FED, SEC, FDIC, OCC, OTC Employees of NERC and FERC of U.S.; Employees of NEB of Canada Market View Companies in oil & gas refining industry Institutions in financial industry MAAC-ECAR-NPCC power grid Management View BP senior management Senior management of financial institutions & credit rating agencies Senior managementof FE, AEP, MISO, PJM Plant View BP Texas City refinery management Dealers, investors, managers of financial products Eastlake 5 generation,HardingChamberlin line Equipment View Engineers and operators, equipment Borrowers, lenders, brokers, subprime loans Engineers and operators, equipment AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 15 the North America Power Grid, the pre-protection point that protects single operators will not work for the whole system. When single operators dropped out from the grid, the pressure was all on the other part of the system. Finally the system had no options but to fail systemically.56 Monitoring problems often play a major role in sociotechnical disasters. Monitoring failures were observed at the management level in all three cases. As discussed in last section and in Table 1, a sensor or a monitoring task can fail low, high, zero, or fail to detect in time. BP was not aware of hazards at Texas City Refinery, because BP failed to incorporate previous incidents; even worse, the incidents investigations were missing1 (“failing zero”). The monitoring failure of BP is particularly mentioned by CSB in Figure 11d. Conversely, prior to the Subprime Crisis, Moody’s did not account for the deterioration in underwriting standards and was not aware of the plummeting home prices. Moody’s did not develop a model specifically to look into layered risks of subprime securities, after it had rated nearly 19,000 subprime securities2 (“failing zero”). Deregulation and self-policing by financial institutions had stripped away key safeguards2 (“failing low”). Moreover, in Northeast Blackout, the Midcontinent Independent System Operator, Inc. (MISO) failed to recognize the consequence of Hanna-Juniper line loss, while other operators Figure 10. Failure modes in the comparison table. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] 16 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal recognized the overload but had not expected it because the contingency analysis earlier did not examine enough lines to foresee the Hanna-Juniper contingency. The failure of not recognizing the line loss in a timely manner worsened the situation. When the operators finally figured out the situation, it was too late to respond56 (“failing to detect in time”). MISO’s monitoring failure not only was highlighted by ThinkReliability (in Figure 12) as lack of warning, but also raised concerns of U.S.–Canada Power System Outage Task Force. The Task Force report56 recommends FERC should not approve the operation of a new Regional Transmission Operator (RTO) or Independent System Operator (ISO) until the applicant has met the minimum functional requirements for reliability coordinators. This recommendation directly addressed the issue of MISO’s, as a reliability coordinator, failing to recognize line loss in its region. Beyond the decision making or monitoring failures, the flawed actions of regulators and their limited oversight always contribute to sociotechnical system collapses. The reports1,12 mention that OSHA did not conduct a comprehensive inspection of any of the 29 process units at the Texas City Refinery. Knowing the high leverage and vast sums of Subprime loans, the FED did not begin routinely examining subprime subsidiaries until a pilot program in July 2007. FED did not even issue new rules until July 2008, a year after the subprime market had shut down.2 North American Electric Reliability Corporation (NERC), the power grid self-regulator, knowing FE’s potential risk, did not enforce any changes or regulate FE’s activities.56 All these flawed actions contributed to the disasters. Regulators also experience conflict of interest. Especially financial regulators, who face challenges from powerful financial institutions. These observations are just a few examples of what we studied in the TeCSMART comparison. Comparing with the logic tree and the causal map, TeCSMART comparison is able to capture high-level failures such as regulatory failures, which are not covered in the logic tree or causal map. More importantly, TeCSMART comparison can systematically identify potential risks in a sociotechnical system by identifying Figure 11. The logic tree of BP Texas City Refinery Explosion (Adapted from Ref. [12] Investigation Report Refinery Explosion and Fire). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 17 possible failure modes associated with different components at different levels. Summary and Conclusions As the recent systemic failures in different domains remind academicians and practitioners alike, one can never take system safety for granted. All of us—individuals, corporate management, regulatory agencies, and communities, need to learn the lessons from every accident, particularly from the systemic ones. It is imperative to study all these disasters from a common systems engineering perspective so that one can thoroughly understand the commonalities as well as the differences to prevent or mitigate future ones. This is the approach we have adopted in this article. Analyzing systemic risk in a complex sociotechnical system thus requires modeling the system at multiple levels, at multiple perspectives, using a systematic and unified framework. It is not enough to focus only on equipment failures. It is important to systematically examine the potential failures associated with humans and institutions at all levels in a society. We have proposed such an approach, the TeCSMART framework, which models sociotechnical systems in seven layers using control-theoretic concepts. Using this framework, a HAZOPlike hazards identification can be conducted for every layer of a sociotechnical system. The failure modes identified using TeCSMART framework, at all levels, serve as a common platform to compare systemic failures from different domains to elicit and understand common failure mechanisms which can help with improved design and risk management in the future. They also serve as the input information for developing other types of models (e.g., DAE, SDG, game-theoretic, agentbased) for more detailed studies. We carried out such a comparative analysis of 13 major systemic events from different domains, analyzing over 700 failures discussed in official post mortem reports. Even though we are only highlighting the results from three of them, for the sake of brevity, the common failure patterns we identify in this article were found in the other events as well. These 7001 failures can be systematically classified into the five categories (and their subcategories) that can occur at all levels of the system. Using a unifying control-theoretic framework, we show how these correspond to common failure modes associated with the elements of a control system, namely, sensor, controller, actuator, process unit, and communication channels. Even though every systemic failure happens in some unique manner, and is not an exact replica of a past event, we show that the underlying failure mechanism can be traced back to similar patterns associated with other events. No modern engineered system with ever increasing complexity can be totally risk free. However, minimizing inherent risks in our products and processes is an important societal challenge, both intellectually and practically, for innovative science and engineering. Safety is not the responsibility of just the environment, health and safety department. It is everyone’s responsibility in the facility. There exists a need for systems, procedures, corporate and regulatory cultures that ensure this. In the long run, considerable technological help would come from progress in taming complexity, which would result in more effective prognostic and diagnostic systems for monitoring, analyzing, and controlling systemic risks. But getting there would require innovative thinking, bolder vision, and overcoming certain misconceptions about process safety as an intellectually dull activity. Acknowledgment This work is supported in part by the Center for the Management of Systemic Risk at Columbia University. Literature Cited 1. Baker J, Leveson N, Bowman F, Priest S. The report of the bp us refineries independent safety review panel. Report; Independent Safety Review. 2007. 2. Financial Crisis Inquiry Commission, United States. Financial Crisis Inquiry Commission. The financial crisis inquiry report: Final report of the national commission on the causes of the financial and economic crisis in the United States. PublicAffairs; 2011. 3. Ottino JM. Engineering complex systems. Nature. 2004;427(6973): 399. 4. Jasanoff S. Learning from Disaster: Risk Management after Bhopal. Philadelphia: University of Pennsylvania Press, 1994. ISBN 081221532X. 5. Plotz D. Play the Enron Blame Game! Slate.com. 2002. Access Date: February 23, 2016. [Available from: http://www.slate.com/articles/ news_and_politics/politics/2002/02/play_the_enron_blame_game.html.] 6. CCPS. Building process safety culture: tools to enhance process safety performance. Report; Center for Chemical Process Safety of the American Institute of Chemical Engineers, New York. 2005. 7. MSNBC. Mine Owner Ran Up Serious Violations: MSNBC; 2010. Access Date: February 23, 2016. [updated April 6, 2010. Available from: http://www.nbcnews.com/id/36202623/.] Figure 12. The cause map of Northeast Blackout (Adapted from Ref. [63] The cause map of Northeast Blackout of 2003). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] 18 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal 8. Thomas P, Jones Lisa A. Cloherty J, Ryan J. Bp’s dismal safety record. 2010. Access Date: February 23, 2016. [updated May 27, 2010. Available at: http://abcnews.go.com/WN/bps-dismal-safetyrecord/story?id510763042.] 9. Johnson LD, Neave EH. The subprime mortgage market: familiar lessons in a new context. Manag Res News. 2007;31(1):12–26. 10. Lewis M. The Big Short: Inside the Doomsday Machine. New York: W. W. Norton, 2011. ISBN 9780393078190. 11. Olive C, OConnor TM, Mannan MS. Relationship of safety culture and process safety. J Hazard Mater. 2006;130(1):133–140. 12. CSB. Investigation report refinery explosion and fire. Report; U.S. Chemical Safety and Hazard Investigation Board. 2005. 13. Hopkins A. Failure to Learn: The BP Texas City Refinery Disaster. CCH Australia Ltd, 2008. ISBN 1921322446. 14. Krugman P. Berating the raters. 2010. Access Date: February 23, 2016. [updated April 25, 2010. Available from: Available at:http:// www.nytimes.com/2010/04/26/opinion/26krugman.html?_r50.] 15. Urbina I. Inspector general’s inquiry faults regulators: New York Times; 2010. Access Date: February 23, 2016. [updated May 24, 2010. Available from: http://www.nytimes.com/2010/05/25/us/ 25mms.html.] 16. Venkatasubramanian V. Systemic failures: Challenges and opportunities in risk management in complex systems. AIChE J. 2011;57(1): 2–9. 17. Venkatasubramanian V, Zhao JS, Viswanathan S. Intelligent systems for hazop analysis of complex process plants. Comput Chem Eng. 2000;24(9–10):2291–2302. 18. Catanzaro M, Buchanan M. Network opportunity. Nat Phys. 2013; 9(3):121–123. 19. Caldarelli G, Chessa A, Gabrielli A, Pammolli F, Puliga M. Reconstructing a credit network. Nat Phys. 2013;9(3):125–126. 20. Galbiati M, Delpini D, Battiston S. The power to control. Nat Phys. 2013;9(3):126–128. 21. Ashby WR. Requisite variety and its implications for the control of complex systems. In Facets of Systems Science 1991 (pp. 405–417). Springer US. 22. Natarajan S, Srinivasan R. Implementation of multi agents based system for process supervision in large-scale chemical plants. Comput Chem Eng. 2014;60:182–196. 23. Saleh JH, Marais KB, Favar FM. System safety principles: a multidisciplinary engineering perspective. J Loss Prev Process Ind. 2014; 29:283–294. 24. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematic framework for the development and analysis of signed digraphs for chemical processes. 1. algorithms and analysis. Ind Eng Chem Res. 2003;42(20):4789–4810. 25. Maurya MR, Rengaswamy R, Venkatasubramanian V. A systematic framework for the development and analysis of signed digraphs for chemical processes. 2. control loops and flowsheet analysis. Ind Eng Chem Res. 2003;42(20):4811–4827. 26. Maurya MR, Rengaswamy R, Venkatasubramanian V. Application of signed digraphs-based analysis for fault diagnosis of chemical process flowsheets. Eng Appl Artif Intell. 2004;17(5):501–518. 27. Srinivasan R, Venkatasubramanian V. Multi-perspective models for process hazards analysis of large scale chemical processes. Comput Chem Eng. 1998;22(98):S961–S964. 28. Venkatasubramanian V, Vaidhyanathan R. A knowledge-based framework for automating hazop analysis. AIChe J. 1994;40(3): 496–505. 29. Rasmussen J, Svedung R, Svedung I. Proactive risk management in a dynamic society. Swedish Rescue Services Agency, Karlstad, Sweden. 2000. ISBN 9789172530843. 30. Leveson NG, Stephanopoulos G. A system-theoretic, control-inspired view and approach to process safety. AIChE J. 2014;60(1):2–14. 31. Levenson NG. Engineering a Safer World: System Thinking Applied to Safety, 1st ed. Cambridge, MA: The MIT Press, 2011. ISBN 9780262016629. 32. Leveson NG. A systems-theoretic approach to safety in softwareintensive systems. IEEE Trans Dependable Secure Comput. 2004; 1(1):66–86. 33. Leveson N. A new accident model for engineering safer systems. Safety Sci. 2004;42(4):237–270. 34. Stephanopoulos G. Chemical Process Control: An Introduction to Theory and Practice. Prentice-Hall, Englewood Cliffs, New Jersey 07632. 1984. 35. Seborg D, Edgar TF, Mellichamp D. Process Dynamics & Control. United States of America: Wiley, 2006. ISBN 8126508345. 36. Ogunnaike BA, Ray WH. Process Dynamics, Modeling, and Control, vol.1. New York: Oxford University Press, 1994. 37. Bequette BW, Bequette WB. Process Dynamics: Modeling, Analysis, and Simulation. Upper Saddle River, NJ: Prentice Hall PTR, 1998. ISBN 0132068893. 38. Bookstaber R, Glasserman P, Iyengar G, Luo Y, Venkatasubramanian V, Zhang Z. Process systems engineering as a modeling paradigm for analyzing systemic risk in financial networks. Off Financ Res Work Pap Ser. 2015;15(1). 39. Seider WD, Seader JD, Lewin DR. Product & Process Design Principles: Synthesis, Analysis and Evaluation. United States of America: Wiley,; 2009. ISBN 8126520329. 40. Srinivasan R, Venkatasubramanian V. Petri net-digraph models for automating hazop analysis of batch process plants. Comput Chem Eng. 1996;20(96):S719–S725. 41. Srinivasan R, Venkatasubramanian V. Automating hazop analysis of batch chemical plants: Part i. the knowledge representation framework. Comput Chem Eng. 1998;22(9):1345–1355. 42. Srinivasan R, Venkatasubramanian V. Automating hazop analysis of batch chemical plants: Part ii. algorithms and application. Comput Chem Eng. 1998c;22(9):1357–1370. 43. Vaidhyanathan R, Venkatasubramanian V. Digraph-based models for automated hazop analysis. Reliab Eng Syst Saf. 1995;50(1):33–49. 44. Vaidhyanathan R, Venkatasubramanian V. A semi-quantitative reasoning methodology for filtering and ranking hazop results in hazopexpert. Reliab Eng Syst Saf. 1996;53(2):185–203. 45. Rakoff JS. The financial crisis: why have no high-level executives been prosecuted?: The New York Review of Books; 2014. Access Date: February 23, 2016. [updated January 9, 2014. Available from: http://www.nybooks.com/articles/2014/01/09/financial-crisis-why-noexecutive-prosecutions/.] 46. Eilperin J, Higham S. How the minerals management services partnership with industry led to failure. 2010. Available at: http://www. washingtonpost.com/wp-dyn/content/article/2010/08/24/ AR2010082406754.html. 47. Jefferson T. United states declaration of independence: archives.gov; 1776. Access Date: February 23, 2016. [Available from: http://www. archives.gov/exhibits/charters/declaration_transcript.html.] 48. Blinder A. Donald blankenship sentenced to a year in prison in mine safety case: New York Times; 2016. Access Date: April 23, 2016. [updated April 6, 2016. Available from: http://www.nytimes.com/ 2016/04/07/us/donald-blankenship-sentenced-to-a-year-in-prison-inmine-safety-case.html?_r50.] 49. Steinzor R. Why Not Jail?: Industrial Catastrophes, Corporate Malfeasance, and Government Inaction. New York: Cambridge University Press, 2014. ISBN 1316194884. 50. Presidential Commission. Deepwater, the gulf oil disaster and the future of offshore drilling. Report; National Commission on the BP Deepwater Horizon Oil Spill and Offshore Drilling, Washington. 2011. 51. Browning JB. Union carbide: Disaster at Bhopal. In: Managing under Siege. Detroit, MI: Union Carbide Corporation, 1993:1–15. 52. Investigation of the challenger accident. Report; Committee on Science and Technology House of Representative, Washington. 1986. 53. Cullen WD. The public inquiry into the piper alpha disaster. Report 0046-0702, London. 1993. 54. WHO. Sars: how a global epidemic was stopped. Report. 2006. Geneva. Available at: http://www.tandfonline.com/doi/abs/10.1080/ 17441690903061389. 55. CAIB. Columbia accident investigation board report. Report; Columbia Accident Investigation Board: Washington. 2003. Available at: http://www.slac.stanford.edu/spires/find/books?irn5317624. 56. TaskForce. Final report on the August 14, 2003 blackout in the United States and Canada. Report; US-Canada Power System Outage Task Force. 2004. 57. McAteer JD, Beall K, Beck J, McGinley P. Upper big branch: the April 5, 2010, explosion: a failure of basic coal mine safety practices: Report to the governor. Report; Governors Independent Investigation Panel, West Virginia. 2011. 58. Bonnefoy P. Poor safety standards led to chilean mine disaster: GlobalPost; 2010. Access Date: February 23, 2016. [updated August 29, 2010. Available from: http://www.globalpost.com/dispatch/chile/ 100828/mine-safety. 59. Kurokawa K, Ishibashi K, Oshima K, Sakiyama H, Sakurai M, Tanaka K, Tanaka M, Nomura S, Hachisuka R, Yokoyama Y. The official report of the fukushima nuclear accident independent AIChE Journal 2016 Vol. 00, No. 00 Published on behalf of the AIChE DOI 10.1002/aic 19 investigation commission. Report; The Fukushima Nuclear Accident Independent Investigation Commission, Japan. 2012. 60. CERC. Report on the grid disturbance on 30th July 2012 and grid disturbance on 31st July 2012. Report, India; 2012. 61. Blackburn R. The subprime crisis. New left review 50: 63. 2008 Mar 1. 62. Jickling M. Containing financial crisis. Report; Congressional Research Service. 2011. 63. Think Reliability. The cause map of northeast blackout 0f 2003. Houston. 2008. URL: http://www.thinkreliability.com/InstructorBlogs/Blog%20-%20NE%20Blackout.pdf. 64. Schumer CE, Maloney CB. The subprime lending crisis: the economic impact on wealth, property values and tax revenues, and how we got here. 2007. Available at: www.jec.senate.gov/Documents/ Reports/10.25.07OctoberSubprimeReport.pdf. Manuscript received Feb. 26, 2016, and revision received Apr. 30, 2016. 20 DOI 10.1002/aic Published on behalf of the AIChE 2016 Vol. 00, No. 00 AIChE Journal



Calculate your paper price
Pages (550 words)
Approximate price: -

Why Work with Us

Top Quality and Well-Researched Papers

We always make sure that writers follow all your instructions precisely. You can choose your academic level: high school, college/university or professional, and we will assign a writer who has a respective degree.

Professional and Experienced Academic Writers

We have a team of professional writers with experience in academic and business writing. Many are native speakers and able to perform any task for which you need help.

Free Unlimited Revisions

If you think we missed something, send your order for a free revision. You have 10 days to submit the order for review after you have received the final document. You can do this yourself after logging into your personal account or by contacting our support.

Prompt Delivery and 100% Money-Back-Guarantee

All papers are always delivered on time. In case we need more time to master your paper, we may contact you regarding the deadline extension. In case you cannot provide us with more time, a 100% refund is guaranteed.

Original & Confidential

We use several writing tools checks to ensure that all documents you receive are free from plagiarism. Our editors carefully review all quotations in the text. We also promise maximum confidentiality in all of our services.

24/7 Customer Support

Our support agents are available 24 hours a day 7 days a week and committed to providing you with the best customer experience. Get in touch whenever you need any assistance.

Try it now!

Calculate the price of your order

Total price:

How it works?

Follow these simple steps to get your paper done

Place your order

Fill in the order form and provide all details of your assignment.

Proceed with the payment

Choose the payment system that suits you most.

Receive the final file

Once your paper is ready, we will email it to you.

Our Services

No need to work on your paper at night. Sleep tight, we will cover your back. We offer all kinds of writing services.


Essay Writing Service

No matter what kind of academic paper you need and how urgent you need it, you are welcome to choose your academic level and the type of your paper at an affordable price. We take care of all your paper needs and give a 24/7 customer care support system.


Admission Essays & Business Writing Help

An admission essay is an essay or other written statement by a candidate, often a potential student enrolling in a college, university, or graduate school. You can be rest assurred that through our service we will write the best admission essay for you.


Editing Support

Our academic writers and editors make the necessary changes to your paper so that it is polished. We also format your document by correctly quoting the sources and creating reference lists in the formats APA, Harvard, MLA, Chicago / Turabian.


Revision Support

If you think your paper could be improved, you can request a review. In this case, your paper will be checked by the writer or assigned to an editor. You can use this option as many times as you see fit. This is free because we want you to be completely satisfied with the service offered.