Reality-Based System Safety — how could we apply it to a hazard analysis technique?

Since we published A manifesto for Reality-based Safety Science, I’ve been thinking and talking about how (and how much) it applies to safety-critical engineering work (rather than just operational safety). The following is a description of how we might apply it to a hazard analysis technique. I’d like to thank Drew Rae, Richard Hawkins and Ibrahim Habli for comments and questions on earlier drafts of this.

Doing what I propose below isn’t easy, and certainly isn’t cheap. But if we are to keep advancing as a field I think it’s something that we need to do.

A challenge

A response to the manifesto:

“I can’t think how to get empirical evidence for the effectiveness of using <new hazard analysis technique XYZ> since its value is supposed to be in influencing the design of the system. You would probably need to independently develop the same system twice, once using XYZ and once not using XYZ and then find some way of measuring which is safer (which could take a very long time if you were basing this on e.g. number of accidents). This is where your focus on observing work breaks down I think. You can learn nothing of value from watching people undertake XYZ.”

Our response

General approach

We certainly can’t just Randomised-Controlled-Trial the whole thing (“20 teams developed the A380 with Airbus practice-as-usual, 20 using XYZ-based processes methods…”). But, then, if XYZ is has been successful there are probably hundreds of bad XYZ papers — why do they all have to be answered by one super paper?

The starting point is to recognise that “Is XYZ effective?” is a bad compromise between a research question and a practical goal. The practical goal here is to provide useful information for a practitioner who is trying to decide whether and how to apply XYZ. This decision is surrounded by context: a particular practitioner, in a particular organisation, with particular other activities that are occurring, with a particular system …. The list of relevant “particulars” isn’t endless, but it’s long enough that even if we could do a randomised controlled trial with forty A380 developments, the outcome of that study would still not be compelling evidence that this particular practitioner should apply XYZ.

The trick is to break down the question so that it can be studied in parts. None of these individual parts gives an “aha, this is it!” answer to the question of effectiveness, but each part fills in some of the picture. If we do it well, this picture explains more than just XYZ, so a change to the method won’t invalidate the whole thing forcing us back to square one.

Some of the parts:

How do working engineers actually do HA, in typical practice? (probably several ways) What does that look like?
What does XYZ claim you should do differently to improve design?
Is that claim consistent with the observations about how engineers actually work?
If so, what happens when engineers try to apply XYZ e.g. how does their behaviour change? Is that consistent with the claims about how XYZ works? ……and so on in that vein.

I.e. we try to progressively

Spell out very precisely what we mean by “Using XYZ improves design” — what does it change, and how does that lead to its ultimate beneficial effects?¹
Test aspects of that claim, including auxiliary claims about how design works, and what improvement looks like, through a series of studies²
(If XYZ does in fact operate as intended), create an empirical model of how XYZ transforms design
Either expand the scope of contexts in which the model is good representation, or find the relevant factors that determine when it is likely to be true, and when it isn’t

These smaller questions leave room for a range of research methods, each selected to match the question with the method. Questions about how engineers work will probably require ethnographic methods or contextual inquiry — embedding researchers in the engineering environment, or co-opting people already in the environment as participant researchers. Questions about the effect of changes can be initially addressed by case studies, but ultimately will require experiments or intervention studies.³

Demonstrating causality – in the absence of randomisation, we can fall back on the Bradford-Hill criteria. These criteria are used in epidemiology, and show that a case that X causes Y (in our case that XYZ improves safe design) can be built up from different types of evidence. The criteria include:
* Mechanism: Is there a plausible pathway from cause to effect
* Consistency: Can you see the same thing happening independent of particular people or places
* Temporality: Does the effect come after the cause
* Strength (often dose response): If you increase the cause, does it increase the effect
* Specificity: Is there a specific occurrence of the outcome where the only special thing happening is the cause
* Change in risk factor (if you remove the cause, does the effect become less likely)

Getting started on XYZ

XYZ as described by its creator and proponents may not make it easy for us to get started. Most likely, those proponents make some their claims about who should apply XYZ, what XYZ achieves, or the mechanism by which it works, but they are not specific and detailed enough for us to assess directly.

So, we’d need to start by working out what we thought XYZ might reasonably achieve, and how it might reasonably do it, and spelling all that out as concrete claims. You could imagine this as a GSN argument, where research is needed to provide the context and the evidence. The top level claim would be something like “When an organisation of a particular type adopts XYZ in a particular way, they will get benefits X, Y, Z by way of mechanisms A, B, C”.

Once we have our set of claims about benefit and mechanism, we need to work out a set of studies that could assess as much as possible of that set of claims.

Example — the Knight-Leveson experiment on N-version programming. It took the specific claim “N-version programming reduces the risk of software faults causing failures because the different versions will be implemented in different ways and thus (when run in duplex/triplex/etc) will not simultaneously make the same error”, and tested by having many programmers implement their own versions of a given specified program. It turned out that the resulting programs were similar in structure, and that the faults in them were very similar. Granted, it had significant weaknesses (e.g. use of student programmers rather than experienced professionals, lack of explicit techniques to encourage diverse solutions), but it raised a valid question mark over advocacy of N-version techniques.

The discussion above gives rise to the general research question “When and how do safety practitioners influence design?” and the specific research question “How does adopting XYZ change when and how safety practitioners influence design?”

These aren’t questions that require experiments. In any organisation, there are lots of safety practitioner activities, and lots of design changes. We can use something like the Bradford-Hill criteria to investigate the pre-existing relationships between safety practices and design changes. In an organisation that is introducing XYZ, we can use the change over time and between projects to draw lots of comparisons.

Ultimately, we’ll never get to “Here’s the proof that XYZ reduces the likelihood of an accident”. But if we can show that as an organisation adopts XYZ, more hazards are identified earlier, resulting in changes to the design that will plausibly eliminate or significantly reduce the likelihood of those hazards, we have one leg of an argument for the value of XYZ. We can combine that with theoretical arguments for why XYZ should work, or or arguments based on “it would have prevented these past accidents”, to make the strongest case we (legitimately) can for the value of XYZ.

WRT to the specific claim “You can learn nothing of value from watching someone perform XYZ”, that’s not true. For example, you can learn what they actually do, what their thought processes are, what thoughts they have, and when they have them (e.g. in terms of process stages, or of what exactly they were doing). You can do this via expensive ethnographic methods (i.e. participant observation), cheaper (but still intrusive) talk-aloud studies, or more subtle methods like contextual inquiry (where you watch someone at work and ask questions as you they occur to you e.g. because you don’t understand why they’re doing something).

Observation of knowledge work is harder, in general, than observation of live-system-control, which is harder in turn than observation of physical work. But it can be done, and there are methods for doing this. There is a large body of working studying how decisions are made, and what tools or information people use to make those decisions.

Choosing a good project to study

In terms of environment to study this in, the ideal would be to find an organisation intending to introduce XYZ, and to watch these things as XYZ is introduced. That gives you the comparative / counterfactual element of how things are different when XYZ is involved. It’s very hard in the steady state to know what is due to XYZ, and what would be exactly the same if XYZ wasn’t there.⁴

A lot of this depends on the project timescales. A380s and space shuttles are not good systems to study risk assessment techniques on. The ideal would be a car manufacturer or a railway organisation — somewhere where at any moment there are a range of ongoing projects, all with a high degree of complexity and safety criticality. Hospitals are also pretty good, because they tend to separate out their risk activities from their everyday work. If a researcher is present at the right meetings and workshops, they can be fairly confident that there isn’t other formal risk assessment activity going on that they don’t know about. In an engineering organisation you never know who might be sitting at their desk working on an analysis.

Implications for the broader context of the field

In parallel with doing all this, we as a field can make more room for this work, and incentivize it better, by reducing the number of prescriptive claims we make. E.g. if someone proposes a new hazard analysis technique, unless it’s for a very specific niche that’s known to be very problematic for HA, we can tell them that this is not, in itself, useful research. Given the shortage of good work in the vein described above, proposing new generic HA techniques is not something deserving of time, funding, or publication.

We should get into the mindset that a proposed technique is not itself new and substantiated knowledge. The “research contribution” comes either from examining the status quo (i.e. demonstrating using new real-world data that there is a problem that the technique addresses) or from carefully examining the properties of the new technique. This should be familiar to most academics, but also challenging for some. For Computer Science, my nominal parent discipline, it’s challenging because the properties aren’t related to termination or algorithmic complexity, but come from the real-world application of human-centric techniques.

Proposing new techniques isn’t just a waste of time, it drags down real research activity. Likewise for proposing “improvements” to existing techniques based on a poor theory base (i.e. one not derived empirically from studies of how practitioners work) and then crudely evaluated (by a toy example only, or by very weak study of practitioners). It is hard to do rigorous evaluation when there is a constant landslide of new, unsubstantiated “how to” material.

As I noted at the top, doing this will not be easy or cheap. But if we want to continue to improve the safety we have achieved, and indeed to sustain it in the face of increasing system complexity, I think we need to do it.

Footnotes

If we can’t spell out how we think a method works, in terms of a process with potentially-observable intermediate effects, we do not understand the method well enough to advocate for it. (By contrast, drug companies often don’t understand the mechanism for their drugs. But their domain makes it possible to study outcomes well (thru RCTs etc) and thus they don’t need that. Since we can’t study outcomes well, we can’t legitimately advocate for something until we have a convincing theory of its mechanism.)
One approach to this, though not the only approach, is to create a host of falsifiable claims about the method, and start making observations to try to falsify them. The surviving claims will start to build an empirical picture of the method.
Most of our experiments are likely to be field experiments, targeted at understanding particular parts of the method. These sacrifice things like randomising and blinding in return for conducting the experiment in a real-world setting. Key point here — when you weaken an experimental design, you’re not necessarily sacrificing integrity, just allowing room for alternative explanations (i.e. you are making any result more equivocal). E.g. if you don’t randomise, maybe any difference between the groups is because of who was in each group. If you don’t blind, maybe people got the results they expected to get. If you don’t conduct it in a real organisation, maybe your results are only true for students in classrooms. Highly equivocal results are still valid, as long as that equivocality is made clear.
Although if we have specific claims about mechanism e.g. “discovery of new causal pathways occurs predominantly when systematically investigating possible variants of control actions from the control model” that’s something an observational study could look at — is that where hazards seem to be discovered? or is it rare for hazards to be discovered that way?