Skip to main content
The BMJ logoLink to The BMJ
. 1998 Jun 6;316(7146):1734–1736. doi: 10.1136/bmj.316.7146.1734

Half of all doctors are below average

Jan Poloniecki 1
PMCID: PMC1113280  PMID: 9614030

A heart operation can put a very ill patient on the road to a long and healthy life, or it can kill the patient. Major surgery is just one of many instances when treatment can result in a failure more serious than the consequences of doing nothing. The balance of risk requires a responsible attitude from all the many parties to an operation: the patient, the general practitioner, the specialist physician, the surgeon, theatre nurses, and the anaesthetist; supervisors, such as the chief medical officer and chief executive of the hospital; and the funders of the operation.

This article considers the advantages of having an authoritative estimate of the current failure rate for an operation and reflects on the problems that have arisen where there was a lack of interest in doing this.

Summary points

  • Even if all surgeons are equally good, about half will have below average results, one will have the worst results, and the worst results will be a long way below average

  • With imperfect allowance for differences in case mix, differences in performance figures for surgeons or hospitals do not necessarily reflect differences in risk to an individual patient

  • All prospective parties to a major operation should have access to a numerical estimate of the risk of the patient not surviving

What are my chances, Doc-as a percentage, please?

A numerical estimate of the failure rate is a number, not a statement like “The operation is nearly always successful.” It is also a single number, not a range like “5-20%.” The estimate should relate to the doctor who will perform the operation, and it should be a current estimate, especially if there have been recent failures. It will be different from the national average for last year, and from the rates for other surgeons at the same hospital. The source of the estimate must be known, so that the same answer is given by the surgeon and the nurse.

The question, “what is the failure rate of an operation” is simple, but the answer is not as simple as dividing the number of failures by the number of operations. An estimate is required even for the first operation, and if the first operation is a failure it does not mean that the chances of the next operation failing are 100%. Also, the estimation should incorporate the fact that some patients are more risky than others.

Why do I ask?

It is important to know that an estimate is available.

  • If nobody knows the chances of failure for this patient, then the balance of benefit and risk has not been adequately considered for this particular case;

  • If an estimate is never calculated for any patient, the failure rate is not being monitored. So there may have been a run of failures recently, and no opportunity for early correction of an adverse trend;

  • The parties to the operation want to know what they are letting themselves in for;

  • If the risk is very big the patient, or the parents of a child, might prefer not to go ahead with an operation;

  • It might be interesting to compare the quoted failure rate with what is quoted elsewhere.

Maybe I should get another quote?

The government believes that “Patients have a right to expect that ... they get a first class service. And in a first class service there is no room for second best.”1 Even when the “second class” has been excluded, a list of performance figures will still have a top, a middle, and a bottom. For surgery, referral is typically to a specific surgeon, not to the collection of surgeons at a hospital who can perform the operation. A hospital may have a satisfactorily low failure rate for a certain type of operation, but this does not mean that all the surgeons who perform the operation at that hospital have acceptable mortality figures.

What is the national average?

Unavoidably, about half of all practitioners will have performance figures that are below the national average, even if all practitioners are equally good and have the same failure rate in the long term. But not all practitioners are equally good. The one with the best performance figures will be at the top of the list, probably not by chance, and will be there, or thereabouts, if the list is renewed periodically. Unfortunately, not all patients can be seen by the best doctor.

A practitioner can have performance figures that are said to be “significantly” below average, meaning “statistically significant.” But even a very small difference in performance, one that is of no practical significance, will emerge as statistically significant if performance data are gathered over a sufficiently long period. In practice, “significantly below average” means consistently poorer than average performance by an amount that is likely to be of concern to some patients.

To see if a practitioner is significantly below average, a test of statistical significance must be performed to see whether the short term results are consistent with the national average, differing only by chance, or whether there is evidence that the long term failure rate would continue to be below average. graphic file with name polj5279.f1.jpg

It is simple to perform and interpret a test of statistical significance once. But if a practitioner or groups of practitioners are repeatedly subject to simplistic significance testing, too many false alarms will occur. A surgeon may perform several different types of operation, thus having separate series of results, each of which can be tested not just once but, for example, after every failure.

The table illustrates the sharp increase in the frequency of false alarms as the number of simple tests increases. For purposes of illustration, a national average failure rate of 25% is assumed, corresponding to a very high risk surgical intervention. The table shows the probability of a false alarm when the operator(s) have a long term failure rate equal to the national average of 25% and also when they have a substantially better than average failure rate of 20%. The calculations are for series of 100 operations each, with four series per operator, two operators per hospital, and a total of four hospitals. The probability of one or more operators failing the test is the same as the probability of the operator with the worst results failing the test.

It is almost inevitable (P=0.995) that one or more of the operators will fail the test if they all have a long term failure rate equal to the national average. Even when all eight operators have a substantially better than average failure rate of 20%, there is still a 75% chance that one or more of them will be found to be “significantly” below average.

Some large and statistically significant differences have been reported from single tests. For example, the death rate after surgery for pancreatic cancer is said to be five times greater in non-specialist hospitals than for specialist surgeons.2,3 Ninety day mortality after hip operations was four times greater in seven East Anglian hospitals than in another nearby hospital.4 A failure rate for a hospital is an average of the failure rates of the individual practitioners, and comparisons between groups will disguise still larger differences that exist between practitioners.

Is it you, Doc, or your patients, who are below average?

Not all patients are the same: some conditions dispose more readily to the failure of a proposed treatment than others. These differences cannot be expected to average out, because the process of referral to consultants differs both within and between hospitals. Peer comparisons made without taking into account gross and manifest distinctions in the preoperative risk of individual patients are a highly unreliable guide to the quality of service.

If there are evident differences in risk, then quantification of a patient’s preoperative risk is feasible, and a stratification system like the Parsonnet scoring system5 can be used to adjust for these differences. But however detailed the risk stratification, differences between practitioners may reflect yet other differences between patients—namely, those not adequately allowed for in the scoring system—rather than differences in professional skills. For this reason, a patient may not be typical of the patients treated by someone else. The failure rate applicable to patients who transfer because of the prospect of a lower failure rate may be higher or lower than that for non-transferring patients.

Have you thought of retraining?

Deaths after major surgery occur more often in an intensive care or general ward than they do in the operating theatre itself. If there is a correctable cause for a high death rate, the problem may lie with surgery or with intensive care or anaesthesia, or elsewhere. If a monitoring scheme is in place for intensive care as well as for surgery, it may be possible to say which of these two requires attention. At present there is a presumption that it is the surgeon who should retrain. Unless there is an indication that a specific skill is deficient, and not just that the results are below average, it may be difficult to know which skills need retraining. Perhaps experience with retraining of various practitioners, not just surgeons, and monitoring of the changes in subsequent performance will show the benefits of generalised retraining. Meanwhile, it may be wise to keep an open mind about whether the right person has been identified for retraining and whether it will improve results.

Where assessment of poor performance is triggered by complaint to the General Medical Council, it will be difficult to make allowance for the fact that there will always be someone with the worst performance figures. To avoid mandatory retraining of practitioners who are already as good as, or better than, average, a second prospective period of observation is required; except perhaps where a specific skill has been identified as deficient, or statistical expertise is available to adjust for the selection bias.

Do you monitor your performance?

The collection and analysis of performance data is a difficult subject which requires expert statistical consideration beyond the application of a few simple tests of significance. A report on adult cardiac surgery commissioned by the Bristol Healthcare Trust concluded that the performance of the one of the surgeons was “significantly” poorer than that of the other surgeons.6 The conclusion seems to have been based on a test of statistical significance without adjustment for the number of comparisons implied by the number of surgeons and series that were analysed.7 The non-random selection of Bristol for investigation was also not considered. This subject is now at such an early stage that even professionally qualified statisticians may have difficulty in interpreting performance data. If unreliable inferences are made during this period of learning, there is potential for causing harm and distress to practitioners and patients.

An in-house monitoring system will give tighter control than sending reports to a central registry for aggregation. It is necessary to have a formal statistical quality control scheme so that an adverse trend can be detected early and investigated properly. A numerically informal surveillance system may cry “wolf” so unauthoritatively that follow up investigations become ineffectual or non-existent. The CRAM (cumulative risk-adjusted mortality or morbidity) chart8 is a formal control procedure, and it yields an up to date estimate of prospective risk for individual patients, provided that there have been at least 16 failures. A formal mathematical method has not yet been proposed for the earliest cases of a series, but a prospective estimate can and should be established, based on a training series as second operator, or from the other data sources that are the basis for believing that the proposed intervention represents a balance of risk that is favourable to the patient. In all cases a locally agreed prospective estimate can be written in the patient’s notes. How to use the number may be a matter of judgment, but it should be available if requested.

The difficulties experienced in Bristol in relation to a series of neonatal arterial switch operations and a series of atrioventricular septal defect repairs arose from not knowing when the risk of death from surgery was unacceptable; lack of guidance on when patients might be referred elsewhere with an expectation of lower mortality; and lack of statistical authority in the estimates of risk given to the parents.9 All of these difficulties would have been avoided if agreed estimates of prospective risk had been available.

It will be of little value if the concerns raised about the large number of deaths in children operated on in Bristol are resolved merely by striking off the three doctors, two of whom have already retired, and the third of whom long ago stopped the type of operation in question. The expressions of concern by the parents will be of lasting value if they help to establish that the correct question, which the service should be equipped to answer, is, “What is the current failure rate?”

Table.

False alarm rates using 90% two tailed confidence interval: probabilities of concluding that failure rate in worst of possibly many series exceeds threshold value of 25% when it equals 25% or 20%. Results of computer simulations of series with 100 operations per series

Probability of false alarm Failure rate
National average (25%) Better than average (20%)
Single series tested once 0.05 0.002
Single series tested after every failure 0.2 0.05
Four series per operator tested after every failure 0.59 0.20
Two operators per hospital (8 series) 0.74 0.29
Four hospitals (32 series) 0.995 0.75

See Editor's choice

Footnotes

Funding: None.

Conflict of interest: None.

References

  • 1.Department of Health. London: DoH; 1998. NHS performance will be measured against what really matters. (Press release 98/024, 21 January.) [Google Scholar]
  • 2.Neoptolemos JP, Russell RC, Bramhall S, Theis B. Low mortality following resection for pancreatic and periampullary tumours in 1026 patients: UK survey of specialist pancreatic units. UK Pancreatic Cancer Group. Br J Surg. 1997;84:1370–1376. [PubMed] [Google Scholar]
  • 3.Dobson R. Cancer survival rates far greater for specialist surgeons. Independent on Sunday 1997 November 16:12.
  • 4.Todd CJ, Freeman CJ, Camilleri-Ferrante C, Palmer CR, Hyder A, Laxton CE, et al. Differences in mortality after fracture of hip: the east Anglian audit. BMJ. 1995;310:904–908. doi: 10.1136/bmj.310.6984.904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Parsonnet V, Dean D, Bernstein AD. A method of uniform stratification of risk for evaluating the results of surgery in acquired adult heart disease. Circulation. 1989;79(suppl I):3–12. [PubMed] [Google Scholar]
  • 6.Dyer C. Wisheart begins to give evidence at GMC. BMJ. 1998;316:646. [PubMed] [Google Scholar]
  • 7.Bristol Healthcare Trust. Independent review of adult cardiac surgery—United Bristol Healthcare Trust. Bristol: The Trust; 1997. pp. 1–17. [Google Scholar]
  • 8.Poloniecki J, Valencia O, Littlejohns P. Cumulative risk adjusted mortality chart for detecting changes in death rate: observational study of heart surgery. BMJ. 1998;316:1697–1700. doi: 10.1136/bmj.316.7146.1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dyer C. GMC accused of prejudicing doctors’ defence. BMJ. 1997;315:1177. doi: 10.1136/bmj.315.7117.1177. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from BMJ : British Medical Journal are provided here courtesy of BMJ Publishing Group

RESOURCES