Skip to main content
eLife logoLink to eLife
. 2021 Mar 2;10:e64618. doi: 10.7554/eLife.64618

Genomic epidemiology of COVID-19 in care homes in the east of England

William L Hamilton 1,2,†,, Gerry Tonkin-Hill 3,, Emily R Smith 4, Dinesh Aggarwal 2,5, Charlotte J Houldcroft 6, Ben Warne 1,2, Luke W Meredith 6, Myra Hosmillo 6, Aminu S Jahun 6, Martin D Curran 7, Surendra Parmar 7, Laura G Caller 6,8, Sarah L Caddy 6, Fahad A Khokhar 2, Anna Yakovleva 6, Grant Hall 6, Theresa Feltwell 6, Malte L Pinckert 6, Iliana Georgana 6, Yasmin Chaudhry 6, Colin S Brown 5, Sonia Gonçalves 3, Roberto Amato 3, Ewan M Harrison 3, Nicholas M Brown 1,7, Mathew A Beale 3, Michael Spencer Chapman 3,9, David K Jackson 3, Ian Johnston 3, Alex Alderton 3, John Sillitoe 3, Cordelia Langford 3, Gordon Dougan 2, Sharon J Peacock 2, Dominic P Kwiatowski 3, Ian G Goodfellow 6, M Estee Torok 1,2,; COVID-19 Genomics Consortium UK
Editors: Amy Wesolowski10, Miles P Davenport11
PMCID: PMC7997667  PMID: 33650490

Abstract

COVID-19 poses a major challenge to care homes, as SARS-CoV-2 is readily transmitted and causes disproportionately severe disease in older people. Here, 1167 residents from 337 care homes were identified from a dataset of 6600 COVID-19 cases from the East of England. Older age and being a care home resident were associated with increased mortality. SARS-CoV-2 genomes were available for 700 residents from 292 care homes. By integrating genomic and temporal data, 409 viral clusters within the 292 homes were identified, indicating two different patterns – outbreaks among care home residents and independent introductions with limited onward transmission. Approximately 70% of residents in the genomic analysis were admitted to hospital during the study, providing extensive opportunities for transmission between care homes and hospitals. Limiting viral transmission within care homes should be a key target for infection control to reduce COVID-19 mortality in this population.

Research organism: Virus

Introduction

Care homes are at high risk of experiencing outbreaks of SARS-CoV-2. COVID-19 is associated with higher mortality in older people and those with comorbidities including cardiovascular and respiratory disease (Williamson et al., 2020), making the care home population especially vulnerable. As of week ending 30th June 2020, the UK Office for National Statistics (ONS) estimated that 30.2% of all deaths due to COVID-19 (13,417 deaths) in England occurred in care homes, and 63.9% (28,390 deaths) occurred in hospital (Office for National Statistics, 2020a). Most of the COVID-19 deaths in hospital were in persons aged 65 years and over (86.1%). Deaths due to confirmed COVID-19 from this period may be underestimates due to limitations on diagnostic testing; the ONS estimates that from 28 December 2019 to 12 June 2020, there were 29,393 excess deaths in care homes compared to the expected number based on previous years, of which only two thirds are explained by recorded COVID-19 (Office for National Statistics, 2020b). To date, SARS-CoV-2 transmission in care homes has not been systematically studied with linkage of epidemiological and genomic data on a large scale.

Care homes are defined by the Care Quality Commission (CQC), the independent regulator of adult health and social care in England, as ‘places where personal care and accommodation are provided together’ (Care Quality Commission, 2020a). In 2011, 291,000 people aged 65 or older were living in care homes in England and Wales, representing 3.2% of the total population at this age; 82.5% of the care home population was aged 65 years or older (Office for National Statistics, 2014). Care homes are known to be high-risk settings for infectious diseases, owing to a combination of the underlying vulnerability of residents who are often frail and elderly with multiple comorbidities, the shared living environment with multiple communal spaces, and the high number of interpersonal contacts between residents, staff, and visitors in an enclosed space (Curran, 2017; Lansbury et al., 2017; Strausbaugh et al., 2003). Understanding the transmission dynamics of SARS-CoV-2 within care homes is therefore an urgent public health priority.

Rapid SARS-CoV-2 sequencing combined with detailed epidemiological analysis has been used to trace viral transmission networks in hospital and community-based healthcare settings (Meredith et al., 2020). This study was based in Cambridge University Hospitals (CUH), a secondary care provider and tertiary referral centre in the East of England, UK. The study focused on identifying hospital-acquired and healthcare-associated infections by integrating genomic and epidemiological data with hospital Infection Prevention and Control (IPC) systems. While clusters involving care home residents and healthcare workers were observed, the study was not intended to analyse care home transmission specifically and focused on samples tested at CUH to provide information for IPC on potentially hospital-acquired infections. Previous epidemiological studies of COVID-19 specifically in care homes have been limited in population size, temporal scale and/or the amount of genomic data included (Arons et al., 2020; Burton et al., 2020; Graham et al., 2020; Kemenesi et al., 2020; Quicke et al., 2020). Here, genomic epidemiology is used to investigate viral transmission dynamics in care home residents across the East of England (EoE), the fourth largest of the nine official regions in England (Office for National Statistics, 2011). Several key questions of public health concern are addressed: What is the burden of care-home-associated COVID-19 tested in the region? What are the outcomes for care home residents admitted to hospital with COVID-19? Does SARS-CoV-2 spread between care home residents from the same care home via a single introduction and subsequent transmission, or through multiple independent acquisitions of the virus among residents? Finally, is there evidence of viral transmission between care homes and hospitals?

Results

COVID-19 case numbers from care home and non-care home residents included in the study

A total of 7,406 SARS-CoV-2 positive samples from 6600 individuals were identified in the study period (26th February to 10th May 2020) (Figure 1), and care home residency status was determined in 6413 (Figure 1—figure supplement 1) – the remaining 187 cases had missing address data and care home status could not be determined. The samples were tested at the Public Health England (PHE) Clinical Microbiology and Public Health Laboratory (CMPHL) in Cambridge, which receives samples from across the East of England (EoE). Positive cases came from 37 submitting organisations including regional hospital laboratories and community-based testing services (Supplementary Materials). The proportion of samples coming from different sources changed over the study period (Figure 1—figure supplement 2). This likely reflects a combination of regional hospitals establishing their own testing facilities, increasing availability of community testing in the UK, and the implementation of national policies that increased the scope of care home testing (Figure 1—figure supplement 3). Overall, the study population included almost half of the COVID-19 cases diagnosed in the EoE at this time (Public Health England, 2020a), with the remainder being tested at other laboratory sites.

Figure 1. Study flow diagram Out of 6600 patients testing positive in the Cambridge Microbiology Public Health Laboratory (CMPHL) during the study period, 1167 were identified as being care home residents from 337 care homes.

(The methodology for assigning care home status is described in main text and Figure 1—figure supplement 1). Out of 1297 samples from 1167 care home residents, 286 samples were assigned for nanopore sequencing on site and 833 samples for sequencing at the Wellcome Sanger Institute (WSI). Of these, 258 and 533 sequences were available and downloaded from the MRC-CLIMB server at the time of running the analysis, respectively. Of these available genomes, 224 and 522 passed sequencing quality control thresholds (described in Materials and methods), respectively. This yielded the final analysis set of 700 high-coverage genomes from care home residents (representing 292 care homes): 197 genomes sequenced on site by nanopore and 503 sequences at WSI by Illumina. * 193 care homes were registered with the CQC as being residential homes without nursing care, referred to as ‘residential homes’ in main text, and 144 had nursing care available, referred to as ‘nursing homes’. ** Samples were selected for nanopore sequencing on site if they were inpatients or healthcare workers at Cambridge University Hospitals NHS Foundation Trust (CUH), where we prioritised rapid turnaround time to investigate hospital-acquired infections, plus a randomised selection of other East of England samples to provide broader genomic context to the CUH cases. The remaining samples not selected for nanopore sequencing on site, where available, were sent to WSI for sequencing.

Figure 1.

Figure 1—figure supplement 1. Flow diagram for identifying care homes from Cambridge-COGUK metadata Steps for identifying care home residents (further details in Materials and methods).

Figure 1—figure supplement 1.

First, the address field in the patient electronic healthcare records was searched for matching terms indicating a care home (e.g. ‘care home’, ‘nursing home’, etc). Second, the patient address field was searched for matching terms from a list of care home names registered to the Care Quality Commission (CQC). The resulting list was manually inspected and every care home included in the study was linked to a registered CQC care home. CQC coding of whether the care home had nursing care available was used (referred to as ‘nursing homes’ if nursing care was available and ‘residential homes’ if not). If the address information was incomplete (no postcode and/or no address line) then the case was excluded as impossible to determine whether or not the patient was from a care home, unless the person was known to be a healthcare worker (HCW), in which case it was assumed they were not a care home resident. This process yielded the final result of 1167 care home residents from 337 care homes; 5246 individuals that were not care home residents, and 187 individuals that were indeterminable.
Figure 1—figure supplement 2. Breakdown of main organisations submitting samples to Cambridge PHE Laboratory over study period per week.

Figure 1—figure supplement 2.

Only showing sites that submitted samples from >50 people with positive test results over study period, otherwise counted as ‘Other’. To maintain patient anonymity, per time interval only showing sites that submitted samples from >5 people with positive test results (otherwise counted as ‘Other’). Data prior to 16 March is amalgamated due to low sample numbers. Note that over the course of the study, some sites changed testing provider from CMPHL as further testing sites became available around the region. This explains some of the variation in the relative proportion of cases submitted from each site. The numbers reported here do not necessarily reflect total case numbers for each hospital or submitting organisation, as tests may have been performed elsewhere or metadata not collected in this study; the numbers are included purely to indicate where the samples included in this study originated from.
Figure 1—figure supplement 3. UK care home testing policy timeline.

Figure 1—figure supplement 3.

(1) 31st January – first recorded case of covid-19 in the UK. (2) 26th February - first case of COVID-19 in the East of England; start date of this study. (3) 12th March – individuals in the community advised to self-isolate for 7 days, without testing. Testing only offered to care homes in the context of a suspected outbreak. (4) 23rd March - UK lockdown officially begins. (5) 15th April – action plan announced to test all symptomatic residents in care homes, plus testing of all residents prior to admission to care home from hospital. (6) 29th April – testing guidance amended to reflect that asymptomatic as well symptomatic residents and staff in care homes may need to be tested as part of an outbreak. (7) Policy for COVID-19 testing prior to discharge to care homes instigated 16th April: https://www.gov.uk/government/publications/coronavirus-covid-19-adult-social-care-action-plan/covid-19-our-action-plan-for-adult-social-care. (8) 10th May - end date of this study. (9) 11th May – national whole care home testing portal (offering a single test to all staff and residents) goes live for care homes with residents aged 65 years and over or dementia patients. (10) 8th June – national whole care home testing portal extends eligibility to care homes with residents aged under 65 years. (11) 3rd July – announcement that regular asymptomatic testing for care home staff and residents will be rolled out through the national whole care home testing portal in July for homes with residents aged over 65 years or dementia patients. References: Public Health England, 2020b; The Health Foundation, 2020.

Of the study population, 1167/6413 (18.2%) were identified as care home residents from 337 care homes. 193/337 (57.3%) care homes were residential homes and 144/337 (42.7%) were nursing homes, with the majority located in five counties across EoE: Essex, Hertfordshire, Bedfordshire, Suffolk and Cambridgeshire (Figure 2). This represents around half of the care homes in the East of England which had reported suspected or confirmed COVID-19 outbreaks to PHE as of 11th May 2020 (UK government, 2020a). As expected, care home residents were older than non-care home residents (median age 86 years versus 65 years, respectively [p<10−5, Wilcoxon rank sum test]) (Table 1). There was a median of two cases per care home (range 1–22), with a highly skewed distribution: the 10 care homes (top 3%) with the largest number of cases contained 164/1167 (14.1%) of all care home cases (Figure 2—figure supplement 1).

Figure 2. Care home locations by county, showing nursing, and residential homes.

Only showing the five counties with the largest number of cases (all >25) to preserve patient anonymity. Definitions of ‘nursing home’ and ‘residential home’ are based on Care Quality Commission (CQC) information on whether nursing care is or is not present. If no nursing care is available the home is classified as a residential home. If the care home offers nursing care (including if it can offer both nursing and residential care) then the home is classified as a nursing home.

Figure 2.

Figure 2—figure supplement 1. Distribution of cases per care home.

Figure 2—figure supplement 1.

The number of positive cases per care home was highly skewed, such that a relatively small number of care homes contributed a large proportion of cases (right-hand side of the plot). Plot produced with R package ggplot2 using geom_histogram with binwidth = 1.

Table 1. Epidemiological characteristics of care home and non-care home residents with COVID-19 included in the study.

The total sample set for this study comprised 6600 individuals. Of these, care home residency status could be established for 6413 (97.2%). 1167/6413 (18.2%) individuals were identified as being care home residents, of which 700/1167 (60.0%) had genomic data available that passed quality control filtering and were used for identifying care home clusters using the transcluster algorithm (described in Methods and main text). The subset of individuals (464/6600, 7.03%) that were tested at Cambridge University Hospitals (CUH) had richer metadata available and were used for analysing intensive care unit (ICU) admissions and 30 day mortality after first positive test, shown here. Not showing precise values where the number of cases is equal to or less than five individuals, to preserve patient anonymity. Ct = Cycle threshold; CUH = Cambridge University Hospitals; ICU = Intensive Care Unit; IQR = interquartile range.

Variable Care home residents (all) Non-care home residents (all) Care home residents with genomes
Number (%) 1167/6413 (18.2%) 5246/6413 (81.8%) 700/1167 (60%)
Female (%) 624/1167 (53.5%) 2338/5246 (44.6%) 363/700 (51.9%)
Male (%) 543/1167 (46.5%) 2908/5246 (55.4%) 337/700 (48.1%)
Age in years
(median, IQR, range)
86 (IQR: 79–90, range: 30–100) 65 (IQR: 48–80, range: 0–100) 86 (IQR: 78–90, range: 42–99)
Diagnostic Ct value 26 (IQR: 22–29) 25 (IQR: 21–29) 24 (IQR: 20–27)
Tested at CUH (%) 72/464 (15.5%) 392/464 (84.5%) 54/72 (75%)
CUH patient admitted to ICU (%) <5/72 (<7%) 84/392 (21.4%) <5/54 (<9%)
CUH patient 30 day mortality (%) 34/72 (47.2%) 78/392 (19.9%) 23/54 (42.6%)
Number of care homes 337 - 292
Cases/ care home (median, IQR, range) 2 (IQR: 1–5, range: 1–22) - 2 (IQR: 1–3, range: 1–18)
Care homes with 5 cases 85/337 (25.2%) - 32/292 (11%)

The epidemic curve for all cases tested at the Cambridge CMPHL peaked in the end of March and early April (Figure 3). Care home residents comprised a greater proportion of cases in late April and May than in March (Figure 3A, Table 2). This may reflect the changing profile of samples submitted to the CMPHL, as more regional hospitals had their own testing capacity and a greater number of samples were submitted from community testing organisations in later weeks. However, a similar trend was observed for patients tested at Cambridge University Hospitals, with the proportion of community-onset care home-associated cases increasing from <5% in March to a peak of 14/49 (28.6%) in mid-April (Figure 3B, Table 3). This may suggest that transmission involving care home residents took longer to decline following national lockdown (implemented on 23rd March 2020 in the UK) than transmission in the non-care home general community.

Figure 3. Epidemic curves for EoE and CUH showing care home residents.

Number of positive cases per week over the study period for different infection sources, for all samples tested from EoE at the Cambridge PHE laboratory (A), or those tested at CUH acute medical services (B). Peak of the epidemic for samples tested at the Cambridge PHE laboratory and CUH acute medical services were weeks commencing 30th March and 6th April, respectively. UK lockdown started 23rd March 2020. In both settings, a prolonged right-hand ‘tail’ was observed as case numbers gradually fell. The relative proportion of cases admitted from care homes increased over this period for both sample sets, while the contribution of general community cases fell more quickly. However, interpreting these trends is confounded by the changing profile of COVID-19 testing nationally and regionally. If the patient address was missing, and they were not a HCW, then the care home status was undetermined. CAI = Community Acquired Infection; EoE = East of England; HAI = Hospital Acquired Infection; HCW = Healthcare Worker; ‘Other’ mainly comprise inpatient transfers from other hospitals to CUH for which metadata was lacking to determine the infection category. CAI was considered ‘healthcare-associated’ if there had been healthcare contact within 14 days of first positive swab. The three categories of HAI were defined based on the difference in days between admission and first positive swab, reflecting increasing likelihood of hospital acquisition: indeterminate = 3–6 days; suspected 7–14 days; definite >14 days (as used in Meredith et al., 2020).

Figure 3.

Figure 3—figure supplement 1. Care home residents per week showing genome sequencing site.

Figure 3—figure supplement 1.

Plot shows total care home residents testing positive per week over the study period, showing number of care home residents with genomes included in the study broken down by sequencing location (on site in the Department of Pathology, Division of Virology or at the Wellcome Sanger Institute).

Table 2. Case numbers from care homes and non-care home residents per week for full dataset tested at Cambridge CMPHL.

Data plotted in Figure 3A of main text, showing case numbers for care homes, non-care homes, and undetermined, for all EoE samples tested at CMPHL. The proportion of COVID-19 cases from care home residents increased in April and May; however, this may reflect the changing profile of samples submitted to the Cambridge CMPHL rather than underlying epidemiological trends.

Week commencing Care home resident Not determined Not care home resident Weekly total Care home resident (%)
24-Feb 0 0 <5 <5 0.0%
02-Mar 0 0 31 31 0.0%
09-Mar 10 6 149 165 6.1%
16-Mar 25 6 364 395 6.3%
23-Mar 60 26 852 938 6.4%
30-Mar 126 35 1235 1396 9.0%
06-Apr 162 43 1064 1269 12.8%
13-Apr 154 31 540 725 21.2%
20-Apr 247 16 415 678 36.4%
27-Apr 198 16 393 607 32.6%
04-May 185 8 199 392 47.2%

Table 3. Proportion of community acquired, care home-associated COVID-19 infections tested at Cambridge University Hospitals.

The proportion of community onset, care home-associated COVID-19 infections tested at Cambridge University Hospitals (CUH) peaked in mid to late April. Total cases shows the total number of new COVID-19 cases diagnosed at CUH that week. ‘Community acquired’ was defined as first positive test <48 hr from admission and no healthcare contact within the previous 14 days. Not showing precise values if number of patients is less than or equal to five to preserve patient anonymity.

Week Total weekly COVID-19 cases Community acquired, care home-associated (%)
09-Mar 12 0 (0%)
16-Mar 24 0 (0%)
23-Mar 75 <5 (<7%)
30-Mar 96 <5 (<5.2%)
06-Apr 99 14 (14.1%)
13-Apr 49 14 (28.6%)
20-Apr 41 10 (24.4%)
27-Apr 41 9 (22.0%)
04-May 27 6 (22.2%)

Mortality of COVID-19 infections for care home and non-care home residents tested in hospital

Of 6600, 464 (7%) individuals with positive COVID-19 tests were patients tested at Cambridge University Hospitals. Richer metadata were available for this subset of patients via the hospital electronic records system. Seventy-two of 464 (15.5%) COVID-19 patients diagnosed at CUH were identified as care home residents (Table 1, Figure 3B), of which < 7% were admitted to the intensive care unit (ICU) and 34/72 (47.2%) died within 30 days of their first positive test (precise values not shown where the number of individuals is equal to or below five, to protect patient anonymity). In comparison, amongst non-care home residents, 84/392 (21.4%) were admitted to the ICU and 78/392 (19.9%) died within 30 days of diagnosis. In a logistic regression analysis, older age, care home residency, ICU admission, and lower diagnostic cycle threshold (Ct) values were associated with increased odds of mortality at 30 days from diagnosis (Figure 4, Table 4). The odds of mortality within 30 days of diagnosis did not differ between residents at nursing homes versus residential homes in a separate logistic regression analysis.

Figure 4. Odds ratios for mortality at 30 days.

Logistic regression analysis showing odds of death at 30 days (with 95% confidence intervals) for five available metadata variables: patient sex, age (here categorised as >80 years), whether they were a care home resident, the diagnostic Ct value (here categorised as <20), and whether they were admitted to the intensive care unit. Overall there were 116 deaths within 30 days of diagnosis (out of 464 CUH patients). ICU = intensive care unit. Ct = Cycle threshold for diagnostic PCR.

Figure 4.

Figure 4—figure supplement 1. Pairwise comparisons of mortality at 30 days, age and whether the person was a care home resident.

Figure 4—figure supplement 1.

Each plot compares two of these three variables to visualise cross-associations, and the data are divided in each case into individuals that died (yellow) or survived (blue). The plot was produced using GGally::ggpairs().

Table 4. Odds ratios for mortality at 30 days.

Logistic regression analysis of odds of mortality at 30 days. Age 80 years, being a care home resident, being admitted to ICU and Ct <20 were significantly associated with increased odds of death at 30 days post-diagnosis (p<0.05). OR = Odds Ratios. CI = Confidence Interval. ICU = intensive care unit. Ct = Cycle threshold for diagnostic PCR.

Variable OR 95% CI low 95% CI high P value
Age >= 80 6.6 3.7 12.0 2.46E-10
Sex 1.5 0.9 2.6 1.30E-01
Care resident status 3.0 1.6 5.7 9.22E-04
ICU admission 3.9 2.1 7.5 3.02E-05
Ct value < 20 2.9 1.6 5.3 5.04E-04

Identifying viral clusters within care homes using genomic and epidemiological data

Genome sequence data were available for 700/1167 (60.0%) care home residents from 292 care homes (Figure 3—figure supplement 1). There was a median of eight single-nucleotide polymorphisms (SNPs) separating care home genomes, compared to nine for randomly selected non-care home samples (p=0.95, Wilcoxon rank sum test) (Figure 5—figure supplement 2), similar to the EoE region described previously (Meredith et al., 2020). The proportion of viral lineage B.1.1 increased over the study period in both care home residents and non-care home residents (Figure 5, Table 5), consistent with European trends (Alm et al., 2020). With ongoing viral evolution, descendent lineages of B.1 and B.1.1 also rose in frequency and were commonly found in England during the relevant time period. This suggests that the SARS-CoV-2 lineages circulating in care homes were similar to those found across the EoE outside of care homes. Consistent with this, care home and non-care home samples were intermixed across the phylogenetic tree (Figure 6A), suggesting viral transmission could pass between care homes and non-care home settings. No new viral lineages from outside the UK were observed, which may reflect the success of travel restrictions in limiting introductions of new lineages into the general population.

Figure 5. Viral lineage compositions in care home and non-care home samples.

Plots showing the ratios of SARS-CoV-2 viral lineages for 700 care home resident genomes (A) and a randomly selected subset of 700 non-care home residents (B). The proportion of lineage B.1.1 increased over the study period in both care home and non-care home residents. Lineages defined using pangolin. Data also presented in Table 5.

Figure 5.

Figure 5—figure supplement 1. Viral lineage compositions in care home and non-care home samples by count.

Figure 5—figure supplement 1.

Plots showing the counts of SARS-CoV-2 viral lineages for 700 care home resident genomes (A) and a randomly selected subset of 700 non-care home residents (B). Lineages defined using pangolin. Data also presented in Table 5.
Figure 5—figure supplement 2. Distribution of pairwise SNP differences between care home samples.

Figure 5—figure supplement 2.

Pairwise SNP differences between the 700 care home residents (244,650 comparisons). There was a median of eight single nucleotide polymorphisms (SNPs) separating care home genomes (interquartile range, IQR 6–12, range 0–29), compared to 9 (IQR 5–13, range 0–28) for randomly selected non-care home samples (p=0.95, Wilcoxon rank sum test).

Table 5. Proportion of care home and non-care home samples that were lineage B.1.1.

The proportion of lineage B.1.1 (defined using the Pangolin tool) increased from earlier to later sampling weeks, for both care home and non-care home samples. Data based on the 700 care home residents with genomic data available and 700 randomly selected non-care home samples. ‘Early’ was defined as the period from the start of the study (26th February 2020) to 29th March 2020. ‘Late’ was defined as 20th April 2020 to the end of the study (10th May 2020).

Care home status Early Late % change
Care home resident 6/47 (12.8%) 155/286 (54.2%) + 41.40%
Not care home resident 39/173 (22.5%) 50/96 (52.1%) + 29.50%

Figure 6. Care home clustering on viral phylogenetic tree and within-care home pairwise SNP differences.

(A) Phylogenetic tree of 1400 East of England SARS-CoV-2 genomes rooted on a sample from Wuhan, China, collected December 2019, including 700 care home residents and 700 randomly selected non-care home residents. The colour bar (right) indicates whether samples were from care home residents (blue) or non-care home residents (grey). Samples from the 10 care homes with the largest number of genomes are highlighted by coloured circles on branch tips. A magnified subtree of the branch containing all 18 samples from care home CARE0314 is shown to the left. These genomes were all either identical or differed by one SNP from the most common genome in this cluster. Two non-care home genomes are also present in this group. Across the dataset, viruses from care home residents and people not living in care homes are phylogenetically intermixed, consistent with viral transmission between these two settings. (B) Distributions of pairwise SNP differences for the 10 care homes with the largest number of genomes (same samples as highlighted in the branch tips of panel A). Numbers above each box indicate the number of genomes present from that care home. Among the ten care homes with the largest number of genomes, some clustered closely on the phylogenetic tree with low pairwise SNP differences (e.g. CARE0063, CARE0264, CARE0314); in contrast, some care homes were distributed across the tree with higher pairwise SNP differences (e.g. CARE0061, CARE0151, CARE0173, CARE0263). Clusters within each care home were defined using integrated genomic and temporal data using the transcluster algorithm and are shown in Figure 7.

Figure 6.

Figure 6—figure supplement 1. Phylogenetic tree of all available genomes highlighting care home and non-care home samples.

Figure 6—figure supplement 1.

Of the 6600 individuals in the study, 1167 were identified as care home residents and 5246 were not care home residents (187 were undetermined). 700/1167 (60.0%) care home residents had genomes available that passed quality control (QC) filtering at time of analysis. Of 5246, 3745 (71.4%) non-care home residents had genomes available and passing the same QC filtering at time of analysis, accessed from the COG-UK public database (https://www.cogconsortium.uk/data/). This tree comprises all 700 care home and 3745 non-care home genomes from the study (total 4445 samples), rooted on a 2019 genome from Wuhan, China. As with Figure 6, the colour bar (right) indicates whether samples were from care home residents (blue) or non-care home residents (grey). Samples from the ten care homes with the largest number of genomes are highlighted by coloured circles on branch tips. This supports the findings shown in Figure 6 using the randomly selected sub-sample of non-care home samples, (1) that care home genomes were phylogenetically intermixed with non-care home genomes (consistent with transmission between care homes and outside of care homes) and (2) that, using the 10 care homes with the largest number of samples as examples, some care homes were monophyletic (such as CARE0314) while others were polyphyletic (such as CARE0061). Even for polyphyletic care homes (implying multiple independent introductions of the virus among residents), the majority of samples were usually attributable to a single dominant cluster (described further in main text).

The 10 care homes with the largest number of genomes (top ~3%) contained 102/700 (14.6%) of all samples with genomic data available. For several of these 10 care homes, all cases clustered closely together on a phylogenetic tree with zero or one pairwise SNP differences, consistent with a single ‘outbreak’ spreading within the care home (where an outbreak is defined as two or more cases linked in time or place McAuslane and Morgan, 2014; Figure 6 and Figure 6—figure supplement 1). By contrast, several care homes were ‘polyphyletic’, with cases distributed across the phylogenetic tree and higher pairwise SNP difference counts between samples, consistent with multiple independent introductions of the virus among residents.

The probability of two cases having linked transmission in an epidemiologically meaningful timeframe (for example direct transmission or within one or two intermediate hosts – likely the maximum practical limit for investigating the source of infection for a positive case) is a function of several factors. These include the pairwise genetic differences between viruses and their phylogenetic relatedness, the time difference between cases, and the opportunities for infection between people (for example, the frequency, duration and extent of close contact). For this continuous probability distribution, a pragmatic cut-off was used of >15% likelihood that samples were connected by <2 intermediate hosts, using a previously published algorithm called transcluster (Stimson et al., 2019), adjusted for SARS-CoV-2 (Materials and methods). Each care home was considered as a separate microcosm of transmission and the number of viral clusters per care home was estimated, with separate clusters implying distinct acquisition events among residents.

This clustering method identified 409 transmission clusters from 292 care homes (median one cluster per care home, range 1–4). Within each cluster, 673/775 (86.8%) of pairwise links had zero or one pairwise SNP differences (maximum 4), and 756/775 (97.5%) were sampled <14 days apart (maximum 22 days) (Figure 7—figure supplement 4–5). Clusters had a smaller distribution of sampling dates than for the total cases within each care home, as expected (Figure 7—figure supplement 6). For the 170/292 (58%) care homes with two or more cases with genomic data (578 individuals), there was a median of 9 (IQR: 4–15) days from the first case to the last case within each care home, up to a maximum of 50 days. In contrast, more clusters comprised only a single individual than for care homes, and for the 133/409 (33%) clusters with two or more cases with genomic data (424 individuals), there was a median of 5 (IQR: 1–11) days from the first case to the last case within each cluster, up to a maximum of 22 days (p<10−5, Wilcoxon rank sum test comparing date differences for care homes vs clusters with two or more samples; comparison shown in Figure 7—figure supplement 6). The median and interquartile range for pairwise date differences between all samples within each cluster is shown in Figure 7—figure supplement 7, and the date ranges for all care homes and clusters is in Supplementary Materials.

Transmission networks for the ten care homes with the largest number of genomes are shown in Figure 7A, indicating linked transmission clusters among residents based on the model assumptions and probability threshold (full dataset shown in Figure 7—figure supplement 1). Consistent with the phylogeny shown in Figure 6A, some care homes contained a single transmission cluster involving multiple cases (e.g. CARE0314), while others comprised multiple independent clusters (e.g. CARE0061) (Table 6). While care homes frequently had more than one introduction of the virus among residents (i.e. >1 cluster), there was typically a single dominant cluster responsible for the majority of cases within each care home. Of the 170 care homes with two or more residents with genomic data (comprising 578/700 (82.6%) care home residents with genomic data), 111/170 (65.3%) had a dominant cluster responsible for >50% of all cases in the care home. This rises to 74/90 (82.2%) of care homes with three or more residents with genomic data.

Figure 7. Visualisations of SARS-CoV-2 clusters among care home residents.

Transmission networks were produced using a derivative of the transcluster algorithm, which incorporates pairwise date and genetic differences to estimate the probability of cases being connected within a defined number of intermediate hosts. Clusters were defined using a probability threshold of >15% for cases being linked by <2 intermediate hosts (further details in Materials and methods). (A) Transmission clusters for the ten care homes with the largest number of care home residents with available genomes. Consistent with Figure 6, several of the 10 care homes with the largest number of genomes comprised single transmission clusters (e.g. CARE0314), while others contained two or more clusters consistent with multiple independent transmission sources among the residents. These data alone do not indicate where the residents acquired their infections, and hospital-acquired infections for some of the clusters is a possibility alongside multiple introductions into the same care homes. (B) Visualisation of transmission links between residents of two nearby carehomes and a group of healthcare workers (HCW). Two care homes, CARE0063 (blue) and CARE0273 (orange), each had strong transmission links identified with the transcluster algorithm to a group of HCW (green). The HCW comprised paramedics and care home carers – one working at CARE0063 and the other working at an unknown care home. We do not have confirmatory epidemiological data available, but this raises the possibility of the cases sharing a linked transmission network.

Figure 7.

Figure 7—figure supplement 1. Transmission network diagrams for all care homes with two or more cases with genomic data.

Figure 7—figure supplement 1.

Transmission networks were produced using a derivative of the transcluster algorithm, which incorporates pairwise date and genetic differences to estimate the probability of cases being connected within a defined number of intermediate hosts. Clusters were defined using a probability threshold of >15% for cases being linked by <2 intermediate hosts (further details in Materials and methods). This figure displays data from all care homes with >2 samples with genomic data.
Figure 7—figure supplement 2. Histogram of pairwise transmission probabilities between care home samples.

Figure 7—figure supplement 2.

Histogram of the pairwise probabilities for cases being connected by <2 intermediate hosts for all 700 care home residents as inferred by the transcluster algorithm, with vertical red line at 0.15 showing the cutoff used to identify care home clusters in our analysis. Note the data gaps along the x-axis reflect the inherent discontinuity of the input datasets, measured in days and SNP differences between cases.
Figure 7—figure supplement 3. Transmission probability threshold vs number of care home clusters.

Figure 7—figure supplement 3.

The transcluster algorithm computes the likelihood of two samples being linked within a given number of intermediate hosts, based on the date and genetic differences between samples (assuming a given serial interval and mutation rate, further details in Materials and methods). Changing the probability threshold used to define clusters changes the number of clusters defined, with a higher threshold yielding more clusters (and higher likelihood of transmission within each cluster). The dataset analysed contained 700 genomes from residents in 292 care homes, and we treated each care home separately as microcosms of potential infection networks. Therefore, the highest theoretical number of clusters is 700, if every genome were its own cluster; and the lowest possible number of clusters is 292, if every person within each care home was part of the same cluster. The cut-off used (>15% probability of transmission with <2 intermediate hosts) is indicated by the red vertical line. This is arbitrary, and was selected (1) because the distribution of pairwise SNP and date differences within resulting clusters appeared reasonable (Figure 7—figure supplements 4 and 5) and because of a ‘jump’ in the number of clusters occurring at that point.
Figure 7—figure supplement 4. Pairwise SNP difference distribution between samples within clusters.

Figure 7—figure supplement 4.

Within each cluster, 673/775 (86.8%) of pairwise links that had a 15% probability of transmission with <2 intermediate hosts had 0 or one pairwise SNP differences (maximum 4).
Figure 7—figure supplement 5. Pairwise date difference distribution between samples within clusters, aggregated across dataset.

Figure 7—figure supplement 5.

Within each cluster, 756/775 (97.5%) of pairwise links that had a 15% probability of transmission with <2 intermediate hosts cases were sampled <14 days apart (maximum 22 days).
Figure 7—figure supplement 6. Distributions of date ranges (from first to last sampling dates) for care homes vs clusters.

Figure 7—figure supplement 6.

Date ranges were calculated by subtracting the date of the first sample from the last sample for each care home (left) or cluster (right). Care homes and clusters were only included in this analysis if there were >2 samples with available genomic data in that care home or cluster. Of 292, 170 (58%) care homes had two or more cases with genomic data (578 individuals), compared with 133/409 (33%) clusters (424 individuals). Using these datasets, there was a median of 9 days (IQR: 4–15, range: 0–50) from the first case to the last case within each care home, compared with 5 days (IQR: 1–11, range: 0–22) from the first case to the last case within each cluster (p=9.2e-06, Wilcoxon rank sum test). As expected, the transcluster algorithm produces clusters with a narrower and smaller date range between samples than for the care homes as a whole. Collection date was used for sample dates; if collection date was missing then receive date in the laboratory was used instead.
Figure 7—figure supplement 7. Pairwise date difference distribution between samples within each cluster.

Figure 7—figure supplement 7.

Boxplots indicate the median and interquartile ranges for the number of days separating samples found to be within the same transmission cluster by the transcluster algorithm. The boxplots are overlaid with points representing the underlying transmission links. Larger points are used to represent cases where many transmission links within a cluster are separated by the same number of days.

Table 6. Outbreak characteristics for 10 care homes with the largest number of SARS-CoV-2 genomes.

Epidemiological characteristics of the 10 care homes with the largest number of genomes are shown. Collectively these comprised 102 cases (102/700 (14%) of the total number of care home cases with genomic data available). ‘Cluster count’ refers to the number of SARS-CoV-2 clusters within each care home defined by transcluster (described in Materials and methods and main text). ‘Major cluster count’ shows the count for the dominant cluster (with the largest number of cases) and its percentage contribution to total case numbers for each care home. ‘Care home date range’ indicates the number of days from first sample to last sample date for residents from each care home. ‘Cluster date range’ indicates the number of days from first sample to last sample date for residents from each cluster within that care home, as defined by the transcluster algorithm, also showing the sample count (n) for each cluster. Sampling dates used collection date if known, or receive date in the diagnostic laboratory if collection date was unknown. The date range for each care home is typically larger than the date range for clusters within care homes, except for single-cluster care homes like CARE0314. This is consistent with the transcluster algorithm defining groups of cases occurring closer together in time. While the care homes frequently had more than one introduction of the virus among residents (i.e. >1 clusters), there was usually a single dominant cluster responsible for the majority of cases. Individual counts of males and females for each care home are not shown as this generally gave counts of less than five, risking patient anonymity. Overall, there were 59/102 (57.8%) females for these 10 care homes.

Care home code Sample count Age
(median, IQR, range)
Ct values
(median, IQR, range)
Cluster count Major cluster count Care home date range (days) Cluster date range (days, sample count)
CARE0032 7 87 (IQR: 81–91, range: 56–93) 23 (IQR: 22–24, range: 14–26) 2 6/7 (85.7%) 39 0 days, n = 1
10 days, n = 6
CARE0061 10 88.5 (IQR: 87–92.2, range: 84–97) 23 (IQR: 21.2–26.5, range: 12–33) 4 7/10 (70%) 38 0 days, n = 1
22 days, n = 7
0 days, n = 1
0 days, n = 1
CARE0063 12 74.5 (IQR: 67.8–81, range: 42–94) 23 (IQR: 20.8–27, range: 14–30) 2 11/12 (91.7%) 21 18 days, n = 11
0 days, n = 1
CARE0097 7 90 (IQR: 82.5–92, range: 73–95) 23 (IQR: 20.5–24, range: 17–27) 2 6/7 (85.7%) 28 0 days, n = 1
14 days, n = 6
CARE0151 7 81 (IQR: 77–89, range: 69–96) 20 (IQR: 19–25.5, range: 17–30) 4 4/7 (57.1%) 20 0 days, n = 1
0 days, n = 4
0 days, n = 1
0 days, n = 1
CARE0173 7 81 (IQR: 77.5–94, range: 71–95) 19 (IQR: 17.5–26, range: 15–27) 3 3/7 (42.9%) 21 0 days, n = 1
3 days, n = 3
0 days, n = 3
CARE0263 12 85.5 (IQR: 81.8–90.5, range: 69–97) 19.5 (IQR: 18.5–24.8, range: 14–29) 3 9/12 (75%) 3 3 days, n = 9
0 days, n = 2
0 days, n = 1
CARE0264 9 91 (IQR: 82–95, range: 73–96) 26 (IQR: 25–27, range: 18–29) 1 9/9 (100%) 14 14 days, n = 9
CARE0277 13 84 (IQR: 82–89, range: 71–94) 26 (IQR: 24–27, range: 23–29) 2 12/13 (92.3%) 13 13 days, n = 12
0 days, n = 1
CARE0314 18 87.5 (IQR: 81.2–90.8, range: 74–97) 24 (IQR: 22.2–26, range: 14–29) 1 18/18 (100%) 5 5 days, n = 18

The contribution made by genomic data in defining care home clusters was quantified. Without genomic data (or access to more detailed epidemiology such as accommodation sub-structuring within care homes), clustering can only be based on temporal differences between cases. For example, if two groups of COVID-19 cases occur several months apart within a care home they could be inferred to have resulted from (at least) two separate introductions. However, this method cannot account for multiple introductions occurring around the same time, as may happen when community transmission is high. To quantify the impact made by adding genomic data, which can distinguish between genetically dissimilar viruses introduced at similar times, the transcluster algorithm was repeated using the same parameters as for the main analysis but assuming all genomes were identical. This yielded 316 clusters – 23% fewer than the 409 clusters yielded when incorporating genomics. This suggests that genomics makes a significant contribution to defining viral clusters; without genomic data, cluster sizes may be over-estimated and the number of separate viral introductions under-estimated. This is illustrated by care home CARE0263, in which all 12 residents tested positive within 3 days of each-other, but these are divided into three separate clusters by the transcluster algorithm (one dominant cluster of nine cases, one cluster of two cases and a single separate case (Table 6)); this is consistent with the phylogeny shown in Figure 6A, with samples split into three branches along the tree. Without genomic data, the three clusters in CARE0263 would have been impossible to distinguish.

Links between care homes and hospitals

Links between care homes and hospitals were investigated for the 700 care home residents with genomic data available. Of 700, 694 (99%) care home residents with genomic data had NHS numbers available, which were linked to national hospital admissions data (Materials and methods) (Table 7). Of 694, 470 (67.7%) care home residents had at least one hospital admission within the study period, and 398/694 (57.3%) were deemed to have been admitted to hospital with COVID-19 (i.e. their first positive sample was taken within 2 days prior to admission up to 7 days post-admission). Forty of 694 (5.8%) cases were categorised as suspected hospital-acquired COVID-19 infections, defined as first positive test being 7 days or more after their hospital admission date and prior to their discharge date (N = 13) or within 7 days following their hospital discharge (N = 27) (Table 7). Of 694, 230 (33.1%) individuals were discharged from hospital within 7 days of their first positive test, and thus could potentially have been infectious at the time of hospital discharge (Byrne et al., 2020).

Table 7. Hospitalisation data for the 700 care home residents with genomic data available 700/1167 (60.0%) care home residents identified in the study had genomic data available and were used to define care home SARS-CoV-2 clusters.

We investigated the proportions of these care home residents that were hospitalised and may have acquired their infections through interactions with hospitals. This was possible for 694/700 (99.1%) individuals who had NHS numbers documented that could be linked with national hospitalisation data. Being hospitalised due to COVOD-19 was defined as the date of first positive sampling being within 2 days prior to admission up to 7 days post-admission. Suspected hospital-acquired COVID-19 infections were defined as first positive test being 7 days or more after hospital admission date and prior to discharge date (N = 13) or within 7 days following hospital discharge (N = 27). Of the latter group, 10 individuals were admitted to hospital and discharged on the same day prior to their positive test, nine were admitted for 1–7 days, and eight had been admitted for greater than 7 days.

Category Counts (%)
Care home residents with genomic data 700
Care home residents with genomic data that could be linked to hospitalisation data 694/700 (99.1%)
Hospitalised during study period 470/694 (67.7%)
Hospitalised due to COVID-19 398/694 (57.3%)
Suspected hospital-acquired COVID-19 40/694 (5.76%)
Discharged within 7 days of positive test 230/694 (33.1%)

Viral clusters linking care home residents and healthcare workers

Potential transmission networks involving care home residents and healthcare workers (HCW) were investigated for people tested at CUH (HCW data were not available outside of CUH). This analysis comprised 54 care home residents tested at CUH and 76 HCW with genomic data available. Clusters were defined using the same method as for the care home resident analysis (described above), but allowing HCW to belong to clusters from multiple care homes, so residents from several care homes could be linked to the same HCW. 38/54 (70.4%) care home residents had possible links with HCW using this relaxed threshold. However, on review of the medical records we could only identify strong epidemiological links for 14/54 (26.0%) residents from two care home clusters, CARE0063 and CARE0114. The CARE0063 cluster has been described previously (Meredith et al., 2020) and includes care home residents, a carer from that same care home and another from an unknown care home, paramedics and people living with the above. The CARE0114 cluster comprises several care home residents and acute medical staff working at CUH who cared for at least one of the residents. The transcluster method does not assign probabilities for directionality of transmission and cannot determine precise person-to-person transmission chains. While all residents from a care home cluster may link to a given HCW, in reality the resident-HCW transmission event may have only involved one of the residents from that cluster, so the proportion of residents with links to HCW may be inflated. Nonetheless, these data show that two care home clusters involved HCW, one based mainly in the community and the other with hospital-based staff at CUH.

Residents from a third care home, CARE0273, also had strong transmission links to the paramedics and carers involved in the CARE0063 cluster. These two care homes are within 1 km of each-other and the cases cluster together on the phylogenetic tree, raising the possibility of shared transmission between them. A plausible transmission network connecting the residents at these two care homes and the shared HCWs could be made with at most zero SNPs and 3 days between sampled cases (Figure 7B); these links are in the top 1.1% of all pairwise transmission probabilities inferred using the transcluster algorithm. However, without confirmatory epidemiological data this interpretation remains speculative.

Discussion

The genomic epidemiology of SARS-CoV-2 in care homes in the East of England was investigated. Care home residents comprised a large fraction of COVID-19 diagnoses in the ‘first wave’ of the pandemic in this region: up to a quarter of patients in the peak weeks of late March and early April tested at CUH were admitted from care homes. Older age and being from a care home were correlated with each other and were both associated with significantly increased odds of mortality within 30 days of diagnosis. Care home residents thus bore a high burden of COVID-19 infections and mortality.

A smaller proportion of care home residents were admitted to ICU compared with people who were not from care homes. What treatments a patient receives, including the invasive treatments provided in intensive care, are complex and individualised decisions based on risk-benefit assessments involving patients, their families and carers, and healthcare professionals (ICS, 2020; NICE, 2020). Of note, non-invasive respiratory support (such as continuous positive airway pressure, high-flow nasal oxygen therapy and non-invasive ventilation) are routinely provided outside ICU in many UK centres. Despite care home residents being at higher risk of severe COVID-19, and being under-represented in ICU, admission to ICU was still correlated with significantly increased mortality. This is likely because patients admitted to ICU have more severe disease, typically requiring more intensive treatments such as organ support.

Viral clusters were defined within each care home by integrating temporal and genetic differences between cases. This provides a ‘high resolution’ picture of viral transmission; without genomic data, separate introductions of the virus occurring around the same time are impossible to distinguish. Care homes frequently experienced ‘outbreaks’ of multiple cases within clusters (the largest of which had >10 residents), consistent with substantial person-to-person transmission taking place within care homes. Care homes also frequently had multiple distinct clusters (up to 4), consistent with independent acquisitions of COVID-19 among residents – however, a single dominant cluster usually comprised the majority of samples within each care home. The majority of care home residents in the genomic analysis did not acquire COVID-19 in hospital. In the context of a national lockdown, the most likely location they acquired their infection was the care home. The high frequency of care home outbreaks may reflect the underlying vulnerability of this population to COVID-19 and the challenges of infection control in care homes. In contrast, the UK as a whole had an average of 2.37 people per household in Office for National Statistics, 2019a and in the East region only 2.2% of households were made up of two or more unrelated adults (6.2% in London) (Office for National Statistics, 2019b).

These findings emphasise the importance of limiting viral transmission within care homes in order to prevent outbreaks. Given there is increasing evidence for asymptomatic and presymptomatic transmission of SARS-CoV-2 (Arons et al., 2020; Goldberg et al., 2021; He et al., 2020), isolating residents or staff when they develop symptoms is not sufficient to prevent within-care home spread once the virus has entered the care home. Certain measures may be required on an ongoing basis within care homes when there is sustained community transmission, even when no outbreak is suspected (at least until the morbidity and mortality of the virus in older people has been reduced substantially through vaccination or treatments). These may include use of appropriate Personal Protective Equipment (PPE) for staff and visitors (including visiting healthcare professionals and friends and family), rigorous hand hygiene, social distancing, and making use of larger, well-ventilated rooms for social interactions or socialising outdoors, providing that this is practical and safe (Jones et al., 2020b). This is consistent with current national guidance for care homes in England (Public Health England, 2020c; UK government, 2020b). Face coverings for residents themselves when interacting socially in communal indoor areas could be considered, if acceptable to residents.

The majority of residents had hospital contact during the study period, indicating substantial opportunity for infections to pass between care homes and hospitals in either direction. A third of patients were discharged from hospital within 7 days of their first positive test, and thus were potentially infectious at discharge. We identified transmission clusters that would be consistent with COVID-19 spread between care home residents and HCW, based both in the community and in hospitals. A previous study found that working across different homes was associated with higher SARS-CoV-2 positivity among staff (Ladhani et al., 2020). Limiting the spread of COVID-19 between care home residents, HCW and hospitals is a therefore another key target for infection control and prevention.

There are several limitations to this study. First, not all of the COVID-19 cases from the East of England have been included. Serology data suggest that 10.5% of all residents in care homes for people aged 65 and older in England had been infected with SARS-CoV-2 by early June, the majority of whom were asymptomatic (UK government, 2020c). The Cambridge CMPHL did not receive all the samples tested from the region; national data indicate around half of the COVID-19 cases reported from EoE during the study were included. Viral sequence data were not available for 40% of care home residents, as a result of missing samples, mismatches between sequences and metadata, genomes not passing quality control filtering using a stringent threshold (<10% missing calls), or sequences being unavailable at the time of data extraction. Viral cluster sizes may therefore be underestimated.

Second, the nature of diagnostic testing sites changed during the study period as regional hospitals developed their own in-house testing capacity and community testing laboratories were set up. ‘Pillar 2’ testing in the UK was outsourced to high-throughput laboratories during April 2020 and performed an increasing proportion of community testing. It is possible that some care home residents from the same care home could have been tested through different routes, with symptomatic cases more likely to be tested in ‘Pillar 1’ via the CMPHL (and included in this dataset), and asymptomatic screening occurring more via the Pillar two laboratories. However, most care homes in EoE only began systematic screening after the end of our study following the introduction of the UK care home testing portal on 11th May 2020. Moreover, the transcluster algorithm allows for ‘missing links’ within a cluster (the threshold used assumed a 15% probability of infections being linked within <2 intermediate hosts), reducing the impact of missing care home cases on defined clusters. The changing profile of COVID-19 testing in the UK between March and May 2020 should therefore be factored into all interpretations of COVID-19 epidemiology from that period.

Third, defining who is a care home resident from large electronic healthcare records is challenging and, despite substantial efforts (described in Materials and methods), some care home residents may have been missed. Using pre-defined coding such as care home CQC registration numbers when patients are booked into hospital systems, rather than free-text data entry, would help considerably with care home surveillance. Multiple rounds of electronic searches and manual inspection were undertaken to identify as many care home residents as possible, and every care home resident included was cross-referenced against a CQC database of registered care homes in England. The care homes included for analysis should therefore be accurate.

Fourth, low viral sequence diversity limits the power of genomics to infer transmission clusters. Between-care home transmission was not investigated specifically because, unlike within-care home cases, opportunities for transfer of SARS-CoV-2 between care homes cannot be assumed or inferred from the data. This could be assessed in a dedicated prospective study gathering epidemiological data on between-care home contacts. Even within care homes, it is possible some genetically similar viruses are from unconnected introduction events. However, incorporating genomic data is more accurate for excluding linked transmission than if only temporal data are available. Genomics can thus be used to ‘rule out’ cases as being part of a linked cluster if the genetic difference is greater than would be expected given the viral mutation rate. This could be practically informative for care homes (along with other organisations at risk of COVID-19 outbreaks like factories Middleton et al., 2020), with implications for infection control procedures. Directionality of person-to-person transmission cannot be inferred from the transcluster algorithm. Inferring the likelihood of transmission direction between pairs of individuals requires integration with multiple forms of epidemiological data, yielding a probabilistic estimate (Illingworth et al., 2020).

In conclusion, care homes represent a major burden of COVID-19 morbidity and mortality, with transmission events introducing SARS-CoV-2 into care homes and subsequent transmission within them. Genomic data can be used in outbreak investigations to define viral clusters; this is critically dependent on integration with epidemiological data. The cut-offs we used for defining care home clusters were pragmatic but plausible given current understanding of the biology and epidemiology of SARS-CoV-2. Such cut-offs can be helpful for producing understandable outputs for biological and public health interpretation (MacFadden et al., 2018; Stimson et al., 2019), and for focusing investigations with limited public health resources. Future work will need to prospectively integrate genomic and epidemiological data to rapidly identify viral clusters, thus enabling deployment of infection control and public health interventions in real time.

Materials and methods

Study overview

Data were collected on SARS-CoV-2-positive samples from the East of England, tested at the PHE CMPHL in Cambridge, between 26th February and 10th May 2020. The CMPHL is a PHE diagnostic laboratory that receives samples from across the East of England. The East of England is one of nine official regions in England. In the 2011 census, it had a population of 5,847,000, one of the fastest growing populations in England and Wales and the fourth largest population of the nine official regions (Office for National Statistics, 2011). The most populous cities include Luton, Norwich, Southend-on-Sea, and Peterborough (City Population, 2020). The 10th May was selected as a study end-date because it encompassed the bulk of the ‘first wave’ of the epidemic in the East of England. Furthermore, prior to the 11th May 2020, systematic screening of all residents within care homes was much less common and testing primarily occurred where there was a suspicion of an outbreak. The UK government launched a national care home testing portal on 11th May 2020 (UK government, 2020d), in which all care home staff and residents were eligible for testing with priority for homes caring for people aged 65 years or older. Ending the study on 10th May reduces the risk of bias which may be introduced by uneven systematic screening, for example when comparing the population genetics of care home and non-care home samples, if care homes undergo screening while non-care home settings do not. During the study period, the scope of testing in hospital, community, and care home settings changed several times, as eligibility criteria were modified (Figure 1—figure supplement 1). When interpreting trends in COVID-19 cases in the UK during this period it is essential to consider the changing capacity and policies surrounding testing.

Diagnostic testing, metadata collection, and genome sequencing

For details on diagnostic testing, patient metadata collection, and nanopore genome sequencing see Meredith et al., 2020. Briefly, CMPHL used an in-house generated and validated one-step RT q-PCR assay detecting a 222 bp region of the RdRp genes, along with an MS2 bacteriophage internal extraction control, using the Rotorgene PCR instrument. Samples that generated a Ct value <36 were considered positive. The study aimed to sequence all samples which tested SARS-CoV-2 PCR positive at the CMPHL during the study period. Sequencing of every positive diagnostic sample could not be performed, however, for the following reasons: (i) sample unavailability (e.g. diagnostic samples being lost or discarded before they could be collected by the sequencing team); (ii) labelling errors when assigning sequencing codes (which resulted in specimens being discarded); or (iii) metadata mismatches (if the sample did not match to a metadata record downloaded from the hospital electronic patient records system). Samples were either sequenced on site using Oxford Nanopore Technologies or transported to the Wellcome Sanger Institute for Illumina sequencing.

Samples from Cambridge University Hospitals NHS Foundation Trust (CUH) and a selection of East of England (EoE) samples were sequenced on site to provide rapid information on hospital-acquired infections (Meredith et al., 2020). Nanopore sequencing (Oxford Nanopore Technologies) took place in the Division of Virology, Department of Pathology, University of Cambridge, following the ARTICnetwork V3 protocol and assembled using the ARTICnetwork assembly pipeline. The sequencing workflow involved a directional sample flow as used in a diagnostic laboratory which includes separated pre- and post-PCR areas, with dedicated equipment for each stage of the process. All steps were performed in PCR cabinets which were cleaned using DNA removal solutions and a UV decontamination cycle run after each batch. All sequencing batches included at least one water negative control carried over from the reverse-transcription step. Mapped reads were assessed in real-time during sequencing with RAMPART (Hadfield, 2020) and all data from batches containing a contaminated negative control were discarded before sequence assembly. The remaining EoE samples, where available, were sent to the Wellcome Sanger Institute (WSI) for sequencing.

Sequencing at WSI used Illumina technology. cDNA was generated from SARS-CoV-2 viral nucleic acid extracts and subsequently amplified to produce 400nt amplicons tiling the viral genome using V3 nCov-2019 primers (ARTIC). This was followed by Illumina library generation using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs Inc, Cat. No. E7645L). Libraries were amplified with KAPA HiFi Ready Mix (Kapa Biosystems, Cat. No. 07958927001) and uniquely indexed with a 100 µM i5 and i7 primer mix (50 µM each) (Integrated DNA Technologies) to allow multiplexing of up to 384 SARS-CoV-2 viral extracts into one sequencing pool. The PCR products were pooled in equal volume and purified with an AMPure XP workflow (Beckman Coulter, Cat. No. A63880). The purified pool was quantified by qPCR (Illumina Library Quantitation Complete kit, Cat. No. KK4824) and sequenced on one lane of an Illumina NovaSeq SP flow cell (Illumina Inc, NovaSeq 6000 SP Reagent Kit v1.5 (500 cycles), Cat. No. 20028402), with XP workflow (Illumina Inc, NovaSeq XP two lane kit v1.5, Cat. No. 20043130). Genomes were generated for each library’s sequencing data using bwa mem (Li, 2013) for alignment with MN908947.3 (Wu et al., 2020) as reference, samtools (Li et al., 2009) for pileup and ivar (Grubaugh et al., 2019) for trimming and consensus generation, all orchestrated by the ncov2019-artic-nf pipeline (Bull, 2020, cf01166, b88235d and 48816ee).

The WSI sequencing workflow also uses negative controls and the pass rate to date related to negative controls is 90%. Sequencing read counts are considered after a clipping and minimum alignment length filtering step (corresponding to data which is used to create consensus sequence or variant calls). Such read counts for the samples analysed in this study were typically in the millions (median: 4,497,543). If such read counts for the corresponding negative controls are >100 then the samples are currently failed. This QC procedure was introduced for samples analysed on or after the 18th of April. Of the 1007 samples analysed in this study sequenced at WSI (503 care home residents and 504 non-care home residents), 749 were sequenced once this workflow was established, 242 were sequenced before this but had a negative control and 16 did not have a negative control. If we apply the current criteria then 38 of these earlier samples would have failed (38/1400 = 2.7% of the analysed samples). Of these 38 samples, 26 are non-care home samples and 12 are from care homes. Of the 12 care home samples (12/700 = 1.7% total care home genomes analysed), one belongs to one of the ‘top 10’ care homes with the largest number of genomes, care home CARE0063, which comprises a single cluster of 12 genomes using the transcluster algorithm, described in main text. Thus, the main result of our genomic cluster analysis (that multiple introductions are often observed in care homes, but typically a single dominant cluster causes most of the cases) would not be altered by the small number of early genomes included that would now be excluded by current criteria.

Sequences were available from both Illumina and Nanopore platforms for eight care home residents included in the study (in all cases the Illumina data were used for the study analysis). In 7/8 cases, the sequence pairs were identical. In one case, there were two SNP differences between the consensus fasta sequences: C1884T and C16351T; for both SNPs, the Illumina sequence matched the reference genome (C) and the nanopore sequence had the alt call (T). These are not included among a list of previously identified sites that are highly homoplasic or have no phylogenetic signal and/or low prevalence (De Maio and Walker, 2020). The sequence pairs are shown below:

Illumina sample - COG-UK ID Illumina sample - date Nanopore sample - COG-UK ID Nanopore sample - date Pairwise SNP difference
CAMB-761D5 30/03/2020 CAMB-7B088 11/04/2020 zero
CAMB-1AF1F0 30/04/2020 CAMB-1AD8A2 30/04/2020 zero
CAMB-1AE7C2 30/04/2020 CAMB-1AC269 30/04/2020 2
CAMB-80590 09/04/2020 CAMB-789BD 06/04/2020 zero
CAMB-1AB23D 20/04/2020 CAMB-840B9 26/04/2020 zero
CAMB-83AAD 15/04/2020 CAMB-8416B 25/04/2020 zero
CAMB-1ABE2A 21/04/2020 CAMB-8468A 27/04/2020 zero
CAMB-1AB631 21/04/2020 CAMB-1ABF18 27/04/2020 zero

As with all the sample dates used, the above dates are based on sample collection date where available, with missing data substituted with the date of receipt in the laboratory. SNP differences were identified from a vcf file produced from the alignments using the package snp-sites v 2.5.1 (Page et al., 2016), command:

snpsitesvalignmentfile.aln

In Meredith et al., 2020, out of 14 sample pairs sequenced both by Illumina at WSI and nanopore in the University of Cambridge there were zero SNP differences at positions where both sequences had made a call (Meredith et al., 2020). There are several reasons why pairwise comparisons between different sequences from the same individual may not be identical, even if both sequences are produced using the same technology. When the cycle threshold (Ct) of a sample is near the limit of detection sensitivity, and/or RNA is degraded (e.g. due to delays between sampling and sequencing at room temperature), it is likely that amplicons that are not as efficiently amplified by the multiplex PCR may have low read coverage, or could be more sensitive to amplification bias. In this case, the samples both had high Ct values: CAMB-1AE7C2 (sequenced by Illumina at WSI) had Ct value of 30 and CAMB-1AC269 (nanopore sequenced in Cambridge) had a Ct value of 31. Median Ct value for the 700 care home residents with genomes analysed was 24 (interquartile range: 20–27) (data displayed in Table 1). If an individual is infected with more than one clone at significant frequency, it is also possible for stochastic variation in read counts for the two variants to yield different consensus calls at the variant locus. However, larger studies have systematically evaluated sequencing quality for SARS-CoV-2 between Oxford Nanopore Technology (ONT) and Illumina, and demonstrated highly accurate consensus-level sequence determination (Bull et al., 2020). Given this degree of consensus sequence accuracy, and because transcluster uses a transmission probability cut-off based on integrating pairwise SNP and temporal differences (rather than relying solely on a strict SNP cut-off), limited sequencing noise is unlikely to have a substantial impact on the clusters identified.

COG-UK IDs and GISAID accession numbers for genomes analysed in this study are included in Supplementary Materials, along with a complete author list for the COG-UK consortium.

Sample selection

As described in Meredith et al., 2020, patient metadata were downloaded daily from the electronic medical record system (Epic Systems, Verona, WI, USA) and metadata manipulations were performed in R (v 3.6.2) using the tidyverse packages (v 1.3.0) installed on CUH computers. Positive samples were collected and assigned either for nanopore sequencing on site (focusing on CUH samples and a randomised selection of EoE samples), or sent to WSI for Illumina sequencing. Metadata were uploaded weekly to the MRC CLIMB system as part of the COG-UK Consortium. Samples included healthcare workers (HCW) tested in the CUH HCW screening programme (Jones et al., 2020a; Rivett et al., 2020), all of which were nanopore sequenced on site.

Identifying care home residents

Care home residents were identified using a two-stage data mining approach followed by manual inspection and linking of putative care home addresses to care homes registered to the Care Quality Commission (CQC).

Step 1: search terms in patient address fields

Patient address lines 1 and 2 were searched for the following list of key phrases (not case sensitive) in their electronic healthcare records; if any phrases were present the patient was labelled as being from a care home:

  • residential home’

  • care home’

  • nursing home’

  • care centre’

  • care hom’

  • nursing hom’

  • residential hom’

  • carehome’

This identified 765 patients as being care home residents.

Step 2: matching location names to CQC registered care facilities

Many care homes do not have the above list of phrases in their address names. To capture these facilities, we used the publicly available database of care homes registered to the CQC, the independent regulator of health and adult social care in England. All organisations providing accommodation for persons who require nursing or personal care must be registered with the CQC, including care homes with or without nursing care (Care Quality Commission, 2020b). Details of the CQC registration scope can be found in 'The scope of registration (Registration under the Health and Social Care Act 2008)', March 2015, available at this link as of 24th June 2020: (Care Quality Commission, 2015).

The file ‘CQC care directory – with filters (1 June 2020)’ was accessed on 23rd June 2020 from the CQC website: (Care Quality Commission, 2020c), and the following filters were applied:

  • Total facilities in CQC database: N = 49,516,516

  • Carehome?’ column filtered to ‘Y’: N = 15,507*

  • Only care homes for which the ‘Location Postal Code’ column matched at least one postcode from the dataset of 6600 patients were included, yielding N = 444 care homes.**

  • Following manual review and consistifying postcodes with the sample metadata, a set of 469 CQC registered care homes were included.***

*Filtering using the ‘carehome?’ column was based on advice given after correspondence with the CQC.

** Requiring CQC registered care homes to match postcodes from the patient dataset minimised the number of ‘false positives’ – patients whose address name matched a CQC registered care home name by coincidence.

*** 25 CQC registered care homes were added following manual review of the identified putative care home residents, who had a different postcode documented in the electronic healthcare records for the same care home, yielding the final ‘CQC EoE care home search set’ of 469 care homes.

We then used the values from the ‘Location name’ column of the filtered CQC dataset (i.e. the care home facility names) as search phrases for address line one in the patient database. Any patients with exactly matching phrases were labelled as care home residents. This increased the number of care home residents identified by a further 382–1147, that is, around one third of care home residents were identified using CQC facility names and would have been missed by relying on generic care home-related search phrases alone.

Step 3: manual inspection and data clean up

Address lines for the non-care home patients were manually inspected; this identified a further 89 care home residents. Most of these had not been detected in steps 1 and 2 due to spelling or formatting issues with the patient addresses (e.g. short-hand abbreviations used for care home names, or inclusion of extra details like flat number meaning the string did not match a CQC care home name exactly).

Next, address lines for the care home residents were manually inspected and 14 were deemed not to be care home residents. Most of these were due to unrelated locations sharing the same address name as a CQC registered care home. The manual filtering steps thus yielded a care home resident count of 1147 + 89–14 = 1222. Address line 1 for all 1222 care home residents was manually inspected and formatted to ensure residents from the same care home had matching terms in this column. This was necessary due to discrepant address entrance formats for identical care homes; without this step, residents from the same care home would be incorrectly assigned to different anonymised care home codes.

Step 4: linking care home addresses to CQC registered care homes

First line of patient address and postcodes were matched to care home names and postcodes from the CQC EoE care home search set (described above). Any discrepancies (care homes not matching the CQC data) were manually inspected and in the majority of cases the discrepancy could be reconciled (e.g. alternative name or postcode used for the same care home). In 55 cases, a ‘care home’ was reclassified to non-care home, either because the address was independent housing with a matching name to a care home by coincidence, or because a care facility was determined by CQC definitions to not be a care home – for example several mental health community hospitals, drug rehabilitation centres, and supported living environments were excluded. This yielded the final analysis set of 1222–55 = 1167 care home residents, from 337 care homes. All 337 care homes included were therefore linked to CQC data; in two cases, the care home had been previously registered but had since been ‘archived’, and the most recent CQC data for defining whether residential or nursing care was being provided was used.

Care home location IDs assigned by the CQC were turned into anonymised codes (format: CARE followed by a four-digit numeric code). Care homes were classified as ‘residential homes’ or ‘nursing homes’ using the CQC data column ‘Service type - Care home service with nursing’ filtered to ‘Y’ for care homes with nursing, and column ‘Service type - Care home service without nursing’ = ‘Y’ for care homes without nursing (‘residential homes’). If both fields were ‘Y’ then the care home was coded as being a nursing home.

Linking care home data to CUH acute medical testing data

The dataset of 7407 PCR-positive samples with metadata were collected prospectively as part of the COG-UK study in Cambridge. Data on CUH acute care testing, including categorisations of whether infections were community- or hospital-acquired (definitions provided in Meredith et al., 2020) and data on patient outcomes (mortality at 30 days and ICU admissions), were collected separately as part of CUH and national monitoring. During the study period, 464 patients tested positive for COVID-19 at CUH.

When merging the metadata collected for COG-UK (including the above care home categorisations) with CUH acute testing data, 71 care home residents tested at CUH were identified. However, there were 23 samples that had tested positive in CUH that were not in the COG-UK dataset. Of 23, 21 of these were tested on the SAMBA platform at CUH (Collier et al., 2020), which is not PCR-based; sequencing was not possible for these samples owing to rapid RNA degradation. For technical reasons, SAMBA results were not included in the data collected prospectively in the Cambridge COG-UK study. The remaining two discrepancies were not captured in the electronic patient record downloads, which likely reflects periods where the download processes and coding methods were being established. Of the 23 missing samples, 20 were community-onset community-associated, two were hospital-onset indeterminate healthcare-associated, and one was a healthcare worker. These are counted as such and depicted with the above categorisations in the CUH epidemic curve shown in Figure 3B. Of the 23 CUH samples missing from the Cambridge COG-UK dataset, one was determined to be a care home resident, bringing the total CUH care home residents analysed to 72.

Statistics

All statistical analyses were performed in R. The logistic regression model used to estimate odds of 30-day mortality was coded as follows: glm.fit <- glm(mortality_30_days ~ age + sex + care_status + ICU_admission + diagnostic_ct_value, data=data, family=binomial) summary(glm.fit).

Odds ratios and 95% confidence intervals were derived by exponentiating the model coefficients: exp(cbind(coef(glm.fit), confint(glm.fit))).

To produce the plot of odds ratios shown in Figure 4, the age and diagnostic Ct value continuous variables were transformed into binary categoricals using cut-offs of age >80 years and Ct value <20.

Wilcoxon rank sum tests performed in R using command format: wilcox.test(x, y, alternative = ‘two.sided’, conf.level = 0.95).

p-Values below 10−5 are not reported.

Selecting randomised sample of non-care home residents as comparison group

A randomised sample of non-care home residents was selected to use as a control group for comparison of viral lineage composition against the care home residents. Because this group was intended to be representative of non-care home community-acquired transmission, we applied the following inclusion criteria prior to randomisation:

  • Patient address available.

  • Not one of the identified care home residents.

  • Not a healthcare worker (information only available for people tested at CUH).

  • Not a CUH case of indeterminate, suspected or definite hospital acquired infection.

  • Not living in a long-term care facility other than a care home (e.g. mental health hospital, rehabilitation unit, etc).

  • Not living in a prison.

We attempted to have a roughly equivalent representation of nanopore and WSI sequenced samples as present in the care home database. Samples were selected using the R randomisation command sample_n() from available genomes in the CLIMB database passing QC filters. Having identified 698 samples, any cases with matching addresses that had been excluded were added to yield the final set of 700 non-care home genomes for comparison. Of the 700 non-care home samples included, we note that there were five instances of pairs of samples sharing the same address; in all five cases the pairwise SNP difference was zero or 1, and in 4/5 cases the people shared the same surname. This non-care home comparison set is not part of the care home viral cluster analysis performed using the transcluster algorithm.

Care home viral phylogenetics and cluster analysis

Consensus fasta sequences were downloaded from the MRC-CLIMB website (https://www.climb.ac.uk/) (Connor et al., 2016). Genomes were de-duplicated (one genome per person) and passed through quality control (QC) filtering using the same criteria as in Meredith et al., 2020: genome size >29 Kb, N count <2990 (i.e. >90% coverage). Where there were multiple sequences from the same patient, the sequence passing QC filters that was collected first was used for genomic analysis (closest to the onset of symptoms).

The 700 de-duplicated viral genomes from care home residents passing QC were aligned using MAFFT (v 7.458) (Katoh and Standley, 2013) with default settings. Command: ‘/PATH/mafft’ --retree 2 --inputorder ‘multi_fasta_filename.fasta’ > ‘alignment_filename’.

A SNP difference matrix was produced from the alignment using snp-dists v 0.7.0 (Seemann, 2020) installed in a conda environment, run with the following command: snp-dists -c alignment_filename.aln > snp_diff_matrix_filename.csv.

The SNP difference matrix was manipulated in R using the Matrix and tidyverse packages to generate the SNP difference histogram and boxplots.

Phylogenetic trees were generated using IQ-TREE (v 1.6.12 built 15th August 2019). An alignment was generated as above including a reference genome from Wuhan, China, collected December 2019 and used to root the tree (GISAID ID: EPI_ISL_402123). The IQ-TREE Model Finder Plus option was used (Kalyaanamoorthy et al., 2017) which searches from a database of available nucleotide substitution models and selects the best fit to the analysis, command line:

 /PATH/iqtreesalignmentfilenamemMFP

The best-fit nucleotide substitution model according to BIC was GTR+F+R2. The tree shown in this manuscript was produced using the GTR+F+R2 model with the ultrafast bootstrap option (Hoang et al., 2018) run through 1000 iterations to estimate branch support values, using command:

 /PATH/iqtreesalignmentfilenamemGTR+F+R2bb1000

Newick trees were manipulated in FigTree (v 1.4.4) to root on the Wuhan sample and put in increasing node order. Trees were visualised initially using the microreact online tool (Argimón et al., 2016), and Figure 6A was produced in R using ggtree (v 2.0.4) (Yu et al., 2017).

For the phylogenetic tree of all samples in the study (Figure 6—figure supplement 1), consensus fasta files were downloaded from the COG-UK database (https://www.cogconsortium.uk/data/) accessed 01/12/2020. The same QC filtering described above was applied (genome size >29 Kb, N count <2990). Sequences passing QC were linked by their COG-UK IDs to individuals from this study. Of the 6600 people in the study, 1167 had been identified as care home residents and 700/1,167 (60.0%) had genomes available that passed QC at time of the main analysis, leaving 5246 non-care home residents (187 were undetermined). Of the 5246, 3745 (71.4%) non-care home residents had genomes available that passed QC (including the 700 randomly sub-sampled non-care home residents described above). A multiple sequence alignment was produced in MAFFT and phylogenetic tree produced using IQTREE, command line:

iqtreesalignmentall.alnmGTR+FntAUTOntmax16mem16Gbb1000

The tree was manipulated in FigTree (v 1.4.4) and Figure 6—figure supplement 1 was produced in R using the ggtree package as with Figure 6.

Lineage assignment

Viral lineages were assigned using the Pangolin COVID-19 Lineage Assigner web utility (COG-UK, 2020). Analysis was performed with Pangolin (Rambaut et al., 2020a) version 1.1.14, lineages version 2020-05-19-2. Contextual information about lineages was taken from Rambaut et al., 2020b, accessed 24/07/2020.

Clustering

Clusters were produced using an implementation of the transcluster algorithm (Stimson et al., 2019; Tonkin-Hill, 2020). Instead of targeting the number of SNPs separating two genomes, the transcluster algorithm proposes a probabilistic alternative which estimates the number of intermediate transmission events separating two sampled genomes. The method takes into account both genetic SNP distance as well as the time at which each sample was taken. The approach models both the SNP distance and the number of intermediate hosts as a Poisson process. Using a predefined evolutionary rate as well as an estimate of the generation time (the time between transmission events), the method infers the distribution of the number of intermediate hosts separating two samples.

Scheme 1. Diagram representing transmission dynamics between two samples.

Scheme 1.

Briefly, N let be the SNP distance separating two genomes and δ the time difference between when the samples were taken. We would like to estimate h, the time between the infection times of the two samples. The number of SNPs per unit time can be modelled as a Poisson process with evolutionary rate λ. Similarly, we assume the rate β at which the pathogen jumps to a new host is constant resulting in another Poisson process for the number of intermediate hosts given h and δ. We are thus interested in the probability that there are κ intermediate hosts given N and δ which, following the derivation in Stimson et al., 2019, can be written as:

P(k|N,δ)=h=0(h|N,δ)P(k|h)dh

This can be expressed as the sum:

P(k|N,δ)=λN+1βk(n+k)!eδβn!k!i=0N(λδ)ii!i=0N+kδN+ki(N+ki)!(λ+β)i+1

The implementation of transcluster assumed a viral mutation rate of 1e-3 substitutions/site/year (Fauver et al., 2020) and generation time of 5 days, approximated by previous estimates of the serial interval of SARS-CoV-2 (He et al., 2020; Zhang et al., 2020). Days between first positive sampling date for pairs of individuals was used as a proxy for generation time. As above, where collection date was missing, the date the sample was received in the Cambridge PHE laboratory was used. The resulting pairwise transmission probabilities were used to generate a pairwise distance matrix and clustering was performed using single linkage hierarchical clustering with the R hclust function. Links were only considered if they involved residents from the same care home; thus, the largest theoretical number of clusters in this analysis would be 700 (every individual is their own distinct cluster), and the smallest would be 292 (one cluster for each care home).

The relationship between the probability of infections being linked by <2 intermediate hosts and the resulting number of care home clusters was explored. A higher threshold leads to more care home clusters, with greater likelihood of linked transmission within each cluster than when using a lower threshold. A pragmatic cut-off of <15% probability was selected, yielding 409 clusters. The majority of pairwise comparisons within clusters were zero or 1 SNP different and <14 days apart.

For 16/700 (2.3%) genomes, the sample that produced the analysed sequence was not the first positive test for that individual in the dataset. This could have occurred if the first positive test was not sequenced, or the sequencing failed or did not pass QC filters. This could theoretically lead to different clustering outcomes, if two cases were counted as further apart temporally than they really were from the date of first positive swab. To ensure this had not biased our findings, the transcluster analysis was re-run with identical thresholds using the date of first positive test for each individual (keeping the same genomes). There was no change in the number of clusters identified (n = 409).

To maintain study participant anonymity, care home residency status cannot be released publicly linked to their COG-UK genome codes. However, an anonymised version of the same dataset analysed in this study, with COG-UK sequence codes replaced by anonymised sample codes, can be accessed via GitHub at https://github.com/gtonkinhill/SC2-care-homes-anonymised. This includes all code and anonymised input data to reproduce the transmission analysis. Further discussion on data release is provided in Supplementary Materials.

Investigating hospital admissions for care home residents

Hospital Episode Statistics (HES) data from 26th February to 10th May 2020 were linked to cases from this study using matching NHS numbers. The data were accessed by the Public Health England Healthcare Associated Infections (HCAI) division via the PHE Data Lake. This was possible for 694/700 (99%) of the care home residents with genomes available (used in the cluster analysis); six cases could not be linked to admission data due to missing NHS numbers in the study metadata.

Hospital admission coding included transfer of care between medical units as separate admissions. These were condensed into single admissions if the time interval between the preceding discharge and the following admission was less than or equal to 1 day; that is an admission had to occur 2 days or more after the preceding discharge to be counted as a new admission.

Hospital admission data were parsed to yield the following outputs

  • COVID-19-related hospital admission: first positive test date was −2 to +7 days inclusive from a hospital admission date

  • Suspected hospital acquired: first positive test date was +7 days from a hospital admission to +7 days from a hospital discharge, inclusive. The people testing positive in the community within 7 days of discharge from hospital are categorised as, ‘community onset, suspected hospital acquired’; the people testing positive after 7 days from admission but before their discharge are categorised as, ‘hospital onset, suspected hospital acquired’.

  • For the six individuals with no NHS number, we assumed they were not discharged within 7 days of a positive test.

For the care home residents with community-onset, suspected hospital-acquired infections, the number of days the patient had been admitted to hospital prior to their positive test was calculated.

CUH HCW-care home resident cluster analysis

The analysis of transmission between healthcare workers (HCW) and care home residents focused on CUH cases, where the richest metadata was available including HCW status.

Of 6600 PCR-positive patients, 91 had been identified as HCW. Of these, 74 were from the CUH HCW screening programme (which includes symptomatic, asymptomatic and household contact arms) (Jones et al., 2020a; Rivett et al., 2020) and 17 had presented acutely to CUH medical services, and been identified as HCW during their initial medical clerking and subsequent note reviews. Of the 91 HCW, 76 had genomes available for analysis (breakdown: 56 samples identified through the CUH HCW screening programme, 9 CUH HCW who presented to acute medical services at CUH, and 11 HCW from community settings (paramedics and care home workers) that had been flagged as HCW through admission clerkings). Of 464 CUH cases in the study period, 72 were care home residents (described above) and 54 of these had available genomes for analysis. The total combined analysis set of CUH HCW and care home residents was therefore 76+54 = 130.

The 130 genomes were aligned using MAFFT and underwent the same cluster analysis using the transcluster algorithm as described above. Transmission links between care homes were excluded as were links between HCWs. HCWs could belong to multiple clusters from different care homes to allow for the possibility of a HCW seeding multiple care home infections. Twenty-one clusters involving both care home residents and HCWs were identified. Of the 54 care home residents, 38 had links with HCWs within the 0.15 probability threshold. Medical notes for potential care home resident-HCW transmission pairs were reviewed by author WLH as described in Meredith et al., 2020, with cases being categorised as strongly linked epidemiologically (e.g. the HCW documented in the care home residents’ medical notes); possibly linked (e.g. both working in the hospital at the same time but not in the same wards); or no evidence of an epidemiological link.

Acknowledgements

We gratefully acknowledge the invaluable contributions of all members of the Wellcome Sanger Institute Covid-19 Surveillance Team (www.sanger.ac.uk/covid-team) who have supported this project. We would also like to thank Nick Donnelly for advice with statistical analyses, and the Public Health England Hospital Acquired Infection (HCAI) division, in particular Rebecca Guy and Mehdi Minaji, for assistance accessing hospital admission data for this study.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

William L Hamilton, Email: will.l.hamilton@gmail.com.

M Estee Torok, Email: et317@cam.ac.uk.

Amy Wesolowski, Johns Hopkins Bloomberg School of Public Health, United States.

Miles P Davenport, University of New South Wales, Australia.

Funding Information

This paper was supported by the following grants:

  • Medical Research Council COG-UK MC-PC-19027 to Sharon J Peacock.

  • National Institute for Health Research COG-UK MC-PC-19027 to Sharon J Peacock.

  • Wellcome Trust COG-UK MC-PC-19027 to Sharon J Peacock.

  • Wellcome Trust Senior Fellowship 207498/Z/17/Z to Ian G Goodfellow.

  • Academy of Medical Sciences Clinician Scientist Fellowship to M Estee Torok.

  • Health Foundation Clinician Scientist Fellowship to M Estee Torok.

  • National Institute for Health Research to William L Hamilton, Emily R Smith, Ben Warne, M Estee Torok.

  • Wellcome Trust Collabrative grant 204870/Z/16/Z to Charlotte J Houldcroft.

Additional information

Competing interests

No competing interests declared.

I have received grant support from the Academy of Medical Sciences, the Health Foundation, and the NIHR Biomedical Research Centre. I have also received book royalties from Oxford University Press and honoraria from the Wellcome Sanger Institute.

Author contributions

Conceptualization, Resources, Data curation, Formal analysis, Supervision, Visualization, Methodology, Writing - original draft, Project administration.

Conceptualization, Data curation, Software, Formal analysis, Methodology, Writing - review and editing.

Resources, Data curation, Formal analysis, Writing - review and editing.

Resources, Data curation, Formal analysis, Writing - review and editing.

Data curation, Formal analysis, Writing - review and editing.

Resources, Data curation, Formal analysis, Writing - review and editing.

Resources, Data curation, Formal analysis, Supervision, Investigation, Writing - review and editing.

Investigation, Writing - review and editing.

Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation.

Resources, Investigation, Writing - review and editing.

Resources, Investigation.

Resources, Investigation, Writing - review and editing.

Resources, Data curation, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation.

Resources, Investigation, Writing - review and editing.

Resources, Data curation, Formal analysis, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Project administration, Writing - review and editing.

Resources, Supervision, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Resources, Investigation, Writing - review and editing.

Supervision, Project administration.

Supervision, Funding acquisition, Project administration, Writing - review and editing.

Resources, Supervision, Project administration, Writing - review and editing.

Resources, Data curation, Supervision, Investigation, Project administration, Writing - review and editing.

Conceptualization, Resources, Data curation, Formal analysis, Supervision, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Software.

Ethics

Human subjects: This study was conducted as part of surveillance for COVID-19 infections under the auspices of Section 251 of the NHS Act 2006. It therefore did not require individual patient consent or ethical approval. The COG-UK study protocol was approved by the Public Health England Research Ethics Governance Group (reference: R&D NR0195).

Additional files

Supplementary file 1. Supplementary materials for ‘Genomic epidemiology of COVID-19 in care homes in the East of England’.
elife-64618-supp1.docx (232.4KB, docx)
Transparent reporting form

Data availability

The main analysis set comprised 700 genomes from care home residents. Additionally, a randomised selection of 700 genomes from non-care home residents was used for comparing lineage composition, and genomes from 76 healthcare workers tested at CUH were included for the analysis of care home resident-HCW transmission. Consensus fasta sequences for the 1,476 genomes are publicly accessible through the COG-UK website data section (https://www.cogconsortium.uk/data/). COG-UK also regularly deposits data into public databases such as GISAID (https://www.gisaid.org/). COG-UK sequence codes, GISAID accession IDs and virus names for the 1,476 analysed genomes are included in Supplementary file 1. Sequences generated through the COG-UK consortium have associated public metadata (available via the COG-UK website or GISAID), including patient age, sex, collection date (if available), and location to the level of UK county. COG-UK samples are sequenced under statutory powers granted to the UK Public Health Agencies. Matched patient data is securely released to the COG-UK consortium under a data sharing framework which strictly controls the handling of patient data. The status of individuals living in a care home and groups of such care home patients are both on the consortium restricted data list. This means that this data cannot be publicly released linked to their sequencing identifiers (eg. COG-UK sequence codes). This is because of the risk of deductive disclosure, potentially compromising study participant anonymity. However, code to fully reproduce the transcluster transmission analysis using anonymised metadata is available via GitHub at: https://github.com/gtonkinhill/SC2-care-homes-anonymised (v0.1.0). The genomes are the same as those used in the study, but sample names in the genetic distance matrix and corresponding metadata have been changed from COG-UK sequence codes to anonymised sample codes. The metadata (sampling dates) has been altered from the original patient data but in a way that preserves the date-differences between samples within care homes, thus yielding an identical transcluster analysis. If a researcher requires access to restricted metadata (including care home residency status) linked to the COG-UK sequence codes, then this will require a formal data sharing agreement with the COG-UK Consortium. Access to patient outcome information for patients treated at Cambridge University Hospitals NHS Foundation Trust (CUH) requires a data sharing agreement with CUH. Data will only be shared for public health and research purposes, not for commercial enterprise, and only to individuals working at reputable research and public health institutions for which data security can be assured. Should this be required researchers should contact the study corresponding authors in the first instance.

References

  1. Alm E, Broberg EK, Connor T, Hodcroft EB, Komissarov AB, Maurer-stroh S. Geographical and temporal distribution of SARS-CoV-2 clades in the WHO european region, January to June. Euro Surveillance : Bulletin Européen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin. 2020;2020:1–8. doi: 10.2807/1560-7917.ES.2020.25.32.2001410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Argimón S, Abudahab K, Goater RJE, Fedosejev A, Bhai J, Glasner C, Feil EJ, Holden MTG, Yeats CA, Grundmann H, Spratt BG, Aanensen DM. Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microbial Genomics. 2016;2:e000093. doi: 10.1099/mgen.0.000093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Arons MM, Hatfield KM, Reddy SC, Kimball A, James A, Jacobs JR, Taylor J, Spicer K, Bardossy AC, Oakley LP, Tanwar S, Dyal JW, Harney J, Chisty Z, Bell JM, Methner M, Paul P, Carlson CM, McLaughlin HP, Thornburg N, Tong S, Tamin A, Tao Y, Uehara A, Harcourt J, Clark S, Brostrom-Smith C, Page LC, Kay M, Lewis J, Montgomery P, Stone ND, Clark TA, Honein MA, Duchin JS, Jernigan JA, Public Health–Seattle and King County and CDC COVID-19 Investigation Team Presymptomatic SARS-CoV-2 infections and transmission in a skilled nursing facility. New England Journal of Medicine. 2020;382:2081–2090. doi: 10.1056/NEJMoa2008457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bull M. A Nextflow pipeline for running the ARTIC network’s fieldbioinformatics tools with a focus on ncov2019. AGPL-3.0Github. 2020 https://github.com/connor-lab/ncov2019-artic-nf
  5. Bull RA, Adikari T, Hammond JM, Stevanovski I, Ferguson JM, Beukers AG, Naing Z, Yeang M, Verich A, Gamaarachichi H, Kim KW, Luciani F, Stelzer-Braid S, Eden J-S, Rawlinson WD, Van Hal SJ. Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis. bioRxiv. 2020 doi: 10.1101/2020.08.04.236893. [DOI] [PMC free article] [PubMed]
  6. Burton JK, Bayne G, Evans C, Garbe F, Gorman D, Honhold N, McCormick D, Othieno R, Stevenson J, Swietlik S, Templeton K, Tranter M, Willocks L, Guthrie B. Evolution and impact of COVID-19 outbreaks in care homes: population analysis in 189 care homes in one geographic region. medRxiv. 2020 doi: 10.1101/2020.07.09.20149583. [DOI] [PMC free article] [PubMed]
  7. Byrne AW, McEvoy D, Collins AB, Hunt K, Casey M, Barber A, Butler F, Griffin J, Lane EA, McAloon C, O'Brien K, Wall P, Walsh KA, More SJ. Inferred duration of infectious period of SARS-CoV-2: rapid scoping review and analysis of available evidence for asymptomatic and symptomatic COVID-19 cases. BMJ Open. 2020;10:e039856. doi: 10.1136/bmjopen-2020-039856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Care Quality Commission The scope of registration. 2015. [July 30, 2020]. https://www.cqc.org.uk/files/scope-registration-march-2015
  9. Care Quality Commission Service types | care quality commission. 2020a. [July 30, 2020]. https://www.cqc.org.uk/guidance-providers/regulations-enforcement/service-types#care-homes-without-nursing
  10. Care Quality Commission What is registration? | care quality commission. 2020b. [July 30, 2020]. https://www.cqc.org.uk/guidance-providers/registration/what-registration
  11. Care Quality Commission Using CQC data | care quality commission. 2020c. [July 30, 2020]. https://www.cqc.org.uk/about-us/transparency/using-cqc-data
  12. City Population East of England (United kingdom): Counties and unitary districts & settlements - Population statistics, charts and map. 2020. [December 4, 2020]. https://www.citypopulation.de/en/uk/eastofengland/
  13. COG-UK Pangolin COVID-19 lineage assigner. 2020. [February 24, 2021]. https://pangolin.cog-uk.io/
  14. Collier D, Assennato S, Sithole N, Sharrocks K, Ritchie A, Ravji P, Routledge M, Sparkes D, Skittrall J, Warne B, smielewska A, Ramsey I, Goel N, Curran M, Enoch D, Tassell R, Lineham M, Vaghela D, Leong C, Gupta R. Rapid point of care nucleic acid testing for SARS-CoV-2 in hospitalised patients: a clinical trial and implementation study. medRxiv. 2020 doi: 10.1101/2020.05.31.20114520. [DOI] [PMC free article] [PubMed]
  15. Connor TR, Loman NJ, Thompson S, Smith A, Southgate J, Poplawski R, Bull MJ, Richardson E, Ismail M, Thompson SE-, Kitchen C, Guest M, Bakke M, Sheppard SK, Pallen MJ. CLIMB (the cloud infrastructure for microbial bioinformatics): an online resource for the medical microbiology community. Microbial Genomics. 2016;2:e000086. doi: 10.1099/mgen.0.000086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Curran ET. Infection outbreaks in care homes: prevention and management. Nursing Times. 2017. [July 7, 2020]. https://www.nursingtimes.net/clinical-archive/infection-control/infection-outbreaks-in-care-homes-prevention-and-management-14-08-2017/
  17. De Maio N, Walker C. Issues with SARS-CoV-2 sequencing data - SARS-CoV-2 coronavirus / nCoV-2019 genomic epidemiology. Virological. 2020. [December 4, 2020]. https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
  18. Fauver JR, Petrone ME, Hodcroft EB, Shioda K, Ehrlich HY, Watts AG, Vogels CBF, Brito AF, Alpert T, Muyombwe A, Razeq J, Downing R, Cheemarla NR, Wyllie AL, Kalinich CC, Ott IM, Quick J, Loman NJ, Neugebauer KM, Greninger AL, Jerome KR, Roychoudhury P, Xie H, Shrestha L, Huang ML, Pitzer VE, Iwasaki A, Omer SB, Khan K, Bogoch II, Martinello RA, Foxman EF, Landry ML, Neher RA, Ko AI, Grubaugh ND. Coast-to-Coast spread of SARS-CoV-2 during the early epidemic in the united states. Cell. 2020;181:990–996. doi: 10.1016/j.cell.2020.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Goldberg SA, Lennerz J, Klompas M, Mark E, Pierce VM, Thompson RW, Pu CT, Ritterhouse LL, Dighe A, Rosenberg ES, Grabowski DC. Presymptomatic transmission of severe acute respiratory syndrome coronavirus 2 among residents and staff at a skilled nursing facility: results of Real-time polymerase chain reaction and serologic testing. Clinical Infectious Diseases. 2021;72:686–689. doi: 10.1093/cid/ciaa991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Graham NSN, Junghans C, Downes R, Sendall C, Lai H, McKirdy A, Elliott P, Howard R, Wingfield D, Priestman M, Ciechonska M, Cameron L, Storch M, Crone MA, Freemont PS, Randell P, McLaren R, Lang N, Ladhani S, Sanderson F, Sharp DJ. SARS-CoV-2 infection, clinical features and outcome of COVID-19 in united kingdom nursing homes. Journal of Infection. 2020;81:411–419. doi: 10.1016/j.jinf.2020.05.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, Tan AL, Paul LM, Brackney DE, Grewal S, Gurfield N, Van Rompay KKA, Isern S, Michael SF, Coffey LL, Loman NJ, Andersen KG. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biology. 2019;20:1–19. doi: 10.1186/s13059-018-1618-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hadfield J. artic-network/rampart: Read Assignment, Mapping, and Phylogenetic Analysis in Real Time. 1.2.0Github. 2020 https://github.com/artic-network/rampart
  23. He X, Lau EHY, Wu P, Deng X, Wang J, Hao X, Lau YC, Wong JY, Guan Y, Tan X, Mo X, Chen Y, Liao B, Chen W, Hu F, Zhang Q, Zhong M, Wu Y, Zhao L, Zhang F, Cowling BJ, Li F, Leung GM. Temporal dynamics in viral shedding and transmissibility of COVID-19. Nature Medicine. 2020;26:672–675. doi: 10.1038/s41591-020-0869-5. [DOI] [PubMed] [Google Scholar]
  24. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: improving the ultrafast bootstrap approximation molecular biology and evolution. Molecular Biology and Evolution. 2018;35:518–522. doi: 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. ICS Assessing whether COVID-19 patients will benefit from critical care, and an objective approach to capacity challenges. 2020. [October 10, 2020]. https://www.rcplondon.ac.uk/file/20726/download [DOI] [PMC free article] [PubMed]
  26. Illingworth CJR, Hamilton WL, Jackson C, Popay A, Meredith L, Houldcroft CJ, Hosmillo M, Jahun A, Routledge M, Warne B, Caller L, Caddy S, Yakovleva A, Hall G, Khokhar FA, Feltwell T, Pinckert ML, Georgana I, Chaudhry Y, Curran M, Parmar S, Sparkes D, Rivett L, Jones NK, Sridhar S, Forrest S, Dymond T, Grainger K, Workman C, Gkrania-Klotsas E, Brown NM, Weekes MP, Baker S, Peacock SJ, Gouliouris T, Goodfellow I, De Angelis D, Török ME. A2B-COVID: a method for evaluating potential SARS-CoV-2 transmission events. medRxiv. 2020 doi: 10.1101/2020.10.26.20219642. [DOI] [PMC free article] [PubMed]
  27. Jones NK, Rivett L, Sparkes D, Forrest S, Sridhar S, Young J, Pereira-Dias J, Cormie C, Gill H, Reynolds N, Wantoch M, Routledge M, Warne B, Levy J, Córdova Jiménez WD, Samad FNB, McNicholas C, Ferris M, Gray J, Gill M, Curran MD, Fuller S, Chaudhry A, Shaw A, Bradley JR, Hannon GJ, Goodfellow IG, Dougan G, Smith KG, Lehner PJ, Wright G, Matheson NJ, Baker S, Weekes MP, CITIID-NIHR COVID-19 BioResource Collaboration Effective control of SARS-CoV-2 transmission between healthcare workers during a period of diminished community prevalence of COVID-19. eLife. 2020a;9:e59391. doi: 10.7554/eLife.59391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Jones NR, Qureshi ZU, Temple RJ, Larwood JPJ, Greenhalgh T, Bourouiba L. Two metres or one: what is the evidence for physical distancing in covid-19? BMJ. 2020b;370:m3223. doi: 10.1136/bmj.m3223. [DOI] [PubMed] [Google Scholar]
  29. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods. 2017;14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kemenesi G, Kornya L, Tóth GE, Kurucz K, Zeghbib S, Somogyi BA, Zöldi V, Urbán P, Herczeg R, Jakab F. Nursing homes and the elderly regarding the COVID-19 pandemic: situation report from Hungary. GeroScience. 2020;42:1093–1099. doi: 10.1007/s11357-020-00195-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ladhani SN, Chow JY, Janarthanan R, Fok J, Crawley-Boevey E, Vusirikala A, Fernandez E, Perez MS, Tang S, Dun-Campbell K, Wynne-Evans E, Bell A, Patel B, Amin-Chowdhury Z, Aiano F, Paranthaman K, Ma T, Saavedra-Campos M, Myers R, Ellis J, Lackenby A, Gopal R, Patel M, Chand M, Brown K, Hopkins S, Consortium C, Shetty N, Zambon M, Ramsay ME, London Care Home Investigation Team Increased risk of SARS-CoV-2 infection in staff working across different care homes: enhanced CoVID-19 outbreak investigations in London care homes. Journal of Infection. 2020;81:621–624. doi: 10.1016/j.jinf.2020.07.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lansbury LE, Brown CS, Nguyen-Van-Tam JS. Influenza in long-term care facilities. Influenza and Other Respiratory Viruses. 2017;11:356–366. doi: 10.1111/irv.12464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li H. Aligning sequence reads clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 https://arxiv.org/abs/1303.3997
  36. MacFadden DR, McGeer A, Athey T, Perusini S, Olsha R, Li A, Eshaghi A, Gubbay JB, Hanage WP. Use of genome sequencing to define institutional influenza outbreaks, Toronto, Ontario, Canada, 2014-15. Emerging Infectious Diseases. 2018;24:492–497. doi: 10.3201/eid2403.171499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. McAuslane H, Morgan D. Communicable disease outbreak management Operational Guidance: public heal engl. 2014. [March 12, 2021]. https://www.gov.uk/government/publications/communicable-disease-outbreak-management-operational-guidance
  38. Meredith LW, Hamilton WL, Warne B, Houldcroft CJ, Hosmillo M, Jahun AS, Curran MD, Parmar S, Caller LG, Caddy SL, Khokhar FA, Yakovleva A, Hall G, Feltwell T, Forrest S, Sridhar S, Weekes MP, Baker S, Brown N, Moore E, Popay A, Roddick I, Reacher M, Gouliouris T, Peacock SJ, Dougan G, Török ME, Goodfellow I. Rapid implementation of SARS-CoV-2 sequencing to investigate cases of health-care associated COVID-19: a prospective genomic surveillance study. The Lancet Infectious Diseases. 2020;20:1263–1271. doi: 10.1016/S1473-3099(20)30562-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Middleton J, Reintjes R, Lopes H. Meat plants-a new front line in the covid-19 pandemic. BMJ. 2020;370:m2716 . doi: 10.1136/bmj.m2716. [DOI] [PubMed] [Google Scholar]
  40. NICE NICE guideline NG159. COVID-19 rapid guideline: critical care in adults. 2. Admission to critical care. 2020. [October 11, 2020]. https://www.nice.org.uk/guidance/ng159
  41. Office for National Statistics 2011 census - Population and household estimates for England and Wales, march 2011. National Census. 2011. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/2011censuspopulationandhouseholdestimatesforenglandandwales
  42. Office for National Statistics Changes in the older resident care home population between 2001 and 2011 - Office for national statistics. 2014. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/ageing/articles/changesintheolderresidentcarehomepopulationbetween2001and2011/2014-08-01
  43. Office for National Statistics Families and households in the UK - Office for national statistics. 2019a. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/families/bulletins/familiesandhouseholds/2019
  44. Office for National Statistics Families and households. Office for National Statistics. 2019b. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/families/datasets/familiesandhouseholdsfamiliesandhouseholds
  45. Office for National Statistics Deaths involving COVID-19, England and wales. Office for National Statistics. 2020a. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/bulletins/deathsinvolvingcovid19englandandwales/deathsoccurringinjune2020
  46. Office for National Statistics Deaths involving COVID-19 in the care sector, England and wales. Office for National Statistics. 2020b. [July 7, 2020]. https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/articles/deathsinvolvingcovid19inthecaresectorenglandandwales/deathsoccurringupto12june2020andregisteredupto20june2020provisional
  47. Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microbial Genomics. 2016;2:e000056. doi: 10.1099/mgen.0.000056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Public Health England Weekly Coronavirus Disease 2019 (COVID-19) Surveillance Report Confirmed Cases in England. Year: 2020, Week: 20. 2020a. [March 12, 2021]. https://www.nottinghamshire.gov.uk/media/2891988/weeklycovid-19surveillancereportinnottinghamshire-19july2020.pdf
  49. Public Health England COVID-19 personal protective equipment (PPE) – resource for care workers working in care homes during sustained COVID-19 transmission in England. 2020b. [March 12, 2021]. https://www.gov.uk/government/publications/covid-19-how-to-work-safely-in-care-homes
  50. Public Health England Coronavirus testing. 2020c. [July 30, 2020]. https://www.gov.uk/government/news/coronavirus-testing
  51. Quicke K, Gallichote E, Sexton N, Young M, Janich A, Gahm G, Carlton EJ, Ehrhart N, Ebel GD. Longitudinal surveillance for SARS-CoV-2 RNA among asymptomatic staff in five Colorado skilled nursing facilities: epidemiologic, virologic and sequence analysis. medRxiv. 2020 doi: 10.1101/2020.06.08.20125989. [DOI]
  52. Rambaut A, Holmes EC, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology. 2020a;5:1403–1407. doi: 10.1038/s41564-020-0770-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Rambaut A, Holmes E, O'Toole Á, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. SARS-CoV-2 lineages. 2020b. [July 30, 2020]. https://cov-lineages.org/descriptions.html [DOI] [PMC free article] [PubMed]
  54. Rivett L, Sridhar S, Sparkes D, Routledge M, Jones NK, Forrest S, Young J, Pereira-Dias J, Hamilton WL, Ferris M, Torok ME, Meredith L, Curran MD, Fuller S, Chaudhry A, Shaw A, Samworth RJ, Bradley JR, Dougan G, Smith KG, Lehner PJ, Matheson NJ, Wright G, Goodfellow IG, Baker S, Weekes MP, CITIID-NIHR COVID-19 BioResource Collaboration Screening of healthcare workers for SARS-CoV-2 highlights the role of asymptomatic carriage in COVID-19 transmission. eLife. 2020;9:e58728. doi: 10.7554/eLife.58728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Seemann T. tseemann/snp-dists: Pairwise SNP distance matrix from a FASTA sequence alignment. 3.0Github. 2020 https://github.com/tseemann/snp-dists
  56. Stimson J, Gardy J, Mathema B, Crudu V, Cohen T, Colijn C. Beyond the SNP threshold: identifying outbreak clusters using inferred transmissions. Molecular Biology and Evolution. 2019;36:587–603. doi: 10.1093/molbev/msy242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Strausbaugh LJ, Sukumar SR, Joseph CL. Infectious disease outbreaks in nursing homes: an unappreciated hazard for frail elderly persons. Clinical Infectious Diseases. 2003;36:870–876. doi: 10.1086/368197. [DOI] [PubMed] [Google Scholar]
  58. The Health Foundation COVID-19 policy tracker. 2020. [July 30, 2020]. https://www.health.org.uk/news-and-comment/charts-and-infographics/covid-19-policy-tracker
  59. Tonkin-Hill G. fasttranscluster. v0.1.0Github. 2020 https://github.com/gtonkinhill/fasttranscluster
  60. UK government COVID-19: number of outbreaks in care homes - management information. GOV.UK. 2020a. [July 30, 2020]. https://www.gov.uk/government/statistical-data-sets/covid-19-number-of-outbreaks-in-care-homes-management-information
  61. UK government Update on policies for visiting arrangements in care homes. 2020b. [October 26, 2020]. https://www.gov.uk/government/publications/visiting-care-homes-during-coronavirus/update-on-policies-for-visiting-arrangements-in-care-homes#section-4
  62. UK government Vivaldi 1: COVID-19 care homes study report. GOV.UK. 2020c. [October 11, 2020]. https://www.gov.uk/government/publications/vivaldi-1-coronavirus-covid-19-care-homes-study-report/vivaldi-1-covid-19-care-homes-study-report
  63. UK government Get coronavirus tests for a care home. GOV.UK. 2020d. [July 30, 2020]. https://www.gov.uk/apply-coronavirus-test-care-home
  64. Williamson EJ, Walker AJ, Bhaskaran K, Bacon S, Bates C, Morton CE, Curtis HJ, Mehrkar A, Evans D, Inglesby P, Cockburn J, McDonald HI, MacKenna B, Tomlinson L, Douglas IJ, Rentsch CT, Mathur R, Wong AYS, Grieve R, Harrison D, Forbes H, Schultze A, Croker R, Parry J, Hester F, Harper S, Perera R, Evans SJW, Smeeth L, Goldacre B. OpenSAFELY: factors associated with COVID-19 death in 17 million patients. Nature. 2020;584:430–436. doi: 10.1038/s41586-020-2521-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Yu G, Smith DK, Zhu H, Guan Y, Lam TT‐Y. Ggtree : an r package forvisualization and annotation of phylogenetic trees with theircovariates and otherassociated data. Methods in Ecology and Evolution. 2017;8:28–36. doi: 10.1111/2041-210X.12628. [DOI] [Google Scholar]
  67. Zhang J, Litvinova M, Wang W, Wang Y, Deng X, Chen X, Li M, Zheng W, Yi L, Chen X, Wu Q, Liang Y, Wang X, Yang J, Sun K, Longini IM, Halloran ME, Wu P, Cowling BJ, Merler S, Viboud C, Vespignani A, Ajelli M, Yu H. Evolving epidemiology and transmission dynamics of coronavirus disease 2019 outside Hubei Province, China: a descriptive and modelling study. The Lancet Infectious Diseases. 2020;20:793–802. doi: 10.1016/S1473-3099(20)30230-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Amy Wesolowski1

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Given the importance of understanding the role of care homes and burden of SARS-CoV-2 in older individuals, understanding how transmission is occurring within and between these facilities is of key epidemiological importance. Further integrating genomic with epidemiological information can provide novel insight into transmission patterns that would otherwise be nearly to disentangle. This manuscript is able to leverage a wealth of information to add important insight into transmission cluster that can illuminate important factors dictating the epidemiological dynamics.

Decision letter after peer review:

Thank you for submitting your article "Genomic epidemiology of COVID-19 in care homes in the East of England" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Miles Davenport as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Additional background needed: transcluster method, Meredith paper, Figure 6.

Sequencing discrepancy:

This manuscript combines sequencing and epidemiological analysis to understand SARS-CoV-2 transmission in care home settings. The authors aim to answer questions related to the burden of care home-associated cases, outcomes for care home residents, and the transmission dynamics of the virus within care homes. These questions are laid out very clearly in this well-written manuscript, and the results are thoughtful and informative. The authors also do a nice job of outlining the limitations of their method in the Discussion.

1) While Figure 5 compares the viral lineages observed in care and non-care home individuals, I did not see non-care home sequences on the phylogenetic tree in Figure 6. Why is this? Incorporating additional sequences from individuals not in care homes from the same region (from both this study and other previously published studies, if available) could reinforce that there were transmission clusters within care homes (i.e., if all care home sequences were more closely related to each other than any other sequences). It appears that there are sequences from both and showing the relationship between care homes and non-care homes would help make sense of their results and ensure that the clustering conclusions are not artifacts of the sampling method. Why were only a few of the 10 care homes further discussed in the paper. Also, minor point but panel A is not further discussed in the manuscript.

2) Did the sequencing include the use of negative controls to detect possible contamination? Contamination is a common issue when using the ARTIC primers due to the many cycles of PCR amplification, and I have a hard time accepting sequencing data produced using this method without explicit description of the contamination controls or clean lab practices used. This information would be very helpful to see in the Materials and methods and would help promote careful validation of sequencing data.

3) Similarly, did the authors investigate the ONT/Illumina sequence pair that had two SNPs between the sequences? There is evidence that ONT sequencing in particular has biases at particular areas within the SARS-CoV-2 genome (e.g., https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473, https://www.medrxiv.org/content/10.1101/2020.08.13.20174136v2), and it would be interesting to know if this discrepancy occurs at one of these regions, which would allow the authors to explain and possibly even correct the error. If the error cannot be easily explained, the authors should comment on why they don't think their analysis, which is based heavily on single-SNP differences between sequences, could be affected by whatever might be causing this discrepancy.

4) Overall the manuscript is fairly dense with a substantial amount of results and information provided in the text which would be better summarized as tables. For example, the authors state that 7,406 samples from 6,600 individuals were identified, but only 18% are from care homes? If so, is the 18% that was primarily analyzed in the manuscript? Or is this 18% only residents (and if so, are the care givers omitted?). Further, there are many values without percentages presented and it is difficult to follow which samples were considered in each section. For example, in the next section they report that 7% are from a hospital, but are these a subset of the first section? In general, it is difficult to follow how the samples/information in each section relate (or do not relate) to subsequent sections and exactly which samples (including the numbers and percentages) were analyzed.

5) The cluster results should provide more background and clarification. For example, it appears that only 60% of the data was analyzed in this section however it is not clear how this was decided. The authors should provide more information about the transcluster method since it is a newer method. This can help put the results in context, because currently is it not clear if the analysis considers all of the clusters together or only within cluster transmission. In addition, the authors state that 8 SNPs separate each care home, however they do not provide any indication if this suggest very close transmission amongst care homes. Additional lineage information (including statistics) would be helpful to put the data into context. Further, for outbreaks identified, information should be provided on the time these samples were collected.

Further, the sentence "We investigated the role of genomics in defining care home clusters by repeating the transcluster algorithm using the same parameters as for the main analysis but assuming all genomes were identical " is not clear enough. What authors mean by "the role of genomics?". Finally, a figure/table with SNPs, days separating and number of individuals per cluster would helpful since there is currently limited investigation provided. There is limited depth provided when exploring these clusters – for example the authors discuss a paramedic in cluster CARE0063 but they do not provide information of relatedness (SNP distance), how this sample related to the other samples in time, or any other information. In addition, if directionality cannot be inferred, this should be addressed within the limitations in the Discussion.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Genomic epidemiology of COVID-19 in care homes in the East of England" for consideration by eLife. Your article has been evaluated by the previous reviewers, and the evaluation has been overseen by a Reviewing Editor and Miles Davenport as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Summary:

Based on the revision, we believe there are still a few areas that require revision and additional detail prior to acceptance. We recognize the importance of the ethical and IRB process, however it is not clear why some information about the date samples were collected could not be provided. Oftentimes IRBs will allow the time between samples or more aggregated dates to be provided. Given that the method used is highly reliant on the timing of the samples, this information is needed to evaluate the work.

Essential Revisions:

1) Timing of samples: Information is needed to put the timing of samples into context – given that the method is highly reliant on this information. For example, were distinct samples taken a substantial period of time apart could possibly be part of the same transmission chain if not all cases were sampled? For example, dates by epi week or month/year are often allowed by IRBs. Or perhaps the authors could add some summary statistics to Table 6, e.g. span of time between samples in the care home and mean/spread of days between samples? This would be aggregate and could not be connected to a specific sample. Without additional information, the work is not transparent to the level that it can be properly evaluated. Additionally, we may have missed it but we don't think the authors specified the generation time they are using when running transcluster? Details on this and the uncertainty around this estimate would help, especially when compared to even relative/aggregate timing information (and a few more details on sampling). For example, we would be quite convinced that two samples taken 1-2 months apart could be ruled out to be separate introductions if there were no confirmed cases between them and the generation time was a few days with a small confidence interval.

2) Sampling bias: The revision does not adequate address the issues of sampling bias. For example, what was the depth of sampling in each care home? Were all positive cases sequenced? Only a proportion? What sampling bias (and subsequent impact on the results) may arise from the sampling strategies used? For example, the text states that: there is genomic information for "for 700 / 1,167 (60.0%) care home residents from 292 care homes" but only only 10 of these care homes were analyzed (102/700 total samples). It is not clear why all of the samples were not included and should be justified or these additional samples should be included in the analysis. This was included in the original reviews, however was not adequately addressed in the revision.

3) Provide additional detail on the phylogeny: Additional information is needed in the text to ensure that the conclusions based on SNP differences are supported by the phylogeny. For example, seeing the tree Figure 6A in more detail (e.g. a zoom in panel), would allow the reader to see if non-care home sequences fall within the colored clusters or not. This would also help address the issues of sampling bias (care and non-care home individuals) and the subsequent implications. For example, the authors say "Samples from the ten care homes with the 1180 largest number of genomes are highlighted by coloured circles on branch tips." : if I look at the first colored circle in cyan corresponding to CARE0151, based on Table 6 I would expect to see 7 samples, but I only count 5. It is overall difficult to deciphering the tree. The addition of more information in the figure legend would also be beneficial and avoid the reader from searching the information in the text. Nonetheless, it is surprising that the transcluster method, that defines clusters based in genetic information but also date of collection, identifies clusters with sequences that are scattered all over the tree like in the case of clusters CARE0151, CARE0277, CARE0061 or CARE0032. The fact that the clusters are heterogeneous is reflected in the pairwise SNP difference plot, and very clearly for CARE0277 that seems to have two sub-populations, but does not appear to be the case for CARE0151. These might need more in depth explanation despite the observation reported in “By contrast, several care homes were “polyphyletic”, with cases distributed across the phylogenetic tree and higher pairwise SNP difference counts between samples, consistent with multiple independent introductions of the virus among residents.”, for example.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Genomic epidemiology of COVID-19 in care homes in the East of England" for further consideration by eLife. Your revised article has been evaluated by Miles Davenport (Senior Editor) and a Reviewing Editor.

Unfortunately the Editors and Reviewers felt that the manuscript is not acceptable for eLife as it has not reached the level of clarity needed to allow for the key conclusions to be evaluated and for someone else to replicate these results. In particular, the key piece of information that is missing is the amount of time between samples and clusters within the same care home that are needed to understand how the genetic data is being used to determine the clusters. While Table 6 is helpful, it still does not provide enough detail about the timing of infections – in particular the time between clusters within a care home and the mean time/distribution of time between samples within and between the care homes. These data are needed to interpret the results. Given limitations to protect the privacy of participants, information by care home – including clusters identified, dates associated with these clusters, and sampling timeframes – is acceptable as an alternative to including the care home information per sample.

Further, since the authors make a point to distinguish between monophyletic and polyphyletic cluster, a full tree should be included in the supplementary information (while the added zoom is a nice addition, it is lacking since it is just about one cluster).

Finally, the data sharing agreement does not meet the standards required for eLife. That is, simply stating that others may "discuss the process of signing a data sharing agreement" appears particularly subjective. The authors need to have a full data package available to anyone who requests it – or any limitations on providing the full data need to be specified. Eg: if it is limited to academic (non-commercial) study / people in a particular geographical jurisdiction / a confidentiality agreement is required – a draft of this agreement should be provided and a clear statement on what reasons for declining would be. These raw data need to be accessible to allow attempts at replication as an important part of the scientific process.

eLife. 2021 Mar 2;10:e64618. doi: 10.7554/eLife.64618.sa2

Author response


Additional background needed: transcluster method, Meredith paper, Figure 6

We have provided further background on the transcluster method and the Meredith et al. paper, provided further detail on viral clustering in care homes, produced a new version of Figure 6 allowing for additional conclusions, and added tables to present more of the data succinctly, as described further below.

Sequencing discrepancy:

This manuscript combines sequencing and epidemiological analysis to understand SARS-CoV-2 transmission in care home settings. The authors aim to answer questions related to the burden of care home-associated cases, outcomes for care home residents, and the transmission dynamics of the virus within care homes. These questions are laid out very clearly in this well-written manuscript, and the results are thoughtful and informative. The authors also do a nice job of outlining the limitations of their method in the Discussion.

1) While Figure 5 compares the viral lineages observed in care and non-care home individuals, I did not see non-care home sequences on the phylogenetic tree in Figure 6. Why is this? Incorporating additional sequences from individuals not in care homes from the same region (from both this study and other previously published studies, if available) could reinforce that there were transmission clusters within care homes (i.e., if all care home sequences were more closely related to each other than any other sequences). It appears that there are sequences from both and showing the relationship between care homes and non-care homes would help make sense of their results and ensure that the clustering conclusions are not artifacts of the sampling method. Why were only a few of the 10 care homes further discussed in the paper. Also, minor point but panel A is not further discussed in the manuscript.

We have produced a new version of Figure 6, which includes the 700 care home resident genomes plus the 700 randomly selected non-care home genomes (as described elsewhere in the paper). This new figure demonstrates that care home infections are intermixed across the phylogenetic tree with non-care home infections. This is consistent with the virus passing between care home and non-care home settings, rather than all the care home sequences being more closely related to each other than to other sequences. We have added this point to the figure legend and to the results:

“Consistent with this, care home and non-care home samples were intermixed across the phylogenetic tree (Figure 6A), suggesting viral transmission could pass between care homes and non-care home settings.”

The above sentence also means Figure 6 panel A is now discussed further in the manuscript than it was previously.

We have added a new table (Table 6) providing more detailed epidemiology on the COVID-19 outbreaks for the 10 care homes with the largest number of cases. These data indicate that, “while care homes frequently had more than one introduction of the virus among residents (i.e. >1 cluster), there was typically a single dominant cluster responsible for the majority of cases within each care home,” now in the Results section. The same trend is seen across all care homes with >3 samples. This is an interesting observation, and we thank the reviewers for prompting us.

2) Did the sequencing include the use of negative controls to detect possible contamination? Contamination is a common issue when using the ARTIC primers due to the many cycles of PCR amplification, and I have a hard time accepting sequencing data produced using this method without explicit description of the contamination controls or clean lab practices used. This information would be very helpful to see in the Materials and methods and would help promote careful validation of sequencing data.

The reviewer’s comment that, “contamination is a common issue when using the ARTIC primers due to the many cycles of PCR amplification…” is not entirely correct – any process that requires PCR is potentially sensitive to contamination, this is not a unique problem with “ARTIC primers”.

The workflow for nanopore sequencing performed in the Goodfellow lab at the University of Cambridge has been developed following years of field testing and validation. It relies on a directional sample flow as used in a diagnostic laboratory which includes separated pre- and post-PCR areas, with dedicated equipment for each stage of the process. All steps are performed in PCR cabinets which are cleaned using DNA removal solutions and a UV decontamination cycle run after each batch. All sequencing batches included at least one water negative control carried over from the reverse-transcription step. Mapped reads were assessed in real-time during sequencing with RAMPART (https://github.com/artic-network/rampart) and all data from batches containing a contaminated negative control were discarded before sequence assembly. This information is now included in Materials and methods.

The WSI sequencing workflow also uses negative controls and the pass rate to date related to negative controls is 90%. Sequencing read counts are considered after a clipping and minimum alignment length filtering step (corresponding to data which is used to create consensus sequence or variant calls). Such read counts for the samples analysed in this study were typically in the millions (median: 4,497,543). If such read counts for the corresponding negative controls are >100 then the samples are currently failed. This QC procedure was introduced for samples analysed on or after the 18th of April. Of the 1,007 samples analysed in this study sequenced at WSI (503 care home residents and 504 non-care home residents), 749 were sequenced once this workflow was established, 242 were sequenced before this but had a negative control and 16 did not have a negative control. If we apply the current criteria then 38 of these earlier samples would have failed (38/1400 = 2.7% of the analysed samples). 26 of these 38 samples are non-care home samples and 12 are from care homes. Of the 12 care home samples (12/700 = 1.7% total care home genomes analysed), 1 belongs to one of the "top 10" care homes with the largest number of genomes, care home CARE0063, which comprises a single cluster of 12 genomes using the transcluster algorithm, described in main text. Thus, the main result of our genomic cluster analysis (that multiple introductions are often observed in care homes, but typically a single dominant cluster causes most of the cases) would not be altered by the small number of early genomes included that would now be excluded by current criteria. This information has been added to the Materials and methods.

3) Similarly, did the authors investigate the ONT/Illumina sequence pair that had two SNPs between the sequences? There is evidence that ONT sequencing in particular has biases at particular areas within the SARS-CoV-2 genome (e.g., https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473, https://www.medrxiv.org/content/10.1101/2020.08.13.20174136v2), and it would be interesting to know if this discrepancy occurs at one of these regions, which would allow the authors to explain and possibly even correct the error. If the error cannot be easily explained, the authors should comment on why they don't think their analysis, which is based heavily on single-SNP differences between sequences, could be affected by whatever might be causing this discrepancy.

We have looked into this pair of samples further. The two SNPs are identified as C1884T and C16351T. Neither of these are included in the list of problematic sites included in the virological article mentioned by the reviewer (sites that are described as highly homoplasic and have no phylogenetic signal and/or low prevalence), and we state this in the manuscript with this reference. In our previous study (Meredith et al., 2020), out of 14 sample pairs sequenced both by Illumina at WSI and nanopore in the University of Cambridge, there were zero SNP differences at positions where both sequences made a call.

We have a new paragraph in the Materials and methods outlining reasons why pairs of sequences from the same patient may have SNP differences, and link to a formal comparison of Oxford Nanopore Technology (ONT) vs Illumina sequencing which demonstrated highly accurate SARS-CoV-2 consensus sequences (https://www.biorxiv.org/content/10.1101/2020.08.04.236893v1.full). We conclude, “Given this degree of consensus sequence accuracy, and because transcluster uses a transmission probability cut-off based on integrating pairwise SNP and temporal differences (rather than relying solely on a strict SNP cut-off), limited sequencing noise is unlikely to have a substantial impact on the clusters identified.”

4) Overall the manuscript is fairly dense with a substantial amount of results and information provided in the text which would be better summarized as tables. For example, the authors state that 7,406 samples from 6,600 individuals were identified, but only 18% are from care homes? If so, is the 18% that was primarily analyzed in the manuscript? Or is this 18% only residents (and if so, are the care givers omitted?). Further, there are many values without percentages presented and it is difficult to follow which samples were considered in each section. For example, in the next section they report that 7% are from a hospital, but are these a subset of the first section? In general, it is difficult to follow how the samples/information in each section relate (or do not relate) to subsequent sections and exactly which samples (including the numbers and percentages) were analyzed.

We have added a more detailed legend to Table 1, which hopefully clarifies the samples analysed:

“The total sample set for this study comprised 6,600 individuals. Of these, care home residency status could be established for 6,413 (97.2%). 1,167/6,413 (18.2%) individuals were identified as being care home residents, of which 700/1,167 (60.0%) had genomic data available that passed quality control filtering and were used for identifying care home clusters using the transcluster algorithm (described in Materials and methods and main text). The subset of individuals (464/6,600, 7.03%) that were tested at Cambridge University Hospitals (CUH) had richer metadata available and were used for analysing intensive care unit (ICU) admissions and 30-day mortality after first positive test, shown here…”

When referring to “care home residents” from the study without qualification, we are referring to the 1,167 care home residents identified in the study. Of these, 700 individuals had genomic data available and these are generally referred to as, “care home residents with genomic data”, unless it is obvious from the context. We have ensured each paragraph of the Results section includes a description of what sample set is being used for that analysis, e.g. the subset of samples tested at Cambridge University Hospitals is clearly sign-posted:

“464 / 6,600 (7%) individuals with positive COVID-19 tests were patients tested at Cambridge University Hospitals. We had access to richer metadata for this subset of patients via the hospital electronic records system.”

The 464 individuals tested at Cambridge University Hospitals (CUH) are a subset of the total 6,600 individuals in the study.

Other examples:

“Genome sequence data were available for 700 / 1,167 (60.0%) care home residents from 292 care homes (Figure 2—figure supplement 2). […] Links between care homes and hospitals were investigated for the 700 care home residents with genomic data available.[…]Potential transmission networks involving care home residents and healthcare workers (HCW) were investigated for people tested at CUH (HCW data were not available outside of CUH). This analysis comprised 54 care home residents tested at CUH and 76 HCW with genomic data available”.

Regarding the density of numerical data presented in prose, we have cut out details that can be found in tables in several places. We have added a new table (Table 7) that summarises most of the numerical data presented in the “Links between care homes and hospitals” section, allowing us to cut back some of the counts listed in main text there.

5) The cluster results should provide more background and clarification. For example, it appears that only 60% of the data was analyzed in this section however it is not clear how this was decided. The authors should provide more information about the transcluster method since it is a newer method. This can help put the results in context, because currently is it not clear if the analysis considers all of the clusters together or only within cluster transmission. In addition, the authors state that 8 SNPs separate each care home, however they do not provide any indication if this suggest very close transmission amongst care homes. Additional lineage information (including statistics) would be helpful to put the data into context.

We have added substantially to the background of transcluster in Materials and methods, beginning with paragraph:

“Clusters were produced using an implementation of the transcluster algorithm (Stimson et al., 2019; Tonkin-Hill, 2020). Instead of targeting the number of SNPs separating two genomes, the transcluster algorithm proposes a probabilistic alternative which estimates the number of intermediate transmission events separating two sampled genomes. The method takes into account both genetic SNP distance as well as the time at which each sample was taken. The approach models both the SNP distance and the number of intermediate hosts as a Poisson process. Using a predefined evolutionary rate as well as an estimate of the generation time (the time between transmission events), the method infers the distribution of the number of intermediate hosts separating two samples.”

We go on to describe the mathematics of transcluster in further detail.

Re: the selection of 700 genomes; this was all care home residents with genomic data available for the analysis. We have clarified in the legend to Table 1 and at the start of each Results section exactly which samples were used for each analysis. In this case:

“Genome sequence data were available for 700 / 1,167 (60.0%) care home residents from 292 care homes (Figure 2—figure supplement 2).”

SARS-CoV-2 has low overall genetic diversity, so a median of 8 SNPs separating pairwise comparisons in a geographically contained area is quite typical and does not suggest care home genomes overall were especially low in diversity when compared against each-other. We put this into the context of the East of England region as a whole:

“There was a median of 8 single nucleotide polymorphisms (SNPs) separating care home genomes, compared to 9 for randomly selected non-care home samples (P=0.95, Wilcoxon rank sum test) (Figure 6—figure supplement 1), similar to the EoE region described previously (Meredith et al., 2020).”

We provide as much additional information on lineages as can be inferred from the available data:

“With ongoing viral evolution, descendent lineages of B.1 and B.1.1 also rose in frequency and were commonly found in England during the relevant time period. This suggests that the SARS-CoV-2 lineages circulating in care homes were similar to those found across the EoE outside of care homes… No new viral lineages from outside the UK were observed, which may reflect the success of travel restrictions in limiting introductions of new lineages into the general population.”

The main points we wish to make here are that the lineages inside and outside of care homes were similar, and similar to the East of England (and indeed Europe) as a whole. There were not specific “care home lineages” circulating separately from the non-care home wider community. We have added a new table (Table 5), which explicitly compared the frequency of lineage B.1.1 in early vs late time periods, for care home vs non-care home samples. We have not elaborated on other lineages (which would be complex to appreciate in a table and statistically challenging to compare) as we do not make strong claims on changes in particular lineage frequencies. We provide regional context in the references e.g. see Figure 2 of Elm et al. demonstrating the rise in frequency of lineage B.1.1 across Europe over the same time period (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7427299/).

Further, for outbreaks identified, information should be provided on the time these samples were collected.

We do not disclose the dates that care home outbreaks occurred/ samples were collected because of the risk of deductive disclosure: i.e. the risk that the combination of different anonymised data (e.g. number of cases for the care home, dates of sampling, age profile of residents etc) could be used to de-anonymise individuals. This concern is highlighted in the supplementary table of COG-UK IDs for the analysed genomes. If the reviewer is referring to the time of day the samples were collected, we do not have access to those data.

Further, the sentence "We investigated the role of genomics in defining care home clusters by repeating the transcluster algorithm using the same parameters as for the main analysis but assuming all genomes were identical " is not clear enough. What authors mean by "the role of genomics?".

We have explained this further in the same paragraph:

“The contribution made by genomic data in defining care home clusters was quantified. Without genomic data (or access to more detailed epidemiology such as accommodation sub-structuring within care homes), clustering can only be based on temporal differences between cases. For example, if two groups of COVID-19 cases occur several months apart within a care home they could be inferred to have resulted from (at least) two separate introductions. However, this method cannot account for multiple introductions occurring around the same time, as may happen when community transmission is high. To quantify the impact made by adding genomic data, which can distinguish between genetically dissimilar viruses introduced at similar times, the transcluster algorithm was repeated using the same parameters as for the main analysis but assuming all genomes were identical.”

Finally, a figure/table with SNPs, days separating and number of individuals per cluster would helpful since there is currently limited investigation provided. There is limited depth provided when exploring these clusters – for example the authors discuss a paramedic in cluster CARE0063 but they do not provide information of relatedness (SNP distance), how this sample related to the other samples in time, or any other information. In addition, if directionality cannot be inferred, this should be addressed within the limitations in the Discussion.

We have added a new table (Table 6) that shows a breakdown of epidemiological information for each of the 10 care homes with the largest number of genome samples. The pairwise SNP difference distributions for residents within these care homes is displayed in Figure 6B.

Our method of identifying clusters does not use a SNP difference cut-off, which is why we do not report SNP distance; instead, we use the transcluster algorithm, which integrates both SNP difference and date between sampling (used as a proxy for serial interval). We report that links for the HCW and care home residents shown in Figure 7B are in the top 1.1% of all pairwise transmission probabilities inferred using the transcluster algorithm, and the figure itself indicates the individual pairwise transmission probabilities based on their colour, as shown with the figure key.

The overall distributions of pairwise SNP and date differences within the clusters defined by transcluster are shown in supplements to Figure 7: Supplement 1 shows the distribution of pairwise transmission probabilities imputed by the transcluster algorithm (and the cut-off used to define clusters in our analysis); Supplement 2 shows how the number of clusters changes as the cut-off is changed (showing the cut-off used in our analysis); Supplement 3 shows a pairwise SNP difference histogram for samples within clusters; and Supplement 4 shows a pairwise date difference histogram for samples within clusters. We have added mentions for supplements 3 and 4 to the main text to highlight these.

We have provided more background on the transcluster algorithm in Materials and methods, which hopefully makes this methodology clearer.

Re: directionality of transmission – we have added this as a limitation in the Discussion, as suggested by the reviewer:

“Directionality of person-to-person transmission cannot be inferred from the transcluster algorithm. Inferring the likelihood of transmission direction between pairs of individuals requires integration with multiple forms of epidemiological data, yielding a probabilistic estimate (Illingworth et al., 2020).”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential Revisions:

1) Timing of samples: Information is needed to put the timing of samples into context – given that the method is highly reliant on this information. For example, were distinct samples taken a substantial period of time apart could possibly be part of the same transmission chain if not all cases were sampled? For example, dates by epi week or month/year are often allowed by IRBs. Or perhaps the authors could add some summary statistics to Table 6, e.g. span of time between samples in the care home and mean/spread of days between samples? This would be aggregate and could not be connected to a specific sample. Without additional information, the work is not transparent to the level that it can be properly evaluated. Additionally, we may have missed it but we don't think the authors specified the generation time they are using when running transcluster? Details on this and the uncertainty around this estimate would help, especially when compared to even relative/aggregate timing information (and a few more details on sampling).

We have added 2 columns detailing date information to Table 6. These columns show the date range in days (i.e. days from first sample to last sample date) for each care home and for each cluster within each care home. The date range for each care home is typically larger than the date range for clusters within care homes, except for single-cluster care homes like CARE0314 where the date range is already small for the care home as a whole. This is consistent with transcluster identifying groups of cases occurring closer together in time. It is also interesting to note cases like CARE0263, in which all 12 residents tested positive within 3 days of each-other, but these were in fact three separate clusters (one dominant cluster of 9 cases, one cluster of 2 cases and a single separate case), consistent with the three clusters that can be seen in the phylogeny shown in Figure 6A, i.e. transcluster is yielding both temporally linked cases and genetically linked cases, as expected. Without the genomic data, the three clusters in CARE0263 would have been impossible to distinguish. We have included this as an illustrative case in the Results.

The reviewer asks for aggregate data on time difference between samples within each cluster. A histogram of pairwise date differences between samples within each cluster is shown in Figure 7—figure supplement 5. We also describe these data in Results:

“Within each cluster, 673 / 775 (86.8%) of pairwise links had zero or 1 pairwise SNP differences (maximum 4), and 756 / 775 (97.5%) were sampled <14 days apart (maximum 22 days) (Figure 7—figure supplements 4-5).”

We have added Figure 7—figure supplement 6, which shows the median and interquartile range for pairwise date differences between all samples within each cluster, arranged from lowest to highest median date difference.

In addition, we have added an analysis of differences between sampling dates from first to last case for care homes versus clusters across the dataset:

“Clusters had a tighter distribution of sampling dates than for the total cases within each care home, as expected. For the 170 care homes with 2 or more cases with genomic data, there was a median of 9 (IQR: 4 – 15) days from the first case to the last case within each care home, compared with a median of zero (IQR 0-5) days from the first case to the last case of each cluster (P < 10-5, Wilcoxon rank sum test).”

Lastly, we have added sampling date to the list of COG-UK and GISAID IDs for samples analysed in the study to the supplementary materials.

Re: Generation time – we approximated the generation time by the serial interval, as is common in transmission studies (https://www.medrxiv.org/content/10.1101/2020.09.18.20197210v1), but recognise this was not clear in the Materials and methods. Alternative strategies for estimating the generation time have also led to similar estimates (https://www.eurosurveillance.org/content/10.2807/1560-7917.ES.2020.25.17.2000257). We have updated the corresponding sentences to read:

“The implementation of transcluster assumed a viral mutation rate of 1e-3 substitutions/site/year (Fauver et al., 2020) and generation time of five days, approximated by previous estimates of the serial interval of SARS-CoV-2 (He et al., 2020; Zhang et al., 2020). Days between first positive sampling date for pairs of individuals was used as a proxy for generation time.”

For example, we would be quite convinced that two samples taken 1-2 months apart could be ruled out to be separate introductions if there were no confirmed cases between them and the generation time was a few days with a small confidence interval.

We presume that the reviewer meant to say, “we would be quite convinced that two samples taken 1-2 months apart could be inferred to be separate introductions if there were no confirmed cases between them.”? If so, we agree that cases that occurred this far apart in time with no cases in between could be assumed to be separate introductions. The pairwise date difference between samples within each cluster is shown in Figure 7—figure supplement 5; most cases within each cluster are <14 days apart, with a maximum dispersion of up to 22 days. The distributions of differences in pairwise sampling dates between samples within each cluster are also shown in Figure 7—figure supplement 6. The transcluster clustering method is consistent with the reviewer’s intuitions.

2) Sampling bias: The revision does not adequate address the issues of sampling bias. For example, what was the depth of sampling in each care home? Were all positive cases sequenced? Only a proportion? What sampling bias (and subsequent impact on the results) may arise from the sampling strategies used? For example, the text states that: there is genomic information for "for 700 / 1,167 (60.0%) care home residents from 292 care homes" but only only 10 of these care homes were analyzed (102/700 total samples). It is not clear why all of the samples were not included and should be justified or these additional samples should be included in the analysis. This was included in the original reviews, however was not adequately addressed in the revision.

Every sample with an available genome at the time the analysis was run that passed sequencing quality control measures was included in the analysis. There were no exclusion criteria for genomic analysis other than the sequence quality control criteria described in Materials and methods. We attempted to sequence every sample tested in the Cambridge PHE diagnostic laboratory. The fact that genomic data were not available for 40% of care home residents is a limitation described in the Discussion, with reasons listed:

“Viral sequence data were not available for 40% of care home residents, as a result of missing samples, mismatches between sequences and metadata, genomes not passing quality control filtering using a stringent threshold (<10% missing calls), or sequences being unavailable at the time of data extraction.”

We have added to the Materials and methods stating the aim was to sequence all positive samples from the diagnostic laboratory, and list reasons why genomic sampling was incomplete:

“The study aimed to sequence all samples which tested SARS-CoV-2 PCR positive at the CMPHL during the study period. Sequencing of every positive diagnostic sample could not be performed, however, for the following reasons: (i) sample unavailability (e.g. diagnostic samples being lost or discarded before they could be collected by the sequencing team); (ii) labelling errors when assigning sequencing codes (which resulted in specimens being discarded); or (iii) metadata mismatches (if the sample did not match to a metadata record downloaded from the hospital electronic patient records system).”

There is no reason to believe that these factors were biased in any particular way. Moreover, 60% SARS-CoV-2 genomic coverage for all care home residents tested in our diagnostic laboratory is one of the highest rates of sequencing coverage for care home residents anywhere in the world.

We have re-written the caption for Figure 1 (the study flow diagram) to explicitly state how many samples were sequenced by nanopore and Illumina technologies:

“Out of 1,297 samples from 1,167 care home residents, 286 samples were assigned for nanopore sequencing on site and 833 samples for sequencing at the Wellcome Sanger Institute (WSI). Of these, 258 and 533 sequences were available and downloaded from the MRC-CLIMB server at the time of running the analysis, respectively. Of these available genomes, 224 and 522 passed sequencing quality control thresholds (described in Materials and methods), respectively. This yielded the final analysis set of 700 high-coverage genomes from care home residents (representing 292 care homes): 197 genomes sequenced on site by nanopore and 503 sequences at WSI by Illumina.”

All 700 genomes passing QC from care home residents were analysed in the transcluster analysis used to define care home clusters – none were excluded. We chose to visualise the “top 10” care homes with the largest number of genomes available in Figure 7 because we thought that displaying transmission networks for hundreds of care homes in a single, large figure would be difficult for a reader to appreciate. However, we have now added Figure 7—figure supplement 1, which shows the same network diagrams for every care home in the study with 2 or more genomes available. Care homes with only 1 genome available in the dataset are not displayed as the transmission network would consist of a single point.

One important finding of the cluster analysis is that care homes may have had multiple introductions among residents, but frequently a single cluster was responsible for the majority of cases (consistent with a substantial role of within-care home transmission driving the majority of care home cases). This is described for both the “top 10” care homes and aggregated for the entire dataset, quoting the relevant section of the previous manuscript version below:

“Of the 90 / 292 (30.8%) care homes with three or more residents with genomic data (comprising 418 / 700 (59.7%) care home residents with genomic data), 74 / 90 (82.2%) had a dominant cluster responsible for >50% of all cases in the care home.”

This analysis was limited to care homes with 3 or more cases with genomic data because 3 is the minimum number in which 2 clusters could be present and one be “dominant” (i.e. represent 2/3 cases). This still includes the majority (59.7%) of care home residents with genomic data. If the analysis is repeated for all care homes with 2 or more cases we still find the majority of care homes have one cluster comprising >50% of samples: Of the 170 / 292 (58.2%) care homes with 2 or more residents with genomic data available (comprising 578 / 700 (82.6%) care home residents with genomic data), 111 / 170 (65.3%) had a dominant cluster responsible for >50% of all cases in the care home. We have now added both numbers to the main text for transparency.

Regarding “sampling bias” from the point of sample collection (as opposed to which samples were sequenced/ analysed after testing positive), we agree that opportunistic sampling for genomic epidemiology carries a risk of sampling bias, and studies such as this must pay careful attention to this and consider how/whether it could affect their main conclusions. We describe this in the Discussion:

“…the nature of diagnostic testing sites changed during the study period as regional hospitals developed their own in-house testing capacity and community testing laboratories were set up. “Pillar 2” testing in the UK was outsourced to high-throughput laboratories during April 2020 and performed an increasing proportion of community testing. It is possible that some care home residents from the same care home could have been tested through different routes, with symptomatic cases more likely to be tested in “Pillar 1” via the CMPHL (and included in this dataset), and asymptomatic screening occurring more via the Pillar 2 laboratories.”

However, we feel this issue does not detract from our fundamental conclusions on clustering within care homes:

“…most care homes in EoE only began systematic screening after the end of our study following the introduction of the UK care home testing portal on 11th May 2020. Moreover, the transcluster algorithm allows for “missing links” within a cluster (the threshold used assumed a >15% probability of infections being linked within <2 intermediate hosts), reducing the impact of missing care home cases on defined clusters.”

And finally, we warn that:

“The changing profile of COVID-19 testing in the UK between March and May 2020 should therefore be factored into all interpretations of COVID-19 epidemiology from that period.”

3) Provide additional detail on the phylogeny: Additional information is needed in the text to ensure that the conclusions based on SNP differences are supported by the phylogeny. For example, seeing the tree Figure 6A in more detail (e.g. a zoom in panel), would allow the reader to see if non-care home sequences fall within the colored clusters or not. This would also help address the issues of sampling bias (care and non-care home individuals) and the subsequent implications. For example, the authors say "Samples from the ten care homes with the 1180 largest number of genomes are highlighted by coloured circles on branch tips." : if I look at the first colored circle in cyan corresponding to CARE0151, based on Table 6 I would expect to see 7 samples, but I only count 5. It is overall difficult to deciphering the tree. The addition of more information in the figure legend would also be beneficial and avoid the reader from searching the information in the text. Nonetheless, it is surprising that the transcluster method, that defines clusters based in genetic information but also date of collection, identifies clusters with sequences that are scattered all over the tree like in the case of clusters CARE0151, CARE0277, CARE0061 or CARE0032. The fact that the clusters are heterogeneous is reflected in the pairwise SNP difference plot, and very clearly for CARE0277 that seems to have two sub-populations, but does not appear to be the case for CARE0151. These might need more in depth explanation despite the observation reported in “By contrast, several care homes were “polyphyletic”, with cases distributed across the phylogenetic tree and higher pairwise SNP difference counts between samples, consistent with multiple independent introductions of the virus among residents.”, for example.

We have added a magnified subtree for one branch of the phylogeny in Figure 6A, focusing on the monophyletic branch for care home CARE0314. This demonstrates these genomes are identical or 1-SNP different (consistent with the box plot in panel 6B). Two non-care home genomes are also present in this clade. We have stated that care home and non-care home genomes are intermixed in the tree, consistent with viral transmission occurring between care home and non-care home settings, which is expected.

Re: The number of coloured tips on the tree, some samples close together on the tree have coloured tips “on top of each other” so the number of circles visualised may be fewer than the total plotted. We have checked and this applies to CARE0151.

Re: “heterogeneous clusters”. To be clear, Figure 6B shows pairwise SNP differences for the 10 care homes with the largest number of samples, not pairwise SNP differences within clusters defined by transcluster. Some care homes, such as CARE0314, are “monophyletic”, with low pairwise SNP difference count across all samples from that care home, and CARE0314 is identified as having a single cluster by transcluster. Other care homes, such as CARE0151, are “polyphyletic” and have higher pairwise SNP differences between all samples from that care home, and transcluster accordingly divides this care home into multiple clusters as shown in Figure 7A. ie. Figure 7 shows the clusters defined by transcluster, Figure 6 is showing a phylogenetic tree and pairwise SNP differences within care homes, not clusters. We have made this clearer in the legend to Figure 6:

“Clusters within each care home were defined using integrated genomic and temporal data using the transcluster algorithm and are shown in Figure 7.”

The pairwise SNP difference between samples within each cluster is shown in Figure 7—figure supplement 4, and is also summarised in main text:

“Within each cluster, 673 / 775 (86.8%) of pairwise links had zero or 1 pairwise SNP differences (maximum 4), and 756 / 775 (97.5%) were sampled <14 days apart (maximum 22 days) (Figure 7—figure supplements 4-5).”

The phylogenetic tree and pairwise SNP difference plot shown in Figure 6 illustrate the general principle of the distinction between polyphyletic and monophyletic care home cases and demonstrate that care home genomes are phylogenetically intermixed with non-care home genomes. The paper goes on to define care home clusters formally using transcluster, integrating the genomic data with temporal data. All conclusions drawn on clustering within care homes are based on the transcluster analysis, which is described at length in the two paragraphs detailed in Materials and methods and visualised in Figure 7.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Unfortunately the Editors and Reviewers felt that the manuscript is not acceptable for eLife as it has not reached the level of clarity needed to allow for the key conclusions to be evaluated and for someone else to replicate these results.

We recognise the importance of reproducibility in the scientific method. However, this must be balanced against the ethical requirement for study participant confidentiality in line with the approved ethics for the study. COG-UK has ethical approval to publicly release limited sample metadata including the person’s age, sex and county of residence and the date of sample collection. Releasing additional metadata linked to COG-UK IDs, such as whether the person was a care home resident and their relationship to other individuals via anonymised care home codes, would risk deductive disclosure. For example, it may be possible to work out who individuals are based on knowing 3 people from the same care home in Cambridgeshire who tested positive on particular dates, and knowing their age and sex. This breach of confidentiality would be unethical and violate the ethical approvals in place to do the study.

We refer to the General Medical Council (GMC, the regulatory body for medical doctors practicing in the UK) guidance on confidentiality (https://www.gmc-uk.org/ethical-guidance/ethical-guidance-for-doctors/confidentiality), section “Using and disclosing patient information for secondary purposes”, pp 37-39. The section on anonymised information states (our emphasis):

“The Information Commissioner’s Office anonymisation code of practice (ICO code) considers data to be anonymised if it does not itself identify any individual, and if it is unlikely to allow any individual to be identified through its combination with other data. Simply removing the patient’s name, age, address or other personal identifiers is unlikely to be enough to anonymise information to this standard.”

This is based on the UK Information Commissioner's Office (ICO) guidance, available here: https://ico.org.uk/media/1061/anonymisation-code.pdf

We have discussed with senior members of the COG-UK consortium (several of whom are co-authors on this manuscript) who have confirmed that care home residency status including anonymised care home codes are on the COG-UK restricted data list and cannot be released publicly.

To address the issue of reproducibility we have generated a version of the dataset with anonymised sample names – i.e. the genetic distance matrix and linked anonymised metadata required to run the transcluster analysis but without samples being linked to their COG-UK sequence codes. This anonymised dataset includes the same anonymised care home codes as used in the paper and all code so the results are fully reproducible. The data and code are publicly available via GitHub at https://github.com/gtonkinhill/SC2-care-homes-anonymised.

In particular, the key piece of information that is missing is the amount of time between samples and clusters within the same care home that are needed to understand how the genetic data is being used to determine the clusters. While Table 6 is helpful, it still does not provide enough detail about the timing of infections – in particular the time between clusters within a care home and the mean time/distribution of time between samples within and between the care homes. These data are needed to interpret the results. Given limitations to protect the privacy of participants, information by care home – including clusters identified, dates associated with these clusters, and sampling timeframes – is acceptable as an alternative to including the care home information per sample.

We have added two large supplementary tables that provide all of the requested data, expanded the description of date differences in the Results, and added another supplementary figure comparing the date range distribution for care homes and clusters.

The two new tables in Supplementary Materials are, “Sampling date ranges for care home residents with genomic data: by care home”, and, “Sampling date ranges for care home residents with genomic data: by cluster defined by the transcluster algorithm”. These provide the sampling dates for the first and last samples for every care home and cluster in the full dataset, respectively. This should provide all of the information the reviewers ask for above, including number of clusters from each care home, number of samples within each care home and cluster, and the sample date range for every care home and cluster in the study.

Summary data on within-cluster sampling date distributions is shown in Figure 7—figure supplement 7, and the time ranges from first to last samples within clusters and within care homes is described in the Results. We have added Figure 7—figure supplement 6, showing boxplot distributions of date ranges (from first to last sample dates) for care homes vs clusters.

Further, since the authors make a point to distinguish between monophyletic and polyphyletic cluster, a full tree should be included in the supplementary information (while the added zoom is a nice addition, it is lacking since it is just about one cluster).

We have added Figure 6—figure supplement 1, which is a phylogenetic tree for all samples in the study with available genomic data that passes sequencing quality control (total N = 4,445 samples). The figure is produced in high resolution, so that the interested reader can zoom in and view every branch of the tree.

Finally, the data sharing agreement does not meet the standards required for eLife. That is, simply stating that others may "discuss the process of signing a data sharing agreement" appears particularly subjective. The authors need to have a full data package available to anyone who requests it – or any limitations on providing the full data need to be specified. Eg: if it is limited to academic (non-commercial) study / people in a particular geographical jurisdiction / a confidentiality agreement is required – a draft of this agreement should be provided and a clear statement on what reasons for declining would be. These raw data need to be accessible to allow attempts at replication as an important part of the scientific process.

As above, the anonymised dataset and code now available on GitHub allows full reproducibility of the clustering results generated using the transcluster algorithm, without requiring any data sharing agreement.

If a researcher requires access to restricted metadata (including care home residency status) linked to the COG-UK sequence codes, then this will require a formal data sharing agreement with the COG-UK Consortium. Access to patient outcome information for patients treated at Cambridge University Hospitals NHS Foundation Trust (CUH) requires a data sharing agreement with CUH. Data will only be shared for public health and research purposes, not for commercial enterprise, and only to individuals working at reputable research and public health institutions for which data security can be assured. Should this be required researchers should contact the study corresponding authors in the first instance.

[Editors' note: we include below the reviews that the authors received from another journal, along with the authors’ responses.]

Reviewer 1 (stats)

I will focus on methods and reporting. This is an interesting and valuable piece of reseach but I have some reservations.

The process to identify care home residents is not perfect but it is comprehensive. Some cases may be missed but I would not expect them to be many.

Major

1) It wasn't clear to me if testing occuring for each care home ending up int he authors' database, for the time period of interest, would be available. In other words, it is possible for some of the testing taking place being in the database (for a particular care home – except those not appearing in the data at all), but not all? If not all tests are available, this would be a severe limitation. The first implication that this may be the case appears in the Results section, as percentage of all cases compared to those in the whole of the East. But that doesn't discuss care homes specifically (which is of course hard to do without doing the research for those other cases, but in the context of consistent submissions by care homes, it is an important point).

We capture around half of all COVID-19 cases that occurred in the East of England over the study period, the remaining positive cases being sampled at other sites. We have expanded our description of where the samples included in the study derive from in the Results:

“The samples were tested at the Public Health England (PHE) Clinical Microbiology and Public Health Laboratory (CMPHL) in Cambridge, which receives samples from across the East of England. Positive cases came from 37 submitting organisations including regional hospital laboratories and community-based testing services (Supplementary file 1). The proportion of samples coming from different sources changed over the study period (Figure 1—figure supplement 2). This likely reflects a combination of regional hospitals establishing their own testing facilities, increasing availability of community testing in the UK, and the implementation of national policies that increased the scope of care home testing (Figure 1—figure supplement 3). Overall, the study population included almost half of the COVID-19 cases diagnosed in the EoE at this time (Public Health England, 2020a), with the remainder being tested at other laboratory sites.”

Figure 1—figure supplement 2 and Supplementary file 1 provide breakdowns of which sites submitted samples and were included in the study.

Specifically addressing the reviewer’s question of whether one care home could have submitted some samples to the Cambridge CMPHL (getting included in this dataset) and other samples to different testing sites (not included in this dataset), the answer is yes this is possible. As community testing expanded, more “screening” samples were sent to “Lighthouse Laboratories” (in the UK “Pillar 2”), while symptomatic cases were more likely to have been tested via Pillar 1 (the CMPHL). However, we do not believe this limitation is as “severe” as the reviewer suggests, as the transcluster algorithm we have used for defining clusters does allow for “missing links” connecting a transmission chain within a threshold number of intermediate hosts, based on the expected serial interval and mutation rate of the virus. We now address this issue explicitly as a limitation in our Discussion:

“We acknowledge several limitations to this study. First, we have not captured all of the COVID-19 cases from the East of England. Serology data indicate that 10.5% of all residents in care homes for people aged 65 and older in England had been infected with SARS-CoV-2 by early June, the majority of whom were asymptomatic (UK government, 2020c). […] The changing profile of COVID-19 testing in the UK between March and May 2020 should therefore be factored into all interpretations of COVID-19 epidemiology from that period.”

2) There is no clear justification for the selected time period. It is because of convenience becaue the data are available? Why not use the national portal data as well? Was it because it was the first outbreak within that time period? Anyway, some justification is needed.

The time period used spanned from the first positive case received in our laboratory (26th February), to the 10th May. This end-date was chosen because: (1) it captured the majority of the “first wave” of the epidemic in the East of England (as shown in the epidemic curves included); (2) Due to some delays in sequencing and genomic data becoming accessible after sample collection, the availability of genomes at the time data was pulled for analysis declined after that date; and (3) the national care home testing portal opened on 11th May 2020, and this potentially could have introduced bias in population analyses as care homes may have undergone systematic screening. As described above, the distribution of testing was complexifying over April and May such that once systematic care home screening became more common, and the Lighthouse Laboratories were running, it becomes more likely that different samples from the same care home could be tested at different sites, making the picture from the Cambridge CMPHL less complete. We have made this more explicit in the Materials and methods:

“The 10th May was selected as a study end-date because it encompassed the bulk of the “first wave” of the epidemic in the East of England. Furthermore, prior to the 11th May 2020, systematic screening of all residents within care homes was much less common and testing primarily occurred where there was a suspicion of an outbreak; our strategy reduced risk of bias which would have been introduced had we included systematic screening.”

3) the Mann-Whitney U test (more commonly used name than Wilcocox rank-sum test) is appropriately used to compare the age of care home residents vs other cases. However, I struggle to see what the test tells us when comparing cases. How can that comparison be informative without the populations in each nursing home and each residential home? A more appropriate approch would be to model cases as count data, using a Poisson or negative binomial regression model with the care home as the unit of analysis.

In a Poisson regression model, increasing case numbers per care home was weakly associated with nursing homes relative to residential homes (odds ratio (OR) 1.21 (95% confidence interval (CI) 1.08 – 1.35), P=0.00132). However, we agree with the reviewer’s point that this needs to be interpreted in the context of the populations in each home. The CQC dataset includes number of beds registered with each care home. This could be a rough proxy for number of residents in each care home (though bed occupancy may not be 100%, and there may be some turnover of new patients into the beds over time). When using positive cases per CQC registered beds for each care home, there was a slightly higher positivity rate in residential homes than in nursing homes: median 0.063 (IQR 0.033 – 0.11) cases per bed vs median 0.048 (IQR 0.026 – 0.066), respectively (P=0.008322, Wilcoxon rank sum test). We are not sure of the significance or applicability of this observation, and it is tangential to the main narrative of the paper. We have therefore removed the comment on different case numbers between residential and nursing homes from the results.

4) Another issue is what is the genome sequence bringing into the epidemiology of COVID-19. The only thing I can see is whether the outbreak is monophyletic or polyphyletic. I am not sure a clear argument about the importance of this, from the public health point of view has been made – perhaps the authors need to make this clearer.

This is an important point; we have expanded on the role of genomics in the analysis and in addressing public health questions in several places.

We repeated the viral clustering algorithm assuming all genomes were identical; this effectively eliminates the contribution of genomics to the clustering so only temporal differences between samples are used to define clusters. This yielded 316 clusters, significantly fewer than the 409 yielded when genomics was included. This shows that without genomics, distinct transmission events occurring around the same time cannot be distinguished, so viruses are grouped incorrectly into the same clusters when they are actually more likely to be distinct introductions. This analysis has been added to the Results section describing the cluster work:

“We investigated the role of genomics in defining care home clusters by repeating the transcluster algorithm using the same parameters as for the main analysis but assuming all genomes were identical to each other. This yielded 316 clusters from the 700 residents across 292 care homes – 23% fewer than the 409 clusters yielded when incorporating genomics. This suggests that genomics makes a significant contribution to defining viral clusters; without genomic data, separate transmission events occurring around the same time in a care home cannot be readily distinguished (so cluster sizes are over-estimated and the number of separate viral introductions is under-estimated).”

We emphasise this point in the Discussion, e.g.:

“We defined viral clusters within each care home by integrating temporal and genetic differences between cases. This provides a “high resolution” picture of viral transmission; without genomic data, separate introductions of the virus occurring around the same time would be impossible to distinguish.”

And,

“incorporating genomic data is more accurate for excluding linked transmission than if only temporal data are available. Genomics can thus be used to “rule out” cases as being part of a linked cluster if the genetic difference is greater than would be expected given the viral mutation rate. This could be practically informative for care homes (along with other organisations at risk of COVID-19 outbreaks like factories (Middleton et al., 2020)) with implications for infection control procedures.”

An example application of integrating genomics into cluster definitions is to assess whether a care home has a single “outbreak” or unconnected transmission events. PHE currently defines an outbreak as 2 or more cases within the same care home within 14 days. Once an outbreak is declared there are significant infection control implications, such as closing to all non-essential visitors (currently for 28 days). This period resets each time a subsequent case occurs within that time period – so care homes can remain closed to visitors for extended periods, which is obviously difficult for the residents and their families. Genomics could be used to rule out certain cases as being part of a linked transmission cluster occurring within that time window, with implications for whether the care home continues to operate under its “outbreak” protocol for visitor restrictions.

We have also emphasised the key public health messages of the study more strongly in the Discussion – the need for strong infection prevention and control within care homes to limit transmission:

“These findings emphasise the importance of limiting viral transmission within care homes in order to prevent outbreaks. Given that SARS-CoV-2 is thought to be infectious before the onset of symptoms (He et al., 2020), isolating residents or staff when they develop symptoms is not sufficient to prevent within-care home spread once the virus has entered the care home. Certain measures may be required on an ongoing basis within care homes when there is sustained community transmission, even when no outbreak is suspected (at least until the morbidity and mortality of the virus in older people has been reduced substantially through vaccination or treatments). These may include use of appropriate Personal Protective Equipment (PPE) for staff and visitors (including visiting healthcare professionals and friends and family), rigorous hand hygiene, social distancing, and making use of larger, well-ventilated rooms for social interactions or socialising outdoors, providing that this is practical and safe (Jones et al., 2020). This is consistent with current national guidance for care homes in England (Public Health England, 2020b; UK government, 2020b). Face coverings for residents themselves when interacting socially in communal indoor areas could be considered, if acceptable to residents.”

Minor

1) Quality of the Abstract is substandard. Even after going over it a few times, I struggle to understand the design, the samples and some of the reported findings. How were samples colelcted. How may samples per patient. Wouldn't it make sense to report positive patients rather than positive samples? why is the CHU subsample of 71 patients highlighted?

We have re-written the Abstract to emphasise the key points from the study, particularly the central public health point on the importance of limiting within-care home viral transmission

Reviewer 2

This is a well written and informative manuscript which addresses transmission of SARS-CoV2 among a particularly vulnerable population. Overall, the authors are to be commended for their diligence, not just in their sequencing efforts, but in tracing and checking the , no doubt, highly complex world of CQC records. The methodology, both for sequencing and data analysis, is comprehensively described and the conclusions reached are well supported by the data. It highlights the potential for within nursing home and between nursing home transmission, and shows clearly the potential involvement of healthcare workers. The findings should have implications for infection control, and surveillance in nursing homes.

We thank the reviewer for their comments.

A couple of areas would welcome clarification:

1) The genomes were sequenced using the nanopore and illumina platforms. Were a subset of samples cross checked on the other platform to ensure the same results? I might have missed this in the appendix, but would help ensure no disparity between different sequencing platforms.

Thank you for this comment, this is an important point. In our previous study (Meredith and Hamilton, Lancet Infectious Diseases, DOI: https://doi.org/10.1016/S1473-3099(20)30562-4), 14 genomes were sequenced on both platforms, and we found zero instances in which a different nucleotide was called between a pair of sequences. We have now checked our dataset from this study and identified 8 instances in which the same care home resident was sequenced on both Illumina and Nanopore technology. In 7 cases the pairs of sequences have zero SNP differences; in one case there are two SNP differences, both C vs T calls in different parts of the genome. We have added this analysis to the Materials and methods section, including the COG-UK IDs of the 8 pairs of genomes.

2) The study, due to testing, bases transmission modelling on symptomatic testing- perhaps a limitation of the study is that asymptomatic transmission couldnt be determined. If this was included, there may have been more within and between nursing home transmission accounting for some of the cases observed.

We agree with the reviewer as prior to 11th May (when the UK national care home testing portal was introduced) care homes were unlikely to have been screening residents for COVID-19. We highlight in the Discussion the broader limitation that we have an incomplete picture of COVID-19 in the region, due to limitations on testing, asymptomatic/ pauci-symptomatic cases, some individuals getting tested at other sites not included in our study, and not all tests producing a genome that passed quality control filtering. We have also added a reference to the UK Vivaldi study, which estimated care home SARS-CoV-2 prevalence and found a high proportion of asymptomatic cases:

“We acknowledge several limitations to this study. First, we have not captured all of the COVID-19 cases from the East of England. Serology data indicate that 10.5% of all residents in care homes for people aged 65 and older in England had been infected with SARS-CoV-2 by early June, the majority of whom were asymptomatic (UK government, 2020c). The Cambridge CMPHL did not receive all of the samples from the region, though based on national data we estimate we have captured around half of the COVID-19 cases reported from EoE during the study period. We did not have viral sequence data available for 40% of care home residents, as a result of missing samples, mismatches between sequences and metadata, genomes not passing quality control filtering using a stringent threshold (<10% missing calls), or sequences being unavailable at the time of data extraction. We may therefore have underestimated viral cluster sizes.”

However, one of the strengths of modelling clusters using transcluster is that we can, to some extent, account for “missing links” in a connected transmission cluster (whether because those links were asymptomatic and not tested, or the samples did not produce usable genomes, etc). transcluster allows for a probability of transmission within a set number of intermediate hosts – in our case we chose a relatively relaxed probability (>15%) of transmission occurring within 2 or fewer intermediate hosts. We describe this in the next paragraph after the above quotation in the Discussion:

“the transcluster algorithm allows for “missing links” within a cluster (our thresholds assumed a >15% probability of infections being linked within <2 intermediate hosts), reducing the impact of missing care home cases on defined clusters.”

Reviewer 3

In this study, genomic epidemiology was used to investigate viral transmission dynamics among care homes residents and health care workers. It aimed at answering questions about the burden of care home associated COVID-19; patterns of SARS-CoV-2 spread between care home residents (single and multiple independent transmission networks); and the role of health care workers in the viral spread.

Major:

Figure 3A. In many points along the manuscript, some genomes are said to "cluster together on the phylogenetic tree", but the current plot of the phylogeny (Figure 3A) does not allow readers to inspect such clustering pattern. Please provide inset panels highlighting the referred clusters, including the statistical support of each clade, especially those suggested to represent single introductions. Please also consider adding a scale bar (in subs/site, or mutational units). It would be helpful if a full phylogeny could be provided as supplement, with labels and all the annotations suggested in this review.

We have re-produced this figure and now highlight the ten care homes with the largest sample numbers as coloured tips to the branch ends, rather than as the adjacent colour bar. Hopefully this makes the clustering easier to appreciate.

Appendix. As shown in the table, among the 6600 cases reported in the period of study, around 19% were linked to home care residents, while most of the remaining cases were associated with community acquired infections. If the proportion between “care home” X “non-care home” cases were nearly 20:80, why the proportion of genomes sequenced from each category was 50:50 (700 genomes each)? How the over-representation of ”care home resident” genomes (and consequent sub-representation of non-care home ones) could have impacted the phylogenetic and transmission network analysis?

We only sampled a matching set of non-care home genomes to get a sense of whether the predominant viral lineages seen in the care home residents was similar to non-care home residents (and to the UK/ Europe as a whole), and this is only for a single figure supplementary to the main narrative. We do not use this sample of non-care home genomes anywhere else in the analysis, including the cluster analysis (transcluster) or the phylogenetic tree, so it does not impact on these analyses. We now make that point explicitly clear in the Materials and methods. To formally compare care home and non-care home genomes, we agree multiple factors would need to be controlled for between the populations including the proportion of cases from different time periods, age, county locations, etc. However, as we do not rely on the non-care home samples for any heavy conclusions in this study, we do not feel this is necessary here.

Appendix. Please supplement the last table in the appendix with information about: GISAID or Genbank virus ID (“strain name”) and accession numbers (if submitted to other databases), date of collection, geographic location, and other relevant metadata associated with the genomes used in this study. This information is essential for the reproducibility of the results.

The consensus fasta sequences used in our analyses are publicly available and can be downloaded via COG-UK or GISAID websites. We have added virus name and GISAID accession numbers along with the COG-UK IDs for all analysed genomes to the Supplementary Materials. When linking GISAID IDs to their corresponding COG-UK IDs for this publication we identified two samples where the sequence we analysed was different to the publicly available sequence on COG-UK: CAMB-1ADAE8 and CAMB-1AEB6C, and there were no GISAID IDs for them. After some investigation we found this was due to low coverage in an initial run, which was uploaded to COG-UK and did not pass GISAID QC filtering, but the samples were subsequently re-sequenced with higher coverage, and this was used in our analysis. We are in the process of getting the re-sequenced (high coverage) versions uploaded to COG-UK, and then they will be assigned GISAID accession numbers. This should happen in time for publication.

The public databases like GISAID include metadata such as patient age, sex, country of sampling etc. However, for reasons of patient confidentiality and information governance, not all of the metadata used in our analysis can be released publicly. This is because of the risk that, when used in combination, certain metadata could de-anonymise individuals. We have added the following explanation to the supplementary table of analysed genomes:

“Sequences have associated public metadata (also available via the COG-UK website or GISAID), including patient age, sex, collection date (if available), and location to the level of UK county. However, not all of the metadata used in this study can be released publicly. COG-UK samples are sequenced under statutory powers granted to the UK Public Health Agencies. Matched patient data is securely released to the COG-UK consortium under a data sharing framework which strictly controls the handling of patient data. The status of individuals living in a care home and groups of such care home patients are both on the consortium restricted data list. This means that this data cannot be publicly released linked to sequencing identifiers, sampling date and UK counties. This is because of the risk of deductive disclosure. If a research scientist would like to repeat our analysis using these data fields, they should write to the corresponding authors to discuss the process of signing a data sharing agreement that will allow them to access the data securely.”

Minor:

Figure 3. Please provide the number of genomes found in each bin on Panel B. To ease interpretation, consider colouring the branches/tips of the phylogeny according to the legend on panel A. If the same is done on panel 3B, that would be helpful.

We have followed the reviewer’s advice and coloured the tips of the tree by the care home, rather than using a colour bar. We have also added the number of genomes for each care home to the box plot (panel B), as suggested.

Could you please provide additional information as to why less then 7% of home care patients were admitted to ICUs, while 42.3% of then died? Did these deaths occur in the care homes, prior to hospitalization? This figure is especially striking when they are compared to those from non-care homes (21.4% and 17.3%, respectively). Were all deaths of non-care home residents associated with patients admitted to ICUs?

We have added the following addressing this point, with references, to the Discussion:

“A smaller proportion of care home residents were admitted to ICU compared with people who were not from care homes. What treatments a patient receives, including the invasive treatments provided in intensive care, are complex and individualised decisions based on risk-benefit assessments involving patients, their families and carers, and healthcare professionals (ICS, 2020; NICE, 2020). Of note, non-invasive respiratory support (such as continuous positive airway pressure, high-flow nasal oxygen therapy and non-invasive ventilation) are routinely provided outside ICU in many UK centres.”

Please also note that the mortality figures have slightly changed from original submission, (1) because we include a care home resident tested at CUH who was not included in the Cambridge COG-UK dataset (detailed in Materials and methods), bringing the total CUH care home residents from 71 to 72, and (2) because we now use mortality at 30 days post positive test, rather than mortality within the initial admission (thus including people who may have been discharged from hospital in a palliative setting who died at home or elsewhere such as in a nursing home). This is more consistent with the way mortality data is reported nationally.

It is not clear if all non-care home samples came from residents of EoE. Could you please clarify?

All samples and genomes in the study, including non-care home residents, came from the catchment area of tests performed at the Cambridge PHE Clinical Microbiology Laboratory, all of which are from the East of England. We have made this explicitly clear in the Materials and methods.

The manuscript made use of two distinct analyses to reveal potential clusters of cases: one used phylogenetics, and another used transmission networks. In both analyses, distinct elements are included: HCW, care house residents, and non-care house residents. It would be beneficial if nodes representing those elements could be highlighted in both graphs, so that they can be easily identified.

Both the phylogenetic and transcluster analyses identified clusters of residents within each care home. Neither analysis involved healthcare workers or non-care home residents, so these cannot be annotated in the phylogenetic tree of cluster analysis (every individual is a care home resident). However, we did a separate analysis focusing specifically on potential transmission links between healthcare workers and care home residents. The example discussed in main text, involving care homes CARE0063 and CARE0273, is shown in Figure 7B and the healthcare workers are marked with a different colour to the care home residents.

"1 pairwise SNP difference"

That sentence is “zero or 1”, so either “differences” (if referring to “zero SNP differences”) or “difference” (if referring to “1 SNP difference”) would be correct. We think “zero or 1 SNP differences” reads more fluently but this may be subjective.

Appendix, Page 35: The figure in this page shows CARE0273, which is not mentioned in the main manuscript. Would that be CARE173?

The figure is correct in labelling this CARE0273. That care home is mentioned in the section of the Results on the HCW-care home link analysis:

We also observed cases from a third care home, CARE0273, with possible transmission links to the same paramedics and carers involved in the CARE0063 cluster. These two care homes are within 1 kilometre of each-other and the cases cluster together on the phylogenetic tree, raising the possibility of shared transmission between them.

CARE0273 is not among the “top 10” care homes with the largest number of genomes and so not included in the figures showing transmission links between residents within the ten largest care homes.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Supplementary file 1. Supplementary materials for ‘Genomic epidemiology of COVID-19 in care homes in the East of England’.
    elife-64618-supp1.docx (232.4KB, docx)
    Transparent reporting form

    Data Availability Statement

    The main analysis set comprised 700 genomes from care home residents. Additionally, a randomised selection of 700 genomes from non-care home residents was used for comparing lineage composition, and genomes from 76 healthcare workers tested at CUH were included for the analysis of care home resident-HCW transmission. Consensus fasta sequences for the 1,476 genomes are publicly accessible through the COG-UK website data section (https://www.cogconsortium.uk/data/). COG-UK also regularly deposits data into public databases such as GISAID (https://www.gisaid.org/). COG-UK sequence codes, GISAID accession IDs and virus names for the 1,476 analysed genomes are included in Supplementary file 1. Sequences generated through the COG-UK consortium have associated public metadata (available via the COG-UK website or GISAID), including patient age, sex, collection date (if available), and location to the level of UK county. COG-UK samples are sequenced under statutory powers granted to the UK Public Health Agencies. Matched patient data is securely released to the COG-UK consortium under a data sharing framework which strictly controls the handling of patient data. The status of individuals living in a care home and groups of such care home patients are both on the consortium restricted data list. This means that this data cannot be publicly released linked to their sequencing identifiers (eg. COG-UK sequence codes). This is because of the risk of deductive disclosure, potentially compromising study participant anonymity. However, code to fully reproduce the transcluster transmission analysis using anonymised metadata is available via GitHub at: https://github.com/gtonkinhill/SC2-care-homes-anonymised (v0.1.0). The genomes are the same as those used in the study, but sample names in the genetic distance matrix and corresponding metadata have been changed from COG-UK sequence codes to anonymised sample codes. The metadata (sampling dates) has been altered from the original patient data but in a way that preserves the date-differences between samples within care homes, thus yielding an identical transcluster analysis. If a researcher requires access to restricted metadata (including care home residency status) linked to the COG-UK sequence codes, then this will require a formal data sharing agreement with the COG-UK Consortium. Access to patient outcome information for patients treated at Cambridge University Hospitals NHS Foundation Trust (CUH) requires a data sharing agreement with CUH. Data will only be shared for public health and research purposes, not for commercial enterprise, and only to individuals working at reputable research and public health institutions for which data security can be assured. Should this be required researchers should contact the study corresponding authors in the first instance.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES