Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset

Ian Harris; Timothy J Osborn; Phil Jones; David Lister

doi:10.1038/s41597-020-0453-3

. 2020 Apr 3;7:109. doi: 10.1038/s41597-020-0453-3

Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset

Ian Harris ^1,^2,^✉, Timothy J Osborn ², Phil Jones ², David Lister ²

PMCID: PMC7125108 PMID: 32246091

Abstract

CRU TS (Climatic Research Unit gridded Time Series) is a widely used climate dataset on a 0.5° latitude by 0.5° longitude grid over all land domains of the world except Antarctica. It is derived by the interpolation of monthly climate anomalies from extensive networks of weather station observations. Here we describe the construction of a major new version, CRU TS v4. It is updated to span 1901–2018 by the inclusion of additional station observations, and it will be updated annually. The interpolation process has been changed to use angular-distance weighting (ADW), and the production of secondary variables has been revised to better suit this approach. This implementation of ADW provides improved traceability between each gridded value and the input observations, and allows more informative diagnostics that dataset users can utilise to assess how dataset quality might vary geographically.

Subject terms: Climate and Earth system modelling, Atmospheric dynamics

Measurement(s)	temperature • volume of hydrological precipitation • vapour pressure • wet days • cloud cover
Technology Type(s)	digital curation
Factor Type(s)	date of observation • location of observation
Sample Characteristic - Environment	climate system
Sample Characteristic - Location	Asia • Africa • Europe • Australia • North America • South America

Open in a new tab

Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.11980500

Background & Summary

The CRU TS (Climatic Research Unit gridded Time Series) dataset provides a high-resolution, monthly grid of land-based (excluding Antarctica) observations going back to 1901 and consists of ten observed and derived variables (Table 1 introduces their acronyms and other relevant information). There are no missing values in the defined domain. Individual station series are anomalised using their 1961–1990 observations, then gridded to a 0.5° regular grid using angular distance weighting (ADW); the resulting anomaly grids are converted to actuals (actual values, ie, not anomalies) for publication using the CRU CL v1.0 climatologies¹.

Table 1.

CRU TS variables, showing codes, units, correlation decay distances (CDDs) and precursors.

Variables	Code	Units	CDD (km)	Precursors
Variables - primary
Mean 2 m temperature	TMP	degrees Celsius	1200	None
Diurnal 2 m temperature range	DTR	degrees Celsius	750	TMN, TMX databases
Precipitation rate	PRE	mm/month	450	None
Variables - secondary
Vapour pressure	VAP	hPa	1000	TMP, DTR
Wet days (Notes 1, 2)	WET	days	450	PRE
Cloud cover	CLD	percentage	600	DTR
Variables - derived
Frost days (Note 3)	FRS	days per month	750	TMN
Minimum 2 m temperature (Note 4)	TMN	degrees Celsius	1200	TMP, DTR
Maximum 2 m temperature (Note 4)	TMX	degrees Celsius	1200	TMP, DTR
Potential evapo-transpiration (Note 5)	PET	mm/day	n/a	TMP, TMX, TMN, VAP, CLD

Open in a new tab

Note 1: A wet day is one receiving ≥0.1 mm precipitation.

Note 2: Used in diverse areas, including evaluation of satellite observations³² and evaluation of potential evapotranspiration equations³⁶.

Note 3: Also used in many areas, including dendroclimatology³⁷ and health³⁸.

Note 4: Used to calculate scPDSI for monitoring drought³⁹, and in areas including regional agronomic production⁴⁰ and river basin vegetation⁴¹.

Note 5: minimum and maximum temperatures are the monthly means of the individual daily minimum and maximum temperatures; they are not the overall minimum or maximum temperature recorded in each month.

CRU TS was first published in 2000², using ADW (angular-distance weighting) to interpolate anomalies of monthly observations onto a 0.5° grid over land surfaces (excluding Antarctica) for seven variables (Table 2). The selection of ADW as the interpolation method was made after extensive evaluation of alternatives [2, section 2b]. Updates in 2004³, 2005⁴ and yearly from 2006 to present⁵ increased the variable count to ten (Table 2), and switched to triangulation (utilising IDL functions including TRIGRID and TRIANGULATE) to effect the interpolation and perform much of the synthetic variable work (remaining code was in Fortran). Synthesised observations were interpolated onto a coarser grid (2.5°, regular) and used to ‘plug gaps’ in the observed coverage. An extensive account of these processes may be found in references² and⁵, particularly with respect to filling in gaps in coverage.

Table 2.

CRU TS major versions, showing included variables (‘X’).

Version	TMP	DTR	PRE	VAP	WET	CLD	TMN	TMX	FRS	PET
1.0	X	X	X	X	X	X			X
2.0	X	X	X	X		X
2.1	X	X	X	X	X	X	X	X	X
3.0	X	X	X	X	X	X	X	X	X
3.1	X	X	X	X	X	X	X	X	X	X
4.0	X	X	X	X	X	X	X	X	X	X

Open in a new tab

Since the first release in 2000, CRU TS has been used widely by many classes of user, in diverse research areas and applications. These include those with localised weather- and climate-dependent models (for example, river catchment⁶, agronomic⁷), those calibrating paleoclimate reconstructions^8,9, those analysing climate variability¹⁰, and those needing bias correction for global¹¹ and regional climate models¹² and reanalyses¹³. Away from the sphere of climate research, users include the civil engineering¹⁴, financial¹⁵ and insurance¹⁶ sectors.

This version seeks to implement a more streamlined process, with ADW improving interpolation efficiency and accuracy, and delivering a full suite of metadata to facilitate nuanced interpretation of the gridded values and full traceability where necessary for quality control. This has been enabled by the move to a fully bespoke process, implemented in Fortran and described in the ‘Methods’ and ‘Data Records’ sections of this paper. The choice to return to ADW was driven by the need for improved traceability; it is justified by this, and supported by the comparison of interpolation methods reported in².

Monthly land station observations for seven variables (Mean, Minimum and Maximum Temperatures, Precipitation, Vapour Pressure, Wet Days and Cloud Cover) are updated regularly from several principal monthly sources: CLIMAT messages, exchanged internationally between WMO (World Meteorological Organisation) countries, obtained as quality-controlled files via the UK Met Office; MCDW (Monthly Climatic Data for the World) summaries, obtained from the US National Oceanographic and Atmospheric Administration (NOAA) via its National Climate Data Centre (NCDC); and updates of minimum and maximum temperatures for Australia, obtained from the Bureau of Meteorology (BoM). In addition, ad-hoc collections of stations are incorporated (after quality control checks including location, correspondence to existing holdings, and outlier checking). These observations serve to provide six ‘databases’ of monthly values (Diurnal Temperature Range being calculated from Minimum and Maximum Temperatures). Coverage for selected variables at selected dates is shown for precipitation in Fig. 1 and for temperature, DTR and vapour pressure in the three figures of Supplementary File 1, and discussed further in the ‘Meteorological station database updating’ subsection of ‘Methods’. Figure 2 shows the overall process by which these observations, along with various static repositories, are used to derive each version of the CRU TS data set. Further variables are derived from these, including Potential Evapotranspiration (PET), which is required by many users in the agricultural and hydrological sectors.

Fig. 1 — Station coverage for PRE (total precipitation). Decades included are 1910–1919 (a,b), 1940–49 (c,d), 1970–79 (e,f) and 2000–09 (g,h), showing station locations (left column) and resulting cover (right column). Additional cover from the background climatology is not shown. The CDD for PRE is 450 km. Stations appear if they contribute at least 75% of observations in the decade; grid cell cover is shown where a gridcell has interpolated data for at least 75% of time steps in the decade. For this reason, discontinuities may be observed between each decadal pair.

Fig. 2 — CRU TS production process. Colours show construction routes for each variable (see Table 1 for details of the variables).

Because of the overriding objective to present complete coverage of land surfaces (excluding Antarctica) from 1901 onwards, CRU TS is not necessarily an appropriate tool for assessing or monitoring global and regional climate change trends. Nevertheless, with care taken to identify and avoid trend artefacts caused by changing data coverage or data inhomogeneities, then CRU TS can be used for global and regional trend analysis. The first issue is that unlike, for example, CRUTEM, regions uninformed by observations are not left missing but instead are replaced by the published climatology¹. This has the advantage of being a known entity, rather than an estimate, but has the unavoidable side effect of decreasing variance. Additionally, the numbers and locations of stations contributing to any grid cell will change over time. Both effects can potentially give rise to trend artefacts. This is a particular problem with high-resolution grids, if individual grid cells or small groups of grid cells are analysed without checking to see if they contain any observation stations at all, or whether they are interpolated from distant stations during one part of the record and from close stations during another period. However, the metadata provided with the CRU TS version 4 dataset enables users to understand the level of support behind each grid cell and time step, permitting informed detection of trends or masking of areas so that analysis of trends can focus on well-observed regions. Temperature, in particular, has been shown to be resilient to the problems described above: this is in part due to its long correlation decay distance (CDD) of 1200 km (Supplementary File 1). Precipitation, with its much shorter CDD of 450 km, has reduced and more time-dependent coverage (Fig. 1), and so subject to these problems unless the data are masked prior to analysis. The second issue is that no extra homogenization is performed on the observations, so artefacts could be present where the originators have not already homogenized their data. Comparisons with other observation-based datasets at a global scale (GPCC¹⁷, UDEL¹⁸, CRUTEM¹⁹, and regional or third-party exercises^20–22) demonstrate the robustness of the dataset at large spatial scales. Assessment of the grids is discussed further in the ‘Technical Validation’ section of this paper.

Methods

Meteorological station database updating

The process to update the databases with observations, and to derive the DTR database, is unchanged and is described in⁵. Holdings of observations vary by variable, with spatial and temporal concerns affecting cover. In Fig. 1, and in Supplementary File 1 (three figures), the left column shows station locations in different decades: valid observations with at least 75% in a decade (ie, 90 or more monthly observations) are required for inclusion here. The right columns show the resultant gridded cover, taking into account the correlation decay distances (CDDs) of the variables: again, interpolated data for a minimum of 75% of the decade (90 or more values) are needed for a grid cell to be shaded. CDDs for CRU TS variables were established². Figure 1 shows PRE station cover for 1910–19, 1940–49, 1970–79 and 2000–09. There are far more PRE stations than for any other variable, but its CDD is the lowest (450 km), and so regions with sparse support have patchy coverage. The PRE database has been evaluated against other precipitation station collections in²³. In Supplementary File 1, TMP station cover (p.2) shows that in the early 20th century, even the high CDD of TMP cannot deliver full land cover: central-west Africa being the most obvious region that will default to the climatology. DTR cover (p.3) has far patchier cover than TMP, owing to its shorter CDD (750 km) and lower station numbers. The final figure in Supplementary File 1, VAP station cover (p.4) demonstrates the difference between the cover provided by VAP observations, and that introduced with the addition of synthetic VAP: for this reason, only two decades (1940–49 and 1970–79) are shown. The comparisons between b) and d), and between f) and h), show how essential synthetic variables are to achieving much greater land cover. Note that the synthetic VAP, as it is derived from TMP and DTR, inherits the lower of their CDDs (750 km). Cover is therefore reduced from that of TMP.

Anomalies

The first stage of the process is to convert each station series into anomalies. The mean used to construct the anomalies is based on the period 1961–1990, and a minimum of 75% of observations must be present in this period (23 months or more) for each of the 12 months to be processed. Outlying values exceeding a threshold (±3 standard deviations, SD, for TMP; +4 SD for PRE) are omitted. This outlier-threshold check for TMP is more stringent than that for CRUTEM4.6, which uses a ±5 SD outlier check²⁴. In total the ±3 SD check removes 8.6% of the TMP values. For regions where anomalies were exceptional (>3 SD) this can potentially remove correct values. Of the 8.6%, 8.4% and 0.2% were respectively negative and positive extremes. Although the outlier checks are strong, this does not adversely affect the later, broad comparisons discussed in the ‘Technical validation’ section. While the process to construct anomalies is algorithmically unchanged from the previous version⁵, additional elements now construct a lookup table which, for each anomalised station, lists all land cells for the destination 0.5° grid that are within the correlation decay distance (CDD) for the variable in question. This improves the computational performance of the later interpolation process.

Production of primary variables: TMP, DTR, PRE

Primary variables have no synthetic component. Station observations are anomalised using each station’s 1961–1990 normals (monthly averages). PRE is converted to percentage anomalies, so the lowest possible value would be −100, meaning no rain; and a percentage anomaly of 0 indicates equivalence with the 1961–1990 mean. Monthly anomaly fields are then interpolated onto the 0.5° × 0.5° target land grid using ADW. Land grid cells where no observation reaches are set to 0 (representing the climatology in anomaly space). Finally, the CRU CL published climatologies are used to convert the gridded anomalies to actuals.

Secondary variables

Secondary variables differ from primary variables in that they have fewer direct observations available. We therefore supplement these by estimating synthetic values from the primary variables. The synthetic estimates are obtained using empirical relationships with the primary variables that are unchanged from those described in⁵. What has changed in CRU TS4 is that the synthetic estimates are now calculated from the primary variable station observations rather than from the primary variable gridded values. Two advantages of this change are that (1) it is more transparent which stations have contributed to the gridded values (those with observations of the secondary variable and those with observations of the primary variable(s) needed to obtain the synthetic estimates); and (2) the interpolation of the synthetic estimates can now use the CDD of the secondary variable in deciding the distance weighting. Previously, some synthetic estimates were derived from gridded primary variables that had themselves been interpolated using the CDD of the primary variable (hence less transparent, and information from further afield than the secondary variable’s CDD would have been used). One result of this is that the coverage (the regions where the variable is not simply filled in by its climatological values) of the secondary variables is less complete than previously. However, this reduction in coverage arises from removing potentially low quality estimates that were previously made from too-distant observations.

Synthetic VAP production

Synthetic VAP observations are generated from TMP and DTR station anomalies (or from TMP station anomalies and gridded DTR anomalies where the station data does not include TMN and TMX), as well as the published CRU climatologies for TMP and VAP¹. While the process broadly follows that described in⁵, synthetic anomalies are now produced at a station level, rather than as gridded data, because this better suits the interpolation process as explained above. The VAP process is shown in Fig. 3, and the impact of the inclusion of synthetic VAP on the final gridded coverage is illustrated in Supplementary File 1 (p4).

Fig. 3 — VAP (vapour pressure) production process. Colours show subprocesses. TDW is dewpoint temperature; SVP is saturation vapour pressure.

Synthetic WET production

The WET variable represents counts of wet days defined as having ≥0.1 mm of precipitation (section 2.4.1 of⁵). Figure 4 shows the process by which synthetic WET values are incorporated into production of the WET product. The empirical algorithm that synthesizes WET uses PRE observations, together with normals (the CRU CL 1961–1990 climatologies¹) for PRE and WET. Therefore, the PRE anomalies at a station level are converted to absolute values using the PRE normal from the enclosing gridcells, and then used in the synthesis. The absolute synthetic WET values produced go to create a synthetic WET database; this is then anomalised in the same way as the observed WET database, and both sets of anomalies are passed to the interpolation algorithm. Some users use WET, and as rain day counts are part of the monthly messages we access, they are straightforward to add to the databases.

Fig. 4 — WET (wet-days) production process. DiM is the number of days in the month.

Synthetic CLD production

The process to generate synthetic cloud cover observations from DTR observations is as described in⁵, save that the synthetic station-based values are not gridded separately, but are fed into the main gridding process alongside the CLD observation anomalies.

Interpolation

General approach

The interpolation process implements angular-distance weighting (ADW) and is shown in Fig. 5. The station influence lookup tables produced as part of the anomaly process (described in the ‘Anomalies’ subsection of ‘Methods’) are used to allocate station anomalies to an array of gridcells that, for each monthly time step and cell, stores the nearest eight or fewer anomalies lying within the relevant CDD. Once the observed anomalies have been allocated, and if a secondary variable is being processed, synthetic anomalies are then allocated in the same way. However, they are excluded if within 25 km of either an observed anomaly, another synthetic anomaly, or the centre of the target cell; and if they lie within a 45° subtended angle of an observed anomaly. Additionally, they cannot replace an observed anomaly: the maximum of eight anomalies applies throughout. Once all allocations have been made, distance and (angular) separation weights are calculated (section 2b of²), and used to obtain an interpolated anomaly value for each gridcell. Any land cells without allocated anomalies are set to zero, representing the climatology in anomaly space. Elevation is not specifically included in the interpolation; it is introduced via the climatologies when the gridded anomalies are converted to absolute values (‘Production of absolutes’ in ‘Data records’). Results of a cross-validation exercise to quantify the accuracy of the ADW interpolation scheme are reported in the ‘Technical validation’ section.

Improvements to weighting for v4.02 and later versions

The approach to distance-weighting adopted for version 4 was taken from², a decay function of the form ${(e^{- d / C D D})}^{m}$ , where d is the distance of the station, CDD is the correlation decay distance of the variable, and m = 4 (a value arrived at after extensive sensitivity testing reported in²). However, the function was only used as part of the ADW process, when weighting more than one station to achieve an interpolated value. This resulted in unrealistic artefacts in the interpolated field. To address this, there was a need for the interpolated anomaly - or, for a single station, its anomaly - to be damped using distance-weighting as well, and a cross-validation exercise was conducted. This involved the reconstruction of every observed anomaly from every station in the process, provided at least one other station was available to interpolate from. These reconstructions were made for two basic decay functions, the original:

d a m p i n g f a c t o r = {(e^{- d / C D D})}^{m}

and a sine-based function with a slower decay at closer distances:

d a m p i n g f a c t o r = 1 - \sin {(r a d (d / C D D))}^{n}

In both cases, the power m or n ranged in integer steps from 1 to 8. The interpolation process applied the selected function at all stages: the decay of a lone station anomaly with distance, the relative distance weighting in the ADW calculation, and the decay of the ADW-derived anomaly with distance. Errors were calculated as mean absolute error (MAE) for various regions, including latitude bands and 5° × 5° gridcells, as well as global values. In all cases, either Eq. (1) with m = 8 or Eq.(2) with n = 1 gave the smallest errors as a global picture: PRE was served equally by both, while TMP was served better by Eq. (2). However, this sine curve does not allow a gradual decay at close distances, resulting in unrealistic artifacts as before. Further calculations showed that increasing the power in the sine function introduced little extra error, and a value of n = 4 was selected as a compromise between the need for accuracy in the gridcells and the need to reduce or eliminate unrealistic artifacts in the field to provide a continuous surface.

Derived variables (TMN, TMX, FRS and PET)

TMN and TMX are derived arithmetically from the gridded absolute values of TMP and DTR, as described in⁵. FRS is derived entirely synthetically, using an empirically determined function of the gridded absolute TMN variable. Potential Evapotranspiration (PET) is calculated using the Penman-Monteith formula²⁵ explained in²⁶ (p1071–1072). For this we use the CRU TS gridded values of mean temperature, vapour pressure, cloud cover and static (temporally invariant except for the annual cycle) 1961–90 average wind field values (further described in⁵).

Consistency between variables

One of the benefits of a multivariate dataset is the opportunity to present, at a point in space and time, a set of variable values that are (to an extent) internally consistent. This explains much of the design of the variable production process: TMN and TMX are consistent with TMP and DTR because they are derived from them (DTR having been previously derived from TMN and TMX observations); VAP is consistent with the temperature variables inasmuch as synthetic VAP is derived from them; similarly, the synthetic parts of WET and CLD are consistent with, respectively, PRE and DTR; and FRS and PET are entirely consistent with other variables, being wholly derived from them. Figure 6 shows the consistency relationships.

Fig. 6 — Consistency between variables. The arrowheads indicate the direction of data ensuring consistency; note that TMP and TMX are derived from DTR (and TMP), but DTR is derived from them earlier (as observations), so these lines are bidirectional. Dashed lines indicate partial consistency, where the synthetic element of the recipient variable will be consistent with the donor variable, but the observed element cannot be said to be so.

Homogeneity

As described in⁵, and in the ‘Production of primary variables’ subsection of ‘Methods’, CRU TS is not specifically homogeneous. Some National Meteorological Agencies (NMAs) homogenize their station observations, either before release or at a later stage (requiring a re-release). Therefore, many CRU TS observations have been homogenized (and also quality controlled) within each country. However, performing additional homogenization on the CRU TS databases would be complicated and not completely possible because of elements of the process, such as partly synthetic variables and the use of published climatologies. Sparse data coverage in some regions, or for some variables, is a particular limitation for applying neighbour-based homogeneity tests, as noted by⁴, where a degree of homogenization was implemented. The multivariate nature of CRU TS means that homogeneities identified in, for example, mean temperature data, are likely to influence other variables as well.

Comparisons with other datasets can be used to identify any large inhomogeneities that might be present in CRU TS v4. For example, partial homogeneity assessment and correction was undertaken for an earlier version (v2.1) of CRU TS⁴ and at large spatial scales and for most country averages there is close agreement between CRU TS versions with and without this additional homogenization. Other, single-variable datasets perform various homogeneity assessments on their observations, though even here there are difficulties because of reporting delays¹⁷. The CRUTEM4.6 temperature dataset incorporates homogeneity as a result of previous work and work by originating bodies [24, section 2.2]. CRU TS v4 TMP is compared with CRUTEM4.6 in the ‘Technical validation’ section and Fig. 7. These various inter-dataset comparisons do not indicate that there any large inhomogeneities present in the CRU TS v4 dataset, unless they are also present in the comparison datasets despite these other data being subject to further homogeneity checks.

Fig. 7 — Comparisons of hemispheric and global annual temperature means. From CRU TS v4.03, UDEL v5.01, CRUTEM v4.6.0.0 (variance-adjusted), and JMA JRA-55 reanalysis. All values are anomalies with respect to 1961–1990. Separate difference plots also shown for UDEL and CRUTEM.

Data Records

External data records

The CRU TS v4.03 dataset²⁷ comprises ten variables of high-resolution global land surface gridded absolute values. The data are available in two formats: NetCDF, and space-separated ASCII text. This ensures maximum availability for the diverse users of the dataset. The files are available in decadal blocks, as well as full-length, for the same reason.

The gridded data, excepting PET, are made available alongside metadata indicating the level of station support enjoyed by each datum; this varies between 0 (no cover, climatology inserted, see Interpolation above), and eight (the maximum station count for interpolation). For primary (TMP, DTR, PRE) and secondary (VAP, WET, CLD) variables, the counts produced by their interpolation are used. For derived (TMN, TMX) variables, and for FRS, the DTR counts are used. Because PET is calculated from multiple variables using a Penman-Monteith formula²⁵, no meaningful station count can be produced. The station count metadata are included in the NetCDF files as a second variable (‘stn’), and are published separately as ASCII text files.

An interface is also provided by a file in Keyhole Markup Language (KML) and an accompanying suite of images and datafiles. This is a standard of the Open Geospatial Consortium (https://www.opengeospatial.org/standards/kml) and allows the data set to be accessed in Earth browsers such as Google Earth (https://earth.google.com/). This Google Earth interface is currently available for the TMP and PRE data, allowing access to individual grid-cell series as well as station observations in an intuitive, hierarchical structure.

CRU TS is available from the Centre for Environmental Data Analysis (CEDA: http://data.ceda.ac.uk//badc/cru/data/cru_ts/), and from the CRU website: https://crudata.uea.ac.uk/cru/data/hrg/ (which also hosts the Google Earth interface structures).

Internal data records

The CRU TS process is realised through a collection of Fortran-77 programs that are called from a master program. This arrangement has provided compartmentalisation and flexibility as the process has evolved. This section will address the data files that allow communication between the programs, organised by the program that produces the data files. All files are ASCII text, with space-separated fields, unless otherwise stated.

Anomaly production

The anomaly program produces monthly data files, listing the station anomalies for that month. Station metadata is included. Additionally, two files needed for the interpolation process are produced: a list of stations giving in grid terms the North, South, East and West bounds of their influence (based on the CDD of that variable); and a list of stations giving the co-ordinates and distances of all gridcells within that influence. Files produced by the anomaly process are used by the interpolation process. Additionally, anomalies for primary variables are used by the processes synthesizing VAP, WET and CLD.