Meteorological data rescue: Citizen science lessons learned from Southern Weather Discovery

Andrew M Lorrey; Petra R Pearce; Rob Allan; Clive Wilkinson; John-Mark Woolley; Emily Judd; Stuart Mackay; Sudhir Rawhat; Laura Slivinski; Sally Wilkinson; Ed Hawkins; Patrick Quesnel; Gilbert P Compo

doi:10.1016/j.patter.2022.100495

. 2022 May 27;3(6):100495. doi: 10.1016/j.patter.2022.100495

Meteorological data rescue: Citizen science lessons learned from Southern Weather Discovery

Andrew M Lorrey ^1,^8,^∗, Petra R Pearce ¹, Rob Allan ², Clive Wilkinson ³, John-Mark Woolley ¹, Emily Judd ^1,⁴, Stuart Mackay ¹, Sudhir Rawhat ⁵, Laura Slivinski ⁶, Sally Wilkinson ³, Ed Hawkins ⁷, Patrick Quesnel ⁵, Gilbert P Compo ⁶

PMCID: PMC9214331 PMID: 35755873

Summary

Daily weather reconstructions (called “reanalyses”) can help improve our understanding of meteorology and long-term climate changes. Adding undigitized historical weather observations to the datasets that underpin reanalyses is desirable; however, time requirements to capture those data from a range of archives is usually limited. Southern Weather Discovery is a citizen science data rescue project that recovered tabulated handwritten meteorological observations from ship log books and land-based stations spanning New Zealand, the Southern Ocean, and Antarctica. We describe the Zooniverse-hosted Southern Weather Discovery campaign, highlight promotion tactics, and replicate keying levels needed to obtain 100% complete transcribed datasets with minimal type 1 and type 2 transcription errors. Rescued weather observations can augment optical character recognition (OCR) text recognition libraries. Closer links between citizen science data rescue and OCR-based scientific data capture will accelerate weather reconstruction improvements, which can be harnessed to mitigate impacts on communities and infrastructure from weather extremes.

Keywords: data rescue, meteorology, climate, reanalysis, citizen science, Zooniverse, optical character recognition

Highlights

•
Data rescue on Zooniverse allowed rapid transcription of historic weather observations
•
Eight replicate data entries can be used to obtain consensus with minimal errors
•
Transcribed weather observations can dramatically expand OCR character libraries

The bigger picture

Citizen science has the potential to capture historical handwritten scientific tabulated data that are not held in digital databases. However, undertaking a citizen science campaign for that purpose is not well described, which we address here. Our citizen science data rescue approach constrained data keying targets, developed participant instructions using clear examples, established replication levels to maximize completeness and confidence of data transcription, and demonstrated common data rescue pitfalls. We highlight how an effective communications strategy helps to maintain project momentum. Collaborating with industry to enhance optical character recognition (OCR) capability has the benefit of accelerating data rescue progress that can rapidly augment scientific data repositories. The resulting improvements to comprehensive historical weather datasets with global coverage can support models and predictive capabilities that help mitigate impacts on society from extreme weather.

Southern Weather Discovery is a citizen science project on Zooniverse that captured handwritten historical weather observations. This descriptor article outlines how we ran that citizen science project, which can be adapted to a wide range of disciplines. We highlight replicated data keying requirements to minimize transcription errors, some common pitfalls to avoid, and the importance of a good communications strategy. Our partnership with industry on optical character recognition shows potential to harness computer vision to accelerate historical scientific data capture.

Introduction

The importance of meteorological data rescue

Historical climate research has significantly bolstered global reconstructions of daily weather, also known as reanalyses.1, 2, 3, 4 Reanalyses are valuable tools for visualizing and contextualizing local weather patterns and extreme weather events,³^,⁵^,⁶ as well as investigating climate variability and teleconnections.7, 8, 9 These long weather reconstructions also improve understanding of broad climate change trends in regions where observation depths are robust and can be cautiously interpreted.¹⁰

Broadly, reanalyses rely on incorporating historical observations into estimates of past conditions using modern weather models. The performance of centennial-length reanalyses, like the 20th Century Reanalysis,³^,⁵ can be highly dependent on the density and accuracy of such observations throughout time. Uncertainties in historical daily weather patterns from these reanalyses can arise from a diminished spatiotemporal coverage of near-surface terrestrial and marine observations that were assimilated into the reconstruction. However, international surface observation databases (e.g., International Combined Ocean-Atmosphere Dataset [ICOADS]; International Surface Pressure Databank [ISPD])11, 12, 13, 14, 15, 16 that underpin reanalyses are continually expanding as new data are recovered and digitized by ongoing meteorological data rescue efforts.¹⁷ Data rescue therefore creates new pathways to improve reanalyses, like 20CR, but those opportunities are heavily dependent on (1) the ability to locate missing meteorological records for areas where spatial coverage is weak, and (2) the capability and capacity to rapidly capture, transcribe, and efficiently quality control numerical observations contained in archives.

The first data rescue dependency is being addressed in parallel by individual research efforts and coordinated international initiatives. Both approaches have uncovered new historical meteorological observation sources that have led to improved visibility, management, and curation of those data.¹⁸ Examples of international coordination efforts include the international Atmospheric Circulation Reconstructions over Earth (ACRE) initiative,¹⁹^,²⁰ the International Data Rescue (I-DARE) portal hosted by the World Meteorological Organization (WMO; https://www.idare-portal.org/), and the Copernicus C3S Data Rescue Service (https://data-rescue.copernicus-climate.eu/). The WMO I-DARE and Copernicus portals (https://datarescue.climate.copernicus.eu/) are currently being integrated into the same framework. Collectively, these efforts have improved the quality of recovered data, helped with resource sharing when capturing digital surrogates of original data sources, and reduced replication when obtaining archived meteorological data resources.¹⁷ The second data rescue dependency has been addressed either by individual researchers or research groups who manually transcribe historical data into digital format, or by using computer-aided recovery of text and numeric data (e.g., optical character recognition [OCR]).²¹^,²²

The efficacy of the latter approach to date has, in a handful of trials, shown some promise but with significant limitations.²¹^,²² However, progress to speed up transcription has been made using citizen science, which relies on individuals that are willing to voluntarily transcribe historical analogue meteorological data. There are many projects that have used this approach in recent years across a range of document types (see Ashcroft et al., 2016 for some Australasian examples). Pioneering efforts for handwritten tabulated observations are exemplified by OldWeather (www.oldweather.org), Meteororum ad Extremum Terrae (http://Zooniverse.org/projects/acre-ar/meteororum-ad-extremum-terrae), and Weather Rescue (www.weatherrescue.org).

In this descriptor article, we summarize our experiences from Southern Weather Discovery (SWD) (www.southernweatherdiscovery.org), a citizen science initiative hosted on the Zooniverse web platform, to show how southern hemisphere meteorological time series have been generated and quality controlled from historical ship logbooks. This case study builds on prior work that has documented preparation of historical documents and transcription tactics,23, 24, 25, 26, 27 but also adds detail by explaining key elements of publicity and a media strategy plan that engendered public support for meteorological data rescue. We provide a stepwise account of our methods, highlighting some successes and pitfalls, that other researchers may benefit from to improve citizen science data rescue efforts for the geosciences. We provide details on the use of a data transcription interface on Zooniverse, preparation of historical documents for transcription, requirements for retrieving data from Zooniverse, and tactics to form a comprehensive observation dataset with minimal transcription errors. We also discuss serendipitous outcomes from SWD citizen science, where replicate keying of meteorological observations can be harnessed to improve artificial intelligence (AI) transcription of tabulated scientific data.

Launching a data rescue mission from the antipodes

A citizen science data rescue effort was launched as a component of the New Zealand Deep South National Science Challenge (DeepSouthChallenge.co.nz [DSC]) in 2015. The DSC’s main aim is to understand the role of the Antarctic and Southern Ocean in determining New Zealand’s future climate conditions and environmental outcomes from climate changes. The focus of DSC data rescue work was to recover undigitized weather observations and use them to help assess and evaluate the New Zealand Earth System Model (NZESM).²⁸ Many late 19^th and early 20^th century historical weather and climate events caused damage and disruption to New Zealand’s civil infrastructure and economy (e.g., significant snowfall, floods, droughts).²⁹ Ensemble uncertainty in the 20CR analysis during the late 19^th/early 20^th century is large around New Zealand and the South Pacific (Figure 1), providing little insight into the atmospheric conditions leading up to these important past episodes. Thus, evaluating the quality of the NZESM using a reanalysis during those times is hampered. However, a few examples of full transcriptions and analyses of early handwritten scientific observations show potential to address this shortcoming.³⁰ Improving the efficacy of long-range reanalyses like 20CR with newly rescued data could enable further testing and more detailed validation of the NZESM, with direct applications toward improving our understanding of weather events, climate variability, and long-term changes.

Twentieth Century Reanalysis ensemble spread in late 1800s and early 1900s

The Twentieth Century Reanalysis version 2c (20CRv2c) mean ensemble spread for the 1,000 hPa geopotential height (top) from 1891 to 1910 is contrasted with a spread anomaly plot (bottom) where the zonal (latitude average) mean for the same interval has been subtracted. This was the most recent version of 20CR that existed when Southern Weather Discovery (SWD) began. These plots show the effects of areas where there are relatively concentrated (e.g., New Zealand, Australia) and diminished (e.g., Amundsen Sea) observations. Outside of the south west Pacific tropics, and especially around Antarctica, there is higher uncertainty in the 20CR daily weather reconstruction. Areas where centers of action for modes of variability that affect New Zealand (including PSA, SAM, ZW3) are indicated. Plots are courtesy of National Oceanic and Atmospheric Administration Physical Science Laboratory (NOAA PSL). The spatial extent of the ACRE Antarctica regional chapter domain in the bottom panel was the focus of SWD data rescue.

To achieve data rescue aims that could contribute to the DSC, ACRE Antarctica (a chapter of ACRE) was created to focus on data rescue of historical meteorological observations within the high-latitude region bounded by Australia, South America, and Antarctica.²⁰ Within that geographic domain, there are key atmospheric and oceanic centers of action that are linked to modes of climate variability including the Pacific South American Mode (PSA),³¹ Zonal Wave 3 (ZW3),³² and the Southern Annular Mode (SAM)³³^,³⁴ that directly and indirectly (via teleconnections) impinge on New Zealand’s weather and regional climate conditions (Figure 1). In contrast to the broad continental expanse of the northern hemisphere and the tropics with numerous land-based weather observation stations, the ACRE Antarctica region is dominated by the South Pacific Ocean and Southern Ocean. Thus, historical observations are sparse and significant gaps in observation coverage over the southern hemisphere oceans produce large weather reconstruction uncertainties (Figures 1 and 2). This geographic predicament means any data rescue efforts need to consider limitations of coastal and maritime scientific data resources from lighthouses and seasonal coastal stations, and should harness the benefits of observations from harbor-based and ocean-going vessels. For the latter type of resource, ship log books have previously shown great potential to bolster maritime instrumental observations for the 19^th and 20^th centuries.³⁵^,³⁶ Essential climate variables (ECVs), including atmospheric pressure, air temperature, sea surface temperature, and sea ice extent, were targeted for recovery in ACRE Antarctica’s initial funding support from the DSC. Additional financial support from the Copernicus Climate Change Service (C3S) further allowed us to refine citizen science meteorological data rescue methods that have helped to streamline citizen science data capture of historical maritime weather observations.

Change in selected 20CR uncertainty metrics through time

(Top) The inter-region mean ensemble spread (uncertainty in daily reconstruction of weather) for the tropics and the ACRE Antarctica domain in the 20CR version 3 (20CRv3) January to March. It shows progressive improvement for both regions through time from the mid-19^th to mid-20^th century. (Bottom) The mean ensemble spread ratio (ACRE Antarctica mean ensemble spread divided by the tropics mean ensemble spread) is a dimensionless index indicating that, despite overall 20CR improvement, there is still lower uncertainty for past daily weather in the tropics (possibly a result of greater density or greater consistency of observations in that region) relative to the high southern latitudes. It also shows that distinct seasonal differences for the mean ensemble spread uncertainty are lowest for the southern high latitudes in summer and worst for autumn and winter.

Data

Navigating happy hunting grounds for historical maritime weather data

Significant efforts have been made to photograph logbooks from ships that visited the southern hemisphere during the early to mid-20^th century. Those resources are widely dispersed across a broad range of archives (see Teleti et al.³⁷ and Chappell et al.³⁸). For example, in the archives of New Zealand’s National Institute of Water and Atmospheric Research (NIWA), there are copies of published historical scientific expeditions to the Antarctic region written in English, French, Spanish, Portuguese, Russian, Norwegian, Finnish, and Swedish. Many of the original historical ship logbooks supporting those publications are held in European archives and include British merchant and immigration ships that visited New Zealand, Australia, and the South Pacific.³⁸^,³⁹ In addition, significant numbers of ship logbooks exist in Scandinavia⁴⁰^,⁴¹ related to whale hunting in the southern hemisphere.

An initial assessment spanning the period 1900–1960 indicates a minimum of seven million unique ship logbook weather observations for the high southern latitudes.⁴⁰^,⁴¹ In addition to regional data richness, data consistency is an important element to consider, given the sheer scale of processing and formatting citizen science data (see explanations below). Many logbooks have a standard printed table format that shipboard observers completed while at sea, and some shipboard expedition reports also include land-based observations from stationary and overland traverses.⁴² Our primary focus was then honed and directed at ship logbook data rescue from 1900 to 1950, a time frame that encompasses several severe weather events that affected New Zealand. Some 150,000 logbook images from more than 300 individual voyages were carefully photographed⁴³ and passed to NIWA in support of SWD to bolster mid-19^th to mid-20^th century sample depth and reduce ensemble uncertainty in future 20CR iterations (Figure 2). Table 1 shows details for the log books that were digitized. The experimental procedures section outlines the process of data collection for this study. It describes how we established the SWD project identity and set up a data transcription platform on Zooniverse. This section also highlights progressive changes in data-rescue tactics and the approaches we deployed to recruit personnel who transcribed data from ship logbooks and land-based meteorological registers.

Table 1.

Ship logbook meteorological observations recovered in SWD phase I hosted on Zooniverse

Unique ship	Unique logbook file name	Ship name	Log images	Images clipped	Number of clips	Clip not loaded	Year of data	Barometer uncorrected	Attached thermometer	Barometer corrected	Air temperature	Sea temperature	Total observations
1	MF911_44460_Athel Chief	Athel Chief	10	5	180	6	1946	226	200	229	0	0	655
2	MF911_19702_Bullyses	Bullyses	5	3	40	5	1930	77	78	77	59	59	350
2	MF911_19703_Bullyses	Bullyses	4	2	30	12	1930	47	47	47	47	47	235
3	MF911_32842_Cambridge	Cambridge	6	3	54	0	1935	53	53	53	53	52	264
4	MF911_27139_Canonosa	Canonosa	4	2	36	3	1933	36	35	31	36	36	174
4	MF911_28085_Canonosa	Canonosa	10	5	90	9	1933	172	173	157	173	173	848
4	MF911_29126_Canonosa	Canonosa	8	4	72	6	1934	82	82	82	81	80	407
4	MF911_29949_Canonosa	Canonosa	8	4	72	9	1934	72	71	71	72	72	358
5	MF911_24993_Coptic	Coptic	6	3	54	3	1932	66	65	66	65	65	327
5	MF911_27188_Coptic	Coptic	6	3	54	9	1933	59	60	59	60	60	298
5	MF911_32788_Coptic	Coptic	6	3	54	0	1935	61	63	63	60	61	308
5	MF911_34004_Coptic	Coptic	6	3	108	15	1936	58	58	58	58	58	290
5	MF911_37438_Coptic	Coptic	6	3	108	9	1937	64	61	63	0	0	188
6	MF911_27819_Cumberland	Cumberland	8	4	72	15	1933	65	64	64	66	66	325
6	MF911_37610_Cumberland	Cumberland	8	4	144	30	1937	71	72	72	2	2	219
6	MF911_39392_Cumberland	Cumberland	8	4	144	24	1938	68	68	68	65	66	335
7	MF911_17513_Deucalion	Deucalion	1	1	15	0	1929	21	0	37	37	37	132
7	MF911_17626_Deucalion	Deucalion	1	1	15	0	1929	20	0	27	27	27	101
8	MF911_17770_Devon	Devon	2	2	30	12	1929	42	42	38	42	41	205
8	MF911_22989_Devon	Devon	4	4	72	0	1931	173	176	169	176	175	869
9	ML911_2798_Discovery II	Discovery II	29	29	342	0	1950	450	451	450	0	0	1,351
10	MF911_19398_Dorington Courier	Dorington Courier	2	2	30	0	1930	73	73	73	72	73	364
11	MF911_35639_Dunedin Star	Dunedin Star	3	3	108	24	1936	53	55	53	51	47	259
12	MF911_39566_Durham	Durham	2	2	48	0	1938	107	107	107	107	107	535
12	MF911_41980_Durham	Durham	3	3	54	9	1939	117	117	116	116	117	583
12	ML911_900_Durham	Durham	11	11	231	27	1948	258	258	258	0	0	774
13	MF911_44536_Empire Victory	Empire Victory	11	11	198	0	1946	351	352	352	0	0	1,055
14	MF911_40649_Essex	Essex	3	3	108	18	1938	54	48	54	55	55	266
15	MF911_33390_Fordsdale	Fordsdale	3	3	54	0	1935	84	85	85	79	79	412
15	MF911_37896_Fordsdale	Fordsdale	8	4	105	39	1937	65	66	66	64	63	324
16	MF911_17310_Gloxinia	Gloxinia	4	2	30	0	1929	63	0	61	64	64	252
16	MF911_17474_Gloxinia	Gloxinia	4	2	30	0	1929	76	0	78	75	75	304
16	MF911_19208_Gloxinia	Gloxinia	2	1	15	2	1929	38	0	38	38	33	147
16	MF911_19287_Gloxinia	Gloxinia	8	3	45	12	1930	85	0	85	85	78	333
17	MF911_33108_Hertford	Hertford	8	4	72	9	1935	75	75	74	75	75	374
17	MF911_37308_Hertford	Hertford	4	2	72	15	1937	39	37	38	0	0	114
18	MF911_20476_Hororata	Hororata	4	2	30	12	1930	48	48	48	48	48	240
19	MF911_25026_Huntingdon	Huntingdon	6	3	54	0	1932	67	68	64	68	68	335
19	MF911_25770_Huntingdon	Huntingdon	8	4	72	15	1932	72	70	73	73	73	361
19	MF911_26744_Huntingdon	Huntingdon	5	2	36	0	1932	40	43	43	43	43	212
19	MF911_27605_Huntingdon	Huntingdon	6	3	54	6	1933	56	55	58	59	60	288
19	MF911_30667_Huntingdon	Huntingdon	8	4	72	18	1934	68	68	67	68	68	339
19	MF911_35480_Huntingdon	Huntingdon	8	4	144	24	1936	72	72	72	69	69	354
19	MF911_37975_Huntingdon	Huntingdon	8	4	144	21	1937	78	79	80	78	78	393
20	MF911_34739_Hurunai	Hurunai	8	4	144	144	1936	148	0	146	147	146	587
21	MF911_34145_Ionic	Ionic	8	4	144	144	1936	115	117	115	112	112	571
22	ML_17955_Junie	Junie	26	20	160	160	1929	442	442	443	444	444	2,215
23	MF911_27575_Karamea	Karamea	6	3	54	54	1933	59	59	59	59	59	295
23	MF911_28249_Karamea	Karamea	8	4	72	72	1933	64	63	64	64	64	319
23	MF911_37224_Karamea	Karamea	6	3	54	54	1937	134	133	134	135	135	671
23	MF911_40779_Karamea	Karamea	7	3	108	108	1938	60	60	59	60	60	299
23	ML_18569_Karamea	Karamea	31	20	160	160	1932	441	439	441	434	435	2,190
23	ML_18678_Karamea	Karamea	29	21	168	168	1932	443	444	443	422	426	2,178
24	MF911_9415_Kiaora	Kia Ora	6	3	45	45	1925	90	0	0	90	89	269
25	MF911_44413_Lafonia	Lafonia	8	3	108	108	1946	41	41	41	0	0	123
26	MF911_39550_Loriga	Loriga	6	3	54	54	1938	121	121	120	121	118	601
27	MF911_39459_Losada	Losada	2	1	18	18	1938	30	30	30	30	30	150
28	MF911_32584_Mahana	Mahana	6	3	108	108	1935	60	53	53	59	58	283
29	MF911_26079_Mahia	Mahia	8	4	72	72	1932	74	75	74	75	75	373
29	MF911_27728_Mahia	Mahia	6	3	54	54	1933	52	52	52	52	52	260
29	MF911_37618_Mahia	Mahia	8	4	144	144	1937	71	71	71	0	0	213
29	MF911_43608_Mahia	Mahia	10	3	108	108	1946	55	55	52	50	49	261
30	ML_17885_Maimoa	Maimoa	29	21	168	13	1929	451	450	442	451	451	2,245
30	ML_18480_Maimoa	Maimoa	35	24	192	0	1931	529	523	527	407	356	2,342
30	ML_18660_Maimoa	Maimoa	32	23	184	10	1932	496	496	494	478	475	2,439
30	MF911_27208_Maimosa	Maimosa	8	4	72	6	1933	73	73	72	70	71	359
30	MF911_34685_Maimosa	Maimosa	8	4	144	16	1936	78	78	78	76	72	382
31	MF911_12243_Mamari	Mamari	4	2	30	0	1926/27	77	0	0	78	77	232
31	MF911_13623_Mamari	Mamari	6	3	45	0	1927	92	0	93	93	93	371
32	MF911_25506_Matakana	Matakana	6	3	54	0	1932	68	68	66	68	68	338
32	MF911_27290_Matakana	Matakana	6	3	54	0	1933	71	71	71	71	71	355
32	MF911_28274_Matakana	Matakana	8	4	72	9	1933	81	81	80	81	81	404
32	MF911_33110_Matakana	Matakana	8	4	72	15	1935	71	71	71	72	71	356
32	MF911_9992_Matakana	Matakana	4	2	30	0	1925	76	0	0	77	77	230
32	ML_17869_Matakana	Matakana	30	23	184	20	1929	470	463	465	467	465	2,330
33	MF911_27403_Middlesex	Middlesex	10	5	90	33	1933	138	137	138	138	138	689
34	ML_18676_Norfolk	Norfolk	29	21	168	0	1932	442	446	442	436	429	2,195
34	ML911_958_Norfolk	Norfolk	36	10	210	21	1948	233	234	231	0	0	698
35	MF911_44234_Northumberland	Northumberland	13	7	252	21	1947	292	291	296	0	0	879
36	MF911_23534_Opawa	Opawa	6	3	54	3	1931	64	64	64	64	64	320
36	MF911_26440_Opawa	Opawa	8	4	72	0	1932	62	64	61	64	64	315
36	MF911_27357_Opawa	Opawa	8	4	72	24	1933	63	64	63	64	64	318
36	MF911_28505_Opawa	Opawa	8	4	72	15	1933	70	71	70	70	71	352
36	MF911_29526_Opawa	Opawa	6	3	54	6	1934	63	63	63	63	63	315
36	MF911_30783_Opawa	Opawa	8	4	72	21	1934	66	66	66	66	66	330
36	ML_18575_Opawa	Opawa	31	19	152	0	1932	428	424	428	428	426	2,134
37	MF911_27680_Orari	Orari	6	3	54	3	1933	56	56	50	54	54	270
37	MF911_30480_Orari	Orari	4	2	36	0	1934	33	34	34	34	34	169
37	ML911_527_Orari	Orari	22	12	252	30	1947	262	262	257	0	0	781
37	ML911_81_Orari	Orari	31	12	252	30	1947	240	240	238	0	0	718
38	ML_18115_Otaki	Otaki	39	29	232	36	1929	562	563	563	456	426	2,570
39	MF911_10588_Otira	Otira	4	2	30	0	1926	40	40	39	40	40	199
39	MF911_26306_Otira	Otira	6	3	54	0	1932	77	77	72	79	79	384
39	MF911_28021_Otira_DUP	Otira	8	4	72	9	1933	79	79	79	79	79	395
40	MF911_29047_Pakeha	Pakeha	10	5	90	9	1933-34	88	89	89	89	89	444
40	ML_17655_Pakeha	Pakeha	29	22	176	14	1928	467	467	467	459	462	2,322
40	ML_17804_Pakeha	Pakeha	37	29	232	20	1928	578	581	577	580	578	2,894
40	ML_18410_Pakeha	Pakeha	32	21	168	10	1931	453	455	455	336	333	2,032
41	MF911_25239_Piako	Piako	8	4	72	6	1932	78	78	78	77	77	388
41	MF911_27502_Piako	Piako	8	4	72	9	1933	74	75	75	74	75	373
42	MF911_25348_Port Adelaide	Port Adelaide	9	5	90	1	1932	167	163	168	153	153	804
42	MF911_25349_Port Adelaide	Port Adelaide	2	1	18	0	1932	25	29	29	0	0	83
42	MF911_33271_Port Adelaide	Port Adelaide	8	4	72	0	1935	69	70	71	69	71	350
42	MF911_35042_Port Adelaide	Port Adelaide	14	7	252	23	1936	140	141	140	134	134	689
42	ML_18174_Port Adelaide	Port Adelaide	35	19	152	0	1930	417	420	414	410	383	2,044
43	MF911_25998_Port Alma	Port Alma	8	4	72	0	1932	75	76	76	76	76	379
43	MF911_27851_Port Alma	Port Alma	8	4	72	15	1933	70	69	69	71	62	341
43	ML_18499_Port Alma	Port Alma	32	20	160	0	1931	422	421	425	376	382	2,026
43	ML_18587_Port Alma	Port Alma	37	21	168	0	1932	463	465	463	452	445	2,288
44	MF911_32068_Port Auckland	Port Auckland	6	3	108	0	1935	66	66	66	66	52	316
44	MF911_41786_Port Auckland	Port Auckland	12	6	108	0	1938	130	130	129	131	131	651
44	ML_17895_Port Auckland	Port Auckland	30	23	184	18	1929	473	474	475	475	475	2,372
44	ML_18144_Port Auckland	Port Auckland	30	21	168	0	1930	452	455	453	368	369	2,097
45	MF911_16080_Port Bowen	Port Bowen	4	2	30	0	1928	76	0	0	76	76	228
45	MF911_32187_Port Bowen	Port Bowen	8	4	136	34	1935	58	53	57	59	60	287
45	MF911_33308_Port Bowen	Port Bowen	6	3	54	0	1935	65	65	65	65	61	321
45	MF911_34293_Port Bowen	Port Bowen	8	4	144	38	1936	64	64	64	64	64	320
45	MF911_35486_Port Bowen	Port Bowen	4	2	72	0	1936	47	47	47	47	47	235
46	ML_17849_Port Campbell	Port Campbell	28	22	176	16	1929	461	460	457	458	459	2,295
47	MF911_26975_Port Caroline	Port Caroline	10	4	72	6	1933	82	83	83	83	83	414
47	ML_18260_Port Caroline	Port Caroline	32	21	168	0	1930	436	433	434	400	401	2,104
47	ML_18565_Port Caroline	Port Caroline	38	23	184	0	1932	479	478	482	468	467	2374
48	MF911_31626_Port Chalmers	Port Chalmers	8	4	117	0	1935	71	59	73	73	72	348
48	MF911_37654_Port_Chalmers	Port Chalmers	6	3	108	11	1937	62	62	62	60	61	307
49	MF911_25515_Port Darwin	Port Darwin	8	4	72	9	1932	70	69	71	68	66	344
49	MF911_35675_Port Darwin	Port Darwin	6	3	108	3	1936	65	65	65	65	48	308
49	MF911_37099_Port Darwin	Port Darwin	8	4	144	0	1937	70	70	71	0	0	211
49	MF911_39424_Port Darwin	Port Darwin	8	4	144	30	1938	68	68	68	68	68	340
50	MF911_13131_Port Denison	Port Denison	4	2	30	0	1927	79	0	78	79	79	315
50	MF911_29212_Port Denison	Port Denison	8	4	72	9	1934	76	76	76	76	76	380
50	MF911_34150_Port Denison	Port Denison	8	4	144	27	1936	76	67	74	75	74	366
50	MF911_35293_Port Denison	Port Denison	8	4	144	18	1936	81	66	80	84	79	390
51	MF911_41942_Port Dundedin	Port Dunedin	14	7	126	12	1939	136	131	130	137	135	669
51	ML_18673_Port Dunedin	Port Dunedin	29	20	160	0	1933	416	415	415	365	360	1,971
52	MF911_36019_Port Fremantle	Port Fremantle	6	3	114	0	1936	114	114	114	0	0	342
52	ML_18558_Port Fremantle	Port Fremantle	35	19	152	0	1932	422	419	420	406	410	2,077
52	ML_18630_Port Fremantle	Port Fremantle	31	21	168	8	1932	470	470	468	459	441	2,308
52	ML_18680_Port Fremantle	Port Fremantle	30	21	168	0	1932	453	452	453	437	441	2,236
53	ML_18476_Port_Gisborne	Port Gisborne	33	18	144	0	1931	422	421	422	274	252	1,791
53	MF911_32744_Port Gisborne	Port Gisborne	6	3	108	0	1935	70	70	70	67	68	345
53	MF911_34915_Port Gisborne	Port Gisborne	6	3	108	6	1936	54	54	56	54	54	272
53	MF911_39080_Port Gisborne	Port Gisborne	6	3	108	9	1938	62	64	64	63	66	319
53	MF911_40057_Port Gisborne	Port Gisborne	6	3	108	3	1938	66	65	64	66	65	326
54	MF911_33928_Port Hobart	Port Hobart	6	3	54	0	1936	70	71	71	71	71	354
54	MF911_34904_Port Hobart	Port Hobart	6	3	108	0	1936	70	71	67	64	65	337
54	MF911_35996_Port Hobart	Port Hobart	6	3	108	0	1936	70	69	70	66	66	341
55	ML_18639_Port Hunter	Port Hunter	34	24	192	20	1932	498	503	504	345	365	2,215
56	MF911_40199_Port Jackson	Port Jackson	10	5	180	12	1938	94	98	98	95	96	481
57	ML_17977_Port Melbourne	Port Melbourne	32	24	192	22	1929	474	477	477	479	479	2,386
58	MF911_11208_Port Napier	Port Napier	6	2	30	0	1926	43	41	43	43	42	212
59	ML_17873_Port Nicholson	Port Nicholson	31	23	184	18	1929	479	480	479	478	480	2,396
59	ML_18399_Port Nicholson	Port Nicholson	30	21	168	12	1931	448	446	439	418	407	2,158
60	ML_18155_Port Sydney	Port Sydney	33	22	176	0	1930	443	446	443	417	416	2,165
61	MF911_41432_Port Townville	Port Townsville	6	3	108	18	1938	56	56	53	0	0	165
62	ML_17860_Port Victor	Port Victor	35	28	224	36	1929	541	540	540	539	537	2,697
63	MF911_23307_Port Wellington	Port Wellington	8	4	72	12	1931	151	1	152	152	153	609
63	MF911_27086_Port Wellington	Port Wellington	8	4	72	9	1933	150	150	150	152	152	754
63	MF911_28051_Port Wellington	Port Wellington	8	4	72	12	1933	84	84	86	86	85	425
63	MF911_37821_Port Wellington	Port Wellington	7	3	108	6	1937	51	51	51	45	45	243
64	MF911_33984_Port Wyndham	Port Wyndham	6	3	108	0	1936	55	55	55	2	2	169
64	MF911_36181_Port Wyndham	Port Wyndham	6	3	108	21	1936	47	47	47	46	46	233
64	MF911_39352_Port Wyndham	Port Wyndham	6	6	3	108	1938	44	44	44	37	32	201
64	MF911_40313_Port Wyndham	Port Wyndham	6	3	108	15	1938	44	44	43	41	41	213
65	MF911_41898_Reina del Pacifico	Reina del Pacifico	4	2	36	3	1939	5	5	66	66	66	208
66	ML_17827_Rimutaka	Rimutaka	42	34	272	26	1929	705	689	705	708	701	3,508
67	ML_17998_Runpenu	Ruapehu	36	25	200	26	1929	496	495	495	458	452	2,396
68	ML_18579_Somerset	Somerset	35	22	176	0	1932	458	459	460	435	439	2,251
68	ML_18646_Somerset	Somerset	38	21	168	6	1932	457	458	456	274	242	1,887
69	MF911_44525_Southern Harvester	Southern Harvester	4	2	72	27	1947	51	52	53	0	0	156
70	MF911_18465_Southern King	Southern King	2	1	15	0	1929	38	38	0	38	11	125
70	MF911_18548_Southern King	Southern King	2	1	15	9	1929	13	0	0	13	0	26
70	MF911_18819_Southern King	Southern King	2	1	15	9	1929	13	0	0	13	0	26
70	MF911_18978_Southern King	Southern King	2	1	15	9	1929	13	0	0	13	12	38
70	MF911_18979_Southern King	Southern King	2	1	15	9	1929	16	15	0	15	10	56
70	MF911_21761_Southern King	Southern King	6	3	54	16	1930	95	79	0	21	20	215
71	ML911_1542_Struan	Struan	24	14	294	55	1947	15	0	142	0	0	157
72	MF911_43636_Suffolk	Suffolk	8	3	108	12	1946	121	120	121	0	0	362
73	MF911_10646_Tairoa	Tairoa	4	2	30	0	1926	69	69	69	69	69	345
73	MF911_30690_Tairoa	Tairoa	8	4	72	6	1934	78	78	78	78	78	390
73	MF911_35417_Tairoa	Tairoa	8	4	144	33	1936	66	69	66	66	69	336
73	MF911_37729_Tairoa	Tairoa	8	4	144	24	1937	77	77	76	76	75	381
73	MF911_8723_Tairoa	Tairoa	6	3	45	9	1925	90	90	89	90	90	449
74	MF911_25999_Taranaki	Taranaki	6	3	54	0	1932	143	143	143	142	141	712
74	MF911_26898_Taranaki	Taranaki	6	3	54	0	1933	67	67	65	67	67	333
74	MF911_27674_Taranaki	Taranaki	6	3	54	0	1933	69	69	68	69	69	344
74	MF911_30161_Taranaki	Taranaki	6	3	54	0	1934	64	64	63	64	64	319
74	MF911_42691_Taranaki	Taranaki	12	6	216	3	1939	120	133	125	134	133	645
74	ML_18585_Taranaki	Taranaki	30	19	152	0	1932	379	383	385	331	319	1,797
75	MF911_22425_Tasmania	Tasmania	8	4	72	9	1931	78	78	78	78	78	390
75	MF911_26731_Tasmania	Tasmania	9	4	72	0	1932	107	107	107	107	107	535
75	MF911_27816_Tasmania	Tasmania	10	5	90	15	1933	122	122	122	123	123	612
76	ML911_2149_Thule	Thule	37	28	333	0	1950	430	434	436	0	0	1,300
77	MF911_37834_Tongariro	Tongariro	8	4	144	36	1937	68	68	68	69	69	342
77	MF911_42032_Tongario	Tongariro	12	6	108	3	1938	134	134	134	134	134	670
77	ML_18641_Tongariro	Tongariro	28	19	152	1	1932	428	429	429	332	353	1,971
78	MF911_44607_Trepassey	Trespassey	8	4	72	6	1946	76	75	9	0	0	160
79	MF911_35419_Tuscan_Star	Tuscan Star	6	3	108	18	1936	56	57	55	57	59	284
80	MF911_12609_Verbania	Verbania	4	2	30	0	1927	47	0	0	47	47	141
81	MF911_10751_Waimana	Waimana	6	3	45	0	1926	80	0	0	80	80	240
82	MF911_32663_Waipawa	Waipawa	4	2	72	0	1935	42	42	42	41	41	208
82	MF911_34853_Waipawa	Waipawa	10	5	180	12	1938	105	102	103	0	0	310
82	ML911_988_Waipawa	Waipawa	39	12	252	48	1948	232	232	232	0	0	696
83	MF911_35087_Waiwera	Waiwera	6	3	108	9	1936	64	64	64	63	63	318
83	MF911_36265_Waiwera	Waiwera	6	3	108	12	1936	61	61	61	61	61	305
83	MF911_39360_Waiwera	Waiwera	6	3	108	17	1938	58	59	57	56	59	289
84	MF911_32763_Westmoreland	Westmoreland	6	3	108	0	1935	135	135	135	135	135	675
85	MF911_24120_Zealandic	Zealandic	6	3	54	0	1931	142	142	142	142	142	710
85	MF911_25112_Zealandic	Zealandic	6	3	54	0	1932	139	139	139	139	139	695
85	MF911_25805_Zealandic	Zealandic	6	3	54	0	1932	129	129	129	129	129	645
85	MF911_30311_Zealandic	Zealandic	6	4	72	3	1934	81	81	81	85	85	413

Open in a new tab

A total of 150,690 observations from 85 unique ships that embarked on 210 voyages were successfully captured by replicate keying from citizen scientists. Log images are the total number of digital files that correspond to the unique logbook file, images clipped are the total number of images that had data within the ACRE Antarctica domain (see Figure 1) that were selected for processing, the number of clips are the total segment number that were extracted from all pages selected for processing, and blanks not loaded are the number of clips that had no data. Total number of recovered observations for each category for each unique voyage (barometric pressure, air temperature, and sea temperature) are shown. Total number of unsuccessful transcriptions not shown.

Results

A total of 210 logs from voyages of 85 unique ships were transcribed in phase I of SWD (see Table 1 for details including years of coverage). Over 2,500 log book images were collectively obtained for those voyages, and 1,521 of those images were then selected for transcription. From the log book images that were used, 18,490 clips containing multiple meteorological observations were loaded to Zooniverse (taking into account that 16.6% of a grand total of 22,180 clips were blank and did not need to be transcribed). A grand total of 150,690 meteorological observations were recovered through replicate keying (n_{uncorrected barometer} = 32,747; n_{attached thermometer} = 31,399; n_{corrected barometer} = 32,196; n_{air temperature} = 27,330; n_{sea temperature} = 27,018). The total time for SWD phase I data capture was 9 months (running from October 2018 to July 2019), with a majority of transcribed observations obtained within the first 2 months from the project launch.

Charting a new course for streamlined data transcription

Determining what transcription retirement limit to use for individual observations was still an open-ended question when we launched SWD and after completing phase I. Replicated keying of logbook segments is designed to provide a majority consensus (and a measure of confidence through replication) that defines what numeric value exists in each table cell. Replicated keying levels for numeric values from an individual cell is proportional to time, but the effort to repeat keying as a way to increase confidence should have a functional limit. A choice of too few replicate keying attempts places the onus back on the researcher more frequently to re-classify questionable values that are not resolved via consensus. In turn, that can also generate re-work in terms of reposting logbook clips online to obtain additional transcriptions. We initially chose to have entries transcribed by 10 different volunteers during the first phase of SWD. This limit was increased initially from five entries after we discovered some problems with respect to the general data format returned by Zooniverse (see issues outlined below).

In SWD phase II, a goal was to determine optimal transcription and image clip retirement limits. Tabulated historical weather observations for eight ECVs (attached thermometer, uncorrected barometer, corrected barometer, maximum temperatures, minimum temperatures, wind direction, wind force, wind run) for the austral winter of 1939 (June, July, August) on original meteorological Form 301 paper copies taken held in NIWA’s archive were digitally scanned from 63 stations spread across New Zealand to cover the winter season when the 1939 Week it Snowed Everywhere (WISE) event occurred.

A brute-force approach was employed by setting SWD transcription retirement limits at 20 for WISE, which was twice the sample pool of SWD phase I transcription. Using these data, we were able to use hierarchical degradation that progressively lowered replicate transcription sample depth of keyed values in order to evaluate optimal data keying retirement limits. Retirement statistics (completed successful transcription) were assessed for individual entries (each individual observation recorded in a log book clip), segments (the log book clips), and entire logbook images (with multiple segments that contain multiple entries). We considered results from our entire pool of volunteers (n = 20), the control dataset, to evaluate the effects of transcription sample depth degradation. The 20-volunteer sample depth also allowed us to gather a large enough dataset to evaluate type 1 (consensus acceptance of an incorrect value; false-positive/acceptance) and type 2 (non-consensus and rejection of a value that was legitimate; false-negative/rejection) errors. We also used the WISE dataset to evaluate minimum number of repeat classifications needed to obtain a 100% complete dataset via majority consensus with minimal transcription errors. Some examples of log books that had the most common errors are provided in the supplemental information.

The percentage of entries, segments, and images that were “retired” (i.e., consensus reached, with citizen science transcription considered a success) decreased for all replicate keying tests conducted on each of the hierarchical transcription classes (5, 10, 15, and 20 volunteers) when the pass rate threshold was raised progressively from 60% to 90% (Figure 3). Results for all the hierarchical classes appeared most similar for entries, segments, and images for the 75% pass rate threshold and were the most different for the 90% pass rate test. The difference between success of the five-volunteer class within the 60% pass rate test and the 90% pass rate test resulted from the fact that all five answers need to align for the latter to be considered a success, while the former only requires three out of five to be right. The results for 10 versus 20 volunteers in the 90% pass rate test also appeared similar. Few appreciable differences were also observed in the 60% pass rate test for the 10, 15, and 20 volunteer classes.

WISE consensus results

Frequency of successful consensus classification using different thresholds of agreement for individual entries, meteorological form segments (clips), and entire meteorological form images pooled from unique land-based stations that were transcribed the WISE phase of work on SWD. See supplemental information for more details about the number of data points, segments, and images that comprise these statistics.

We evaluated the probability of type 1 and type 2 errors by comparing expert-guided transcriptions of original logbook entries with the consensus values obtained through WISE. This experiment used tests based on several draws of 20 entries at random for each of the eight WISE data entry tasks, and just over 46,200 values constituted the pool that could be analyzed to assess errors associated with data entry. Across the entire dataset, 56% of entries had a low risk of error, 7% had a medium risk, 1% had a high risk, and 36% were blank. Blank and failed consensus entries were automatically excluded from these random draws. Each draw was evaluated with respect to consensus keying based on either a threshold of agreement (termed O75, 75% consensus, 15 of 20 values; O60, 60% consensus, 12 of 20 values) or by selecting the first five or 10 keyed responses (O5, O10) out of the 20 selected values. An additional test, termed “output resampled” (ORS), added a step to the O60 consensus processing with a random draw for entries that failed to reach consensus as a way to reach a definitive result. Each of the failed entries from this test had a statistical mode calculated from 500 iterations that individually pulled a five-sample random draw from the pool of 20 entered values (Table 2). We further classed type 1 and type 2 errors in each of these tests across three categories of keying success with respect to whether there was an increased likelihood of either error occurring (with the a priori assumption this would be strongly linked to the quality of the uploaded image on SWD). These categorical tests spanned low-risk images (consensus pass rate = 100%; clear penmanship, no edits in the original cell), medium-risk images (consensus pass rate <80%), and high-risk images (consensus not reached; often associated with edited original tabulated entries or messy penmanship).

Table 2.

Type 1 and type 2 errors associated with the WISE hierarchical degradation and resampling tests (O75, O60, O5, O10, and O-RS)

	Low risk			Medium risk			High risk			Blank cells			Whole set
	T1	T2	Correct	T1	T2	Correct	T1	T2	Correct	T1	T2	Correct	T1	T2	Correct
O75	0	0	100	0	13	87	0	89	11	0	0	100	0.00	1.83	98.17
O60	0	0	100	0	5	95	5	62	34	0	0	100	0.05	0.96	98.99
ORS	0	0	100	0	0	100	16	0	84	0	0	100	0.16	0.00	99.84
O5	0	0	100	3	0	97	26	30	44	0	0	100	0.51	0.30	99.19
O10	0	0	100	0	3	97	12	49	39	0	0	100	0.12	0.69	99.19

Open in a new tab

The percentages for each of these risk categories was calculated by weighting by the proportion of composition for the entire dataset by the percentage correct in that particular category (low, medium, high) in order to obtain a percentage error and percentage correct whole-set results. These results represent the aggregate for all entry types (pressure, temperature, and wind) across eight tasks. More details about this experiment can be found in the supplemental information.

Blank cells were identified correctly in all tests. For the low-risk image category, type 1 and type 2 errors were absent, but type 1 errors slightly increased and more so for type 2 errors in medium-risk images (Table 2). For high-risk images, all of the tests except O75 revealed type 1 errors, and there were no type 2 errors associated with the ORS analysis. The most common incorrect transcription issues had to do with (1) confusion between 4s and 7s and 4s and 6s; (2) omission of decimal points or other delimiters; and (3) extraneous notes, arrows, or values in cells where original data had been manually corrected (i.e., crossed out and re-written).

Training machines to guide the data rescue ship

The WISE dataset was also used to independently test Microsoft Read API (Figure 4) using digital photograph surrogates of the 1939 Form 301s that contained the original analogue data. One advantage with Microsoft Read API is the ability to transcribe an entire sheet using computer vision, which can save research preparation time related to clipping and uploading segments of a page onto Zooniverse for citizen science transcription. Six high-resolution scans of full original Form 301 sheet data sheets from two stations were used for a Microsoft Read API preliminary test, which draws on an OCR engine based on deep learning algorithms.44, 45, 46 A Microsoft Excel template indicating the position of data on the page (row cell and column) was also provided to the Microsoft team for supplying values back to NIWA for validation.

Azure OCR pipeline

Generalized architecture of the automated Azure cloud computing pipeline hosted by Microsoft that was used for the WISE OCR and transcription experiment. Handwritten meteorological tables in portable document file (PDF) format were transferred to Microsoft and loaded to the Azure Data Lake Storage (ADLSv2), where a Function Apps code forwarded them for text extraction. The Read API Azure Cognitive Service was used to extract handwritten digits from each PDF, in conjunction with custom machine learning models deployed using the Azure Kubernetes service via the Azure Container Registry. The custom model removed noise from the digital surrogates and located cells with digits in them. The extracted components from each page were further processed and the final outcome from OCR analysis was stored in the Azure SQL database (Result Store) where they were accessed, analyzed, and visualized using Power BI. In addition, capabilities for inter-service communication were securely held in Key Vault.

The results from OCR using Microsoft Read API indicate variable efficacy between different observing sites and for different observation types (Table 3). Across five quantitative observation categories (attached thermometer, barometer uncorrected, barometer corrected, maximum temperature, minimum temperature), the Microsoft Read API validation grand strike rate was 69% ± 15% (n = 920). Results for transcribing ECVs were also site dependent (related to penmanship of the observer who filled in the data table).

Table 3.

Results from Microsoft Read API for the WISE

	Attached thermometer	Barometer	Barometer corrected	Minimum temperature	Maximum temperature
Grand strike rate, uncorrected (%)	65.1	77.1	69.0	64.2	70.7
Grand strike rate, potential correction (%)	81.5	80.1	76.0	78.7	80.8

Open in a new tab

Strike rate (percentage correct) across five meteorological variables transcribed by Microsoft Read API for Albert Park (A64871) and Christchurch (H32561) spanning June to August 1939. The potential corrected grand strike rate is corrected for any miss related to a decimal or a dash that was not captured in the automated transcription.

Extraneous formatting and errors related to decimals and dashes were not considered when validating the Microsoft Read API because they are minor (see Table 3; difference between uncorrected and potential corrected strike rate for machine learning transcription). The most common issues identified where the Microsoft Read API auto-transcription did not validate related to incorrect transcription of the first digit of a numeric string, and designation of a letter where a number actually occurred. The most common digits that were not transcribed correctly were 4s and 7s (often swapped). Both of those shortcomings are similar to issues that we experienced on SWD for citizen scientists keying in data for the WISE experiment. Additional simplified guidance for unsupervised machine learning algorithms could be applied in those cases (e.g., pressure values recorded in inches of mercury must begin with a 2 or 3) to improve strike rate results for the Microsoft Read API (Table 3).

Discussion

Consolidating lessons learned from the SWD data rescue journey

Improving our understanding of past weather events and the roles that modes of variability have played in guiding extreme conditions requires better reanalyses, and in particular the coverage for the southern high latitudes needs to be dramatically augmented (Figure 1). There is massive potential to improve global reanalyses using the troves of historical meteorological data that are stored in a wide range of archives.¹⁷^,¹⁸^,³⁵^,⁴⁷^,⁴⁸ These observations can be digitized by volunteers with assistance from scientists who can prioritize and arrange data rescue activities. A major advantage to using Web-based citizen science for data rescue efforts is that the human resource can be drawn from all regions on Earth, volunteer time is free, and progress is made more or less continuously. In addition, citizen science data rescue provides an opportunity to engage and educate the general public about the importance of long-term meteorological observations and climate change.⁴⁹ In SWD, we learned that when data rescue is conducted under the auspices of a global effort like ACRE,¹⁹^,⁵⁰ and with support from agencies like the World Meteorological Organization⁵¹ and Copernicus Climate Change Service,⁵²^,⁵³ it engenders increased regional responsibility for data stewardship and archives while raising the profile of the science. This typically has a positive feedback for conducting additional data rescue activities, particularly in remote and under-resourced regions.⁵⁴ In addition, there are improvements for transparency of nation- and archive-specific data holdings that engenders wider data sharing that can exceed what ad hoc efforts undertaken by isolated researchers have achieved in the past. It is also clear that automated OCR approaches, like those we tested using Microsoft Read API, could be greatly improved with using the vast data captured through citizen science efforts like SWD.

The SWD core team that undertook the tasks required to capture handwritten observations using Zooniverse consisted of nine people. Our team members collectively found and captured digital twins of data sheets in multiple archives, prepared them for transcription on the Web platform, retrieved/parsed replicate keyed observations, and undertook statistical analyses of the results. Each of these data rescue tasks does not constitute an equivalent time investment or skill level. We also obtained external support from national and international collaborators to achieve many of our aims (e.g., finding ship log books in archives, testing machine learning OCR transcriptions). We divided basic data rescue tasks between senior scientists, casual staff, and students in order to maximize the use of limited funding. Overall, the foundation for a data rescue project like ours could be run on 1.0 full-time equivalent (FTE) employment. However, it is likely that multiple years would be required if one person were to do all of the associated tasks, including the field work. This type of effort also requires a broad enough skill set that includes development, adaptation, or augmentation of code that automates tasks through scientific programming. In addition, support from professional media experts would be required to attain the level of external project promotion we achieved.

The benefits of crowd-sourcing labor to key historical observations are partially offset by some unique challenges. A significant investment of time is required to train personnel in how to manually clip the logbook images or to set up different workflows for logbooks that are printed in different formats. This echoes findings learned from citizen science efforts to key United Kingdom Met Office daily weather reports, where it was noted that the effort required to clip segments of images and provide them using consistent formatting for end-user context places an additional time burden on the research team.⁵⁵ Automated clipping routines we tested reduced the time investment for this specific data rescue step, but success is highly dependent on the quality of photography and the types of scientific data tables being rescued. We are aware that the efforts from ACRE Argentina at present are using clipping approaches that focus on single cells and providing them in Zooniverse without formatting, which is a potential time-saving measure (https://www.zooniverse.org/projects/acre-ar/meteororum-ad-extremum-terrae). For SWD phase one, all the logbooks that were not in a consistent format were omitted.

An issue related to consistency of data transcription from international audiences can also arise. We noted that dashes and decimals were commonly substituted with commas or used as a delimiter, making our post-transcription data retrieved from native Zooniverse outputs difficult. There were also significant discrepancies related to the citizen science transcription of ship coordinates that led us to eventually input that category manually using an expert team. Significant time was also required to respond to questions from volunteers (particularly in the early stages following initial project launch).

Scanning the horizon for fair winds and smooth data rescue sailing

Based on the outcome of the WISE experiment tests, we consider a compromise can be reached between time spent keying by citizen science volunteers and achieving completeness and accuracy of a transcribed dataset when eight replicate entries are employed. To achieve that, we recommend initially setting a minimum 60% pass rate threshold (five out of eight in agreement) and then using a resampling scheme for any values that did not reach consensus. Using that scheme, we would expect that type 2 errors would be absent from the transcribed data, and type 1 errors would be, on average, less than two in 1,000. In addition, the transcribed dataset will be 100% complete and close to 99.5% accurate.

It is also worth noting that these results are dependent upon the nature of the data being transcribed. The specific retirement limit and broader strategy employed for scientific data transcription may need to be adjusted based on the type of observations being rescued. Integer values with no decimal points are the most straightforward to key and require little repetition to ensure a correct consensus value. Conversely, alphanumeric values and values with many significant figures introduce more complexity or opportunity for variation among responses from the citizen scientists (e.g., representing a decimal with a period, a comma, a space, or ignoring the decimal altogether). For example, within our dataset, we observed significantly more errors in the temperature fields, which generally include decimal points, than the pressure (integer-only values) and wind run (alphabetic-only values) fields.

As such, ironing out idiosyncrasies that can make data rescue efforts through Zooniverse universal and successful requires the following minimum requirements:

•
Prepare scans of data tables in a way that enables efficient keying and that is easy to understand.
•
Test and re-test workflows to ensure they are simple to follow (heeding participant feedback).
•
Design tasks so that citizen scientists have the best chance of entering a correct result.
•
Evaluate initial inputs and data retrievals with a small dataset before launching a full data rescue campaign.
•
Optimize replicate keying levels to balance confidence of results with time invested from citizen scientists.
•
Prepare enough material in advance to ensure momentum can be continually maintained.

Promotion of our project and engagement with media and the public was strongly connected to the rate of retirement for logbook segments and the overall success of completing the recovery of meteorological data via SWD. Our approach kept the following in mind:

•
A strong communications strategy with a “hook” to get people involved.
•
Willingness to engage with the media and the project participants.
•
Promotion of the project on multiple social media platforms.
•
Repeated contact with the citizen science community using emails and updates as tasks progressed.

Data rescue on Zooniverse has a proven successful track record for several projects that have focused on the recovery of historical weather observations. Our approach for SWD is something that can be easily replicated for other disciplines where tabulated scientific data need to be transcribed. We recently provided training to assist the launch of the Climate History Australia project (https://climatehistory.com.au) using the lessons we learned via SWD. It is important to note that inter-project knowledge sharing for meteorological data rescue has largely been by word of mouth and interpersonal relationships (having been helped by colleagues in the Weather Rescue and Old Weather projects that came prior to our project). There are relatively few references in the literature that describe exactly how data rescue that engages the general public is undertaken. Hence, we hope that this study provides a basic roadmap for novice practitioners that highlights insights about success and challenges for data rescue, and that the scientific community can build upon these lessons to accelerate the rapid acquisition of historical scientific data for wider societal benefits.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Andrew Lorrey (a.lorrey@niwa.co.nz).

Materials availability

Digital twins of the original log books and meteorological forms used in this study are held by NIWA. They can be made available on reasonable request.

Establishing a citizen science identity for our data rescue crew

When our work began, leading exemplars for historical meteorological data rescue harnessing citizen science were OldWeather56, 57, 58 and Weather Rescue.⁵⁹^,⁶⁰ The latter project was built on the free-to-use Zooniverse Web platform (www.zooniverse.org) and demonstrated a capability to recover millions of observations keyed in replicate. Based on the global success of Weather Rescue, both in terms of public engagement and the great speed and volume of historical weather data transcribed, our research team decided to employ a similar design. We registered our project on Zooniverse, and simultaneously created a project identity. Our project description included a name and icon connected to the southern hemisphere region where we wanted to generate “discovery” science about weather and climate with historical meteorological observations and reanalyses. The heavy focus on rescuing maritime data in our project led us to use a ship as a project icon, including a sail with an Antarctica logo that was embellished with a thermometer and sun in the background. The name SWD arose out of testing word combinations we thought were reflective of the project work and regional focus. It is also a subtle play on words with respect to the well-known RRS Discovery Antarctic expeditions (from which we have obtained data). A website domain name was purchased in order to make a shortcut (via redirection) to SWD (www.southernweatherdiscovery.org) to make it easier for the general public to find us on Zooniverse (instead of directing them to find the project at the Zooniverse URL https://www.zooniverse.org/projects/drewdeepsouth/southern-weather-discovery).

Guiding citizen scientists through an ocean of data

The Zooniverse Web platform is designed to accommodate novice citizen science practitioners who have no prior knowledge of website design or Web development. The build-a-project instructions (https://help.zooniverse.org/getting-started/) guide the completion of a project setup leading to two basic website components: a front end, which the general public can see and work with, and a back end that contains the design and content elements required to organize workflows and create data entry fields. There are several Web page hierarchical elements that can be viewed on the SWD front end, which include primary navigation tabs labeled About, Classify, Talk, and Collect. We discuss the first three of these tabs below.

Under the About tab, there are subsidiary tabs for Research, The Team (biographic information), Results, and Frequently Asked Questions (FAQ). We felt it was important to complete details for the Research and Team tabs under the About heading in order to establish our project identity upon launching SWD. We used the Team tab to outline biographic information; this element of the website humanizes the project by providing a face behind the science, as well as key points of contact. Additional considerations for providing personal details need to be weighed by each research team; we included the ability for citizen scientists to contact us to engender a better connection between our role as researchers and the public who we were trying to engage with for participating in data transcription. The Research tab provided an opportunity to outline more in-depth reasons for doing citizen science data transcription. Many of the citizen scientists using the Zooniverse platform are genuinely excited about the research, and providing these additional details helps them to engage with the project.

The Talk tab included conversations between our research team and citizen scientists. It was used to engage with participants who initiated questions or discussions, with most of the queries related to general data entry issues that were not pre-emptively thought of for the tutorial (mostly uncommon problems). In rare cases, submitted questions were related to reiteration of instructions when an occasional user did not understand our tutorial. More detailed questions about experimental design, including reasons for retirement limits for each image, and how to deal with missing data were popular topics. We also used the Talk tab to occasionally provide new instructions for data keying, and in one case we specifically asked citizen scientists to change their transcription on the fly (to not use commas as a numeric separator due to a data formatting issue with Zooniverse). The Classify tab will be discussed below in more detail when we outline how workflows for data transcription were made.

Advanced preparations for a long data rescue voyage

Historical ship logbook observations and land-based meteorological registers were handwritten on standardized printed table forms (Figure 5), making them ideal for Zooniverse platform transcription. We undertook two main transcription tranches, each with a slightly different approach for uploading digital copies of meteorological registers for transcription. Ahead of volunteers keying data online, the architecture of a basic workplan needs to be considered to determine how the division of labor should proceed. This helps to maximize efficiency, minimize transcription errors, and reduce preparation time. Consideration about the data types that are keyed and preparation related to both SWD transcription tranches are provided below.

A log book page used in SWD

This example shows a standard weather observation register that was transcribed by citizen scientists in SWD. Clipping masks were placed over the digital version of the register, with alphanumeric labels placed on to ship position (X1–X6), barometric pressure (A1–A6), and temperature (B1–B6). The original file name contains a unique sample identifier, the name of the ship (in this case the MS Port Gisborne), and an image number, to which the clipping mask alphanumeric code was added before uploading to Zooniverse (e.g., MF911_39,080_Port Gisborne_IMG_6247_B5.jpg for the clip of the register corresponding to box B5). This scheme facilitated ease of data retrieval and reparsing the data into a continuous time series for replicate quality assurance and further analysis. Image supplied by C. Wilkinson, RECLAIM.

Most of the logbooks used in the first phase of SWD (Table 1) were sourced from UK merchant and immigration ships that visited New Zealand and Australia via the South Pacific, and were acquired from international archives through an extension of the Recovery of Logbooks and International Marine data (RECLAIM) project.³⁹^,⁴⁰ Many of those logs used a standard printed register that arranged multiple observations in columns containing a unique variable (e.g., 9 a.m. temperature, pressure) and discrete entries in rows corresponding to a common date, time, and location (Figure 5). In the second phase of SWD (see section “shore leave for the WISE”), New Zealand land-based observations from meteorological registers containing a broader range of observations were drawn on, with nine discrete variables to key. For both phases of SWD, individual meteorological register pages were subdivided into small parts to provide a segment for an individual volunteer to transcribe rather than providing the whole page. This choice was based on discussions with colleagues and feedback from volunteers, and helped to (1) ensure data keying contributions could be completed in short bursts rather than taking up lengthy intervals of time, (2) minimize the risk of volunteers abandoning a data entry form before submitting a full transcription, (3) reduce mistakes that are associated with transcribing the wrong column or row, and (4) decrease the probability of widespread error propagated across an entire log, if, for example, a specific volunteer had a difficult time deciphering the handwriting on a specific page. Our project’s contractual requirements for the DSC also meant we prioritized certain observations on a logbook page, and therefore only a subset of meteorological logbook segments for each page and logbook were targeted for transcription.

A task fit for a clipper

To accommodate the structure of Zooniverse workflows that lead citizen science volunteers through keying (covered in detail below), we created subsets of each logbook page that were cropped and uploaded to SWD in a standard format. Adobe Illustrator software was used to crop segments of each logbook page using the artboard function. Logbook segments usually covered 2 days of a voyage, with four rows for observations per day (Figure 6). Clipping each segment out of the entire page was initially a semi-manual process, because many logbook images were not positioned identically each time a digital surrogate was created in archive (e.g., pages were inconsistently positioned when captured). This meant artboards used for clipping had to be iteratively adjusted to ensure the observations contained in the 2-day logbook segments were not truncated. Eventually, our team fixed this in pre-processing so the entire process of clipping could be automated. Once artboards were adjusted and aligned to the standardized logbook table dimensions, we adapted existing JavaScript to automate image labeling and cropping to produce logbook segments. A labeling convention was devised to identify where each segment clip was located on the original logbook page, with column A assigned to barometric pressure, column B assigned to temperature, and column X for ship position (Figure 5). The image names of each clip contained information about the archive folder, the ship name, the original image name, and the position of the clip on the logbook page attached as a file name suffix (see Figure 5).

Example of ship log segment in SWD

(Left) Formatted clipped segment taken from the MS Port Alma in 1932 that shows 2 days of handwritten regimented observations in tabulated format for uncorrected atmospheric pressure, attached thermometer, and corrected pressure (reduced to sea level). (Center) The task description for keying these observations (step 1) serves as a check that the correct image clip was uploaded, while the example for the data entry field instructions (right) indicate to the citizen scientist which column to key and how to separate the values.

Prior to uploading each clipped image to Zooniverse, a Jupyter notebook (a Web-based interactive computing platform) script run in Python was used to add the name of the ship, the year of the voyage, the hours of observation, and column headings (see Figure 6). Our decision to use a Jupyter notebook for this step enabled research team members to generate logbook segment clips regardless of their scientific computing experience. In SWD phase II, we also initiated automation of logbook segment clipping using MATLAB to help streamline this stage of the data rescue process. This labeling system also makes reassembling data after transcription easier. Links to code for the aforementioned steps are provided in the supplemental information.

Changing tack with specialized workflows

Three specialized workflows were created on Zooniverse for volunteers to take part in transcribing data for the SWD first tranche: ship position, temperature, and barometric pressure. The workflows were designed to be as simple as possible and utilized the logbook clips discussed above rather than displaying a whole logbook page. In the first tranche, we used an open entry field and asked volunteers to key a small column of data, with values separated by a range of delimiters (e.g., space, comma). In the SWD second phase, we upgraded the data entry forms to provide an individual entry box for each observation and adjusted the subdivisions of the logbook pages to ensure only one column of data was keyed in a step. This was assisted by Zooniverse via the Combo Task feature, which was experimental during early 2020, having been trialed through the Weather Rescue project. Although further customization of a Zooniverse-hosted citizen science website is possible, our team only used minimal special requirements like this that were facilitated by the Zooniverse staff.

The workflow questions were designed to lead the volunteers through the image with handwritten meteorological data: first, we asked if the image related to what the workflow task indicated (to potentially eliminate images that had been loaded into the wrong workflow). This was followed by a number of sequential questions that asked the volunteer to transcribe columns of numbers (Figure 6). A workflow task also asked the volunteers to transcribe the latitude and longitude so the historical weather observations could be ascribed to a location, date, and time. A separate workflow for temperature observations asked volunteers to transcribe air, sea, dry bulb, and wet bulb temperatures. Finally, a barometric pressure workflow asked volunteers to transcribe uncorrected pressure, the attached thermometer (required for correcting raw pressure measurements), and the corrected pressure at sea level.

Alongside each workflow, Zooniverse requires tutorials and a field guide to guide volunteers through each workflow step by step and address any idiosyncratic tasks for that workflow. As such, each separate workflow has a unique tutorial. The field guide addresses more general questions from the workflows and about the project in general, and it can be found on the side of any page of the Zooniverse project (see more details on southernweatherdiscovery.org).

Conscripting data rescue participants

Our maiden voyage into the deep south

Our team used a multiphase communication plan to introduce and promote SWD, including video and print media campaigns to garner participation and maintain interest in our meteorological data rescue project. An initial step was to create an introductory video for the SWD website that would encourage people to participate in digital keying of historical weather observations. The video content was crafted with the assumption that the audience had never heard of the project or previously participated in a citizen science effort. This portion of our strategy, in addition to parallel strategies for social media, print media, and radio, was designed by the SWD team and the NIWA Communications team across several months of work to ensure the data rescue scientific content was robust and that delivery to multiple media outlets would be ready in time for launching our SWD project on Zooniverse.

In pre-production for the SWD introductory video, we noted an obvious limitation related to visual content being restricted to historical ship logbooks. However, we were fortunate to find historical footage shot by Herbert Ponting of Robert Falcon Scott’s British Antarctic Expedition to the South Pole in 1910.⁶¹ The black and white video footage of Scott’s expedition showcases the conditions under which the scientific observations were made during the “heroic age of exploration.” Weaving several segments from this historical video into our messaging was central to the strategy of initiating and maintaining engagement with SWD data rescue. To reflect the nautical elements of SWD data rescue, key segments were filmed at the Auckland Maritime Museum. We also framed the central issue around the difficulty of transcribing handwriting, and demonstrated how the audience could be a part of the solution. The general progression of the video also highlighted the importance of recovering historical meteorological observations to provide insights about our current and future climate (refer to the statement “Using their legacy to help ours” at the 2-minutes-and-13-seconds mark in our first SWD video; https://vimeo.com/297007476). The mixture of contemporary and historical video footage engenders ties to the golden age of exploration, with the idea to “breadcrumb” prospective citizen scientists toward participating.

For the SWD launch, observations that were taken at temporary encampments and during overland sledging missions during Scott’s expedition were added onto the SWD website. These workflows were used to entice members of the public to take part in the transcribing effort, and for the media to create a story around. Some of these data came from printed tables and had already been transcribed by other researchers (without our prior knowledge). However, it was considered a minimal time sacrifice to copy and clip those images to garner significant public interest in the project.

SWD was launched on 30 October 2018. The introductory SWD video was promoted by NIWA and reached 127,200 people on Facebook, resulting in 288 comments, likes, or shares. It was viewed 55,000+ times on Facebook, YouTube, and Vimeo. The project was also promoted through the NIWA Communications team to New Zealand media, with a story featured on primetime television news (TV One), which has a nightly audience of ∼600,000 national viewers (>10% of New Zealand’s population) (https://www.tvnz.co.nz/one-news/new-zealand/robert-scotts-weather-logs-give-kiwi-scientists-new-insight-climate-change).

A story about SWD also featured on the front page of The New Zealand Herald website, New Zealand’s second largest online news outlet with a monthly reach of 1.3 million subscribers (https://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=12151407). SWD also had significant coverage in provincial and regional newspapers, with an additional estimated 100,000+ audience reach. A parallel social media campaign was also launched through NIWA’s social channels (Facebook and Twitter) and by members of the research team, with subsidiary re-promotion of materials to reach the Weather Rescue participants (who were largely based overseas and at the time were waiting for more data to key). NIWA’s Twitter promotion about the project reached an audience of 7,955 people with 165 comments, likes, or retweets. Posts on Twitter about the project were shared by climate scientists, international and New Zealand science organizations, and hundreds of members of the public (and even by Chelsea Clinton to her ∼2.4 million followers). The project was also promoted in an e-mail newsletter to all Zooniverse volunteers.

The metrics and progress components supplied from Zooniverse also allowed us to track the progress of data transcription, feedback from participants, and also opportunities to push social media to re-energize and draw in more people. The uptake of data keying by volunteers was swift, and over 50,000 observations, including all of the ice sledging data from Scott’s expedition, were initially transcribed in replicate over the first 2 days after the launch of the SWD project. In total, 167,914 unique meteorological observations were successfully captured in replicate through phase I of the SWD project.

Shore leave for the WISE

The second phase of SWD focused on a project called The Week it Snowed Everywhere (WISE). This was a phrase that we coined to describe a significant snowfall event that affected most of New Zealand during the austral mid-winter of 1939 (Figure 7). A primary goal for this phase of SWD was to evaluate transcription retirement limits (replicate keying) and how those limits relate to optimal accuracy of citizen science transcription when dealing with different levels of replicate transcription. We also wanted to highlight the serendipitous benefit of SWD citizen science data rescue that comes from high levels of keying replication, including the ability to augment training libraries that underpin computer vision transcription of handwritten tabulated numbers. Automated transcription using computer vision techniques commonly relies on a standardized digital library called the Modified National Institute of Standards and Technology database (MNIST),⁶² which is used to train AI approaches.⁶³^,⁶⁴ However, the MNIST dataset is relatively limited in terms of exemplary forms for handwritten digits compared with available contemporary resources and offerings in old texts.

Snow during the Week it Snowed Everywhere

Snowfall evidence for the Week it Snowed Everywhere (WISE) during late July 1939 at (left) Pukekohe, Auckland (credit: Huia Mitchell via Auckland Libraries Heritage collections, Footprints 03,956), and (right) in the streets of Dunedin, Otago (credit: Evening Star, reproduced by the Otago Daily Times).

A promotional video for WISE homed in on the technological connection between citizen science-driven data rescue and AI-based handwriting transcription (https://vimeo.com/374313908). A primary goal for this promotion was to communicate to the citizen scientists how their assistance could accelerate technology improvements and our scientific goals. In this case, having humans contribute to deep datasets that can train AI for handwriting transcription would result in more rapid realization of the benefits of weather reconstructions on a global scale.

We began the WISE video by establishing the value of the ship log observations for understanding past weather events (see quote at 20 seconds in the video, which states “We cannot go back. This is our time machine”). Then, we highlighted the problem that OCR has for transcribing tabulated handwritten digits. We coupled both concepts with the idea that combining scientific knowledge of meteorological data with citizen science and partnering with a global leader in software provision (Microsoft) could help to rapidly overcome a significant problem.

Despite OCR technology being used for decades, there are limited video exemplars that demonstrate exactly how it works. To get around this shortfall for communicating to the target audience, our team made a suite of visual animations depicting what OCR software basically does (including a mock visualization of the MNIST training dataset). These connections helped to bring two project elements together: historical handwritten logbooks and OCR technology. The understated message is that the 1939 snowfall event provides data that can lead to improved OCR technology, which in turn can help to surmount present limits on rapid acquisition of historical scientific observations. The example from the snowfall event of 1939 also connects the importance of studying past extreme weather events with understanding global change from a relatively isolated location in the antipodes.

The WISE video launched at the November 2019 Microsoft Envision Forum NZ held in Auckland. In parallel, there was a promotional media campaign driven by NIWA Communications, with uptake of the story by all major print media, television, and radio outlets in New Zealand (Figure 8). The connection between Microsoft New Zealand and their parent organization also meant shared Twitter reach presenting the link to the promotional video exceeded 400,000 impressions in under 1 month. We also received significant help by “piggy-backing” off of Rainfall Rescue, a UK-based data rescue project running on Zooniverse, which supplied a volunteer corps to SWD directly after their project was completed. This influx of citizen scientists saw our project classifications increase by about 500%, and that level was maintained through completion, dramatically reducing the time for data capture (Figure 8).

Zooniverse classification daily progress

Classification statistics for SWD, highlighting phase II WISE activity. Each classification is an instance where a citizen scientist has undertaken a keyed data transcription for a small section of a log book uploaded to Zooniverse. A noticeable boost in keying and project participation by citizen scientists coincided with media and social media advertising, emails to participants, and new material being uploaded to the site. The largest keying increase was associated with the completion of the UK Rainfall Rescue project, which bolstered international participation in our project.

Acknowledgments

This work was partly funded by the New Zealand Government's Deep South National Science Challenge Assessing and Validating the New Zealand Earth System Model Using Modern and Historical Observations and Strategic Science Investment Fund support from NIWA for Climate Present and Past contract CAOA2101. The European Union Copernicus Climate Change Data Rescue Service (2017/C3S_311a_Lot1_Met Office) awarded to the United Kingdom Met Office also supported this work, and the World Meteorological Organization kindly provided co-funding for this study to be published. Support for the 20CR project version 3 dataset is provided by the US Department of Energy, Office of Science Biological and Environmental Research (BER); by the National Oceanic and Atmospheric Administration Climate Program Office; and by the NOAA Physical Sciences Laboratory. Support for the 20CR version 2c dataset is provided by the US Department of Energy, Office of Science BER and by the National Oceanic and Atmospheric Administration Climate Program Office. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231 using NERSC award BER-ERCAP0020982. Contributions by G.P.C. were partially supported by the NOAA Physical Sciences Laboratory, NOAA Climate Program Office, and the NOAA Cooperative Agreement with CIRES, NA17OAR4320101. We thank Lohit Batra, Bob Glancy, and Lucas Joppa for facilitating support to NIWA for SWD via a Microsoft AI for Earth grant in aid and for arranging a showcase of our efforts at the Microsoft Envision Forum in Auckland. We thank Peer Hechler from the WMO and Philip Brohan from the UKMO for comments that improved this manuscript. We thank all researchers from NIWA in New Zealand that have contributed to rescuing historical weather observations. We would also like to thank the team at Zooniverse and our SWD volunteers, without whom our project would not have been possible.

Author contributions

Supervising the work, A.M.L.; responsibility for all data, figures, and text, A.M.L., P.R.P., R.J.A., J.-M.W., E.J., L.S., S.M., S.R., P.Q., E.H., S.W., G.C.; ensuring that authorship is granted appropriately to contributors, A.M.L.; ensuring that all authors approve the content and submission of the paper, as well as edits made through the revision and production processes, A.M.L.; ensuring adherence to all editorial and submission policies, A.M.L.; identifying and declaring competing interests on behalf of all authors, A.M.L.; identifying and disclosing related work by any co-authors under consideration elsewhere, A.M.L.; archiving unprocessed data and ensuring that figures accurately present the original data (see Data and code availability section), A.M.L., E.J., J.-M.W., C.W., and R.J.A.; arbitrating decisions and disputes and ensuring communication with the journal (before and after publication), sharing any relevant information or updates with co-authors, and being accountable for fulfilling requests for reagents and resources, A.M.L.; supply of data and analysis, A.M.L., E.J., J.-M.W., S.R., and L.S.; computational resources, S.R., P.Q., G.C., and L.S.

Declaration of interests

The authors declare no competing interests.

Published: May 27, 2022

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2022.100495.

Supplemental information

Document S1. Figures S1–S5 and Tables S1 and S2

mmc1.pdf^{(1.4MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(5.5MB, pdf)}

Data and code availability

The data that were generated in this study are held by the lead contact and are currently being analyzed. They will be made available in the future through ACRE or on reasonable request. Codes that were developed and/or augmented for this study are available in the links found in the supplemental information.

References

1.Kalnay E., Kanamitsu M., Kistler R., Collins W., Deaven D., Gandin L., Iredell M., Saha S., White G., Woollen J., et al. The NCEP/NCAR 40-year reanalysis project. Bull. Am. Meteorol. Soc. 1996;77:437–471. [Google Scholar]
2.Kistler R., Collins W., Saha S., White G., Woollen J., Kalnay E., Chelliah M., Ebisuzaki W., Kanamitsu M., Kousky V., et al. The NCEP–NCAR 50–year reanalysis: monthly means CD–ROM and documentation. Bull. Am. Meteorol. Soc. 2001;82:247–268. [Google Scholar]
3.Slivinski L.C., Compo G.P., Whitaker J.S., Sardeshmukh P.D., Giese B.S., McColl C., Allan R., Yin X., Vose R., Titchner H., et al. Towards a more reliable historical reanalysis: improvements for version 3 of the Twentieth Century Reanalysis system. Q. J. R. Meteorol. Soc. 2019;145:2876–2908. doi: 10.1002/qj.3598. [DOI] [Google Scholar]
4.Uppala S.M., KÅllberg P.W., Simmons A.J., Andrae U., Bechtold V.D.C., Fiorino M., Gibson J.K., Haseler J., Hernandez A., Kelly G.A., et al. The ERA-40 re-analysis. Q. J. R. Meteorol. Soc. 2005;131:2961–3012. doi: 10.1256/qj.04.176. [DOI] [Google Scholar]
5.Compo G.P., Whitaker J.S., Sardeshmukh P.D., Matsui N., Allan R.J., Yin X., Gleason B.E., Vose R.S., Rutledge G., Bessemoulin P., et al. The Twentieth century reanalysis project. Q. J. R. Meteorol. Soc. 2011;137:1–28. doi: 10.1002/qj.776. [DOI] [Google Scholar]
6.Slivinski L.C., Compo G.P., Sardeshmukh P.D., Whitaker J.S., McColl C., Allan R.J., Brohan P., Yin X., Smith C.A., Spencer L.J., et al. An evaluation of the performance of the Twentieth century reanalysis version 3. J. Clim. 2021;34:1417–1438. doi: 10.1175/JCLI-D-20-0505.1. [DOI] [Google Scholar]
7.Gallant A.J.E., Phipps S.J., Karoly D.J., Mullan A.B., Lorrey A.M. Nonstationary Australasian teleconnections and implications for paleoclimate reconstructions. J. Clim. 2013;26:8827–8849. doi: 10.1175/JCLI-D-12-00338.1. [DOI] [Google Scholar]
8.Jiang N., Griffiths G., Lorrey A. Influence of large-scale climate modes on daily synoptic weather types over New Zealand. Int. J. Climatol. 2013;33:499–519. doi: 10.1002/joc.3443. [DOI] [Google Scholar]
9.Liu Z., Alexander M. Atmospheric bridge, oceanic tunnel, and global climatic teleconnections. Rev. Geophys. 2007;45:RG2005. doi: 10.1029/2005RG000172. [DOI] [Google Scholar]
10.Thorne P.W., Vose R.S. Reanalyses suitable for characterizing long-term trends. Bull. Am. Meteorol. Soc. 2010;91:353–362. doi: 10.1175/2009BAMS2858.1. [DOI] [Google Scholar]
11.Freeman E., Kent E.C., Brohan P., Cram T., Gates L., Huang B., Liu C., Smith S.R., Worley S.J., Zhang H.-M. The international comprehensive Ocean-Atmosphere data set – meeting users needs and future priorities. Front. Mar. Sci. 2019;6 doi: 10.3389/fmars.2019.00435. [DOI] [Google Scholar]
12.Freeman E., Woodruff S.D., Worley S.J., Lubker S.J., Kent E.C., Angel W.E., Berry D.I., Brohan P., Eastman R., Gates L., et al. ICOADS Release 3.0: a major update to the historical marine climate record. Int. J. Climatol. 2017;37:2211–2232. doi: 10.1002/joc.4775. [DOI] [Google Scholar]
13.Cram T.A., Compo G.P., Yin X., Allan R.J., McColl C., Vose R.S., Whitaker J.S., Matsui N., Ashcroft L., Auchmann R., et al. The international surface pressure Databank version 2. Geosci. Data J. 2015;2:31–46. doi: 10.1002/gdj3.25. [DOI] [Google Scholar]
14.Compo G.P., Slivinski L.C., Whitaker J.S., Sardeshmukh P.D., McColl C., Brohan P., Allan R., Yin X., Vose R., Spencer L.J., et al. The international surface pressure databank version 4. Res. Data Arch. Natl. Cent. Atmos. Res. Comput. Inf. Syst. Lab. 2019 doi: 10.5065/9EYR-TY90. [DOI] [Google Scholar]
15.Woodruff S.D., Worley S.J., Lubker S.J., Ji Z., Eric Freeman J., Berry D.I., Brohan P., Kent E.C., Reynolds R.W., Smith S.R., et al. ICOADS Release 2.5: extensions and enhancements to the surface marine meteorological archive. Int. J. Climatol. 2011;31:951–967. doi: 10.1002/joc.2103. [DOI] [Google Scholar]
16.Worley S.J., Woodruff S.D., Reynolds R.W., Lubker S.J., Lott N. ICOADS release 2.1 data and products. Int. J. Climatol. 2005;25:823–842. doi: 10.1002/joc.1166. [DOI] [Google Scholar]
17.Brönnimann S., Brugnara Y., Allan R.J., Brunet M., Compo G.P., Crouthamel R.I., Jones P.D., Jourdain S., Luterbacher J., Siegmund P., et al. A roadmap to climate data rescue services. Geosci. Data J. 2018;5:28–39. doi: 10.1002/gdj3.56. [DOI] [Google Scholar]
18.Thorne P.W., Allan R.J., Ashcroft L., Brohan P., Dunn R.J.H., Menne M.J., Pearce P.R., Picas J., Willett K.M., Benoy M., et al. Toward an integrated set of surface meteorological observations for climate science and applications. Bull. Am. Meteorol. Soc. 2017;98:2689–2702. doi: 10.1175/BAMS-D-16-0165.1. [DOI] [Google Scholar]
19.Allan R., Brohan P., Compo G.P., Stone R., Luterbacher J., Brönnimann S. The international atmospheric circulation reconstructions over the Earth (ACRE) initiative. Bull. Am. Meteorol. Soc. 2011;92:1421–1425. doi: 10.1175/2011BAMS3218.1. [DOI] [Google Scholar]
20.Allan R., Wood K., Freeman E., Wilkinson C., Andersson A., Lorrey A., Brohan P., Stendel M., Kennedy J. Learning from the past to understand the future: historical records of change in the ocean. WMO Bull. 2021;70:36–42. [Google Scholar]
21.Brohan P. Testing Google Vision for weather data rescue. 2019. https://brohan.org/Google-Vision/
22.Brohan P. Testing AWS Textract for weather data rescue. 2020. https://brohan.org/AWS-Textract/
23.Kaspar F., Tinz B., Mächel H., Gates L. Data rescue of national and international meteorological observations at Deutscher Wetterdienst. Adv. Sci. Res. 2015;12:57–61. doi: 10.5194/asr-12-57-2015. [DOI] [Google Scholar]
24.Ashcroft L., Gergis J., Karoly D.J. A historical climate dataset for southeastern Australia, 1788–1859. Geosci. Data J. 2014;1:158–178. doi: 10.1002/gdj3.19. [DOI] [Google Scholar]
25.Bridgman H., Ashcroft L., Thornton K., Di Gravio G., Oates W. Meteorological observations for eversleigh station, near Armidale, New South Wales, Australia: 1877–1922. Geosci. Data J. 2019;6:174–188. doi: 10.1002/gdj3.80. [DOI] [Google Scholar]
26.Slonosky V., Sieber R., Burr G., Podolsky L., Smith R., Bartlett M., Park E., Cullen J., Fabry F. From books to bytes: a new data rescue tool. Geosci. Data J. 2019;6:58–73. doi: 10.1002/gdj3.62. [DOI] [Google Scholar]
27.Ashcroft L., Coll J.R., Gilabert A., Domonkos P., Brunet M., Aguilar E., Castella M., Sigro J., Harris I., Unden P., et al. A rescued dataset of sub-daily meteorological observations for Europe and the southern Mediterranean region, 1877–2012. Earth Syst. Sci. Data. 2018;10:1613–1635. doi: 10.5194/essd-10-1613-2018. [DOI] [Google Scholar]
28.Williams M., Varma B., Hayek O., Dean M. Development of the New Zealand Earth system model. Weather Clim. 2016;36:25. [Google Scholar]
29.Brenstrum E. Craig Potton Publishing; 1998. The New Zealand Weather Book. [Google Scholar]
30.Lorrey A.M., Chappell P.R. The ‘dirty weather’ diaries of Reverend Richard Davis: insights about early colonial-era meteorology and climate variability for northern New Zealand, 1839-1851. Clim. Past. 2016;12 doi: 10.5194/cp-12-553-2016. [DOI] [Google Scholar]
31.Mo K.C., Paegle J.N. The Pacific-South American modes and their downstream effects. Int. J. Climatol. 2001;21:1211–1229. doi: 10.1002/joc.685. [DOI] [Google Scholar]
32.Raphael M.N. A zonal wave 3 index for the Southern Hemisphere. Geophys. Res. Lett. 2004;31 doi: 10.1029/2004GL020365. [DOI] [Google Scholar]
33.Ummenhofer C.C., England M.H. Interannual extremes in New Zealand precipitation linked to modes of Southern Hemisphere climate variability. J. Clim. 2007;20:5418–5440. doi: 10.1175/2007JCLI1430.1. [DOI] [Google Scholar]
34.Kidston J., Renwick J.A., McGregor J. Hemispheric-scale seasonality of the southern annular mode and impacts on the climate of New Zealand. J. Clim. 2009;22:4759–4770. doi: 10.1175/2009JCLI2640.1. [DOI] [Google Scholar]
35.Brönnimann S., Allan R., Atkinson C., Buizza R., Bulygina O., Dahlgren P., Dee D., Dunn R., Gomes P., John V.O., et al. Observations for reanalyses. Bull. Am. Meteorol. Soc. 2018;99:1851–1866. doi: 10.1175/BAMS-D-17-0229.1. [DOI] [Google Scholar]
36.Wheeler D., García Herrera R., Koek F., Wilkinson C., Können G., del Rosario Prieto M., Jones P., Casale R. European Commission; 2007. CLIWOC, Climatological Database for the World’s Oceans: 1750 to 1850; Results of a Research Project EVK1-CT-2000-00090. [Google Scholar]
37.Teleti P.R., Rees W.G., Dowdeswell J.A., Wilkinson C. A historical Southern Ocean climate dataset from whaling ships’ logbooks. Geosci. Data J. 2019;6:30–40. doi: 10.1002/gdj3.65. [DOI] [Google Scholar]
38.Chappell P.R., Lorrey A.M. Identifying New Zealand, Southeast Australia, and Southwest pacific historical weather data sources using ian Nicholson’s log of logs. Geosci. Data J. 2014;1:49–60. doi: 10.1002/gdj3.1. [DOI] [Google Scholar]
39.Wilkinson C., Woodruff S.D., Brohan P., Claesson S., Freeman E., Koek F., Lubker S.J., Marzin C., Wheeler D. Recovery of logbooks and international marine data: the RECLAIM project. Int. J. Climatol. 2011;31:968–979. doi: 10.1002/joc.2102. [DOI] [Google Scholar]
40.Wilkinson C., Vásquez M. Historic sea-ice and meteorological data sources for Southern Ocean and Antarctic; 2016. Report on the imaging of historic ice. Vestfold archive, Sandefjord, Norway: meteorologicaland oceanographic data in Antarctic waters. [DOI] [Google Scholar]
41.Wilkinson C., Vásquez M. Historic sea-ice and meteorological data sources for Southern Ocean and Antarctic; 2017. Report on the Imaging of Sources of Historic Ice, Meteorological and Oceanographic Data in the SouthernOcean – Åland Maritime Museum, Mariehamn,Finland. [DOI] [Google Scholar]
42.Kidson E. Government Printer; 1947. Daily Weather Charts Extending from Australia and New Zealand to the Antarctic Continent. [Google Scholar]
43.Wilkinson C., Freeman E. Copernicus Climate Change Data Rescue Service; 2021. Best Practice Guidelines for Keying Data from Historic Marine Documents. [Google Scholar]
44.Ding H., Chen K., Yuan Y., Cai M., Sun L., Liang S., Huo Q. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) IEEE; 2017. A compact CNN-DBLSTM based character model for offline handwriting recognition with tucker decomposition; pp. 507–512. [Google Scholar]
45.Zhong Z., Sun L., Huo Q. An anchor-free region proposal network for Faster R-CNN-based text detection approaches. Int. J. Doc. Anal. Recognit. 2019;22:315–327. doi: 10.1007/s10032-019-00335-y. [DOI] [Google Scholar]
46.Ma C., Zhong Z., Sun L., Huo Q. 2019 International Conference on Document Analysis and Recognition (ICDAR) IEEE; 2019. A relation network based approach to curved text detection; pp. 707–713. [DOI] [Google Scholar]
47.Brönnimann S., Allan R., Ashcroft L., Baer S., Barriendos M., Brázdil R., Brugnara Y., Brunet M., Brunetti M., Chimani B., et al. Unlocking pre-1850 instrumental meteorological records a global inventory. Bull. Am. Meteorol. Soc. 2019;100:ES389–ES413. doi: 10.1175/BAMS-D-19-0040.1. [DOI] [Google Scholar]
48.Allan R., Compo G., Carton J. Recovery of global surface weather observations for historical reanalyses and international users. Eos, Trans. Am. Geophys. Union. 2011;92:154. doi: 10.1029/2011EO180008. [DOI] [Google Scholar]
49.Ashcroft L., Allan R., Bridgman H., Gergis J., Pudmenzky C., Thornton K. Current climate data rescue activities in Australia. Adv. Atmos. Sci. 2016;33:1323–1324. doi: 10.1007/s00376-016-6189-5. [DOI] [Google Scholar]
50.Allan R., Endfield G., Damodaran V., Adamson G., Hannaford M., Carroll F., Macdonald N., Groom N., Jones J., Williamson F., et al. Toward integrated historical climate research: the example of Atmospheric Circulation Reconstructions over the Earth. Wires Clim. Chang. 2016;7:164–174. doi: 10.1002/wcc.379. [DOI] [Google Scholar]
51.Brunet M., Jones P. Data rescue initiatives: bringing historical climate data into the 21st century. Clim. Res. 2011;47:29–40. doi: 10.3354/cr00960. [DOI] [Google Scholar]
52.Wilkinson C., Brönnimann S., Jourdain S., Roucaute E., Crouthamel R., Brohan P., Valente A., Brugnara Y., Brunet M., Compo G.P., et al. ECMWF; 2019. Best Practice Guidelines for Climate Data Rescue v1, of the Copernicus Climate Change Service Data Rescue Service. [DOI] [Google Scholar]
53.Brunet M., Brugnara Y., Noone S., Stephens A., Valente M.A., Ventura C., Jones P., Gilabert A., Brönnimann S., Luterbacher J., et al. Best practice guidelines for climate data and metadata formatting, quality control and submission of the Copernicus climate change service data rescue service. 2020. [DOI]
54.Page C.M., Nicholls N., Plummer N., Trewin B., Manton M., Alexander L., Chambers L.E., Choi Y., Collins D.A., Gosai A., et al. Data rescue in the Southeast Asia and South Pacific region: challenges and opportunities. Bull. Am. Meteorol. Soc. 2004;85:1483–1490. doi: 10.1175/BAMS-85-10-1483. [DOI] [Google Scholar]
55.Craig P.M., Hawkins E. Digitizing observations from the Met Office daily weather reports for 1900–1910 using citizen scientist volunteers. Geosci. Data J. 2020;7:116–134. doi: 10.1002/gdj3.93. [DOI] [Google Scholar]
56.Eveleigh A., Jennett C., Blandford A., Brohan P., Cox A.L. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM; 2014. Designing for dabblers and deterring drop-outs in citizen science; pp. 2985–2994. [DOI] [Google Scholar]
57.Brohan P. American Geophysical Union, Fall Meeting 2014; 2014. Citizen Science for Data Rescue: Recovering Historical Climate Records with a Network of 20,000 Volunteers. [Google Scholar]
58.Brohan P. AGU Fall Meeting Abstract. American Geophysical Union; 2012. oldWeather. Org: citizen science for climate reconstruction. ED53A–0922. [Google Scholar]
59.Hawkins E., Burt S., Brohan P., Lockwood M., Richardson H., Roy M., Thomas S. Hourly weather observations from the Scottish Highlands (1883–1904) rescued by volunteer citizen scientists. Geosci. Data J. 2019;6:160–173. doi: 10.1002/gdj3.79. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Burt S., Hawkins E. Near-zero humidities on Ben Nevis, Scotland, revealed by pioneering 19th-century observers and modern volunteers. Int. J. Climatol. 2019;39:4451–4466. doi: 10.1002/joc.6084. [DOI] [Google Scholar]
61.Read J. Scott’s Last Journey. 1964. United Kingdom: British Broadcasting Corporation; https://archive.org/details/scottslastjourney/scottslastjourneyreel1.mov.
62.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86:2278–2324. [Google Scholar]
63.Ahlawat S., Choudhary A. Hybrid CNN-SVM classifier for handwritten digit recognition. Proced. Comput. Sci. 2020;167:2554–2560. doi: 10.1016/j.procs.2020.03.309. [DOI] [Google Scholar]
64.Kussul E., Baidyk T. Improved method of handwritten digit recognition tested on MNIST database. Image Vis. Comput. 2004;22:971–981. doi: 10.1016/j.imavis.2004.03.008. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S5 and Tables S1 and S2

mmc1.pdf^{(1.4MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(5.5MB, pdf)}

Data Availability Statement

[bib1] 1.Kalnay E., Kanamitsu M., Kistler R., Collins W., Deaven D., Gandin L., Iredell M., Saha S., White G., Woollen J., et al. The NCEP/NCAR 40-year reanalysis project. Bull. Am. Meteorol. Soc. 1996;77:437–471. [Google Scholar]

[bib2] 2.Kistler R., Collins W., Saha S., White G., Woollen J., Kalnay E., Chelliah M., Ebisuzaki W., Kanamitsu M., Kousky V., et al. The NCEP–NCAR 50–year reanalysis: monthly means CD–ROM and documentation. Bull. Am. Meteorol. Soc. 2001;82:247–268. [Google Scholar]

[bib3] 3.Slivinski L.C., Compo G.P., Whitaker J.S., Sardeshmukh P.D., Giese B.S., McColl C., Allan R., Yin X., Vose R., Titchner H., et al. Towards a more reliable historical reanalysis: improvements for version 3 of the Twentieth Century Reanalysis system. Q. J. R. Meteorol. Soc. 2019;145:2876–2908. doi: 10.1002/qj.3598. [DOI] [Google Scholar]

[bib4] 4.Uppala S.M., KÅllberg P.W., Simmons A.J., Andrae U., Bechtold V.D.C., Fiorino M., Gibson J.K., Haseler J., Hernandez A., Kelly G.A., et al. The ERA-40 re-analysis. Q. J. R. Meteorol. Soc. 2005;131:2961–3012. doi: 10.1256/qj.04.176. [DOI] [Google Scholar]

[bib5] 5.Compo G.P., Whitaker J.S., Sardeshmukh P.D., Matsui N., Allan R.J., Yin X., Gleason B.E., Vose R.S., Rutledge G., Bessemoulin P., et al. The Twentieth century reanalysis project. Q. J. R. Meteorol. Soc. 2011;137:1–28. doi: 10.1002/qj.776. [DOI] [Google Scholar]

[bib6] 6.Slivinski L.C., Compo G.P., Sardeshmukh P.D., Whitaker J.S., McColl C., Allan R.J., Brohan P., Yin X., Smith C.A., Spencer L.J., et al. An evaluation of the performance of the Twentieth century reanalysis version 3. J. Clim. 2021;34:1417–1438. doi: 10.1175/JCLI-D-20-0505.1. [DOI] [Google Scholar]

[bib7] 7.Gallant A.J.E., Phipps S.J., Karoly D.J., Mullan A.B., Lorrey A.M. Nonstationary Australasian teleconnections and implications for paleoclimate reconstructions. J. Clim. 2013;26:8827–8849. doi: 10.1175/JCLI-D-12-00338.1. [DOI] [Google Scholar]

[bib8] 8.Jiang N., Griffiths G., Lorrey A. Influence of large-scale climate modes on daily synoptic weather types over New Zealand. Int. J. Climatol. 2013;33:499–519. doi: 10.1002/joc.3443. [DOI] [Google Scholar]

[bib9] 9.Liu Z., Alexander M. Atmospheric bridge, oceanic tunnel, and global climatic teleconnections. Rev. Geophys. 2007;45:RG2005. doi: 10.1029/2005RG000172. [DOI] [Google Scholar]

[bib10] 10.Thorne P.W., Vose R.S. Reanalyses suitable for characterizing long-term trends. Bull. Am. Meteorol. Soc. 2010;91:353–362. doi: 10.1175/2009BAMS2858.1. [DOI] [Google Scholar]

[bib11] 11.Freeman E., Kent E.C., Brohan P., Cram T., Gates L., Huang B., Liu C., Smith S.R., Worley S.J., Zhang H.-M. The international comprehensive Ocean-Atmosphere data set – meeting users needs and future priorities. Front. Mar. Sci. 2019;6 doi: 10.3389/fmars.2019.00435. [DOI] [Google Scholar]

[bib12] 12.Freeman E., Woodruff S.D., Worley S.J., Lubker S.J., Kent E.C., Angel W.E., Berry D.I., Brohan P., Eastman R., Gates L., et al. ICOADS Release 3.0: a major update to the historical marine climate record. Int. J. Climatol. 2017;37:2211–2232. doi: 10.1002/joc.4775. [DOI] [Google Scholar]

[bib13] 13.Cram T.A., Compo G.P., Yin X., Allan R.J., McColl C., Vose R.S., Whitaker J.S., Matsui N., Ashcroft L., Auchmann R., et al. The international surface pressure Databank version 2. Geosci. Data J. 2015;2:31–46. doi: 10.1002/gdj3.25. [DOI] [Google Scholar]

[bib14] 14.Compo G.P., Slivinski L.C., Whitaker J.S., Sardeshmukh P.D., McColl C., Brohan P., Allan R., Yin X., Vose R., Spencer L.J., et al. The international surface pressure databank version 4. Res. Data Arch. Natl. Cent. Atmos. Res. Comput. Inf. Syst. Lab. 2019 doi: 10.5065/9EYR-TY90. [DOI] [Google Scholar]

[bib15] 15.Woodruff S.D., Worley S.J., Lubker S.J., Ji Z., Eric Freeman J., Berry D.I., Brohan P., Kent E.C., Reynolds R.W., Smith S.R., et al. ICOADS Release 2.5: extensions and enhancements to the surface marine meteorological archive. Int. J. Climatol. 2011;31:951–967. doi: 10.1002/joc.2103. [DOI] [Google Scholar]

[bib16] 16.Worley S.J., Woodruff S.D., Reynolds R.W., Lubker S.J., Lott N. ICOADS release 2.1 data and products. Int. J. Climatol. 2005;25:823–842. doi: 10.1002/joc.1166. [DOI] [Google Scholar]

[bib17] 17.Brönnimann S., Brugnara Y., Allan R.J., Brunet M., Compo G.P., Crouthamel R.I., Jones P.D., Jourdain S., Luterbacher J., Siegmund P., et al. A roadmap to climate data rescue services. Geosci. Data J. 2018;5:28–39. doi: 10.1002/gdj3.56. [DOI] [Google Scholar]

[bib18] 18.Thorne P.W., Allan R.J., Ashcroft L., Brohan P., Dunn R.J.H., Menne M.J., Pearce P.R., Picas J., Willett K.M., Benoy M., et al. Toward an integrated set of surface meteorological observations for climate science and applications. Bull. Am. Meteorol. Soc. 2017;98:2689–2702. doi: 10.1175/BAMS-D-16-0165.1. [DOI] [Google Scholar]

[bib19] 19.Allan R., Brohan P., Compo G.P., Stone R., Luterbacher J., Brönnimann S. The international atmospheric circulation reconstructions over the Earth (ACRE) initiative. Bull. Am. Meteorol. Soc. 2011;92:1421–1425. doi: 10.1175/2011BAMS3218.1. [DOI] [Google Scholar]

[bib20] 20.Allan R., Wood K., Freeman E., Wilkinson C., Andersson A., Lorrey A., Brohan P., Stendel M., Kennedy J. Learning from the past to understand the future: historical records of change in the ocean. WMO Bull. 2021;70:36–42. [Google Scholar]

[bib21] 21.Brohan P. Testing Google Vision for weather data rescue. 2019. https://brohan.org/Google-Vision/

[bib22] 22.Brohan P. Testing AWS Textract for weather data rescue. 2020. https://brohan.org/AWS-Textract/

[bib23] 23.Kaspar F., Tinz B., Mächel H., Gates L. Data rescue of national and international meteorological observations at Deutscher Wetterdienst. Adv. Sci. Res. 2015;12:57–61. doi: 10.5194/asr-12-57-2015. [DOI] [Google Scholar]

[bib24] 24.Ashcroft L., Gergis J., Karoly D.J. A historical climate dataset for southeastern Australia, 1788–1859. Geosci. Data J. 2014;1:158–178. doi: 10.1002/gdj3.19. [DOI] [Google Scholar]

[bib25] 25.Bridgman H., Ashcroft L., Thornton K., Di Gravio G., Oates W. Meteorological observations for eversleigh station, near Armidale, New South Wales, Australia: 1877–1922. Geosci. Data J. 2019;6:174–188. doi: 10.1002/gdj3.80. [DOI] [Google Scholar]

[bib26] 26.Slonosky V., Sieber R., Burr G., Podolsky L., Smith R., Bartlett M., Park E., Cullen J., Fabry F. From books to bytes: a new data rescue tool. Geosci. Data J. 2019;6:58–73. doi: 10.1002/gdj3.62. [DOI] [Google Scholar]

[bib27] 27.Ashcroft L., Coll J.R., Gilabert A., Domonkos P., Brunet M., Aguilar E., Castella M., Sigro J., Harris I., Unden P., et al. A rescued dataset of sub-daily meteorological observations for Europe and the southern Mediterranean region, 1877–2012. Earth Syst. Sci. Data. 2018;10:1613–1635. doi: 10.5194/essd-10-1613-2018. [DOI] [Google Scholar]

[bib28] 28.Williams M., Varma B., Hayek O., Dean M. Development of the New Zealand Earth system model. Weather Clim. 2016;36:25. [Google Scholar]

[bib29] 29.Brenstrum E. Craig Potton Publishing; 1998. The New Zealand Weather Book. [Google Scholar]

[bib30] 30.Lorrey A.M., Chappell P.R. The ‘dirty weather’ diaries of Reverend Richard Davis: insights about early colonial-era meteorology and climate variability for northern New Zealand, 1839-1851. Clim. Past. 2016;12 doi: 10.5194/cp-12-553-2016. [DOI] [Google Scholar]

[bib31] 31.Mo K.C., Paegle J.N. The Pacific-South American modes and their downstream effects. Int. J. Climatol. 2001;21:1211–1229. doi: 10.1002/joc.685. [DOI] [Google Scholar]

[bib32] 32.Raphael M.N. A zonal wave 3 index for the Southern Hemisphere. Geophys. Res. Lett. 2004;31 doi: 10.1029/2004GL020365. [DOI] [Google Scholar]

[bib33] 33.Ummenhofer C.C., England M.H. Interannual extremes in New Zealand precipitation linked to modes of Southern Hemisphere climate variability. J. Clim. 2007;20:5418–5440. doi: 10.1175/2007JCLI1430.1. [DOI] [Google Scholar]

[bib34] 34.Kidston J., Renwick J.A., McGregor J. Hemispheric-scale seasonality of the southern annular mode and impacts on the climate of New Zealand. J. Clim. 2009;22:4759–4770. doi: 10.1175/2009JCLI2640.1. [DOI] [Google Scholar]

[bib35] 35.Brönnimann S., Allan R., Atkinson C., Buizza R., Bulygina O., Dahlgren P., Dee D., Dunn R., Gomes P., John V.O., et al. Observations for reanalyses. Bull. Am. Meteorol. Soc. 2018;99:1851–1866. doi: 10.1175/BAMS-D-17-0229.1. [DOI] [Google Scholar]

[bib36] 36.Wheeler D., García Herrera R., Koek F., Wilkinson C., Können G., del Rosario Prieto M., Jones P., Casale R. European Commission; 2007. CLIWOC, Climatological Database for the World’s Oceans: 1750 to 1850; Results of a Research Project EVK1-CT-2000-00090. [Google Scholar]

[bib37] 37.Teleti P.R., Rees W.G., Dowdeswell J.A., Wilkinson C. A historical Southern Ocean climate dataset from whaling ships’ logbooks. Geosci. Data J. 2019;6:30–40. doi: 10.1002/gdj3.65. [DOI] [Google Scholar]

[bib38] 38.Chappell P.R., Lorrey A.M. Identifying New Zealand, Southeast Australia, and Southwest pacific historical weather data sources using ian Nicholson’s log of logs. Geosci. Data J. 2014;1:49–60. doi: 10.1002/gdj3.1. [DOI] [Google Scholar]

[bib39] 39.Wilkinson C., Woodruff S.D., Brohan P., Claesson S., Freeman E., Koek F., Lubker S.J., Marzin C., Wheeler D. Recovery of logbooks and international marine data: the RECLAIM project. Int. J. Climatol. 2011;31:968–979. doi: 10.1002/joc.2102. [DOI] [Google Scholar]

[bib40] 40.Wilkinson C., Vásquez M. Historic sea-ice and meteorological data sources for Southern Ocean and Antarctic; 2016. Report on the imaging of historic ice. Vestfold archive, Sandefjord, Norway: meteorologicaland oceanographic data in Antarctic waters. [DOI] [Google Scholar]

[bib41] 41.Wilkinson C., Vásquez M. Historic sea-ice and meteorological data sources for Southern Ocean and Antarctic; 2017. Report on the Imaging of Sources of Historic Ice, Meteorological and Oceanographic Data in the SouthernOcean – Åland Maritime Museum, Mariehamn,Finland. [DOI] [Google Scholar]

[bib42] 42.Kidson E. Government Printer; 1947. Daily Weather Charts Extending from Australia and New Zealand to the Antarctic Continent. [Google Scholar]

[bib43] 43.Wilkinson C., Freeman E. Copernicus Climate Change Data Rescue Service; 2021. Best Practice Guidelines for Keying Data from Historic Marine Documents. [Google Scholar]

[bib44] 44.Ding H., Chen K., Yuan Y., Cai M., Sun L., Liang S., Huo Q. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) IEEE; 2017. A compact CNN-DBLSTM based character model for offline handwriting recognition with tucker decomposition; pp. 507–512. [Google Scholar]

[bib45] 45.Zhong Z., Sun L., Huo Q. An anchor-free region proposal network for Faster R-CNN-based text detection approaches. Int. J. Doc. Anal. Recognit. 2019;22:315–327. doi: 10.1007/s10032-019-00335-y. [DOI] [Google Scholar]

[bib46] 46.Ma C., Zhong Z., Sun L., Huo Q. 2019 International Conference on Document Analysis and Recognition (ICDAR) IEEE; 2019. A relation network based approach to curved text detection; pp. 707–713. [DOI] [Google Scholar]

[bib47] 47.Brönnimann S., Allan R., Ashcroft L., Baer S., Barriendos M., Brázdil R., Brugnara Y., Brunet M., Brunetti M., Chimani B., et al. Unlocking pre-1850 instrumental meteorological records a global inventory. Bull. Am. Meteorol. Soc. 2019;100:ES389–ES413. doi: 10.1175/BAMS-D-19-0040.1. [DOI] [Google Scholar]

[bib48] 48.Allan R., Compo G., Carton J. Recovery of global surface weather observations for historical reanalyses and international users. Eos, Trans. Am. Geophys. Union. 2011;92:154. doi: 10.1029/2011EO180008. [DOI] [Google Scholar]

[bib49] 49.Ashcroft L., Allan R., Bridgman H., Gergis J., Pudmenzky C., Thornton K. Current climate data rescue activities in Australia. Adv. Atmos. Sci. 2016;33:1323–1324. doi: 10.1007/s00376-016-6189-5. [DOI] [Google Scholar]

[bib50] 50.Allan R., Endfield G., Damodaran V., Adamson G., Hannaford M., Carroll F., Macdonald N., Groom N., Jones J., Williamson F., et al. Toward integrated historical climate research: the example of Atmospheric Circulation Reconstructions over the Earth. Wires Clim. Chang. 2016;7:164–174. doi: 10.1002/wcc.379. [DOI] [Google Scholar]

[bib51] 51.Brunet M., Jones P. Data rescue initiatives: bringing historical climate data into the 21st century. Clim. Res. 2011;47:29–40. doi: 10.3354/cr00960. [DOI] [Google Scholar]

[bib52] 52.Wilkinson C., Brönnimann S., Jourdain S., Roucaute E., Crouthamel R., Brohan P., Valente A., Brugnara Y., Brunet M., Compo G.P., et al. ECMWF; 2019. Best Practice Guidelines for Climate Data Rescue v1, of the Copernicus Climate Change Service Data Rescue Service. [DOI] [Google Scholar]

[bib53] 53.Brunet M., Brugnara Y., Noone S., Stephens A., Valente M.A., Ventura C., Jones P., Gilabert A., Brönnimann S., Luterbacher J., et al. Best practice guidelines for climate data and metadata formatting, quality control and submission of the Copernicus climate change service data rescue service. 2020. [DOI]

[bib54] 54.Page C.M., Nicholls N., Plummer N., Trewin B., Manton M., Alexander L., Chambers L.E., Choi Y., Collins D.A., Gosai A., et al. Data rescue in the Southeast Asia and South Pacific region: challenges and opportunities. Bull. Am. Meteorol. Soc. 2004;85:1483–1490. doi: 10.1175/BAMS-85-10-1483. [DOI] [Google Scholar]

[bib55] 55.Craig P.M., Hawkins E. Digitizing observations from the Met Office daily weather reports for 1900–1910 using citizen scientist volunteers. Geosci. Data J. 2020;7:116–134. doi: 10.1002/gdj3.93. [DOI] [Google Scholar]

[bib56] 56.Eveleigh A., Jennett C., Blandford A., Brohan P., Cox A.L. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM; 2014. Designing for dabblers and deterring drop-outs in citizen science; pp. 2985–2994. [DOI] [Google Scholar]

[bib57] 57.Brohan P. American Geophysical Union, Fall Meeting 2014; 2014. Citizen Science for Data Rescue: Recovering Historical Climate Records with a Network of 20,000 Volunteers. [Google Scholar]

[bib58] 58.Brohan P. AGU Fall Meeting Abstract. American Geophysical Union; 2012. oldWeather. Org: citizen science for climate reconstruction. ED53A–0922. [Google Scholar]

[bib59] 59.Hawkins E., Burt S., Brohan P., Lockwood M., Richardson H., Roy M., Thomas S. Hourly weather observations from the Scottish Highlands (1883–1904) rescued by volunteer citizen scientists. Geosci. Data J. 2019;6:160–173. doi: 10.1002/gdj3.79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] 60.Burt S., Hawkins E. Near-zero humidities on Ben Nevis, Scotland, revealed by pioneering 19th-century observers and modern volunteers. Int. J. Climatol. 2019;39:4451–4466. doi: 10.1002/joc.6084. [DOI] [Google Scholar]

[bib61] 61.Read J. Scott’s Last Journey. 1964. United Kingdom: British Broadcasting Corporation; https://archive.org/details/scottslastjourney/scottslastjourneyreel1.mov.

[bib62] 62.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86:2278–2324. [Google Scholar]

[bib63] 63.Ahlawat S., Choudhary A. Hybrid CNN-SVM classifier for handwritten digit recognition. Proced. Comput. Sci. 2020;167:2554–2560. doi: 10.1016/j.procs.2020.03.309. [DOI] [Google Scholar]

[bib64] 64.Kussul E., Baidyk T. Improved method of handwritten digit recognition tested on MNIST database. Image Vis. Comput. 2004;22:971–981. doi: 10.1016/j.imavis.2004.03.008. [DOI] [Google Scholar]

PERMALINK

Meteorological data rescue: Citizen science lessons learned from Southern Weather Discovery

Andrew M Lorrey

Petra R Pearce

Rob Allan

Clive Wilkinson

John-Mark Woolley

Emily Judd

Stuart Mackay

Sudhir Rawhat

Laura Slivinski

Sally Wilkinson

Ed Hawkins

Patrick Quesnel

Gilbert P Compo

Summary

Highlights

The bigger picture

Introduction

The importance of meteorological data rescue

Launching a data rescue mission from the antipodes

Figure 1.

Figure 2.

Data

Navigating happy hunting grounds for historical maritime weather data

Table 1.

Results

Charting a new course for streamlined data transcription

Figure 3.

Table 2.

Training machines to guide the data rescue ship

Figure 4.

Table 3.

Discussion

Consolidating lessons learned from the SWD data rescue journey

Scanning the horizon for fair winds and smooth data rescue sailing

Experimental procedures

Resource availability

Lead contact

Materials availability

Establishing a citizen science identity for our data rescue crew

Guiding citizen scientists through an ocean of data

Advanced preparations for a long data rescue voyage

Figure 5.

A task fit for a clipper

Figure 6.

Changing tack with specialized workflows

Conscripting data rescue participants

Our maiden voyage into the deep south

Shore leave for the WISE

Figure 7.

Figure 8.

Acknowledgments

Author contributions

Declaration of interests

Footnotes

Supplemental information

Data and code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases