. 2019 Nov 5;11(1):154–168. doi: 10.1039/c9sc04944d

Table 2. Datasets used in this study and their respective sizes, given as the raw dataset size without filtering.

Dataset	Dataset size	Extracted and validated	Without duplicates	Templates extracted
USPTO 1976–2016 ^a (ref. 28)	3 748 191	3 079 351	1 201 602	302 282
Grants ^a	1 808 938	1 471 088	895 436	239 895
Applications ^a	1 939 254	1 608 263	923 765	223 871
Pistachio Nov 2017 ^b	6 836 027	4 897 300	1 627 792	367 488
Combined patents	10 587 618	7 976 651	1 711 330	358 307
Reaxys ^b	6 540 786 ^d	5 071 074	4 571 364	361 603
Reaxys + patents	17 128 404	13 047 725	6 141 875	665 288
AZ ELN subset ^b ^, ^c	398 779 ^d	254 468	207 868	30 805
All combined	17 523 783	13 302 193	6 342 331	675 530

^aPublicly available.

^bProprietary.

^cOnly successful reactions have been considered.

^dValues reported are those after an initial internal data curation step. Dataset size: refers to the number of reactions available as reaction SMILES before curation or subsequent filtering, unless otherwise specified. Extracted and validated: refers to the number of reactions that remain after curation, automatic template extraction, and validation of the extracted template by application to the product of the reaction from which it was extracted, to determine if the corresponding reactants can be regenerated. Duplicates were identified as identical reaction SMILES considering variations in the ordering of different entities and atom-mapping. Duplicate templates were identified in the same manner. The number of products refers to products of single step reactions, where duplicates have been removed.