Skip to main content
. 2019 Nov 5;11(1):154–168. doi: 10.1039/c9sc04944d

Table 2. Datasets used in this study and their respective sizes, given as the raw dataset size without filtering.

Dataset Dataset size Extracted and validated Without duplicates Templates extracted
USPTO 1976–2016 a (ref. 28) 3 748 191 3 079 351 1 201 602 302 282
Grants a 1 808 938 1 471 088 895 436 239 895
Applications a 1 939 254 1 608 263 923 765 223 871
Pistachio Nov 2017 b 6 836 027 4 897 300 1 627 792 367 488
Combined patents 10 587 618 7 976 651 1 711 330 358 307
Reaxys b 6 540 786 d 5 071 074 4 571 364 361 603
Reaxys + patents 17 128 404 13 047 725 6 141 875 665 288
AZ ELN subset b , c 398 779 d 254 468 207 868 30 805
All combined 17 523 783 13 302 193 6 342 331 675 530

aPublicly available.

bProprietary.

cOnly successful reactions have been considered.

dValues reported are those after an initial internal data curation step. Dataset size: refers to the number of reactions available as reaction SMILES before curation or subsequent filtering, unless otherwise specified. Extracted and validated: refers to the number of reactions that remain after curation, automatic template extraction, and validation of the extracted template by application to the product of the reaction from which it was extracted, to determine if the corresponding reactants can be regenerated. Duplicates were identified as identical reaction SMILES considering variations in the ordering of different entities and atom-mapping. Duplicate templates were identified in the same manner. The number of products refers to products of single step reactions, where duplicates have been removed.