Table 2. Datasets used in this study and their respective sizes, given as the raw dataset size without filtering.
| Dataset | Dataset size | Extracted and validated | Without duplicates | Templates extracted |
| USPTO 1976–2016 a (ref. 28) | 3 748 191 | 3 079 351 | 1 201 602 | 302 282 |
| Grants a | 1 808 938 | 1 471 088 | 895 436 | 239 895 |
| Applications a | 1 939 254 | 1 608 263 | 923 765 | 223 871 |
| Pistachio Nov 2017 b | 6 836 027 | 4 897 300 | 1 627 792 | 367 488 |
| Combined patents | 10 587 618 | 7 976 651 | 1 711 330 | 358 307 |
| Reaxys b | 6 540 786 d | 5 071 074 | 4 571 364 | 361 603 |
| Reaxys + patents | 17 128 404 | 13 047 725 | 6 141 875 | 665 288 |
| AZ ELN subset b , c | 398 779 d | 254 468 | 207 868 | 30 805 |
| All combined | 17 523 783 | 13 302 193 | 6 342 331 | 675 530 |
aPublicly available.
bProprietary.
cOnly successful reactions have been considered.
dValues reported are those after an initial internal data curation step. Dataset size: refers to the number of reactions available as reaction SMILES before curation or subsequent filtering, unless otherwise specified. Extracted and validated: refers to the number of reactions that remain after curation, automatic template extraction, and validation of the extracted template by application to the product of the reaction from which it was extracted, to determine if the corresponding reactants can be regenerated. Duplicates were identified as identical reaction SMILES considering variations in the ordering of different entities and atom-mapping. Duplicate templates were identified in the same manner. The number of products refers to products of single step reactions, where duplicates have been removed.