Table 1. Number of Reactions Left in Each Dataset after Cleaninga.
| data set name | ORDerly-condition (labeling) | ORDerly-condition (rxn string) | ORDerly-forward | ORDerly-retro | non-USPTO-forward |
|---|---|---|---|---|---|
| full data set | 1,771,032 | 1,771,032 | 1,771,032 | 1,771,032 | 94,043 |
| too many reactants | 518,369 | 1,627,929 | 1,743,179 | 1,627,929 | 46,821 |
| too many products | 473,437 | 1,589,977 | 1,740,254 | 1,589,977 | 43,362 |
| too many solvents | 446,484 | 1,385,579 | 1,689,075 | NA | 39,114 |
| too many agents | 446,484 | 1,279,207 | 1,552,671 | NA | 32,243 |
| no reactants/products | 441,859 | 1,261,701 | 1,533,571 | 1,564,525 | 32,103 |
| dropping duplicates | 264,846 | 753,338 | 919,077 | 939,648 | 29,417 |
| frequency filtering | 258,273 | 691,142 | NA | NA | NA |
A description of each data set can be found in the Methodology section. Note that the actual number of reactions used for training will differ from the data set size shown below due to train/test splits and augmentation. Non-USPTO-retro had a final data set size of 23,334 and was cleaned in the same way as ORDerly-retro.