Skip to main content
. 2024 Apr 22;64(9):3790–3798. doi: 10.1021/acs.jcim.4c00292

Table 1. Number of Reactions Left in Each Dataset after Cleaninga.

data set name ORDerly-condition (labeling) ORDerly-condition (rxn string) ORDerly-forward ORDerly-retro non-USPTO-forward
full data set 1,771,032 1,771,032 1,771,032 1,771,032 94,043
too many reactants 518,369 1,627,929 1,743,179 1,627,929 46,821
too many products 473,437 1,589,977 1,740,254 1,589,977 43,362
too many solvents 446,484 1,385,579 1,689,075 NA 39,114
too many agents 446,484 1,279,207 1,552,671 NA 32,243
no reactants/products 441,859 1,261,701 1,533,571 1,564,525 32,103
dropping duplicates 264,846 753,338 919,077 939,648 29,417
frequency filtering 258,273 691,142 NA NA NA
a

A description of each data set can be found in the Methodology section. Note that the actual number of reactions used for training will differ from the data set size shown below due to train/test splits and augmentation. Non-USPTO-retro had a final data set size of 23,334 and was cleaned in the same way as ORDerly-retro.