The tips of the tree correspond to year of sampling while branch and node colours reflect the sampling location for the tip branches and the inferred location for the internal branches (AF, Africa; CA, California; CB, Caribbean; GA, Georgia; NY, New York). Tip labels are provided for the newly obtained archival HIV-1 genomes. Diameters of internal node circles reflect posterior location probability values. Thick outer circles represent internal nodes with posterior probability support > 0.95. We also depict the posterior probability density for the time of the introduction event from the Caribbean into the US on the time scale of the tree. A fully annotated tree for this data set (‘full genome 38’, which includes only sequences sampled early in the US epidemic) is shown in ED Fig. 2b; ‘full genome 46’ which includes all available complete genomes basal to the “pandemic clade”3 of subtype B, plus a similar number and date range of US pandemic clade sequences, is shown in ED Fig. 2a. Separate analyses of gag, pol, env, and the coding-complete genomes (including also sequences sampled later in the US epidemic) provide consistent results (ED Figs. 3 and 4).