Fig. 2. Pipeline for training neural networks on phylogenies.
Tree representations: a (i), simulated binary trees. Under each model from Fig. 1, we simulate many trees of variable size (50 to 200 tips for ‘small trees’ and 200 to 500 tips for ‘large trees’). For illustration, we have here a tree with 5 tips. We encode the simulations into two representations, either a (ii–v), in a complete and compact tree representation called ‘Compact Bijective Ladderized Vector’ abbreviated as CBLV or a (vi) with summary statistics (SS). CBLV is obtained through a (ii) ladderization or sorting of internal nodes so that the branch supporting the most recent leaf is always on the left and a (iii) an inorder tree traversal, during which we append to a real-valued vector for each visited internal node its distance to the root and for each visited tip its distance to the previously visited internal node. We reshape this representation into a (iv), an input matrix in which the information on internal nodes and leaves is separated into two rows. Finally, a (v), we complete this matrix with zeros so that the matrices for all simulations have the size of largest simulation matrices. For illustration purpose, we here consider that the maximum tree size covered by simulations is 10, and the representation is thus completed with 0 s accordingly. SS consists of a (vi), a set of 98 statistics: 83 published in Saulnier et al.19, 14 on transmission chains and 1 on tree size. The information on sampling probability is added to both representations. b Neural networks are trained on these representations to estimate parameter values or to select the underlying model. For SS, we use, b (i), a deep feed-forward neural network (FFNN) of funnel shape (we show the number of neurons above each layer). For the CBLV representation we train, b (ii), convolutional neural networks (CNN). The CNN is added on top of the FFNN. The CNN combines convolutional, maximum pooling and global average pooling layers, as described in detail in ‘Methods’ and Supplementary Information.