Improved detection of differentially represented DNA barcodes for high‐throughput clonal phenomics

A schematic presentation of a typical clone‐tracing experiment (see text for description).

To generate the benchmark barcode count datasets, we performed two independent high‐complexity DNA barcoding experiments on Mia‐PaCa‐II and OVCAR5 cell lines (see Materials and Methods for details). In each experiment, cells were collected after selection and expansion step (Fig 1A) to produce two cell pools (Pool A and Pool B). Cells in each pool were counted and mixed in a 50/50 ratio to produce “AB mix”. The AB mix was then sampled in various extents in two replicas to produce so‐called null samples with different numbers of cells (20 × 10³, 40 × 10³, 80 × 10³, 160 × 10³, 330 × 10³, 660 × 10³), but with the same expected representation of each barcode. Perturbed samples were generated by taking either 20, 40, 80 or 160 thousand cells from the AB mix, and adding an indicated percentages of cells from the Pool A (e.g. for sample with 160 × 10³ cells and perturbation degree of 35%, we added 160 × 10³ × 0.35 = 56 × 10³ cells from the Pool A). The number of replicas for each sample is indicated in circles next to the tube icon.

Barcode representation fold changes (log₂) in the null samples of the indicated sizes (number of cells sampled from the AB mix), relative to the mean of two Null‐660 replicas. Barcodes are ordered according to their size in the Null‐660 samples. Pool A barcodes are sorted in decreasing order, and Pool B barcodes are ordered in increasing order. Boxes represent interquartile ranges for each group of 53 observations. Whiskers indicate upper and lower quartiles. Central line corresponds to the median value.

Same data as in (C) but for the perturbed samples. Dotted lines indicate the expected barcode fold changes calculated using formula: (cells from pool A/total number of cells)/0.5, for the Pool A barcodes, and similarly for the Pool B barcodes. Data representation is the same as in (C).

Figure 1. An overview of the experimental setup for the benchmark dataset generation.