Table 3. Data dependencies required to successfully run each component of the McClintock pipeline.
ngs_te_mapper | RelocaTE | TEMP | RetroSeq | PoPoolationTE | TE-locate | |
---|---|---|---|---|---|---|
Reference genome (fasta) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Canonical TE sequences (fasta) | ✓ | ✓a | ✓ | ✓b | ✓ | |
Annotation of reference TEs (GFF) | ✓ | ✓ | ||||
Annotation of reference TEs (BED) | ✓ | ✓c | ||||
Annotation of reference TEs (custom format) | ✓ | |||||
Unaligned reads (single-end fastq) | ✓ | ✓ | ||||
Unaligned reads (paired-end fastq) | ✓ | |||||
Aligned reads (BAM) | ✓ | ✓ | ||||
Aligned reads (lexically sorted SAM) | ✓ | |||||
TE hierarchy (custom format) | ✓ | ✓ |
Must include an entry in the format “TSD=…” for each TE in the file on the same line as the header, where “…” is the TSD sequence if known, or a string of periods with equal to the TSD length if the TSD sequence is unknown. If neither length nor the sequence of the TSD is known, “TSD=UNK” can be supplied.
Must be formatted as one fasta file per TE family and a file of files listing their locations.
Must be one BED file for each entry in the reference TE annotation and a file of files listing their locations.