Table 1.
Comparison of the P. euphratica unigene collection with other sequence collections from whole genomes or EST projects
Sequence collection | Matches | Unique | |
All | 7,841 | ||
Populus genome | 7,671 | 763 | |
Arabidopsis genome | 5,434 | 2 | |
Rice genome | 1,562 | 0 | |
Populus EST sequence | 5,780 | 5 | |
Rosid EST sequence | 4,597 | 1 | |
Asterid EST sequence | 3,490 | 4 | |
Caryophyllid EST sequence | 2,081 | 0 | |
Monocot sequence | 2,135 | 3 | |
GenBank sequence | 5,495 | 0 | |
Short sequences | 275 | 20 | |
Low protein coding potential | 728 | 28 | |
Remainder | 54 |
All P. euphratica unigenes were compared against reference sequence collections to investigate sequence overlap and to identify the number of sequences unique to this sequence collection. The reference sequence collections include the draft Populus genome, the Arabidopsis thaliana genome, the rice genomes and pooled collections of openSputnik EST collections representing large collections from species taxonomically assigned to the plant groups of rosid, asterid, caryophyllid and monocot. Also included in the reference sets are the sequences having a match to an annotated protein in the UniProt database or P. euphratica sequences that are either short (less than 100 nucleotides) or have a low protein coding potential (less than 25% protein coding). In the table, the reference sequence collection is displayed along with the number of P. euphratica sequences that can be matches to the reference sequence collection and the number of sequences that are unique to this sequence collection. All blast analyses were performed using an arbitrary expectation value of 1e-10. The remainder (54) represents the number of sequences that have no match within any of the challenge datasets and may thus represent P. euphratica specific genes.