Tuesday, July 28 • 12:40 - 12:45
Genome assembly of heterozygous tropical trees - will the real (pan)genome stand up?

High-throughput sequencing has the potential to greatly enhance population genetics studies through its power to genotype individuals at multiple loci. Though methods exist for obtaining genotypes and polymorphism data without a reference genome, having a reference sequence at least for the single-copy regions of the genome does help when one would like to compare diversity parameters across different species. As the individual genomes are highly heterozygous, genome assembly programmes struggle to deliver an acceptable reference sequence even for the non-repetitive part of the genome. Platanus, Platanus-allee, SPAdes and Meraculous were compared for their genome assembly capabilities. None of the programmes delivered a usable reference genome from the sequence data, approximately 12-25 x genome coverage of tropical tree species sequenced by pair-end Illumina sequence reads of 101 or 150 nucleotides.
Through k-mer frequency analysis at multiple k-mer lengths the genomes are shown to be highly heterozygous. The k-mer length at which the peak frequency of homozygous k-mers equals the peak frequency of heterozygous k-mers is proposed as a reliable measure to compare the level of polymorphism across heterozygous species. This measure fails though when the heterozygosity is lower and a k-mer longer than the read length would be needed to detect equal peak frequencies.
By “haplotype-specific k-mer walking” contigs longer than 10,000 bp for the two haplotypes could be reconstructed at a number of loci in several species (Xylia xylocarpa, Gluta usitata, Dipterocarpus tuberculatus). The two haplotypes in a single individual generally differ by about 1 polymorphism per 100 bp, more so in the intergenic regions, intermediate in the introns and less so in the exons. About 70% of polymorphisms are simple SNPs, about 5% insertion deletions from 1 to several hundred nucleotides and the rest variations in repeat number of mostly mononucleotide repeats. Sufficient number of read pairs contain two polymorphisms to phase most polymorphic sites. As the sequencing reads are derived from fragments with target insert size of 500 bp, the phasing is broken when polymorphisms are more than 500 bp separated, but the sequences still connect. The walking process needs to be further automated and parallelized to have any chance of building a more or less complete view of the single copy region of the genome.
The data from moderate to low coverage Illumina genome sequencing contain sufficient information for the assembly of long contigs representing the two haplotypes derived from the two genomes in heterozygous individuals.

avatar for Hugo Volkaert

Hugo Volkaert

Principal investigator, Center for Agricultural Biotechnology, Kamphaeng Saen Campus, Nakhon Pathom 73140, Thailand
I am a forest ecologist and population geneticist hoping to use DNA sequence data to study the evolution and adaptation of tropical trees to their environment. Currently trying to assemble tree genomes on a shoestring budget, using low coverage shot gun sequencing of single libraries... Read More →

Tuesday July 28, 2020 12:40 - 12:45 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09