Loading…
Tuesday, July 28 • 18:45 - 18:50
Efficient dynamic associative dictionary for large k-mer sets

Log in to save this to your schedule, view media, leave feedback and see who's attending!



Motivation:
Since BLAST introduced the seed and extend paradigm, indexing fixed length words (k-mers) from a set of sequences is the bread and butter of most algorithms and methods relying on sequence similarity. Due to the ever-increasing amount of available reference genomes, there is a growing interest in global approaches able to take into account a very broad sequence range. Ambitious applications such as pangenomics or metagenomics require to index billions of distinct k-mers and would benefit from incorporating as many reference genomes as possible. Recently the problem of representing huge k-mer sets with a low memory usage and a high throughput caught the interest of the community. In the last few years, several efficient methods (Pufferfish[1], Bifrost[2], BLight[3], REINDEER[4], Kallisto[5], Jellyfish[6], SRC[7]) were proposed with various applications: k-mer counting, quantification, assembly, ...
Some implementations are specific to their main application, others are generic libraries that can fit various purposes. Jellyfish indexes k-mers using an efficient lock-free dynamic hash table scheme to enable fast k-mer counting. Such a scheme needs to store each k-mer in memory, which represents a memory cost of several bytes per k-mer (4 bytes for 31-mers). Probabilistic dictionaries[7] can use less than 2 bytes per k-mer at the cost of a low false positive rate. Recent improvements provided efficient deterministic k-mer set representations, exploiting nucleotide redundancy in k-mer sets to lower the memory cost[1], and k-mer partitioning to further reduce the storage cost and raise cache coherency [3]. However, the efficiency of some of those methods rely on their static aspect. Large construction or update costs make them unfit to some applications where insertions or deletion are required. For instance, rapid acquisition of new data for microbial pangenomes could benefit from dynamic structures. Large scale dynamic de Bruijn graphs [2,8] are another possible application that is gaining traction.

Results:
We present BRISK (Brisk Reduced Index for sequence of k-mers) a resource-efficient dynamic dictionary able to associate value to k-mers without false positives. It relies on three main ideas.
First, instead of storing k-mers independently, we store super-k-mers, a sequence of k-mers that share the same minimizer, to reduce the amount of nucleotides required to encode overlapping k-mers. We partition super-k-mers according to their minimizers, which allows us to work on smaller structures, and improves cache coherence. Second, we represent a partition as a sorted list of super-k-mers to ensure fast retrieval of k-mers. Lastly, we use less nucleotides by encoding only the suffix and the prefix of a super k-mer without its minimizer itself. In practice using this scheme we are able to encode on average[9] eight 31-mers into a single super-k-mer that can fit on a 64 bits integer.The larger the minimizer size, the faster the queries but also the larger the space overhead. That means that queries can be adapted for different space/time tradeoffs. Furthermore, index usage is highly cache coherent as querying several k-mers sharing the same minimizer only requires one random memory access. For instance, we are able to index 3.2 billion k-mers within 2 minutes and less than 10 GB of RAM.
[1]Almodaresi, F., Sarkar, H., Srivastava, A. and Patro, R., 2018. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34(13), pp.i169-i177.
[2]Holley, G. and Melsted, P., 2019. Bifrost–Highly parallel construction and indexing of colored and compacted de Bruijn graphs. BioRxiv, p.695338.
[3]Marchet, C., Kerbiriou, M. and Limasset, A., 2019, April. Indexing De Bruijn graphs with minimizers. In Recomb seq.
[4]Marchet, C., Iqbal, Z., Gautheret, D., Salson, M. and Chikhi, R., 2020. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. ISMB.
[5]Bray, N.L., Pimentel, H., Melsted, P. and Pachter, L., 2016. Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), pp.525-527.
[6]Marçais, G. and Kingsford, C., 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6), pp.764-770.
[7]Marchet, C., Lecompte, L., Limasset, A., Bittner, L. and Peterlongo, P., 2020. A resource-frugal probabilistic dictionary and applications in bioinformatics. Discrete Applied Mathematics, 274, pp.92-102.
[8]Crawford, V. G., Kuhnle, A., Boucher, C., Chikhi, R., & Gagie, T. (2018). Practical dynamic de Bruijn graphs. Bioinformatics, 34(24), 4189-4195.
[9]Baharav, T.Z., Kamath, G.M., David, N.T. and Shomorony, I., 2020, May. Spectral Jaccard Similarity: A new approach to estimating pairwise sequence alignments. In International Conference on Research in Computational Molecular Biology (pp. 223-225). Springer, Cham.



Speakers
avatar for Yoann Dufresne

Yoann Dufresne

Institut Pasteur


Brisk pdf

Tuesday July 28, 2020 18:45 - 18:50 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09