Bioinformatics: from Algorithms to Applications 2020: Full Schedule

13:45 MSK

Opening notes

Speakers

Anton Korobeynikov

Associate Professor, Center for Algorithmic Biotechnology, Saint Petersburg State University, 6 linia V.O., 11/21d, 1990034 St Petersburg, Russia

Alla Lapidus

Professor, Center for Algorithmic Biotechnology, Saint Petersburg State University, 6 linia V.O., 11/21d, 1990034 St Petersburg, Russia

BIATA Formalities pdf

Monday July 27, 2020 13:45 - 14:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Opening / closing

14:00 MSK

MGnify: Introduction to course organisation

Speakers

I am an EBI-Sanger Postdoctoral Fellow focusing on the study of the human gut microbiome using genome-resolved metagenomics. My main research interest is understanding the role of the large uncultured diversity of the gut microbiome in human health and disease.

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Dr Rob Finn leads EMBL-EBI’s Microbiome Informatics team, which is responsible for the MGnify resource, which provides access to the metagenomics, metatranscriptomics and assembly analysis services. The functional and taxonomic profiles of these datasets, once made public, can be... Read More →

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Monday July 27, 2020 14:00 - 14:15 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Lecture

14:15 MSK

MGnify: services offered (Part 1)

Speakers

Alexandre Almeida

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Monday July 27, 2020 14:15 - 15:15 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

MGnify Workshop, Lecture

15:30 MSK

MGnify: assembly pipeline and website (Part 2)

Speakers

Alexandre Almeida

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Monday July 27, 2020 15:30 - 16:30 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Lecture

17:00 MSK

Indexing large and numerous sequencing datasets

Genomic analyses often rely on sequence comparisons. The exponential growth of sequencing data repositories prompts the development of ever-faster algorithms for sequence search: starting from the Smith-Waterman algorithm for pairwise alignment [1], then Blast-like approaches for searching in sequence databases [2], and more recent breakthroughs in database indexing strategies (e.g. Diamond [3], or BIGSI [4]). But the recent data deluge means that even these latest tools can not be used to screen across the full set of sequencing experiments available today.
In this talk, I propose to focus on the problem of querying large unassembled raw sequencing data on the fly, for instance towards the goal of searching for a sequence of interest in all publicly available metagenomes. Hence, I will propose an overview of current methods dedicated to the indexation of large and numerous genomic datasets. These methods are mainly based on the indexation of kmers, words of length k.
Finally, I will focus on a novel strategy to construct a bloom-filter based data-structure, HowDe-SBT [5], one of the state-of-the-art index data-structures. I will present the algorithmic foundations and the current results.

[1] Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology
[2] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology
[3] Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature methods
[4] Bradley, P., Den Bakker, H. C., Rocha, E. P., McVean, G., & Iqbal, Z. (2019). Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology
[5] Harris, R. S., & Medvedev, P. (2019). Improved representation of sequence Bloom trees. Bioinformatics

Speakers

Pierre Peterlongo

Research scientist, Inria Rennes Bretagne Atlantique, GenScale team

Monday July 27, 2020 17:00 - 17:15 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Keynotes

17:00 MSK

Q & A: keynotes

Speakers

Pierre Peterlongo

Research scientist, Inria Rennes Bretagne Atlantique, GenScale team

Robert Fulton

Director of Technical Development, McDonnell Genome Institute

I'm the Director of Technology Development and have >25 years of Genomics experience. I'm happy to discuss a broad range of wet lab operations associated with genomics.

Tatiana Tatarinova

Fletcher Jones Endowed Chair in Computational Biology, University of LaVerne

Professor of Computational biology moonlighting as a rock musicianhttps://soundcloud.com/tatiana-tatarinova-378061263/zdes-314-zdes

Inna Dubchak

Affiliate, Lawrence Berkeley National Laboratory

Terry Gaasterland

Professor of Computational Biology and Genomics; Director,, Bioinformatics & Systems Biology Program, University of California, San Diego

Trained originally as a computer scientist, I transitioned into Computational Biology as an application area for logic-based data and query integrity-checking methods (used in my early, purely CS work for Cooperative Query Answering). I quickly became fascinated with the idea of... Read More →

Stephen Nayfach

Research Scientist, Joint Genome Institute

Monday July 27, 2020 17:00 - 19:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Keynotes, Keynotes

17:15 MSK

A scalable prokaryotic taxonomy in the age of big data

The great majority of microorganisms have yet to be cultured and characterised. This so-called “microbial dark matter” is now being revealed at an ever-increasing rate by sequence-based culture independent methods. In the past few years, thousands of near complete genomes of uncultured microbes have been assembled from sequence data obtained directly from environmental and clinical sources providing the opportunity to fully articulate microbial diversity for the first time. Current estimates suggest that cultured microorganisms only capture ~15% of total microbial diversity based on evolutionary divergence of marker genes. We propose a genome-based taxonomy founded on the existing classification of cultured organisms, but corrected for polyphyletic groups and calibrated to take into account relative evolutionary divergence. The result is a fully systematized classification of Bacteria in an evolutionary framework. Of ~100,000 publicly available bacterial genomes, over half required one or more changes to their existing taxonomy. These include extensive changes at both high ranks, such as amalgamation of the Candidate Phyla Radiation into one phylum and low ranks including subdivision of the genus Clostridium into more than 100 distinct genera.

Speakers

Fletcher Jones Endowed Chair in Computational Biology, University of LaVerne

Professor of Computational biology moonlighting as a rock musicianhttps://soundcloud.com/tatiana-tatarinova-378061263/zdes-314-zdes

Local ancestry pptx

Monday July 27, 2020 18:30 - 18:45 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Keynotes

19:00 MSK

MGnify: Hands-on review (parts 1-2)

Speakers

Alexandre Almeida

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Monday July 27, 2020 19:00 - 20:00 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Hands-on review

11:00 MSK

Installing and Searching BLAST Databases in a Data Science Framework

Data science embodies a pipeline of processes: acquisition, cleaning and organization of data, quality control and assurance, validation, and downstream visualization and analytics. Because of the overwhelming number of tools for each of these steps, the greatest challenge is often making those tools work in concert to facilitate a thorough and insightful analysis.
The BIRCH system (http://home.cc.umanitoba.ca/~psgendb/) is a framework consisting of hundreds of bioinformatics tools, unified through the BioLegato family of programmable graphical applications. Each BioLegato application represents a specific class of biological objects, packaging together the data and the methods for each class of objects. We describe BioLegato applications for BLAST searches, implementing data science principles. For example, in blncbi the user retrieves sequences from NCBI using a graphical Entrez query builder. Amino acid sequences matching the query pop up in blprotein, a BioLegato application that displays proteins, and lets the user run protein-specific tasks. A protein can be selected for a BLAST search, and output will appear in bpfetch: a BioLegato spreadsheet object for protein hits. The blpfetch spreadsheet makes it easy to scan hundreds of hits, refining the list into one or more subsets for retrieval. Sequences are retrieved to a new blprotein object for downstream analysis. Because each object is a separate window with a small screen footprint, the user has more of a sense of working directly with the data than in typical web interfaces.
BioLegato gives the user flexibility at all steps in a pipeline. Because output of each step appears in a new BioLegato object, there are no dead ends. Output from one step can be used directly as input for subsequent steps because BioLegato takes care of things like file format conversion, which is a tedious and sometimes error-prone part of using tools at the command line. We call this process ad hoc pipelining. Ad hoc pipelining enables the user to learn from each step before going to the next. We also describe blastdbkit, a Python script run from BioLegato, for downloading and managing BLAST databases on the users's computer.
Together, these tools provide an integrated point and click pipeline for sequence database searches, within the context of the larger BIRCH system. New programs can be added to any BioLegato application by creating a file using BioLegato's PCD language, which specifies parameters to be set and a shell command to run the program. In this way, the core BIRCH functions can be integrated seamlessly with locally-installed bioinformatics software.

Posters

Brian Fristensky

Associate Professor, University of Manitoba

RESEARCH:Phylogenomics of plant-pathogen interactionsDevelopment of bioinformatics softwareTEACHINGCytogeneticsPlant BiotechnologyBioinformatics

FristenskyPoster47v5 BIRCH Bioinformatics pdf

Tuesday July 28, 2020 11:00 - 11:05 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

11:00 MSK

MGnify: Hands-on review (parts 1-2)

Speakers

Alexandre Almeida

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Tuesday July 28, 2020 11:00 - 12:00 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Hands-on review

11:00 MSK

Q & A: posters

Tuesday July 28, 2020 11:00 - 14:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

11:05 MSK

An approach for sequences secondary structure analysis by the composition of formal grammars and neural networks was proposed recently. In this work, we investigate the applicability of this approach for RNA secondary structure prediction. We show that it is possible to use residual networks to correct secondary structure features extracted by context-free grammars.

Posters

Polina Lunina

SPbSU

main Polina Lunina pdf

Tuesday July 28, 2020 11:25 - 11:30 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

11:30 MSK

iJump: a fast tool for tracking bacterial mobile elements rearrangements in course of adaptive laboratory evolution

Mobile elements rearrangements in bacteria may lead to gene inactivation or deregulation providing an important contribution to adaptation. While the challenge of mapping these rearrangements was addressed for individual genomes, no efficient tools are available for tracking their dynamics in evolving populations, such as in adaptive laboratory evolution (ALE).
We are using ALE in a custom-engineered continuous culture device (morbidostat) to study dynamics and mechanisms of antibiotic resistance in major gram-negative bacterial pathogens. Acquisition of mutations in evolving populations is monitored by deep sequencing of samples in time-series.
To observe evolutionary paths driven by “jumping” of IS elements we have developed the iJump software, which is using soft-clipped reads from the SAM/BAM alignment extracted from the boundaries of known mobile elements to find new junctions and estimate their frequencies. The performance of iJump was first tested on a simulated data set where it showed 1-4% error in frequency estimation. Application of the iJump tool to our ALE studies with Escherichia coli, Acinetobacter baumannii and Pseudiomonas aeruginosa confirmed its practical utility and revealed IS-driven bacterial adaptations to known antibiotics and novel drug candidates. The results were verified by Nanopore-based sequencing and MIC determination of selected individual clones. Software available at https://github.com/sleyn/ijump

Posters

Semen Leyn

Postdoctoral Associate, Sanford Burnham Prebys Medical Discovery Institute

Samuel Muthemba

Lab Technician, ICRAF

BIATA2020 Poster Muthemba ICRAF SAMUEL MUTHEMBA pdf

Tuesday July 28, 2020 11:45 - 11:50 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

11:50 MSK

Genome assembly and search for genetic markers of adaptation of shore flies (Diptera, Ephydridae) to extreme habitats

Dipteran insects characterized by the huge taxonomic diversity (more than 160 thousand species at the time of 2013 (Zhang, 2013)), the diversity of ecological niches, and adaptation to various extreme environmental conditions. At the same time, the mechanisms of these adaptations are poorly studied, especially at the genomics level. Now the NCBI Genome database contains only 165 dipteran genomes (for comparison, for 40 thousand vertebrates we have more than 1000 sequenced genomes). Different species of the large family Ephydridae (shore flies), about 2000 species, are adapted to the most adverse environmental conditions from saline and alkaline impoundment and hot springs to oil puddles (Kadavy et al., 2020), but the genomes of only two species of this curious family - Ephydra gracilis and E. hians (syn. Cirrula hians) - were sequenced. The genome of the third fly from family Ephydridae - E. riparia - were sequenced at the department of biological evolution at Lomonosov Moscow State University.
In this study we de novo assembled the E. riparia genome, assessed assemble quality and compared it with the assemblies of two related species.
Raw reads were filtered by Trimmomatic and cleaned of contamination by Kraken2. E. riparia genome was assembled by SPAdes, SOAPdenovo and Platanus programs, the best one was done by SPAdes. The quality was assessed by QUAST and BUSCO. We made structural genomes annotation and turned to the functional one. See more details on our GitHub repository: https://github.com/Terraslavonica/E_riparia.
E. riparia genome, about 600M bp, was assembled into scaffold with average coverage 11.2x, with N50 equal 3.5K. bp and L50 – 53K contigs. In assembly we detected 53% of genes typical for Diptera by BUSCO. Comparing with assembling of close relatives, E. gracilis (410M bp, cov. 9.4x, N50 2.1K, L50 46K, BUSCO 81%) and E. hians (399M bp, cov. 27.0x, N50 1.8K, L50 53K, BUSCO 37%), we can conclude that E. riparia assembling is good enough for further analyses.
Based on these three assemblies we carry out functional annotation of the genomes and search the genes that can contribute to adaptation to extreme habitats and stressful conditions, it might be genes encoding LEA-proteins, heat shock proteins, aquaporins, metal-transport proteins, proteins that are the part of the signalling cascades p38 and JNK MAPK, etc. (Craig et al., 2004; Xu et al., 2013; Davies et al., 2014; Benoit et al., 2014; Reidl et al., 2016; Huang et al., 2016; Muthusamy et al., 2017; Pawłowicz, Masajada, 2019; Das et al., 2020).

Posters

Ekaterina Yakovleva

Lomonosov Moscow State University, Bioinformatics Institute

Evolutionary biologist and bioinformatician a bit

Yakovleva Ilin poster Ephydra Ekaterina Yakovleva pdf

Tuesday July 28, 2020 11:50 - 11:55 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

11:55 MSK

Predicting CAZy profiles in wood-decay fungal communities with molecular ecological networks

Understanding the functional organization of ecological communities is essential for predicting their succession in stable and changing environments and designing effective strategies to control it. Functioning of terrestrial ecosystems strongly depends on the rate of organic matter turnover and fungi are the main drivers of this complex process in forests. In this work, based on ecological network analysis and functional predictions from amplicon sequences, we characterized between-species interactions in wood-decay fungal communities and mapped functional attributes to their biological networks.
The analysis is based on fungal abundance profiles obtained with high-throughput sequencing of rRNA gene internal transcribed spacer (ITS2). Ecological networks were inferred with SPRING (semi-parametric rank-based correlation and partial correlation estimation). Copy numbers of gene families, encoding extracellular enzymes involved in decomposition of plant biopolymers (e.g., cellulose, hemicellulose, and lignin degrading CAZymes) were reconstructed with PICRUSt2 based on the JGI MycoCosm database of reference genomes.
We compare the predicted functional profiles of undisturbed and degraded communities of wood-decay fungi and estimate the consequences of species loss for biotic interactions. We classify functional elements by their vulnerability to chemical pollution and by the importance in wood decomposition.

The work was funded by Russian Foundation for Basic Research (grant 18-29-05042).

Posters

Vladimir Mikryukov

Senior researcher, Institute of plant and animal ecology UB RAS, Ekaterinburg

Institute of Plant and Animal Ecology, Ural Branch, Russian Academy of Sciences, Russia

Mikryukov BiATA2020 Vladimir Mikryukov pdf

Tuesday July 28, 2020 11:55 - 12:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:00 MSK

Metagenomic analysis of bacterial communities associated with sponges from the White Sea

Sponges (phylum Porifera) form a symbiotic relationship with the community of microorganisms. Sponges and their symbionts produce various pharmacologically active substances that have, among other things, antibacterial properties. Metagenomic analysis of the microbiome allows to find out the taxonomic diversity and properties of the microbial community, but also opens up opportunities in the search for new secondary metabolites. We collected and analyzed metagenomes of four marine sponges Homoeodictya (Isodictya) palmata, Halichondria panacea, Halichondria sitiens, Myxilla incrustans and surrounding seawater, collected in Kandalaksha Bay (White Sea) in august 2016 and 2018. In 2018 metaviroms of the sponges and seawater were studied. Sequencing was performed on Illumina NextSeq, with approximately 100mln paired-end 150+150bp reads per sample. Raw reads were analyzed and filtered with FastQC and Trimmomatic. Metagenomes were assembled using metaSPAdes and quality was assessed with MetaQUAST. Contigs with a length > 5 kb were used to further analyses. Predicting of ORFs was carried out with MetaGeneMark. Analysis of secondary metabolite biosynthesis gene clusters (BGC) with antiSMASH in metagenomes revealed a NRPS, bacteriocins, lanthipeptides, LAPs, PKS and other BGC types (649 clusters totally). CRISPR-Cas systems were detected and classified with CRISPRCasTyper. The most common systems in sponges belong to the class 1, type I and subtypes I-C and I-F. Taxonomic annotation of contigs was performed with DIAMOND using blastx and NCBI nr database and results were submitted to MEGAN6. The composition of metagenomes is mainly represented by classes Gammaproteobacteria and Alphaproteobacteria. In addition, we observed an increase in the abundance of Gammaproteobacteria in all samples from 2018, especially genus Alteromonas and Pseudoalteromonas. We suggest that it can be related with an anomalously high temperature of seawater in summer 2018 (Ereskovsky et al. 2019). Bacterial communities of sponges and seawater differ in their composition and diversity of species based on Bray–Curtis dissimilarity and PCoA. The microbiome of H. palmata and M. incrustans of 2016 and 2018 are similar to each other while the communities of H. sitiens and H. panicea are different. In the metaviromes the most abundant families are Myoviridae, Podoviridae, Siphoviridae and group of Prokaryotic dsDNA virus sp. We obtained 10 contigs that were identified as large DNA viruses with a length more than 100 kb and one contig with a length 227 kb.

Posters

Anastasia Rusanova

Institute of Molecular Genetics of the Russian Academy of Sciences

BIATA 2020 poster Bio Informatics pdf

Tuesday July 28, 2020 12:00 - 12:05 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:05 MSK

Heavy metal contamination in mining sites causes growth inhibition of green vegetation. Fortunately, there are photosynthetic autotrophs, the cyanobacteria that can survive in extreme conditions of the mine tailings. Surface water samples were collected from three sampling points in each Tailings Storage Facility (TSF) of Philex mines in Benguet Province, Philippines such as the re-vegetated Philex TSF1 and the currently active Philex TSF3. Genomic DNA was extracted from all water samples and subjected to shotgun sequencing. A total of 72.87 Gbases raw reads were successfully assembled using St. Petersburg genome assembler (SPAdes). A default and custom-based approaches for both CLARK v1.2.5 and Kraken2 metagenomic classifiers were used in determining taxonomic assignments to contigs using k-mer matches. Prokka was used for the rapid annotation and its output coding sequences were subjected to the evolutionary genealogy of genes-Non-supervised Orthologous Groups (eggNOG) mapper for the analysis of gene ontology. The default CLARK classified a large number of sequences across all sampling points in both re-vegetated and active mining sites. Taxonomic assignments revealed the top five cyanobacteria, namely, the unicellular Synechococcus sp., Cyanobium sp., and Gloeobacter sp., the filamentous, non-heterocystous Leptoplyngbya sp., and the filamentous, heterocystous Nostoc sp. Whereas the custom-based CLARK classified the Leptolyngbya sp., which is about 3% to 4% of the assembled contigs. On the other hand, Kraken2 results revealed the most dominant Rank Order Nostocales ranging from 0.05% to 0.63% of the classified sequences. The cyanobacterial custom-based Kraken2 revealed a large number of sequences belonging to filamentous Fischerella sp. and Trichodesmium sp. in Philex TSF1. A unicellular Microcystis and filamentous Nostoc sp., Spirulina sp., and Pseudanabaena sp. dominated the active Philex TSF3 site. CLARK was able to discriminate cyanobacteria up to the species level while the default Kraken2 classifier was able to distinguish up to the dominant Rank Order taxon. Although the custom-based CLARK detected more cyanobacteria at the Rank Order level compared to Kraken2, the former was only able to determine a single cyanobacterium at the genus level. Kraken2 revealed varying identifications of cyanobacteria in all sites while CLARK consistently identify the same cyanobacterial species among all sites. Protein-coding sequences output from Prokka that were evaluated using eggNOG revealed the genes conferring stress response to Cu2+, Zn2+, Pb2+, Cd2+, Ca2+ metal ions and smt metallothionein. These genes are reported to be responsible for the efflux/transport functions and heavy metal resistance that can be major attributes of cyanobacterial species for their survival to extreme metal conditions. Enhanced growth of Leptolyngbya sp. might also lead to probable formation of viable biological crusts initiating a re-vegetation process. This is the first report of filamentous cyanobacteria dominating the copper and gold mine tailings in Benguet Province successfully assembled and analyzed using a shotgun metagenomic approach.

Posters

Libertine Rose S. Sanchez

Institute of Biology Postdoctoral Research Fellow Metagenomics, Metabarcoding, University of the Philippines Diliman

Plant Genetics and Cyanobacterial Biotechnology LaboratoryCIP Researcher, National Institute of Molecular Biology and Biotechnology

BiATA Poster SANCHEZ LRS Libertine Rose Sanchez pdf

Tuesday July 28, 2020 12:30 - 12:35 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:35 MSK

Analysis of genetic variants associated with infection of Mycobacterium leprae in exomes of Mexican mestizo population

Leprosy, also known as Hansen's disease, is an infectious chorionic disease caused by the bacillus Mycobacterium leprae that mainly affects the peripheral nerves and the skin. This disease has accompanied man for at least 4,000 years and during all that time it has been one of the most dangerous diseases worldwide.

M. leprae has low virulence, to spread it is necessary the prolongated contact with the patient (close and prolonged contact) and have a genetic predisposition to acquire the disease. Mira and collaborators identified a locus within the PARK / PACRG gene that is associated with the susceptibility of the human population to develop leprosy. In mice, the NRAMP1 gene has been identified on chromosome 1, which controls both the susceptibility and resistance of intracellular pathogens.

On the other hand, several reports of single nucleotide polymorphisms (SNPs) associated with functions of resistance and susceptibility to M. leprae pathogenesis have emerged worldwide, therefore, in this project, we compile a basis of SNP data identified in all populations, associated with M. leprae pathogenesis, and compared with the SNPs identified in the "SIGMA T2D" project, a project aimed at the study of type 2 diabetes, which sequenced the exomes of around 3700 individuals of Latin American, specifically Mexican descent. This selected classification selects SNPs associated with M. leprae with greater probabilities of being present in exomes of the Mexican population for their in vitro study without the need to sequence complete exomes.

Currently, Mexico is among the 15 countries with the highest incidence of leprosy worldwide, with Sinaloa being one of the main states affected by this disease, ranking 1st in the nation in registered cases of leprosy with 150 cases registered at the end of 2019, therefore, it identifies genetic markers that help in the prevention methods of the disease is of vital importance for the development of strategies for its eradication.

Posters

Miguel Elenes

PhD Student, Universidad Autónoma de Sinaloa

biATA MAElenes Kniives pdf

Tuesday July 28, 2020 12:35 - 12:40 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:40 MSK

Genome assembly of heterozygous tropical trees - will the real (pan)genome stand up?

High-throughput sequencing has the potential to greatly enhance population genetics studies through its power to genotype individuals at multiple loci. Though methods exist for obtaining genotypes and polymorphism data without a reference genome, having a reference sequence at least for the single-copy regions of the genome does help when one would like to compare diversity parameters across different species. As the individual genomes are highly heterozygous, genome assembly programmes struggle to deliver an acceptable reference sequence even for the non-repetitive part of the genome. Platanus, Platanus-allee, SPAdes and Meraculous were compared for their genome assembly capabilities. None of the programmes delivered a usable reference genome from the sequence data, approximately 12-25 x genome coverage of tropical tree species sequenced by pair-end Illumina sequence reads of 101 or 150 nucleotides.
Through k-mer frequency analysis at multiple k-mer lengths the genomes are shown to be highly heterozygous. The k-mer length at which the peak frequency of homozygous k-mers equals the peak frequency of heterozygous k-mers is proposed as a reliable measure to compare the level of polymorphism across heterozygous species. This measure fails though when the heterozygosity is lower and a k-mer longer than the read length would be needed to detect equal peak frequencies.
By “haplotype-specific k-mer walking” contigs longer than 10,000 bp for the two haplotypes could be reconstructed at a number of loci in several species (Xylia xylocarpa, Gluta usitata, Dipterocarpus tuberculatus). The two haplotypes in a single individual generally differ by about 1 polymorphism per 100 bp, more so in the intergenic regions, intermediate in the introns and less so in the exons. About 70% of polymorphisms are simple SNPs, about 5% insertion deletions from 1 to several hundred nucleotides and the rest variations in repeat number of mostly mononucleotide repeats. Sufficient number of read pairs contain two polymorphisms to phase most polymorphic sites. As the sequencing reads are derived from fragments with target insert size of 500 bp, the phasing is broken when polymorphisms are more than 500 bp separated, but the sequences still connect. The walking process needs to be further automated and parallelized to have any chance of building a more or less complete view of the single copy region of the genome.
The data from moderate to low coverage Illumina genome sequencing contain sufficient information for the assembly of long contigs representing the two haplotypes derived from the two genomes in heterozygous individuals.

Posters

Hugo Volkaert

Principal investigator, Center for Agricultural Biotechnology, Kamphaeng Saen Campus, Nakhon Pathom 73140, Thailand

I am a forest ecologist and population geneticist hoping to use DNA sequence data to study the evolution and adaptation of tropical trees to their environment. Currently trying to assemble tree genomes on a shoestring budget, using low coverage shot gun sequencing of single libraries... Read More →

BIATA2020 Poster Tongyoo Chaibang Volkaert Hugo Volkaert pdf

Tuesday July 28, 2020 12:40 - 12:45 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:45 MSK

Genome-wide characterization of postinfectious functional dyspepsia-associated antibiotic-resistant Escherichia coli isolates from Mexico

Functional dyspepsia (FD) is one of the most common functional gastrointestinal disorders and affects more than 20% of the global population. FD is defined by the presence of fullness, plenitude, or epigastric or burning sensation, with no evidence of organic, metabolic, or systemic diseases that explain those symptoms. The exact etiology of FD is not clearly understood. However, one of the risk factors associated with developing this condition are gastrointestinal infections, where different pathogens have been related, Escherichia coli among them. E coli is a rod-shape Gram-negative bacterium commonly found as a commensal in the human microbiota; however, its genome plasticity has driven the evolution into pathogenic strains and to the acquisition of antibiotic-resistance properties. In this study, whole-genome sequencing (WGS) was used for the molecular characterization of two antibiotic-resistant E. coli isolates collected from postinfectious FD patients. Genomic DNA was extracted using a ZymoBIOMICS DNA Miniprep Kit and sequenced on Illumina Miniseq (2x150 PE). De novo assemblies by SPAdes v.3.12 and A5 assemblers were concatenated to generate final draft assemblies using Mix tool, which then were scaffolded using Medusa server. The draft genome sequences were annotated using Prokka v.1.12 and analyzed regarding phylogroup, multilocus sequences typing (MLST), serotyping, plasmid replicon, acquired antimicrobial resistance and virulence-associated genes using MLST v.2.0, SerotypeFinder v.2.0, PlasmidFinder v.2.0, ResFinder v.3.2 tools and BLASTn search against Virulence Factors Database (VFDB), respectively. According to the in silico typification EC-FD20-2 and EC-FD21-2 strains were classified as ST399-O13:H30 and ST69-O17/O77:H18, respectively. A total of 9 genes conferring resistance to aminoglycosides, quinolones, macrolides, phenicols, sulphonamides, tetracycline, and trimethoprim were identified. Neither β-lactamase genes nor mutations in the quinolone resistance-determining region (QRDR) were detected. A class 1 integron linked to IncFII type plasmid was identified in EC-FD21-2 genome. The WGS analysis revealed that both E. coli strains harbored virulence-associated genes; nonetheless, EC-FD21-2 genome encoded different adherence and iron uptake systems compared to EC-FD20-2 genome. Additionally, EC-FD21-2 housed the increased serum survival protein (iss), Endonuclease colicin E2 (celb), and Enteroaggregative immunoglobulin repeat protein (air) virulence factors giving an insight of its host colonization and adaptation advantage. To the best of our knowledge, this is the first WGS characterization of antibiotic-resistant E. coli isolates recovered from postinfectious FD patients in Mexico. The genomic data evidenced the basis of antibiotic-resistance and the pathogenic potential of these E. coli strains allowing a correct characterization.

Posters

José Antonio Magaña-Lizárraga

Doctoral student, Unidad de Investigaciones en Salud Pública “Dra. Kaethe Willms”, Facultad de Ciencias Químico Biológicas, Universidad Au

Poster BiATA2020 José A. Magaña Lizárraga José Antonio Magaña Lizárraga pdf

Tuesday July 28, 2020 12:45 - 12:50 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:50 MSK

Strain specific traits in the protein production associated with vegetative cells-to-spores transition in Bacillus thuringiensis

Organic agriculture and the trend of reducing usage of chemical pesticides require the development of biological pest-control methods. They include using natural pathogens of insects, like Bacillus thuringiensis. It is Gram-positive bacterium, which produces a great variety of different toxins of proteinaceous and non-proteinaceous nature. Among them, highly specific crystal-forming Cry-toxins are accumulated upon transition from the stage of vegetative cells to spores. The set of Cry-toxins produced by each strain of B. thuringiensis, is remarkably diverse and determines the host-specificity of the strain. The Cry-toxins’ genes are harbored on the different plasmids, which also contain genes encoding proteins involved in the process of sporulation. The strains of B. thuringiensis differ in the number of plasmids in their genomes, but the strain specificity of proteins produced in spores and vegetative cells at proteomic level remains poorly studied. In this study we used HPLC-Orbitrap-MS proteomics to quantitively compare the production of proteins at two stages, vegetative cells and spores, in three different B. thuringiensis serovars, var. thuringiensis, var. darmstadiensis and var. israelensis. Also, we compared B. thuringiensis var. israelensis with one strain of the same serovar, which lacked the ability to produce Cry-toxins. As expected, Cry-toxins were identified at spore stage in all strains except the one, which could not produce them. We also identified a set of proteins differentially expressed at the stage of spores including spore coat proteins, flotillin-like proteins and exosporium proteins. These proteins participate in the cell differentiation and exosporium attachment to the spore. Taking together, the data obtained in this study revealed the differences between proteomes of B. thuringiensis strains at the stages of vegetative cells and spores and have shown the similar patterns in the protein production across different serovars.
This work was supported by the Russian Foundation for Basic Research (Grant No 20-316-70020).

Posters

Kirill Antonets

All-Russia Research Institute for Agricultural Microbiology, Saint Petersburg State University

antonets Kirill Antonets pdf

Tuesday July 28, 2020 12:50 - 12:55 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

12:55 MSK

Metagenomic analysis of virus diversity in cave water habitats

Aquatic viruses have been extensively studied over the past decade, yet aspects of virus communities in cave waters remain poorly described. Our goal was to characterize viromes of cave water sampled in oligotrophic environments where Proteus anguinus, also known as olm or European cave salamander is present. Due to their dependence on in many cases sensitive water habitats, amphibian species are vulnerable to a variety of threats including viral infections.
Water samples (5 litres) form 7 different locations of underground cave system in Slovenia were first concentrated using CIM monolithic chromatography, a method that can efficiently concentrate viruses from high-volume water samples. Then, we used shotgun high-throughput sequencing followed by direct similarity search of sequencing reads against comprehensive database on protein level and subsequent taxonomic classification and visualization. Reads classifying as Caudovirales bacteriophages were most abundant in all cave water samples. Nucleocytoplasmic large DNA viruses from Asfarviridae, Iridoviridae, Mimiviridae, Phycodnaviridae families were also abundantly detected together with virophages (Lavidaviridae) that require a coinfection with giant DNA virus. ssDNA phages from Inoviridae and Microviridae family were detected as well as sequences of eukaryotic circular Rep-encoding single-strand (CRESS) DNA viruses. Sequences of ssRNA plant infecting viruses were abundantly present in some cave water samples, part of them possibly reflecting antropogenic contaminations. Targeted detection of ranaviruses (from the family Iridoviridae), the main viral threat to amphibian diversity, showed negative results using qPCR and these viruses were also not detected in metagenomics analysis of cave water samples.
Overall, our findings provide insight into cave water viromes describing common virus community of karstic underground caves system and identifying specific differences in pathogens and viral indicators detected in different sampling sites.

Posters

Katarina Bačnik

National Institute of Biology, IPS "Jožef Stefan" Slovenia

Bacnik BiATA 2020 katarina bacnik pdf

Tuesday July 28, 2020 12:55 - 13:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:00 MSK

Reducing redundancy of input data sets to improve inference of transcription factor binding sites

The majority of bacterial genome annotations lack information about transcription factor (TF) binding sites (operators) which control how genomic information is expressed. We are developing an application (SigmoID) to solve this problem in a highly automated fashion. SigmoID can both discover unknown operator motifs and annotate matching operators in genomic sequences. In brief, the motif discovery algorithm involves analysing 3D structures of TF-operator complexes, finding TFs with the same contacts between operators and DNA-binding domains and then looking for autoregulatory operator motifs in the promoter regions surrounding the genes encoding these TFs (more detail will be provided in the accompanying talk).
  The success of motif discovery strongly depends on the diversity of promoter region dataset. Assembling appropriate datasets proved to be challenging due to large sizes and rapid expansion of protein databases. This report describes our solution to this problem.
  The first step of our pipeline includes finding TFs homologous to the one being studied and selecting the homologues with identical specificity determinant or CR-tag (amino acid residues specifically contacting operator bases). This stage proved to be highly unreliable if public phmmer or blastp servers were used. Local searches require fast workstation and maintaining large databases which is undesirable taking into the account target audience (bench scientists). Also, many thousands of homologous proteins with matching CR-tag are expected for many TFs, while not more than 30-50 are usually required. Therefore, we have replaced the problematic database search step by fast lookup tables. The tables match CR-tag to IDs of all proteins with this tag. They are generated once for each protein family by running hmmsearch and determining CR-tags for each hit. The excessive redundancy problem was mostly solved by using reference proteome databases provided by PIR. For each protein family, five lookup tables were built: from full protein database and reference proteomes at 75%, 55%, 35% and 15% co-membership thresholds.
  The optimal homolog number can often be achieved by simply taking IDs of the proteins from one of the five lookup tables. In cases when homologue number is still excessive, an additional clustering stage is performed after extraction of the corresponding promoter regions. We found MeShClust (doi:10.1093/nar/gky315) to be the optimal tool at this stage.
  The efficiency of different clustering approaches and database search options was tested by inferring operator motifs for E. coli TFs from several protein families. The double clustering approach proved to be the fastest and produced better motifs in some cases as it didn’t have to resort to random selection of suboptimal promoter regions when their number was excessive. We have also noticed many cases of SigmoID producing realistic motifs (matching experimental data and suitable for genome-wide search) when such a motif was not present in the RegulonDB database or was incorrect.
The SigmoID v2 software with CR-tag lookup tables for 13 TF families is available at github.com/nikolaichik/SigmoID.

Posters

Pavel Vychik

Belarusian State University

Vychik Nikolaichik poster BiATA2020 Cursion Recursion pdf

Tuesday July 28, 2020 13:00 - 13:05 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:05 MSK

Computational tools for the de novo assembly of bacterial genomes: a comparative study.

The emergence of next generation sequencing technologies (NGS) has radically transformed the techniques for identifying nitrogenous bases in DNA and has boosted the production of scientific research across the globe, especially in the field of sequencing and genetic analysis of bacterial organisms.
The post-sequencing step - the computational assembly of DNA fragments - is a complex process and highly dependent on the platforms on which the organisms were sequenced, especially on the methodology adopted by each of them to obtain the biological data. This dependence, linked to several other factors, ended up creating favorable conditions for a large-scale production of assemblers of bacterial genomes, in which each one has different configurations, instructions and operating parameters. In addition, some of these software are difficult to install, require time to understand because of the long manuals and, sometimes, the user is unable to achieve the expected result, either by the data set used as input or by not knowing the flow of operation of the program. Thus, in addition to being concerned with complex biological factors, the bioinformatics professionals also need to have technical knowledge in computing to choose the ideal tool for their project.
Furthermore, the correct choice of the assembly tool to be used in the research is extremely important for its success. However, the scarcity of current works that deal with the performance and usability of these software makes the choice of an assembler something difficult to be done. Given the above, the work developed will provide information on the performance, precision and usability of 7 bacterial genomes assemblers, which were compared with each other using six SRA samples from the second generation platform Ion Personal Genome Machine (IonPGM).
In our study we tried to describe in detail how the input data influence in the performance of the programs and the final quality of their assemblies. For that, quality metrics were used that allowed us to assess the accuracy of the results produced and also in the analysis of the software performance and behavior. Moreover, when evaluating the general level of usability and implementation, we found that the programs that work via terminal are easier to run compared to those that use configuration files, as these require more time to understand the workflow.
In the end, this research will assist professionals in the bioinformatics field in choosing the most appropriate tool for their project and that meets their needs, in addition to contributing to the advancement of techniques related to the assembly of bacterial genomes. Thus, the study developed becomes an important source of information for researchers in the field of computational biology, collaborating for scientific production in the area.

Posters

Gustavo Silva

Instituto Federal de Educação, Ciência e Tecnologia da Bahia (IFBA) - Campus Seabra

Matheus Brito de Oliveira

Teacher, IFBA

Master in Applied Computing

Poster BIATA Gustavo Jorge pdf

Tuesday July 28, 2020 13:05 - 13:10 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:10 MSK

First insights in transcriptome-derived SNPs of two cryptic species Lasiopodomys gregalis and L. raddei (Cricetidae, Rodentia)

Transcriptome analysis revealed abnormalities in skeletal muscle regeneration affected by mutations in LMNA gene

Introduction. LMNA gene encodes proteins lamin A and C that form nuclear lamina. Lamina locates on the nuclear periphery and keeps its structure, controls chromatin organization, participates in gene expression and cell division. Mutations in LMNA cause the diseases called laminopathies. These diseases lead to a muscular dystrophies, cardiomyopathies, lipodystrophies, neuropathies and premature aging syndromes. Interestingly, that many laminopathies are muscle specific, and occurs in adulthood. Molecular mechanisms of disease development and progression remain unknown despite a large number of publications on this topic. In this work we investigate lamin mutations G232R that results in Emery-Dreifuss muscular dystrophy 2 and R482L associated with familial partial lipodystrophy type 2. The aim of our study was to investigate the effect of mutations G232E and R482L in LMNA gene on skeletal muscle regeneration and functioning in vitro using transcriptome sequencing.
Methods. The cell line of mouse myoblasts C2C12 – stem cells of skeletal muscles - were transfected with lentiviruses containing mutant variants of human LMNA: WT (control), G232E, R482L. The effectiveness of infection was assessed by immunocytochemical staining with human lamin antibodies. Differentiation of myoblasts was performed in the myogenic direction using a medium with a low serum content (HS 2%). RNA was collected on d0, d2, and d4 days of differentiation, each state represented in triplicate. Libraries for RNA sequencing were prepared using the TruSeq kit, and sequenced with HiSeq 2500, Illumina. Raw data were aligned to the mouse genome GRCm38 with the annotation GENCODE vM22; the number of reeds was calculated using the featureCounts program. Analysis of differential expression (DE) and pathways were performed in R using the DESeq2 and fgsea packages. Statistically significant results were selected with FDR=1% and log2fc>1 for DEGs and FDR=5% for pathways.
Results. The structure of the nuclear lamina in undifferentiated myoblasts with G232E and R482L mutations was disrupted – it was in condensed state and formed aggregates. However, all tree transgenic cell lines successfully differentiated and formed myotubes. We found a significant decreasing in fusion coefficient for mutant cells. Accordingly, the expression of regulators of myoblasts fusion Myom and Mymx was higher in WT cells. Inside undifferentiated myoblasts with mutations we found differentially expressed genes and pathways that responsible for activation of myogenesis and cell cycle arrest signatures. However, we did not observe spontaneous differentiation of myoblasts. We conclude that cells with LMNA mutations are more committed to myogenic direction than WT. We found the upregulation of myogenic and mitotic pathways in mutant cells on d2 and d4 with respect to WT condition. These indicate that the balance between differentiation and proliferation was impaired in G232E and R482L. In G232E mutation in spite of increased OXPHOS parameters and upregulated pathways responsible for mitochondrial respiration, increased respiration most likely is a result of incomplete substrate oxidation. In R482L cells both glycolysis and OXTPHOS have been suppressed.
Conclusion. We showed that mutations G232E и R482L in LMNA gene change the morphology of nucleus, myoblasts commitment and myotubes metabolism.
The work was carried out with Russian Science Foundation grant #16-15-10178-П.

Posters

Oksana Ivanova

Junior Researcher, Almazov National Medical Research Centre

Hi! My name is Oksana and I am from St. Petersburg ITMO University. Currently I have finished my master’s studies in bioinformatics and systems biology. Now I investigate the effect of different mutations on muscles and heart at Almazov Centre.

Poster OksanaIvanova Oksana Ivanova pdf

Tuesday July 28, 2020 13:25 - 13:30 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:30 MSK

Comparison of extraction methods and sequence platforms for complex water sample

Viruses are one of the most important threats to agriculture. The diversity and impact of viruses present in various sources of irrigation water, such as rivers or processed wastewater, remains understudied. High-throughput sequencing is the most well rounded approach for exploring the virome of irrigation water.. However, water as a sample for metagenomics analysis is incredibly complex. It can contain large amounts of very diverse genetic material that is not only limited to viruses. This makes it challenging to successfully detect individual viral species with certainty. On the other hand, the amount of plant virus genomic material present in water samples is very low and we must employ multiple concentration and purification steps to maximize the sequence yield and quality, without introducing too much bias. In preparation for large-scale irrigation water sampling, we have been optimizing individual steps of our analysis workflow. Here, we describe the comparative analysis done on a single sample of a Serbian river water used for irrigation. We compared two nucleic acid extraction protocols on the same original sample. The modified Trizol protocol for RNA extraction has the advantage of producing high quantities of good-quality RNA with relatively high length of nucleic acids fragments, however, it can be time consuming. On the other hand, Qiagen’s QIAmp MinElute Virus Spin Kit is a faster, more user-friendly alternative that consists of a well-standardized extraction method that targets both DNA and RNA. Assessment of the extraction methods was done using the Illumina MiSeq platform. Samples were normalized based on read count and compared in terms of viral families’ richness, with special consideration for RNA viruses, as the majority of plant viruses have RNA as their genetic material. We have identified several viral species that were detected in these samples and compared the genome coverage obtained with the two extraction methods. The bioinformatics analysis was done using Qiagen’s CLC Genomic Workbench software for pre-processing of reads and majority of individual species mappings, Diamond blastx for similarity search of obtained sequencing reads against GenBank nonredundant database, and MEGAN6 for visualization and comparison of Diamond results. Results indicated the advantage of using the modified Trizol extraction protocol for detection of plant viruses. Having this in mind, we have also assessed and compared the performance of Oxford Nanopore Technologies MinION platform using the Flongle flow cell paired with Ligation Sequencing Kit for library preparation of a Trizol-extracted RNA from the same sample in order to evaluate ability of a long read sequencing platform to increase reliability of detection.

Posters

Olivera Maksimović Carvalho Ferreira

Young researcher, National Institute of Biology Slovenia, IPS "Jožef Stefan" Slovenia

Hello everyone,I am a young researcher at the National Institute of Biology in Slovenia working under INEXTVIR project. For the most part I am working with plant viruses and water and their mutual relationship. At the moment we are predominantly testing irrigation water (via HTS sequencing... Read More →

poster biata 2020 olivera maksimovic Olivera Maksimovic pdf

Tuesday July 28, 2020 13:30 - 13:35 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:35 MSK

The elements of CRISPR-Cas-like system in genomes of 3 ecotypes of Arabidopsis thaliana

It is well-known that mitochondria in higher plant species have an extremely large genome compared to small-sized genomes of some bacterial species. The genome of higher plants mitochondria is actively involved in horizontal gene transfer processes where it can act both as a donor and a gene acceptor. Another important feature of higher plants mitochondrial genome is the presence of species-specific sets of linear and circular plasmids in these organelles of many plant species studied in this regard. These plasmids behave like typical mobile genetic elements in terms of ability to perform gene transfer processes. It was shown earlier that mitochondrial plasmid of Vicia faba contains canonical CRISPR (clustered regularly interspaced short palindromic repeats) locus (Mojica et al., 2000).Taking into account the evolutionary origin of mitochondria and plant mitochondrial genome structure, we have attempted by in silico methods to search for genetic elements similar to those of bacterial and archaeal CRISPR-Cas systems in nuclear genomes of 3 ecotypes (Col-0, Ler, C24) of model plant Arabidopsis thaliana.
We have found sites corresponding to the organization of CRISPR loci of prokaryotic type in mitochondrial and nuclear genome of A. thaliana. Contextual analysis of complete sequence of mitochondrial genome of A. thaliana(ecotype C24, Genbank Accesion Number JF729200) allowed us to discover a site whose structure completely corresponds to the organization of CRISPR loci of prokaryotic origin. This CRISPR locus is formed by 3 perfect direct repeats, separated by 2 spacer sequences. Analysis of these sequences using a database of plant viruses showed that the detected spacers have homology with the DNA of two strains (isolate Cabb B-JI and altered virulence isolate D/H) of cauliflower mosaic virus, which is able to infect A. thaliana.
The search for the genetic elements of adaptive immunity of the prokaryotic type in the nuclear genome of A. thaliana made it possible to detect elements of the CRISPR-Cas system on all 5 chromosomes of this species in the form of relatively numerous CRISPR loci and some putative cas genes. The number of CRISPR loci ranged from 16 on chromosome 3 to 23 on chromosome 5.
We suggest that the main functions of the CRISPR-Cas-like system elements found in A. thaliana plants can be protection not only from viral and plasmid DNA, but possibly from any DNA of foreign origin. Nowadays there is no particular hypothesis about the origin of CRISPR-Cas-like elements in plant genetic apparatus. We believe that such elements may have appeared and then remained partially conserved during the eukaryogenesis since such an ancestors of eukaryotes as archaea and alphaproteobacteria possessed them.
The discovery of the components of adaptive immunity in plants creates, in addition to existing methods of genomic editing, a novel one using plant native CRISPR-Cas-like system permitting to create transgenic plants with much more wide spectrum of economically valuable properties for general consumption.

Posters

Ivan Petrushin

Assistant Professor, Irkutsk State University

Petrushin The elements of CRISPR Cas like system.. (A0 poster) Ivan Petrushin pdf

Tuesday July 28, 2020 13:35 - 13:40 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Posters

13:40 MSK

Genome Assembly of Microbes By Leveraging Evolutionary Relationships

Microbial genomics has seen rapid improvements in the past decade primarily due to the development of novel algorithms capable of assembling the data generated by a variety of next-generation sequencing technologies, into a high quality genome. Depending on the sequencing technology, type of libraries, and the complexity of the genome, this has most often resulted in the generation of draft genomes. The completion of these microbial genomes however, have remained a challenge. Recent technologies capable of producing extremely long reads allow for the determination of the complete genomes of microbes. However, the cost-effectiveness of short-read technologies has resulted in the deposition of 4,68,154 (as of Dec 2019) permanent-draft genomes (i.e., genomes unlikely to be ever completed) in the NCBI database, while the number of complete genomes is only 16,814. Out of these 4,68,154 genomes, 2,62,766 were obtained from the surveillance project, which has increased drastically since 2017 from 13 to 5,883 in 2018 and 2,56,860 in the year 2019. Unfortunately, a large number of these organisms are unavailable in any culture collection for resequencing using long-read technologies, in order to complete the genome. With some exceptions, the short-read data of these genomes available in the short-read archive (SRA) contains information corresponding to the entire genome. When a closely related genome is available, this can be used as a reference to map the short-read data to determine the genome, and often times this performs better than a de novo assembly. We propose a workflow to use information from multiple-reference genomes to obtain an improved assembly (as compared to either single-reference mapping, single-reference-guided, or de novo assembly) of microbial genomes using short-read data from the SRA. It is envisaged that with the increase in the number of complete genomes of a given Genus of microbe in the NCBI, the information contained in the genomes of related microbes can be exploited to obtain an assembly with improved contiguity, and with no loss in strain-specific information, using the original short-read data from the SRA. A proof-of-concept using simulated short-read data sets of E. coli is presented to highlight the improvements in the final assembly guided by multiple reference genomes.

Posters

Urmi Shah

Research Intern, Bioinformatics Centre, CSIR-Institute of Microbial Technology, Chandigarh.

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

virify presentation pdf

Tuesday July 28, 2020 14:00 - 15:15 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Lecture

15:15 MSK

MGnify: MAG generation (Part 4)

Speakers

Alexandre Almeida

Postdoctoral Fellow (ESPOD), EMBL-EBI

Rob Finn

Team Leader, Sequence Families, EMBL-EBI

Lorna Richardson

Microbiome Resources Co-ordinator, EMBL-EBI

Ekaterina Sakharova

Bioinformatician, EMBL-EBI

Tuesday July 28, 2020 15:15 - 16:30 MSK
Zoom Mgnify https://zoom.us/j/93441398259?pwd=ZVRiWWl5ZWFpNlVQZUhVcDB0aTBndz09

MGnify Workshop, Lecture

17:00 MSK

The 3C criterion: Contiguity, Completeness and Correctness to assess de novo genome assemblies

De novo genome assembly is an open challenge in bioinformatic analyzes. Although the genome of an organism is "unique", different assemblies can be obtained depending on the type of DNA sequencing technology, the algorithms and parameters, as well as the complexity of the genome.
In order to select the reconstructed sequence that is closest to the real genome, different approaches to evaluate and select assemblies have been implemented. First, metrics such as the N50, L50, NG50, and others, have been used that are related to the number and size of pieces obtained with respect to the expected sequence, the contiguity. Other comparison strategies have focused on the ability to reconstruct essential genes and known elements of the genomes, referring to completeness (how much of the genome is represented by the pieces of the assembly) as a requirement that has a more biological meaning than just the number of fragments in the assembly. Additionally, the accuracy between the sequenced and the expected bases has been a matter of discussion, due to the difference in the performance of different DNA sequencing technologies. This can be referred as correctness, how well those pieces accurately represent the genome sequenced.
Due to the above, we have recently conceptualized criterion 3C (contiguity, completeness and correctness) as a set of metrics that can be used to benchmark genome assemblies (Molina-Mora, et al., 2020). This allows assembly selection to consider different aspects at the same time. We assessed this concept with the assembly of a bacterial genome, using Pseudomonas aeruginosa AG1 as a study model. Regarding the reference genome (P. aeruginosa PAO1), it was initially estimated that P. aeruginosa AG1 had ~ 1 Mb additional DNA sequence in its genome, so a de novo assembly was required. To do this, we used ultra-deep sequencing by short- (Illumina) and long-reads (Nanopore) technologies. An exhaustive comparison of different algorithms and technology combinations was done, resulting in the selection of a candidate assembly using the criterion 3C.
Thus, in this talk we will delve into the comparison of different assemblies, highlighting: (i) the definitions and relevance of contiguity, completeness and correctness metrics, (ii) the results obtained by sequencing technologies in hybrid or non-hybrid approaches based on metrics, (iv) aspects of the use of guide genomes for scaffolding, assembly polishing and manual curation, and (v) the challenges that still persist in this field of genome assembly. For this, we used the described model and two new isolates of P. aeruginosa (strains C25 and C50) that we have sequenced in the same way. For each genome, 10 approaches (hybrid or not) were implemented using different assemblers (Unycicler, SPAdes, IDBA, SKESA, Canu and Flye).
From the benchmarking results, well-known results of a better performance of long reads technologies to solve repeated zones and the fidelity obtained by short reads technology stand out. Despite the fact that some assembly algorithms achieved a single contig as expected, surprisingly a large number of fragmented genes were identified for the cases with long reads data. Thus, assessment using 3C criterion showed a substantially improved performance for a hybrid assembly approach, using the best advantages each sequencing technology.

Reference:
Molina-Mora, J.-A., Campos-Sánchez, R., Rodríguez, C., Shi, L., & García, F. (2020). High quality 3C de novo assembly and annotation of a multidrug resistant ST-111 Pseudomonas aeruginosa genome: Benchmark of hybrid and non-hybrid assemblers. Scientific Reports, 10(1), 1392. https://doi.org/10.1038/s41598-020-58319-6

Speakers

Jose Arturo Molina-Mora

Microbiologist bioinformatician, Universidad de Costa Rica

Nature, trips and bioinformatics!

Tuesday July 28, 2020 17:00 - 17:05 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:00 MSK

Q & A: talks

Tuesday July 28, 2020 17:00 - 19:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks, Talks

17:05 MSK

Impact of genetic variation on the association rate constant of von Willebrand factor and GPIbα platelet receptor

Von Willebrand factor (VWF) is a large multimeric protein involved in the processes of platelet adhesion and activation. A1-domain of von Willebrand factor subunit interacts with the complex GPIb-V-IX, a platelet transmembrane receptor complex, via the receptor GPIbα.   Information concerning VWF and GPIbα genetic variation are available in open database ClinVar. In the present work we analysed the impact of genetic variation on the association rate constant of von Willebrand factor and GPIbα platelet receptor. Basing on the PDB-structures the rate constants ka of VWF-A1 and GPIbα association were determined for the series of genetic variations (Alsallq & Zhou, 2008). The work has been focused on the clinically significant genetic variants both of VWF-A1 and GPIbα (Landrum et al., 2018).   It was found that certain mutations (Trp1313Cys, Arg1379Cys) in von Willebrand factor A1-domain caused several fold decrease, while the mutation (Gly249Val) is followed by significant increase of the association rate constant ka values.  Models of VWF A3-domain and collagen III interaction, as well as VWF A1-domain and bitiscetin interaction have been studied in the similar manner. Mutation Ser1783Ala in VWF A3-domain caused а nearly two-fold increase of the ka value in case of the interaction model with collagen III.  
The results obtained seem to be important for the interpretation of clinical data concerning Bernard-Soulier syndrome, von Willebrand disease (VWD) and pseudo-VWD.
 The work has been supported by Russian Science Foundation (Grant 19-11-00260).

Speakers

Maria Gefen

National Research Center for Hematology & Moscow Institute of Physics and Technology

Tuesday July 28, 2020 17:05 - 17:10 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:10 MSK

Network perspective on metabolic diversity among mononuclear phagocytes

The diversity of myeloid cells across different tissues is truly astonishing, both in function and in their developmental trajectory. Additional dimension of this diversity is manifested by the metabolic characteristics of individual phagocytes which can vary significantly based on the cell type and its location. At present, direct metabolomics profiling of tissue residing subpopulations is not feasible, as the process of ex vivo sorting can be lengthy and cause significant metabolic perturbations. However, RNA levels are significantly more stable to the sorting process and can serve as a reasonably reliable proxy to activities of metabolic pathways. In this work we focus on understanding metabolic variability across phagocytic subpopulations through integrated examination of several large-scale datasets that transcriptionally profiled subsets of myeloid cells.
Specifically, we have assembled compendium of three datasets, including first public release of the new dataset generated by Mononuclear Phagocytes Open Source ImmGen project. This dataset totals 337 samples and provides a unique source of information about individual cell subpopulations. It extends previous ImmGen effort that included 202 samples of various mononuclear phagocytes, also analysed in this study. Furthermore, we have leveraged recently released single-cell RNA-seq profiling of the multiple murine organs and reanalysed those data by focusing only on the mononuclear phagocytic populations, comprising 36,480 cells across 18 tissues.
Using these transcriptional data, we sought to identify major metabolic features characteristic of different populations of phagocytic cells, and define how these features vary across cell types and locations. This is computational task that has not been address previously for the datasets of such scale. Indeed, a previously described computational approach, called GAM (PMID: 27098040) uses metabolic networks as the backbone for analysis of transcriptional data and provides a verifiable and systematic description of the metabolic differences. However, datasets in question contain hundreds of individual profiles, while GAM approach is designed to analyse comparison between two conditions. Therefore, in this work we have developed novel computational approach, GAM-clustering, which performs unbiased search of a collection of metabolic subnetworks that jointly define metabolic variability across large datasets. By doing so, GAM-clustering reveals metabolically similar subpopulations in a manner that does not require explicit annotation or pair-wise comparison of individual samples. Our analysis revealed major metabolic features associated with different cell subpopulations and highlighted a number of metabolic modules that are specific to individual cell types, tissues of residence, or developmental stages. As an example, GAM-clustering analysis revealed that cholesterol pathway might play an important role in the context of migratory dendritic cells (DC), which we validated using in vivo pharmacological inhibition of this pathway followed by tracking of DC migration. Consistent with the analysis, statins have demonstrated inhibitory effect on DC migratory ability, finding that has not been reported previously.
Taken together, our work provides both (1) unique data and analysis resource in terms of studying variability of phagocytes, as well as (2) validated computational approach that can unbiasedly analyse both single-cell RNA-seq data as well as multi-sample bulk RNA-seq datasets in terms of underlying metabolic features.

Speakers

Anastasiia Gainullina

PhD student, ITMO University

Gene Expression Analysis, Biological Networks (Metabolic, etc), Teaching

Gainullina Biata 2020 pdf

Tuesday July 28, 2020 17:10 - 17:15 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:15 MSK

Shifts in the microbial community of soil in long-term burial conditions

Paloesols (buried soils) are generally formed by covering the undisturbed soil with mounds of different origin. Fresh organic matter is no longer delivered to the buried soil for a long time, as well as humidification, temperature and air regimes changes. This leads to the emergence of various diagenetic processes and shifts in the structure of soil microbiome. Microbial communities of paleosols are considered to be partially conserved and serve as sources of information describing soil conditions before burial, however, this issue is still unclear. On the one hand, a number of chemical and morphological properties and their profile stratification persist in buried soils, on the other hand, a decrease in the number of microorganisms and shifts in the trophic and taxonomic structure of the microbiome are observed. To assess changes in the prokaryotic community during burial, a comparative analysis of the microbiome of the dark chestnut buried under the mound of 500 B.C. and the adjacent surface dark chestnut soil, located in the same landscape conditions, was performed. To scale this difference, other soil types (chernozem, sod-podzol, and gray soil) were taken for comparison. 16S rRNA gene copies abundance was assessed with qPCR, taxonomic structure was analyzed by using throughput sequencing of amplicon libraries of V4 16S rRNA fragments with dada2 package and QIIME2 software.The significance of the differences in representation and abundance of philotypes was assessed with DESeq2 package. Metabolic pathways were reconstructed using Picrust2 software.
The buried soil demonstrated the conservation of the profile stratification with the corresponding differentiation of microbial communities. The decrease in total bacterial number (1.8 - 15.7 times depending on the horizon), as well as significant differentiation between A and B horizons was determined here, in comparison with surface soil.
Significant differences in microbiomes of different horizons were revealed even at the level of phyla (especially Actinobacteria, Proteobacteria, Firmicutes, Thaumarchaeota (Archaea), Acidobacteria, Chloroflexi, Bacteroidetes, Planctomycetes). We determined significant changes in the soil microbiome, and the scale of these changes was comparable with the differences between soils of different types. In the buried soil a decrease in the genus Gaiella, orders Rubrobacterales, Solirubrobacterales, Nitrososphaerales (Archaea), Frankiales, and an increase in the Acidimicrobiia class, phyla Firmicutes (Bacillales) and Chloroflexi were observed. In the upper horizons, the shares of Bacteroidetes and Verrucomicrobia increased. Thus, the burial increases the proportion of microorganisms capable of survival under adverse environmental conditions and the oligotrophic type of nutrition. The presence of microorganisms participating in the nitrogen cycle in the buried soil (Nitrolancea, Candidatus Alisiosphaera, Rhizobiales, Candidatus Nitrososphaera) may indicate that its environment remains stable after burial and maintains the cycle of the main biogenic elements. However, cluster analysis showed that the microbiomes of A and B horizons of the buried soil migrate to the group of C horizons, which may indicate a greater degree of their “mineralization”. This is confirmed by the analysis of potential metabolic pathways showing the predominance of degradation processes in horizons A and B of the buried soil.

This work was supported by the Russian Science Foundation, № 18-16-00073.

Speakers

Kichko A.A.

All-Russian Research Institute for Agricultural Microbiology

Tuesday July 28, 2020 17:15 - 17:20 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:20 MSK

Specialized Metabolism Gene Clusters from Red Sea Brine Pool Microbial Metagenomes

Mining for specialized metabolism gene clusters (SMGCs) is one approach to finding new antibacterial and anticancer natural products, especially from under-explored environments. Microbial metagenomes from Atlantis II Deep, Discovery Deep and Kebrit Deep Red Sea brine pools were shotgun sequenced and 2,751 Red Sea brine SMGCs were detected. The Red Sea brine SMGCs were found to be potentially encoding for natural products pertaining to 28 classes, that were functionally grouped into three main categories, which comprise the following diverse chemistries -in addition to hybrid clusters-: (1) saccharides, fatty acids, aryl polyenes, acyl-homoserine lactones, (2) terpenes, ribosomal peptides, non-ribosomal peptides, polyketides, phosphonates and (3) polyunsaturated fatty acids, ectoine, ladderane and others. We recently reported our findings, and here we will focus on the specific methodology of SMGCs detection in metagenomic samples, and on a particular selected group of natural products, which are the Ribosomally synthesized and post-translationally modified peptides (RiPPs). Although RiPPs constitute only 0.78% of the total Red Sea brine SMGCs, they are technically feasible to test in the lab, and thus it can be selected for prioritization for downstream experimentation. Moreover, several earlier studies have reported RiPPs belonging to similar classes, which exhibited antibacterial and/or anticancer effects. Bacteriocins (17 SMGCs), saccharide-bacteriocin hybrid clusters (3 SMGCs), Microcins (3 SMGCs) and Lanthipeptides (2 SMGCs), constitute the detected Red Sea brine RiPPs. In addition to our earlier reported results, here we will focus more on the methodology and recommendations for optimal mining microbial metagenomes for SMGCs, furthermore, we focus on and prioritize an additional selected group (RiPPs) for recommendation to the experimental work to validate and highlight the importance of the implemented methodology.

Speakers

Laila Ziko

Postdoctoral Researcher, Adjunct Assistant Professor, Biology Department, American University in Cairo

I'm a Postdoc interested in different topics that I really like to work on. Natural products from microbes, Metagenomics, antibacterial compounds, anticancer compounds, all are very interesting to me. Reach out for discussion & possible collaboration, look out for my talk on our recent... Read More →

Tuesday July 28, 2020 17:20 - 17:25 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:25 MSK

A platform for genomic characterization of Enterococcus spp.

The growing demand for genomic data analysis made essential the development of scalable and robust bioinformatic workflows. This are very interesting tools once reduce researcher´s efforts by the automation of task execution and guarantee the reproducibility of the data analysis. Until the moment, there are few genomic analysis workflows designed for a specific bacterial genera. So, we present JAMIRA a reproducible and scalable pipeline for prokaryote genomic data analysis designed for the genera Enterococcus spp.. In the last decade, enterococci have emerged as one of the main bacterial genera of clinical relevance, as they are important carriers of virulence genes and posses intrinsic resistance to commonly used antimicrobial agents including most cephalosporins, all semi-synthetic penicillins and clindamycin. The proposed workflow integrates a comprehensive set of genomic analysis tools for the prediction of phages, plasmids, genomic islands, antimicrobial resistance genes and virulence factors that may be associated with the adaptation of commensal and clinical bacteria. Therefore, our pipeline automate several tasks commonly performed in comparative genomic studies in order to contribute to the elucidation of the biological mechanisms which associated enterococci isolates with public health outcomes. The pipeline development initiate by the selection of bioinformatic tools used for the identification of elements associated with successful colonization and genomic plasticity of prokaryotes. Available free tools were compared in order to select the most appropriate for the genetic study of genus Enterococcus spp. In order to facilitate installation of the software dependencies of each tool and the consequent integration in the pipeline, tools available on the Bioconda platform were used. To ensure data analysis reproducibility, the workflow was constructed based on the Snakemake framework which has a readable definition language, and integrated with Conda package manager that encapsulates all software dependencies necessary for the execution of each tool. Actually, JAMIRA platform includes the following genomic analysis tools, Abricate, RGI, PlasmidFinder, IslandPath-DIMOB and PhiSpy. A web application of JAMIRA is being implemented using the PHP Laravel framework for the elaboration of the program's internal structure and JavaScript, HTML and CSS for the graphic interface. Initially, MySQL management system were used for data storage, which can be changed according to user demand. The application has a graphical interface that allows the analysis of genomic data files in FASTA format, dispensing softwares installation or the use of command line for the analysis running and configuration of the workspace. Therefore, JAMIRA is an automated and easy-to-use workflow that will allow scientists with no background in bioinformatics to perform reproducible and trustworthy genomic data analyses, contributing to the understanding of the differences between commensal and clinical strains, as well as to the elucidation of biological mechanisms which made bacteria from enterococci genera associated with public health risks.

Speakers

Rafaella Santana Bueno

Undergraduate in Biomedical Informatics, UFCSPA - Federal University of Health Sciences of Porto Alegre

Tuesday July 28, 2020 17:25 - 17:30 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:30 MSK

A rigorous approach to UPGMA phylogeny by multidimensional scaling of pairwise distances and bioinformatic outcomes for a commercially significant geneset

This study aimed to rigorously determine a branching order for a set of interrelated housekeeping genes. Following extensive genomic sequencing of the algal triterpenoid-biofuel producer Botryococcus braunii, we rapidly obtained the target set of genes that are related to squalene synthase, by using selective Blast and then iterative SOAP for assembly of overlapping blast hits. In that way we exhaustively ascertained that there were only four homologues present. Introns were crossed by using 2-4 kb paired ends. Full length genes were annotated including potential alternative C termini. We hypothesised that these four key biofuel genes had evolved by two successive gene duplications from squalene synthase, and that two code for proteins tethered to a membrane by their C-termini. In a novel approach to phylogeny, using Matlab we first obtained Needleman-Wunsch protein alignments that minimized the PAM-250 genetic distances between each pair of sequences. We stored each pairwise alignment in a matrix and selected the case with the minimal penalty when normalised. Multiple alignment was intentionally omitted, as all genes were true homologues, so all intergene pairs were valid comparisons. After pairwise alignments, the distance between sequences was computed using the Poisson model. To visualize these distances, multidimensional scaling (MDS) was used to create an optimally distance-preserving projection onto two axes, allowing direct visualization of the relative genetic distances between sequences. The novel MDS approach critically informs the succeeding steps in tree generation, and differs from prior applications of MDS in tree comparison. Use of the Poisson model guarantees an ultrametric tree in the subsequent phylogenetic construction. Phylogeny was analysed by hierarchical clustering, using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. UPGMA merges the two nearest neighbor sequences into one cluster C, and determines the new distance d(C,K) between C and the remaining clusters K; in UPGMA this distance d is the average of distances for all sequences in the cluster. The algorithm iterates, terminating when all sequences are merged into a single cluster, which becomes the root of the generated tree. Each merge operation represents one branch of the resulting phylogenetic tree. The root node is wherever the last merge is made. We suggest that the MDS method conducted may detect bioinformatic richness present in the sequences, relative to other phylogenetic methods that tend to treat AA or nucleotide columns as if they did not form part of a whole gene. By considering the gene as the critical unit upfront, and by defining the gene only by its relationship to each other gene, and then allowing distance to be multidimensional, the algorithm presented may respond to uniquely conserved areas sampled across two genes at a time, and which may represent potential ancestral richness predating the pair. The phylogenetic branching order obtained from the tree for this gene set correlates well to observed synapomorphies (here motifs and introns) present across the gene set, giving us high confidence in the phylogenetic order of duplication of the constituent genes, and allowing us to infer biochemical signatures in the active-site pockets of this wider set of triterpenoid biosynthesis proteins.

Speakers

Robert Moore

There are two speakers sharing the talk. This speaker Robert has experience in molecular microbiology, gene annotation, genome mining, phylogeny and taxonomy, and has worked in plant science, microbiology, and genetics fields. Currently in the environmental microbiology industry... Read More →

Michael Barnathan

Temple University

Tuesday July 28, 2020 17:30 - 17:35 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

17:35 MSK

Do multiple long-distance transfers shape TBEV spread pattern?

Tick-borne encephalitis (TBE) is viral zoonosis transmitted by the bite of infected ticks. About 20 years ago, the TBEV was divided into three main subtypes based on the phylogenetic analysis: European, Siberian, and Far-Eastern. The geographic distribution of subtypes mostly corresponds to the nominal region. However, some exceptions are known. Herein, 848 TBEV sequences (1028 nt E-gene fragments) were analyzed to indicate all long-distance virus transfers, that can be revealed from the sequence data. Threshold of 500 km was used for the selection of long-distance virus transfers. Temporal estimates for these events were obtained using Bayesian evolutionary analysis. Noteworthy, ticks are not able to spread the infection on their own over such a distance. In other words, these long-distance virus transmissions were caused by vector-assisted tick transmission. In all subtypes and most of the smaller groups in these subtypes, there were a lot of recent long-distance virus transfers. Moreover, this is suggested to be a systematic pattern, rather than anecdotal events. Most of the known sequences of the European subtype were obtained in Switzerland, n=41 out of 178, or the Czech Republic, n=35 out of 178. Genetic diversity of viruses found within each of these two countries was comparable with the diversity of the whole subtype, n=178. At the same time, this subtype is distributed throughout Central and Eastern Europe, Altai, the Irkutsk Region (Russia), and South Korea. The above arguments allow us to state that long transfers may be considered as a normal and abundant pattern in TBEV spreading.

Speakers

Nikita Bulantsev

first-year master's degree, Applied Genomics Laboratory, SCAMT Institute, ITMO University, Saint Petersburg, Russia

I’m biotechnologist, first-year master student of molecular biology at ITMO University, bioinformatics cluster. My research areas are metagenomics and personalized medicine. Besides, I'm interested in AI, machine learning.

Tuesday July 28, 2020 18:05 - 18:10 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:10 MSK

The search for genetic risk factors of ischemic stroke with the genome-wide association study and machine learning methods

The ischemic stroke (IS) is a neurological deficit of sudden onset due to brain infarction. It is the primary cause of acquired disability in adults and a leading cause of death. The disease is multifactorial where genetic factors have a certain contribution. Numerous genetic polymorphisms are believed to increase the risk of IS, each having a small effect size. The advent of genome-wide genotyping caused a wave of genome-wide association studies (GWAS). At least ten candidate-genes associated with IS are described and verified in different studies, however many additional genomic regions are expected to be confirmed or identified. The machine learning (ML) approaches looks quite promising here. In this research we present the results of identification of single nucleotide polymorphisms (SNPs) associated with the development of IS in individuals of the Eastern Slavic ancestry with the use of GWAS and ML approaches.

The case and control groups consisted of 1051 and 421 individuals, correspondingly. They were genotyped with DNA-microarrays of different types. Upon combining the genotypic data and meeting the requirement of quality control, we obtained about 82 thousands of SNPs for the investigation. The GWAS included an associative test, an exact Fisher test and the Bayes factor method. The machine learning approaches involved Support Vector Machine, k-Nearest Neighbors, Random Forest, Logistic Regression (LR), Gradient Boosting, and Neural Network (NN). They were aimed to classify the patients and healthy people using the genotypic data. The highest accuracy (ROC-AUC) of 0.697 was achieved with the NN method. The effect of SNP on the outcome variable was estimated with SHAP values for LR model. The top ranked SNPs identified were in good agreement with the results of GWAS.

In this research we also assessed the influence of missing genotypes on the results of both GWAS and ML methods. We compared different strategies for combining the genotypes obtained with different DNA-microarrays and provided some recommendations on the appropriate way of doing this. We also annotated the SNPs found in GWAS in terms of genes and speculated that the candidate genes can be associated not only with IS but also with some other diseases (e.g., Alzheimer disease, Parkinson disease) suggesting common basic mechanisms for developing of brain injuries.

The study was funded by RFBR (Russian Foundation for Basic Research) according to the research project No 19-29-01151.

Speakers

Gennady V. Khvorykh

bioinformatician, Department of Molecular Bases of Human Genetics, Institute of Molecular Genetics of the Russian Academy of Sciences, Moscow, Russia

Medical and population genetics is within the scope of my interests. I search for the genetic variants that contribute to ischemic stroke, using GWAS and AI. Besides I search for the signals of natural selection, applying statistical approaches to the genotypes of several populations... Read More →

Tuesday July 28, 2020 18:10 - 18:15 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:15 MSK

VarQuest+: modification-tolerant database search of secondary metabolites mass spectra

Secondary metabolites (SMs) are at the center of attention for a wide range of researchers from biologists and ecologists to pharmacologists and biomedical scientists [1]. Modern mass spectrometry instruments allow rapid and low-cost scanning of thousands of metabolites which result in huge amounts of high-resolution data. Although this data represents a gold mine for future discoveries, its interpretation remains a bottleneck and requires appropriate computational methods [2]. The current software is either limited to specific classes of SMs, for example, peptidic natural products (VarQuest [3]), or can perform only standard database search which allows identification of known SMs but fails to discover their novel variants (Dereplicator+ [4]).

Here we present VarQuest+, a database search tool capable of identifying novel variants of a wide range of known SMs including polyketides, alkaloids, flavonoids, saponins, and many others. Algorithmic and software innovations in VarQuest+ make it much more efficient in the running time and memory consumption in comparison to existing analogs. This efficiency allowed the implementation of modification-tolerant search mode in VarQuest+, which is more challenging than a regular database search.

We benchmarked VarQuest+ on a Korean medical plants dataset (2.5 millions of mass spectra collected on 337 samples). The standard search of the KNApSAcK database (51,179 plant SMs [5]) resulted in the identification of 349 compounds. VarQuest+ modification-tolerant search identified 4,253 SMs, an order of magnitude more than Dereplicator+. Using the same search parameters, VarQuest+ is twenty times more efficient than Dereplicator+ in runtime, and four times more memory efficient.

The reported study was funded by RFBR, project number 20-04-01096.

References
[1] Cragg, G. M., & Newman, D. J. (2013) Natural products: a continuing source of novel drug leads. Biochimica et Biophysica Acta (BBA)-General Subjects, 1830(6), 3670-3695.
[2] Wang, M. et al. (2016) Sharing and community curation of mass spectrometry data with Global Natural Products Social molecular networking. Nat. Biotechnol., 34, 828.
[3] Gurevich, A. et al. (2018) Increased diversity of peptidic natural products revealed by modification-tolerant database search of mass spectra. Nat. Microbiol., 3, 319.
[4] Mohimani, H., et al (2018) Dereplication of microbial metabolites through database search of mass spectra. Nat. comm., 9:4035
[5] Afendi, F.M. et al (2012) KNApSAcK Family Databases: Integrated Metabolite–Plant Species Databases for Multifaceted Plant Research. Plant and Cell Physiology, 53 (2), e1.

Speakers

Alexey Gurevich

Senior Research Scientist, Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia

I am leading Natural Product Discovery research direction at CAB (http://cab.spbu.ru/research/antibiotics-discovery/). Together with the Center for Computational Mass Spectrometry at UCSD and Mohimani Lab at Carnegie Mellon University, we are creating software for identification of... Read More →

VarQuest+ Gurevich BiATA2020 pdf

Tuesday July 28, 2020 18:15 - 18:20 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:20 MSK

A zero inflated log-normal model for inference of sparse microbial association networks

The advent of metagenomics has prompted the development of efficient taxonomic profiling methods allowing to measure the abundance of organisms in a wide range of environments. Multivariate abundance data further has the potential to enable inference of associations between microbial populations, but several technical issues need to be accounted for, like the compositional nature of the data and its extreme sparsity.

The ecological network reconstruction problem is frequently cast into the paradigm of Gaussian graphical
models (GGMs) for which efficient structure inference algorithms are available. Unfortunately, GGMs can not properly account for the extremely sparse patterns occurring in real-world datasets. In particular, structural zeros corresponding to true absences of biological signals fail to be properly handled by most statistical methods.

We present here a zero-inflated log-normal graphical model specifically aimed at handling such "biological" zeros, and demonstrate significant performance gains over state-of-the-art statistical methods for the inference of association networks.

Speakers

Vincent Prost

CEA

Tuesday July 28, 2020 18:20 - 18:25 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:25 MSK

Single-cell ChIP-seq imputation with machine learning models leveraging bulk ENCODE data

Next generation sequencing is routinely used in biomedical research and pharmaceutical industry. Applied in combination with chromatin immunoprecipitation (ChIP-seq), it provides detailed insights in cell genomic properties such as chromatin accessibility and protein-DNA interactions that play a key role in gene regulation and chromatin structure (ENCODE project consortium, 2012). Recently developed assays for single-cell ChIP-seq (scChIP-seq) enable the characterization of these molecular events on single-cell resolution. This allows the investigation of cell differentiation processes that are of crucial interest in many research fields, especially in cancer studies. While the sequencing coverage can be as low as 1000 reads per single cell (Rotem, Assaf, et al. 2015), it was nevertheless possible to investigate relationships between drug-sensitive and resistant breast cancer cells (Grosselin, Kevin, et al. 2019). Such concise findings would not have been possible with bulk ChIP-seq data. However, the sparsity problem caused by the low signal given for an individual cell, hampers further investigations and there is a need for a dedicated imputation method for scChIP-seq. Furthermore, past publications based on sparse datasets from single-cell RNA-seq which is more established, demonstrate that imputation methods strongly enhance research on such data (Peng, Tao, et al. 2019). Eventually, the full potential of future scChIP-seq studies will not be captured without the application of a dedicated imputation method to complete the data. To address this need we developed SIMPA, an algorithm for Single-cell chIp-seq iMPutAtion.

Based on a large dataset of more than 2250 preprocessed bulk ChIP-seq datasets from the ENCODE data portal, SIMPA leverages statistical patterns within a reference set specified by the target, the investigated histone mark or transcription factor used in the scChIP. The existence of those patterns was proved by a cross-validation analysis on classification models. Considering one single cell, SIMPA trains numerous (~120,000 on 5kb resolution) machine learning models to impute missing genomic regions while being sensitive to the
sparse signal of the individual cell. Compared to another imputation strategy (Xiong, Lei, et
al. 2019) that does not involve reference bulk data, SIMPA achieves a better clustering by cell-types. Using a KEGG pathway enrichment tool (Li, Shaojuan, et al. 2019) we could show that functionally related pathways were recovered in a cell-type-specific manner, but only on imputed results form SIMPA. Finally, randomization tests confirmed that both the single cells signal and the target-specific reference data is used by SIMPA to achieve these meaningful imputations.

Our new imputation algorithm was validated on a set of more than 2600 B-cell and T-cell single cells for two different histone marks: H3K4me3 and H3K27me3 at 5kb and 50kb resolution, respectively. Indeed, this is so far the only scChIP-seq dataset available for human cells. In order to efficiently use resources, SIMPA was implemented with an MPI interface to distribute the computations to many cores possibly from different compute nodes. Software is available at https://github.com/salbrec/SIMPA

In conclusion, to address problems related to data sparsity in single-cell ChIP-seq, we developed the first dedicated imputation method that generates accurate and biologically relevant results.

Speakers

Steffen Albrecht

PhD Student, Johannes Gutenberg University Mainz

Hello, my name is Steffen Albrecht and I am from Mainz in Germany.Currently, I am a PhD student in the group Computational Biology and Data Mining and my main topics are machine learning and bioinformatics data integration. The application fields are imputation, e.g. for sparse data... Read More →

Tuesday July 28, 2020 18:25 - 18:30 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:30 MSK

Preliminary Analysis of Resistome in Mycobacterium abscessus

Mycobacterium abscessus (Mab), a complex of rapidly growing non-tuberculous mycobacteria, causes human infections that are difficult to treat because of its resistance to multiple antibiotics. The whole genome sequences of 1,581 Mab downloaded from the NCBI FTP site were used to infer phylogenetic relationships and investigate the resistome in silico. A total of 2,975 putative protein sequences of resistance genes from 32 distinct drug classes were detected using Comprehensive Antibiotic Resistance Database (CARD) and ARG-ANNOT databases. The most abundant resistance genes detected were related to beta-lactams (1,962 genes), aminoglycosides (258 genes) and fluoroquinolones (205 genes). These genes encoded (i) many multidrug efflux pumps, such as a homolog of Pseudomonas aeruginosa MexAB-OprM involved in resistance to macrolides, fluoroquinolones, monobactams, carbapenems, cephalosporins, cephamycins, penams, tetracyclines, peptides, aminocoumarin, diaminopyrimidines, sulfonamides, phenicols and penems; (ii) different types of beta-lactamases, for instance, KPC type beta-lactamases that decrease susceptibility to monobactams, carbapenems, cephalosporins, and penams, as well as (iii) various transferases, such as a homolog of mph(B) phosphotransferase from Escherichia coli that decreases susceptibility to macrolides. These findings give insight into the mechanisms of resistance to antibiotics in Mab especially those commonly used to treat Mab infections.

Speakers

Shay Lee Chong

Faculty of Information Science and Technology, Multimedia University, Melaka, Malaysia

Tuesday July 28, 2020 18:30 - 18:35 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:35 MSK

Genome-wide inference of bacterial transcription factor binding sites: new method and its applications

None of the current bacterial genome annotation pipelines handles regulatory sequences. Transcription factor binding sites (TFBS or operators) are the most abundant regulatory elements, the methods for their fast genome-wide inference are currently lacking while the importance of TFBSs for understanding genome function is critical.
The method of bacterial TFBS inference we are developing is based on the analysis of 3D structures of transcription factor (TF)-operator complexes. We use TF residues contacting DNA bases as a tag (CR-tag) to link TFs with their operators. TFBSs can be inferred genome-wide via either (1) fast automated CR-tag based genome scan with a library of CR-tagged experimentally characterised TFBS motifs or (2) application of slow semi-automated de novo TFBS inference protocol combining CR-tag information with genome structure analysis.
The first approach allows to reliably transfer regulatory information between different species, not necessarily closely related. Even distantly related TFs of Gram-negative and Gram-positive bacteria can have the same CR-tags and hence recognise the same operators. However, direct regulatory information transfer is most efficient within the same taxonomic order (e.g. over 50% of TF orthologue pairs within Enterobacteriales have identical CR tags).
The de novo protocol builds upon the well-established phylogenetic footprinting approach replacing assumption of similar TFs recognising similar operators by strict 3D-structure based criterium (CR-tag) and is universally applicable to any bacterial species.
We illustrate the following applications of our approach:
1) Correcting poorly defined motifs.
For most TFs in a given species, just one or very few targets exist and proper TFBS models cannot be built. With our de novo TFBS inference protocol, orthologous operator sequences can be collected from other species that have TFs with the same CR-tag. This usually provides enough information for properly defining the motif and building high-quality operator model. This approach can vastly improve the usability of the data from single-organism TFBS databases like RegulonDB.
2) Resolving regulation details for paralogous TFs.
Using our CR-tag based approach and experimental evidence, we show that paralogous quorum-sensing regulators in Pectobacterium spp. recognise the same operator sequence, although completely different operators have been suggested previously.
3) The advantages of full-scale genome-wide TFBS inference.
With a current collection of TFBS profiles, genome-wide scan finds operators for the majority of transcription units in a typical enterobacterial genome. This helps to reveal unexpected regulators for many transcriptional units and allows deciphering regulatory cascades. We will provide examples of such inferred transcriptional cascades supported by experimental data.
4) Genome-wide TFBS scan can also be useful when correcting automated genome annotation, since finding an operator for a well-characterised TF can suggest functions for the downstream genes (doi:10.7717/peerj.2056).
The TFBS inference method described here is added to version 2 of our existing application for TFBS analysis which together with a collection of TFBS profiles is available at github.com/nikolaichik/SigmoID.

Speakers

Yevgeny Nikolaichik

Associate Professor, Belarusian State University

Tuesday July 28, 2020 18:35 - 18:40 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

18:40 MSK

Speakers

Yugo Lima-Melo

Postdoctoral researcher, Universidade Federal do Rio Grande do Sul

Tuesday July 28, 2020 18:55 - 19:00 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09

Q & A: Talks

19:00 MSK

Assembly and Annotation of Ashkenazi Reference Genome

We describe the assembly and annotation of a new, population-specific human reference genome. We used publicly available data for HGP HG002 individual from Ashkenazi jewish trio, available from Genome In A Bottle (GIAB) project. The new reference that we call Ash1, is more complete than the human reference GRCh38. While GRCh38 is a mosaic of five different individual genomes, our reference represents a single individual. The Ashkenazi reference genome, has 2,973,118,650 nucleotides placed on the chromosomes as compared to 2,937,639,212 in GRCh38. We annotated the genome by transferring the CHESS annotation from GRCh38 genome. The new annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Alignment of DNA sequences from an unrelated part-Ashkenazi (~70%) individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Speakers