Back To Schedule
Tuesday, July 28 • 11:00 - 11:05
Installing and Searching BLAST Databases in a Data Science Framework

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Data science embodies a pipeline of processes: acquisition, cleaning and organization of data, quality control and assurance, validation, and downstream visualization and analytics. Because of the overwhelming number of tools for each of these steps, the greatest challenge is often making those tools work in concert to facilitate a thorough and insightful analysis.
The BIRCH system (http://home.cc.umanitoba.ca/~psgendb/) is a framework consisting of hundreds of bioinformatics tools, unified through the BioLegato family of programmable graphical applications. Each BioLegato application represents a specific class of biological objects, packaging together the data and the methods for each class of objects. We describe BioLegato applications for BLAST searches, implementing data science principles. For example, in blncbi the user retrieves sequences from NCBI using a graphical Entrez query builder. Amino acid sequences matching the query pop up in blprotein, a BioLegato application that displays proteins, and lets the user run protein-specific tasks. A protein can be selected for a BLAST search, and output will appear in bpfetch: a BioLegato spreadsheet object for protein hits. The blpfetch spreadsheet makes it easy to scan hundreds of hits, refining the list into one or more subsets for retrieval. Sequences are retrieved to a new blprotein object for downstream analysis. Because each object is a separate window with a small screen footprint, the user has more of a sense of working directly with the data than in typical web interfaces.
BioLegato gives the user flexibility at all steps in a pipeline. Because output of each step appears in a new BioLegato object, there are no dead ends. Output from one step can be used directly as input for subsequent steps because BioLegato takes care of things like file format conversion, which is a tedious and sometimes error-prone part of using tools at the command line. We call this process ad hoc pipelining. Ad hoc pipelining enables the user to learn from each step before going to the next. We also describe blastdbkit, a Python script run from BioLegato, for downloading and managing BLAST databases on the users's computer.
Together, these tools provide an integrated point and click pipeline for sequence database searches, within the context of the larger BIRCH system. New programs can be added to any BioLegato application by creating a file using BioLegato's PCD language, which specifies parameters to be set and a shell command to run the program. In this way, the core BIRCH functions can be integrated seamlessly with locally-installed bioinformatics software.

avatar for Brian Fristensky

Brian Fristensky

Associate Professor, University of Manitoba
RESEARCH:Phylogenomics of plant-pathogen interactionsDevelopment of bioinformatics softwareTEACHINGCytogeneticsPlant BiotechnologyBioinformatics

Tuesday July 28, 2020 11:00 - 11:05 MSK
Zoom Conference https://zoom.us/j/94321101353?pwd=QlJBb09uM0NVVnVyK0FkbTJ3Nkcrdz09