Skip to main content

Common Genomics Tools

Bioinformatics programmers rely on a growing array of tools written by research scientists. These tools were developed for diverse purposes and were written in different languages. The tools run at different levels of efficiency and differ in reliability. Over time the community of bioinformaticians have built up resources to identify the best packages and handle any quirks or deficiencies they may have. Online resources like GitHub, StackOverflow, and Biostars answer questions and offer community support. Further, developers have provided new tools and methods to ease common problems.

A common problem is installation and dependency conflicts. While there is no perfect system for coordinating the installation of software, Bioconda comes close. Bioconda is a channel of bioinformatics software "recipes" that can be built into conda packages. conda is a package manager that is language and platform independent and is designed to simplify the distribution, installation and management of software packages into stand-alone virtual environments that ensure no version or dependency conflicts. conda helps end users easily install packages from other repositories, such as pip, CRAN, CPAN, and Bioconductor. We strongly recommend the use of conda and bioconda, since they are emerging as the modern way to install and utilize genomics tools.

Common Tools, Packages, & Ecosystems

There exist thousands of tools utilized in bioinformatics and genomics. Some are very esoteric and used only by a few researchers. Others are much more ubiquitous. For some tasks, there may be dozens of options; for others, there may be only one. The following is a list of common tools that are used extensively and well-regarded as reliable, organized by generic types of tasks. This list provides only a starting point to learn about common tools.

Sequence Alignment

One of the earliest challenges of genomics was the task of sequence alignment. Sequence alignment at its most basic looks for similarity between two or more sequences. This is done in a high-throughput manner with tens of millions of reads in next-gen sequencing assays such as exome-seq, RNA-seq, ChIP-seq, and more.

ToolDescriptionUse Cases
BLAST (web), CLIBasic Local Alignment Search Tool. It is used to find biological similarity between a query sequence and database of sequences. BLAST comes in different forms that can test DNA, RNA or protein sequences and was developed to improve alignment speed which is necessary to perform the large-scale scans. BLAST uses a rule-based similarity metric that is an approximation of slower more exact methods so BLAST is not guaranteed to find the best alignment and may miss potential alignments. It works by first finding short sequences or "words" from the query sequence, ignoring low complexity words. Then, BLAST uses the best words as "seeds" to find likely sequences. Only sequences in the database that have exact seed matches are retained and tested. This greatly reduces the search space speeding up the test. For matched sequences, the region around the seed is extended in both directions and scored for similarity with the query sequence. If the overall score increases, it is extended further; if not, the extension stops. BLAST retains the best score for each test sequence and reports those that pass a statistical test. BLAST has been used extensively to identify genes with similar functions across multiple species.Flexible sequence alignment against a database of sequences
BLAT (web), CLIThe BLAST-like Alignment Tool (BLAT) is an alignment tool that works like BLAST but is designed to test against genomes rather than discrete sequences. BLAT tests for short perfect or near-perfect matches and works well matching sequences with a high degree of similarity. BLAT gets its speed by using an indexed version of the genome or database which it retains in memory. BLAT combines more HSPs into larger alignments compared to BLAST.Identification of near perfect matches for query in a genome
BWAThough foundational in the field, BLAST and BLAT were not designed to handle the large-scale alignment of millions of short sequences to a reference genome. The Burrows Wheeler Aligner, or BWA, was developed to do exactly this task. This software relies on the Burrows Wheeler Transform (BWT) that changes strings into a sorted array of all suffixes of a string. This is a space saving reversible transform that speeds up alignment so that large scale genomic studies can be completed in a timely manner. The reference genome is transformed once and is then used repeatedly for alignment tasks. BWA includes methods for mapping single and paired-end reads and is commonly used for short-read (70bp to a few kilobases) DNA mapping.Short read alignment (70bp-a few KB) - WGS/WES/ChIP-seq, etc.
STARSTAR is an RNA-seq aligner for transcripts that incorporates information about splicing. STAR uses knowledge about the location of exons to index the genome and refine the search for transcript reads. This means that reads that span two exons across a splice site can be accurately mapped - not all short read aligners are capable of this.RNA-seq, spliced read alignment
minimap2Minimap2 is a versatile aligner for DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.Long, error-prone read alignment (PacBio, Nanopore, Iso-seq, direct RNA)

File/Data Manipulation

Data munging is an unavoidable task in genomics. Large sets of complex data formats don't make it any easier. A solid grasp of basic command line utilities such as sed, awk, cut, and grep are a necessity, but there are also utilities to perform common manipulations of standardized data formats available. See an incomplete list of such tools below.

SamtoolsA set of utilities used to perform common operations on SAM/BAM files. Samtools has functions to view, sort, index, merge and assess the quality of aligned sequences contained in a SAM/BAM. Further, samtools can easily transform between SAM/BAM/CRAM files.SAM/BAM/CRAMRsamtools (R), pysam (python)
tabixA generic tool for indexing tab-delimited text files allowing efficient seek and retrieval. Tabix was written by the author of samtools and can search through bgzipped compressed files. The indexes it produces allow genomic viewers to efficiently view local regions. Tabix indexes sorted tab-delimited data into large intervals called bins. Searches for records begin by looking in bins that overlap the query interval.GFF/BED/SAM/VCFRsamtools (R), pysam (python)
bcftoolsA set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.VCF/BCFRsamtools (R), pysam (python)
bedtoolsA set of tools for manipulating genomic intervals data. It is designed to be used in pipelines and is commonly used to manipulate BED, BAM, GTF/GFF, and VCF files. bedtools has functions to intersect, merge, subtract, slop, window, and sort genomic intervals. It also has functions to calculate the distance between intervals, the coverage of intervals, and the number of intervals in a file.BED/BAM/GTF/GFF/VCFbedr (R), pybedtools (python)
bedopsA toolkit for flexible manipulation of BED-like files. In particular, it has a variety of set operations that can be applied to any number of BED inputs and piped together to create complex manipulations. It is also highly memory efficient, and as such, able to perform operations that other tools may struggle to complete.BEDbedr (R)

Future Tools

  • Picard

  • Python libraries

    • BioPython
  • R libraries

    • BioConductor
  • PERL libraries

    • BioPerl