Skip to main content

Genomic File Formats

When working with genomics data for the first time, many engineers are surprised to find that genomics file types are quite simple in structure. In fact, a majority of universal file formats such as SAM/BAM, VCF, BED, GFF/GTF and many other custom formats are essentially tab-delimited text files with some form of compression and/or serialization. In this chapter, we cover the main file formats one might run into when working in the genomics domain, starting with common compression schemes.

tip

  Mastery of Unix command line tools such as vim, greplesscutjoinsortsed, and awk are essential to working effectively in this domain. Be sure to brush up on your UNIX command line skills!