Software tools for genomics tertiary analysis

Disclaimer: I work for Paradigm4, the creators of SciDB. The views expressed in this blog are my own.

At Paradigm4 / SciDB, we do a lot of work on genomics data e.g.

This post is my brief subjective review of other genomics solutions.

Mainly, I was interested in knowing more about

available open-source genomics tools for tertiary analysis in genomics (see image below). Capturing some notes from my web-search below
specialized resources for TCGA

Tertiary analysis tools out there

This Cloudera blog post provides a good explanation of tertiary analysis

three-levels-of-analysis

GATK and similar tools (?) are used for primary and secondary analysis, while (according to this Reddit post) the following tools are used for tertiary analysis (in alphabetical order)

ADAM (U.C. Berkeley)
GenomicsDB / TileDB (Broad Institute and Intel)
GQT (U. Utah)
Hail (Broad Institute) (successor to PLINK / SEQ)
SciDB (Paradigm4)

Some observations about these tools

Hail (from Broad Instute) is the successor to PLINK (Harvard) [Link], the last version of which was released in 2014 [Link]
As of March 2018, GenomicsDB/TileDB was not integrated with Hail [Link]. But that might change; both tools are from the Broad Institute.
Found references to PCA on genomic data in multiple cases –
- Paper that used PLINK/SEQ for PCA [Link]. Also see footnote 1.
- Blog post from Cloudera that used Hail on Spark.
Found links on Hail integration with Databricks, Cloudera

TCGA tools

TCGA-assembler is a set of user friendly R functions to download data from TCGA firehose at Broad. Their paper is quite popular with 158 citations since 2014.
The XenaBrowser allows beautful visualizations across multiple genomic and phenotypic data types. [Link].

xenabrowser snapshot

Aside

A beautiful cartoon explanation of GWAS from the Broad, using more images like the one below:

Footnotes

From Rivas, Manuel A., et al. “Insights into the genetic epidemiology of Crohn’s and rare diseases in the Ashkenazi Jewish population.” PLoS genetics 14.5 (2018)

“Principal Component Analysis (PCA) was done in each ancestry group using the 21,066 variants. Sample QC was done using the Hail software while PCA, differential missingness and sample relatedness analysis was done using PLINK. Hail is an open-source software framework for scalably and flexibly analyzing large-scale genetic data sets (https://github.com/broadinstitute/hail). Allele balance was calculated using PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq/)”

NOTE

I have only used thumbnails of images so that interested readers hyperlink to the original source – if you feel I am still violating some copyright terms, please let me know in comments below, or via other channels on my home-page.