Disclaimer: I work for Paradigm4, the creators of SciDB. The views expressed in this blog are my own.
At Paradigm4 / SciDB, we do a lot of work on genomics data e.g.
- Global biobank engine for UK Biobank data and other biobanks,
- Paper by Manuel Rivas’ group,
- REVEAL-Genomics API
- PCA in SciDB,
- SciDB SPARK comparison,
This post is my brief subjective review of other genomics solutions.
Mainly, I was interested in knowing more about
- available open-source genomics tools for tertiary analysis in genomics (see image below). Capturing some notes from my web-search below
- specialized resources for TCGA
Tertiary analysis tools out there
This Cloudera blog post provides a good explanation of tertiary analysis
GATK and similar tools (?) are used for primary and secondary analysis, while (according to this Reddit post) the following tools are used for tertiary analysis (in alphabetical order)
- ADAM (U.C. Berkeley)
- GenomicsDB / TileDB (Broad Institute and Intel)
- GQT (U. Utah)
- Hail (Broad Institute) (successor to PLINK / SEQ)
- SciDB (Paradigm4)
Some observations about these tools
- Hail (from Broad Instute) is the successor to PLINK (Harvard) [Link], the last version of which was released in 2014 [Link]
- As of March 2018, GenomicsDB/TileDB was not integrated with Hail [Link]. But that might change; both tools are from the Broad Institute.
- Found references to PCA on genomic data in multiple cases –
- Found links on Hail integration with Databricks, Cloudera
- TCGA-assembler is a set of user friendly R functions to download data from TCGA firehose at Broad. Their paper is quite popular with 158 citations since 2014.
- The XenaBrowser allows beautful visualizations across multiple genomic and phenotypic data types. [Link].
A beautiful cartoon explanation of GWAS from the Broad, using more images like the one below:
- From Rivas, Manuel A., et al. “Insights into the genetic epidemiology of Crohn’s and rare diseases in the Ashkenazi Jewish population.” PLoS genetics 14.5 (2018)
“Principal Component Analysis (PCA) was done in each ancestry group using the 21,066 variants. Sample QC was done using the Hail software while PCA, differential missingness and sample relatedness analysis was done using PLINK. Hail is an open-source software framework for scalably and flexibly analyzing large-scale genetic data sets (https://github.com/broadinstitute/hail). Allele balance was calculated using PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq/)”
I have only used thumbnails of images so that interested readers hyperlink to the original source – if you feel I am still violating some copyright terms, please let me know in comments below, or via other channels on my home-page.