Large Scale Microbiome Profiling in the Cloud
Camilo Valdes, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
Vitalii Stebliankin, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
Bacterial metagenomics profiling for whole metagenome sequencing (WGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient, and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.
We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. Flint runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67 seconds — an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.
Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary materials and data are available at http://biorg.cs.fiu.edu.
Learning a Mixture of Microbial Networks Using Minorization-Maximization
Sahar Tavakoli, University of Central Florida, United States
Shibu Yooseph, University of Central Florida, United States
Motivation: The interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network.
Results: We present a mixture model framework to address the scenario when the sample-taxa matrix is associated with K microbial networks. This count matrix is modeled using a mixture of K Multivariate Poisson Log-Normal distributions and parameters are estimated using a maximum likelihood framework. Our parameter estimation algorithm is based on the Minorization-Maximization principle combined with gradient ascent and block updates. Synthetic datasets were generated to assess the performance of our approach on absolute count data, compositional data, and normalized data. We also addressed the recovery of sparse networks based on an l1-penalty model.
TADA: Phylogenetic augmentation of microbiome samples enhances phenotype classification
Erfan Sayyari, University of California San Diego, United States
Siavash Mirarab, University of California San Diego, United States
Ban Kawas, IBM, United States
Motivation: Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data is high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data is often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks.
Results: In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes.
Key Dates - Deadlines
Abstract Submissions (for talks and posters)
Call for Abstracts Opens: Thursday, January 31, 2019
Abstracts Submission Deadline: Thursday, April 11, 2019
Late Poster Submissions Open: Monday, April 15, 2019
Talk and/or Poster Acceptance Notifications: Thursday, May 9, 2019
Late Poster Submissions Deadline: Wednesday, May 15, 2019
Late Poster Acceptance Notifications: Thursday, May 23, 2019
Proceedings Submission Deadline: Monday, January 28, 2019
Conditional Acceptance Notification: Wednesday, March 6, 2019
Revised Papers Deadline: Friday, March 22, 2019
Final Acceptance Notification: Monday, April 8, 2019
Microbiome COSI Session: Tuesday, July 23, 2019