Right data for right patient: the FDA-CPTAC multi-omics mislabeling challenge

Speaker: Bing Zhang, Baylor College of Medicine
Time: 9:00 AM

Abstract: In biomedical research, sample mislabeling has been a long-standing problem contributing to irreproducible results and invalid conclusions. In a clinical setting, it may lead to incorrect medical treatment. Sample mislabeling is particularly prevalent in large scale multi-omics studies, in which multiple omics experiments are carried out on a large number of samples at different time periods and/or in different labs. However, parallel data sets from different omics platforms also provide more information to identify and correct mislabeled samples than traditional single-omics studies. The National Cancer Institute and the Food and Drug Administration, in coordination with the DREAM Challenges, organized this computational challenge to develop algorithms that can accurately detect and correct mislabeled samples in multi-omics studies.

Integrated Analysis of Multi-omics Data for Clinical Use

Speaker: Renke Pan, Hanying Feng, Hong Chen. (Best Performer team). Sentieon, Inc., Mountain View, CA
Time: 9:20 AM

Abstract: We present the modeling methodology that achieved top scores in PrecisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge. Integrated analysis of proteomic and transcriptomic data may be used to gain additional insights into biological systems. Sample mislabeling detection and correction is one of the problems that could be tackled by such integrated analysis. In the clinical setting, each patient sample has thousands of measurements from omics data such as mass spectrometry and RNA-Seq, but the number of patient samples is usually much smaller. To address the “curse of dimensionality” phenomenon typically associated with these high-dimension low-sample-size datasets, we propose the combination of univariant screening for feature number reduction and regularized fitting for model complexity control. In Subchallenge 1, for mislabeled sample detection, we apply an ensemble approach, combining statistical inference models such as Least Absolute Shrinkage and Selection Operator (LASSO), Nearest Shrunken Centroid (NSC), and k-Nearest Neighbors (k-NN), to identify mismatched clinical and protein profiles. In Subchallenge 2, for sample mislabeling correction, we use the same machine learning ensemble methodology to give accurate predictions of clinical labels based on the samples’ protein and RNA abundance. In addition, to jointly analyze measurements from both mass spectrometry and RNA-Seq, we build regression models for each gene as a bridge to map the two data types to each other, which enable a novel definition of distance matrix between proteomic and transcriptomic profiling data. The sample mislabeling correction problem is thus reformulated into an optimization problem of matching proteomic and transcriptomic data with the shortest distance. This method achieves perfect correction, assuring “right data for right patient”.

Identifying mislabeled samples in multi-omics datasets

Speaker: Anders Carlsson, Patrik Edén, Björn Linse, Mattias Ohlsson, Carsten Peterson. (Best Performer team). Lund University, Sweden.
Time: 9:40 AM

Abstract: We developed an algorithm for detecting mislabeled omics clinical samples based upon a combination of RNA/protein correlation analysis and classifiers trained on proteomic or RNA data to provide gender and microsatellite (MSI) instability status. The 80 samples in the training data was used to identify a subset of RNAs and their protein products that displayed high correlation in the correctly labeled samples. Correlation analysis in the test set using the identified list of reporters revealed a number of samples that were correctly paired with a high degree of certainty. By ranking reporter correlation in all remaining sample pairs, events of sample duplication, shifts and swaps could be identified. Finally, to determine whether the proteomic or transcriptomic sample had wrong clinical labels, classifiers were trained with random forests/multilayer perceptrons, and used to estimate which correction (proteomic or transcriptomic) that minimized the classification error. The analysis was facilitated by outlier detection using deep learning autoencoders and imputation procedures for potentially missing data.

While our algorithm identified and correctly adjusted all mislabeling events, it should be noted that this was to some degree facilitated by the reasonable limitations set by how labeling errors were introduced in the challenge data.

The challenge and promise of interpreting RNA sequence to unravel tumor mysteries

Speaker: Josh Stuart, University of California, Santa Cruz

Time: 10:00 AM

Abstract: DNA sequencing of cancerous tissue has revealed a complex landscape of mutations. Several mutated genes represent the “usual suspects” known to drive the disease. Still, many other mutations are of unknown significance because they reside in the non-coding parts of the genome or occur in less well-studied genes. RNA sequencing can provide information about a tumor’s cells-of-origin and its activated signaling pathways, and thus promises to help distinguish driver from passenger events. Yet detecting and interpreting RNA alterations has remained a challenge.

My talk will describe efforts to interpret the disruptions of rewired circuitry in tumors using RNA sequencing data. A DREAM challenge was launched to find the most accurate computational approaches to detect fusion products and rare isoforms. The Cancer Genome Atlas and other large consortia have investigated integrative approaches to reveal subtypes and possibly new treatment avenues based on expression profiling and RNA signature analysis. Open questions remain about how best to deconvolute bulk tumor data to resolve tumor subclones and participating “normal” cells in the microenvironment, such as immune cell populations. I will describe a couple of examples of how RNA-based information has been used to find n-of-1 treatment options. However, the field is young and many exciting challenges remain for which we need new ideas, algorithms, and training datasets to translate omics information to the oncology clinic.

Overview of the DREAM SMC-RNA Isoform Quantification and Fusion Detection Challenge

Speaker: Kyle Ellrott, Oregon Health & Science University
Time: 11:00 AM

Abstract: The ICGC-TCGA DREAM Somatic Mutation Calling RNA Challenge (SMC-RNA) was launched as an assessment of the accuracy of methods to perform isoform detection and fusion-protein detection in cancer RNA-Seq data analysis. This evaluation was constructed using a two-phase approach, with initial benchmarking done on synthetically generated tumor sequences followed by sequencing of cell lines with spiked-in fusion constructs. Cloud computing and reproducibility were important aspects of this challenge. Participants worked on the NCI Cancer Genomics Cloud, creating entries described using Docker and CWL. This allowed participants to submit their methods in a form that could be moved and applied to new datasets. With these submitted methods, the evaluators were able to refine the benchmark and do additional interrogation of the methods, even after the challenge had closed. The winners of the fusion detection challenge have demonstrated that one of the most critical aspects of the analysis is eliminating false positives and spurious signals. Using the results of this benchmark, large consortium data sets with tens of thousands of RNA-Seq profiles can be scanned for fusion events with a level of specificity that was not previously possible.

Arriba: a computational tool to detect gene fusions from next-generation sequencing data in personalized oncology

Speaker: Sebastian Uhrig, German Cancer Consortium and Heidelberg University
Time: 11:20 AM

Abstract: Next-generation sequencing (NGS) is becoming a standard tool in clinical practice. It enables oncologists to stratify cancer patients based on the individual mutational profiles of their tumors and match them precisely to drugs targeting the driver mutations at hand. However, accurate detection of somatic aberrations from short reads remains a challenging bioinformatics problem. Especially gene fusions are hard to identify reliably from NGS data. Existing algorithms suffer from a high false positive rate and miss therapeutically relevant events. Moreover, they take a long time to deliver results, which delays decision making on a patient’s therapy.

We developed Arriba, a novel computational tool to identify gene fusions from RNA-Seq data. It implements a highly efficient algorithm based on the STAR aligner. Compared to alternative methods, Arriba has extraordinary sensitivity without sacrificing specificity, and reduces the runtime from many hours to just a few minutes.

Our method has been field-tested in DKTK/NCT MASTER, a NGS-guided personalized oncology program. We applied Arriba to tumors of pancreatic cancer patients seeking new treatment options after developing resistance to standard therapy. Arriba discovered recurrent driver fusions with the NRG1 gene in three KRAS wild-type tumors. Upon treatment with targeted drugs inhibiting NRG1-signaling, two patients exhibited partial responses and the third patient showed disease stabilization.

Fast and accurate cancer fusion transcript detection by STAR-Fusion

Speaker: Brian Haas, Broad Institute of MIT and Harvard
Time: 11:40 AM

Abstract: Genomic rearrangements often fuse genes together in an unnatural context that can disrupt or alter gene functions. In the case of a fusion disrupting a tumor suppressor or activating an oncogene, the fusion gene can become a potent driver of cancer. Evidence for such gene fusions can be detected from transcriptome sequencing, leveraging RNA-Seq with specialized software to search the sequencing data for evidence of chimeric gene products. Many algorithms and software tools have been developed over the last decade to leverage RNA-Seq for fusion transcript detection, but there has remained much room for improvement in runtime performance and fusion prediction accuracy. We developed STAR-Fusion as a fast and accurate method for fusion transcript detection, leveraging the speed and accuracy of the STAR aligner with fast identification and effective filtering of candidate fusion predictions. STAR-Fusion demonstrates fast and accurate fusion prediction in benchmarking with comparisons to popular alternative methods. The current version of STAR-Fusion integrates FusionAnnotator and FusionInspector to leverage current knowledgebases of cancer biology and enable further exploration and visualization of evidence supporting fusion transcripts.

NCI Cancer Research Data Commons

Speaker: Allen Dearry, Cancer Research Data Commons, NCI
Time: 1:20 PM

Abstract: As -omics and other sciences increase the volume of data collection, the need for big data solutions in biomedical research intensifies. Biomedical informatics has reached a turning point where key innovations in data storage and distribution such as compression algorithms, indexing systems, and cloud platforms must be leveraged.  In addition to the data curation and storage needs of modern biomedical research, other challenges include development of robust analytical tools as well as infrastructure and funding models to support these efforts. As data generation expands, local storage and computational solutions become less feasible. Thus, NCI has set out to build the NCI Cancer Research Data Commons (NCI CRDC), a cloud-based infrastructure in support of data sharing, tool development, and compute capacity to democratize big data analysis and to increase collaboration among researchers. NCI has sponsored recent initiatives that serve as the foundation for the Cancer Research Data Commons—the Genomics Data Commons (GDC), and three Cloud Resources. In addition, NCI has recently announced plans to expand NCI CRDC to include proteomics, imaging, and animal model data and to develop semantic resources to ensure interoperability.  Both current and planned NCI CRDC activities will be discussed.

Generation and benchmarking of a 1000 Genomes gVCF resource for variant calling

Speaker: Jonathan Pevsner, Krieger Institute and Johns Hopkins School of Medicine

Time: 1:30 PM

Abstract: HaplotypeCaller and GenotypeGVCFs (joint genotyping) are scalable and accurate variant calling algorithms introduced by the Genome Analysis Toolkit (GATK) and implemented by Sentieon DNAseq suite. In this model, genotyping accuracy improves with increasing number of samples. Genomic variant caller format (gVCF) files are inputs into joint genotyping. However, there is currently no public gVCF resource. To facilitate variant discovery in whole genome sequencing (WGS) studies with limited numbers of samples, we generated 2,530 gVCFs from 1,000 Genomes data using Sentieon best practices, which mirrors that of GATK, on the Seven Bridges Genomics Cancer Genomics Cloud. We benchmarked these gVCFs using Genome in a Bottle consortium samples as gold datasets and assessed variant calling metrics for three different depths of sequencing (10x, 30x, and 50x, obtained through down-sampling). Variant calling performance improved with increasing numbers of samples at low depth of sequencing for HG001. At 10x depth of sequencing, we observed a modest gain in F-measure for SNVs, and improvements for indels up to 30x depth of sequencing. Most of the improvement from joint genotyping could also be achieved by including parental genotypes, as observed in HG002 trio data. Using gVCFs of samples from a different geographic origin (e.g. YRI gVCFs in the joint calling of a CEU sample) yielded sub-maximal performance. Researchers may use a subset of the gVCFs we generated, matched for geographic origin, to obtain a modest increase in genotyping accuracy. This study was enabled by the NCI Cancer Research Data Commons project, and other supported projects will be discussed.

Indexing Massive NIH Datasets with NCBI-led Hackathons

Speaker: Ben Busby, Computational Biology Branch, NCBI
Time: 1:50 PM

Abstract: Over the past three years, NCBI has run or been involved in 34 data science "tool" hackathons. In these hackathons, participants assemble into teams of five or six to work collaboratively for three days on pre-scoped projects of general interest to the bioinformatics community. On average, about 80% of teams produce an alpha or beta working prototype, and approximately ten percent ultimately publish a manuscript describing their work, typically built on loud infrastructure. Thus, NCBI hackathons have generated over 160 products, and about 50% of them are stable, and/or continue to be developed. Some of these can be found at In addition to the production aspect, hackathons provide an immersive learning environment and promote networking opportunities. Expanding this program, we have embarked on a series of "data" hackathons; extracting stable metadata out of large numbers of primary datasets, developing novel derived and synthetic reference sets from these massive data, and distributing not only flexible indices, but ideally software sufficient to allow external collaborators to index their own large supersets. We are laying the groundwork for indexing not only genomic data, but helping to index many datatypes relevant to biomedical and biological analyses.

Science Driven By Space Biology Omics Data Utilizing NASA’s GeneLab Platform

Speaker: Afshin Beheshti, WYLE Labs, NASA Ames Research Center
Time: 2:10 PM

Abstract: Determining the biological impact of spaceflight through novel approaches is essential to reduce the health risks to astronauts for long-term space missions. The current established health risks due to spaceflight are only reflecting known symptomatic and physiologic responses and do not reflect early onset of other potential diseases. There are many unknown variables which still need to be identified to fully understand the health impacts due to the environmental factors in space. One method to uncover potential novel biological mechanisms responsible for health risks in astronauts is by utilizing NASA’s GeneLab Data Systems ( GeneLab is public repository that hosts multiple omics datasets generated from space biology experiments that include experiments flown in space, simulated cosmic radiation experiments, and simulated microgravity experiments. This presentation will provide examples of analysis and novel hypothesis generation that are being produced with GeneLab datasets. These example will include novel data and work being generated with various scientists around the world involved with GeneLab’s Analysis Working Groups (AWG) that are assisting with the development of pipelines and advancing GeneLab to the next phase, a publication from GeneLab discovering novel Carbon Dioxide impact due to rodent habitats, a publication from GeneLab discovering a potential master regulator responsible for health risk associated due to spaceflight, and a final publication addressing potential cardiovascular risk from space radiation. These examples will allow the general scientific community to start generating novel findings related to the space biology and assist NASA with guiding future experiments.

NCI CPTAC Big and Open Data Projects

Speaker: Henry Rodriguez, National Cancer Institute
Time: 2:30 PM

Abstract: The overarching goals of the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) are to increase our understanding of tumor biology, accelerate the translation of new molecular findings through public resources, and support clinically-relevant research projects that elucidate biological mechanisms of response, resistance, and/or toxicity – all involving the integration of proteomics with genomics. To achieve these goals, CPTAC has two coordinated programs – a Tumor Characterization Program and a Translational Research Program, each of which makes data (genomics, transcriptomics, proteomics, and imaging) available to the public to maximize utility and benefit. This seminar will highlight the CPTAC program, discuss how genomics, transcriptomics, and proteomics must all be brought together in the quest to better understand the etiology of cancer, and how CPTAC’s big data sets are being applied to crowdsourced computational challenges to leverage new knowledge from its studies.

Utility of proteogenomics data in immunotherapy

Speaker: Bing Zhang, Baylor College of Medicine
Time: 2:40 PM

Abstract: Using proteogenomics data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC), I will present a few examples on how proteogenomics integration can expand our knowledge on cancer genes, prioritize cancer drivers, clarify puzzling genomic observations, and correct misinterpreted gene functions. I will also discuss our recent works on using proteogenomics to identify tumor antigens and understand immune evasion mechanisms.

Goals and Challenges of the NCI Human Tumor Atlas Network

Speaker: Shannon Hughes, Division of Cancer Biology, NCI
Time: 3:00 PM

Abstract:The goal of the NCI Human Tumor Atlas Network (HTAN) is the creation of multi-dimensional maps of human tumors that describe key transitions during cancer, including the progression from pre-cancer to malignancy, from locally invasive to metastatic disease, and the dynamic response to therapy and development of drug resistance. The collection and analysis of comprehensive, multi-modal datasets that include high resolution spatial information about the tumor and its microenvironment will be key in building such maps. Several challenges are inherent to building tumor atlases, including the true integration of disparate data types, with a focus on visualization and modeling of data for use by varying audiences that include scientists, clinicians, and the general public.

Challenges and opportunities in the analysis of single-cell data generated from tumor specimens

Speaker: Ken Lau, Vanderbilt University Medical Center
Time: 3:10 PM

Abstract: The goal of the Human Tumor Atlas Network (HTAN) is to map critical transitions from pre-malignant to metastatic lesions using spatial and single-cell data. These data provide an opportunity to examine the tumor microenvironment, cancer cell subpopulations, and clonal behaviors at an unprecedented resolution, facilitating the deconvolution of intra and inter-tumoral heterogeneity that contribute significantly to cancer morbidity and therapeutic resistance. However, several challenges are present in these data, including sparse information content on a per cell basis, significant batch effects, and the absence of longitudinal sampling of individual lesions. In preparation of HTAN, we present a case study where we utilize single-cell analysis to examine the single-cell landscapes of benign and advanced murine colonic tumors. We will demonstrate the data challenges associated with such data, and our approaches to infer mechanisms associated with transitions from benign to malignant lesions.

The Importance of Large and Systematic Perturbational Studies with LINCS program as an example

Speaker: Ajay Pillai, NHGRI
Time: 3:30 PM

Abstract: The Library of Integrated Network-Based Cellular Signatures (LINCS) program aims to create a network-based understanding of human biology by cataloging changes in gene and protein expression, signaling processes, cell morphology, and epigenetic states, which occur when cells are exposed to a variety of perturbing agents. I will describe the challenges that are being addressed by the LINCS program to date and provide a discussion of the challenges that remain. In addition, I will summarize the LINCS resources available to the community and some that are in progress. I will show use cases of LINCS data in multiple contexts focused on drug discovery and target mechanism of action among others.

Integrative analysis of phenotypic and molecular responses to high-impact microenvironmental signals

Speaker: Laura M. Heiser, Oregon Health & Science University
Time: 3:40 PM

Abstract: The behaviors of normal and diseased cells and their responses to therapeutic agents are strongly influenced by the regulatory signals they receive from the microenvironments in which they reside. These signals come from direct interactions with insoluble extracellular matrix and cellular proteins as well as soluble proteins, peptides, or glycoproteins. The behavior of cells receiving these signals ultimately is determined by the interaction of multiple signals received within the regulatory networks intrinsic to the target cell. The NIH LINCS program has the goal of elucidating the molecular networks associated with response to various perturbagens and is comprised of participants with diverse expertise in ‘omics and imaging. Here, we describe a collaborative consortium-wide project in which we sought to understand the molecular and phenotypic responses of MCF10A cells to high-impact ligands. In preliminary studies of MCF10A cells, we identified 6 ligands that strongly perturb multiple cellular phenotypes, including: proliferation, migration, differentiation status, and morphology. To elucidate the molecular networks associated with these phenotypic responses, we treated MCF10A cells with the 6 ligands, harvested cells at multiple time points from 1 to 48 hours after treatment, and then performed molecular profiling on the assays available in the LINCS consortium: live-cell and fluorescence imaging, L1000 (reduced transcriptomics panel), GCP (global chromatin profiling), ATACseq, RNAseq, cyclic immunofluorescence (image-based assessment of protein expression), and RPPA (reverse phase protein array). These data reveal striking phenotypic and molecular changes following treatment with diverse ligands, shed light on the integration of signals across molecular modalities, and allow identification of links between molecular and phenotypic changes.