jxtx/bioc2019.md

## bioc2019.md

      
    Raw
  

              bioc2019.md
            
          
    Day 1: 25 June 2019
BioC2019: Where Software and Biology Connect (Martin)


Martin providing some "brief logistics"

Inference after prediction (Jeffrey "John" Leek)

aka "What do we do after we have machine learned everything"


Work led by Siruo (Sara) Wang.
Starting from an old observation: 8 normal tissue samples but very obvious clustering

In this case, array batch effect: time
Now we know batch effects are everywhere
SVA, RUV, etc...


What makes primary cancer different from metastatic cancer?

Can we eliminate steps in the experiment -> data analysis pipeline to reduce the time to answers
e.g. recount2; preprocess 70k+ human rna-seq samples to uniformly process expression levels

but for most of this data missing phenotype information
even when phenotypes are provided, they may not be provided uniformly (no structured data models, lots of freeform text)


Predicting phenotypes: train using TCGA and GTEx, predict phenotypes in recount2 dataset
This is just one example, phenotype prediction, polygenic scores, enterotypes, ...


"I'm going to do a little math here" -- comparing actual and predicted data in a linear model

"Predictions are much skinnier" than the actual data -- underestimating variance due to prediction
Categorical case even worse
Seems circular, take X, predict Y, then fit relationship between X and predicted Y, but is common!

e.g. predict individual phenotype base on relatives, then perform GWAS on the predicted phenotype


Three models

Key observation:

Consider observed values vs predicted, looks like a simple regression model

Even with SVM, RF, Neural Net


Condition Y_p (predicted) on Y (observed) and estimate that relationship using a linear model

$$ g( E[ Y | Y_p ] ) = \gamma_0 + \gamma_1 Y_p $$
Predict, bootstrap, estimate


E.g. in the categorical case, improves relationship and consistency of the t statistic comparison
and in the continuous case t statistics are very close to those from the real data


Application

Post morten brain tissue: RNA degredation is a problem
Predict RNA quality in SRA
Uncorrected, standard errors are too low, boostrap correction approach sucesfully corrects


JEFFBOT

Elli Papaemmanuil


Population studies -> Analytical tools -> Clinical decision support


Identify (genomic) biomarkers that can define disease, inform therapy


Requires large scale population studies

Thousands of samples + deep phenotyping + multiple assays
Needs well curated, structured, annotated data


Data genration, analyses, reporting lifecycle

"Control freak in me wanted to develop a framework to manage the data generation process"
Importantly, all data structured to serve not just initial project, but future meta-analysis for the community

Biggest catalyst for innovation in translational research


Questions

Small scale: where is the data, how was it analyzed, is there more data for this patient, ...
Large scale: 1 patient x 10 samples x 10 analysis x 10k individuals --> millions!

State of the art tools,
Data utilization, integration, meta-analysis to support R&D,
Digital biobanks! Major institutional asset


First, map out the process from generation, analysis, interpretation

"If we do something more than once it should be automated"


ISABL: a platform for scalable bioinformatics operations

Data management
Audit trail for everything: enhance reproducibility
Automation: reduce costs
Postgres DB, restful API (django), User Interface (Vue.js, yay!)
Open source -- oh wait not yet but will be
Docker compose configuration provided, plug and play, PyPI
Patient centric model
Multiple ways to ingest data: Excel import, REST APIs (OpenAPI docs, yay!)
Integrations: e.g. integration with sequencing core submission, and return
Command line client: "completely pipeline agnostic"

"Tools are registered as applications", hmm... another tool definition format?
...and workflows


Putting this in a clinical setting -- too fast for me


Vue.js Web UI, looks very interactive/reactive

Embeds visualization of "raw" data, e.g. IGV.js, circos (in a clinical report?!)


In production: 200 projects, 23k analysis


"Isabl is becoming the platform for computational oncology at MSKCC"


Questions (actual questions are hard to hear)

"How do you handle who can access what"

"Anyone who works with us, anyone can access any data" WHAT?!
Do have "enabling of permissions, easily implemented in the backend"
Segregating data may not be a good idea, may predict a PI in the short term but come at a cost. Should advocate for data sharing


EPIC and getting data directly from the EHR

IF the data is structured and data is API enabled can pull the data through, but have not worked with EPIC


Relationship to cBioPortal

They just talk to each other, data can be pushed to cBioPortal through API


Standardize metadata entry

Upon entry have a correction, validation process to make input uniform


Confidentiality

They are in research setting so this is PHI free, but link to clinical data exists


Developing software to build networks and perform data integration in precision oncology (Simina Boca)

@siminaboca

Project goal: CDGNet (Cancer Drug Gene Network)

Help researchers expand targeted therapies for individuals with cancer: e.g more individualized therapies that have fewer side effects


Precision oncology: identify biomarkers (genetic, mRNA, protein levels) that allow tailoring interventions
Tumor molecular profiling is now becoming routine at the time of initial tumor identification

e.g. KRAS mutation and EGFR inhibitors
or ER+ breast cancer and tamoxifan, HER2+ and trastuzumab


In many cases molecular profiling is used after the patient has progressed of multiple lines of therapy / has few options left
How to include pathway information? Look at downstream targets on oncogenes
Therapy prioritization: 4 categories

FDA approved drug for their tumor type -- better evidence but fewer options
approved in other tumor types
drugs which target downstream of input oncogene ... pathway corresponding to this tumor type
... other tumor types -- worse evidence but more options


Goal: approach which is automated, transparent, ... (omg so fast)
Back to CDGnet. A shiny application

Can select cancer types, filter based on various criteria
e.g. looking for category 3 recommendations

Based on molecular profile, identify genes downstream in the network that have associated drugs


How to choose pathways?

Currently using KEGG

well curated by experts, but sometimes out of date, not all cancer types included
due to KEGG's data access policies, need to use some workarounds to get the data, not easy -- difficult to update


How to choose oncogenes?

Also provided by KEGG
Problem, oncogenes are an oversimplification of reality

Not necessarily and oncogene in all tissue types
Based on some discritization from a model
Consider using oncogenes from the pan cancer TCGA paper


How to get the gene drug connections?

DrugBank database: comprehensive and expert curated
Not tissue-specific. Maybe a problem?
Possible alternative tools in BioConductor: Chemmine, paxtoolsr


How to get FDA approved therapies for Biomarkers?

Manual curation

First list of targeted therapies from NCI
Followed links manually to get drug labels and curated from "Indications and usage"
Does not scale, currently working on an approach to automate using NLP


bioRxiv 605261


Future

Categorize by therapy classes, rank and reduce number of unique recs
Expand beyond KEGG, allow user defined
Moar automation
Drill deeper into 1-2 cancer types


Questions

Does the FDA know you have to go through all this manual creation?

Some lists exist, but for targeted therapy they have not found a better way


Did you implement into molecular tumor board

Not yet but considering


Consider using pharmocogenomic data (didn't hear specific database)

No.


What about Tumor Suppressor Genes?

Not considered yet.


Break, then two sessions of contributed talks next.
I'm in "1a - Single-cell theme"

Orchestrating single-cell analysis with Bioconductor (Robert Amezquita)

bioRxiv

Introduction: Fred Hutch Bezos Family Immunotherapy clinic

Translational data science integrated research center: link between computational biologists an clinicians
Next generation immunotherapy

Isolate T cells, genetically modify cell surface receptors, reintroduce -- CAR-T cells "living drug"
How is this process being monitored? Profiling at single cell resolution
What tools are being used? Bioconductor of course.


Huge growth over last ~9 years in single cell tools in Bioconductor, organized around SingleCellExperiment
Book on single-cell orchestration coming soon

Analysis of multi-sample multi-group scRNA-seq data (Heather Crowell)


Cell type: permanent aspects of a cell's identity vs cell state: more transient aspects


Differential analysis: within a cluster (type prediction) find differences in state between groups (e.g. healthy/disease)


cell- vs sample-level approaches: aggregate across features


Simulation studies

Cluster of methods that all perform well and similarly

pseudobulk methods + edgeR


How well do methods agree?

Where a lot of methods agree, there is a lot of truth
Lots of false discoveries that are method specific (singletons)


Study: 8 mice, WT vs LPS treated, treatment seen as shift in pseudobulkd


recommendation

Fast and scalable
use aggregation + established tests (edgeR, DESeq2, limma)


Questions

How many cells do you need

More is of course better. Did not see certain methods affected more/less.
Reasonable results with 30 per cluster per sample


CellBench: A Framework for Evaluating Single Cell Analysis Pipelines (Shian Su)


Motivation: manage the complexity of benchmarking single-cell methods
Normalization, imputation, clustering, trajectories, diff exp, ...
Benchmarking pipelines as a whole, possible pipelines grow combinatorially
CellBench

Simple and flexible, easy to combine methods, tidyverse compatible
Framework

lists of datasets (SingleCellExperiment), methods, run apply_methods to expand out and evaluate combinations
method wrappers allow methods of a given type to have a uniform interface
s/apply\_methods/time\_methods/ gives timing information
sweep over different parameter values at each step
errors are propagated through rather than failing the whole analysis


Quick Application

Applying four different clustering methods across three different protocols
Sampling different numbers of cells to measure time complexity


Summary
Not single cell specific
Inspired by DSC python package
Mostly powered by BiocParallel
See also SummarizedBenchmarks

mbkmeans: fast clustering for single cell data using mini-batch k-means (Stephanie Hicks)

@stephaniehicks

Analyzing big data sets: 500k cells, 1M cells, ...

Might not even be able to load into memory, so what to do?


Want

Fast method to cluster multiple times thousands to millions of cells in PCA space (may fit in memory)
Cluster data on disk, data never fully loaded into memory, e.g. on top of an HDF5 file
Sometimes speed is more important that accuracy (e.g. for normalization)


Good ol' k-means clustering
Mini-batch k-means (Sculley 2010)

At each iteration, only use a random subset of the data

Only the distance between the mini-batch elements and centroids need to be computed


Existing implementations?

CluisterR exists but requires data to be stored in memoryt
MiniBatchKMeans exists in scikit-learn, can use an HDF5 file, but still loads the full matrix into memory


So... mbkmeans, a Bioconductor package. Input can be any matrix, e.g. DelayedArray, HDF5Array.
Preliminary results

Bivariate Gaussian in 2D, three groups, then project into high dimensional space

Accuracy improves with larger batches, good at 5% and overlapping full kmeans at 25%
At 5% the compute time is reduced, and further reduced as number of data point increases
Memory savings comes more from using mini batches than from not loading the data into memory!


Preliminary results using HCA data

12 mins for clustering full 333k x 4k matrix (DelayedMatrix), ...


Next

Other clustering methods, e.g. DBSCAN and BIRCH
More comparisons to other methods


Gene Set Enrichment Analysis with Multi-omics Data (Deepayan Sarkar)


Multi-omics experiments increasingly common, e.g. NCI-60 panel

60 cell lines, 9 types of cancer tissue
Data on mRNA, protein, micro-RNA, DNA me, etc.


Basic setup

Have performed per gene differential expression analysis, test statistic from standard pipelines (e.g. limma)
Have such values for "multiple omics"
Is a set of genes different as a group given multiple omics measures


Method 1: average t-statistic values

Need to account for correlation


Method 2: multivariate linear model of T


Lunch


## bioc2019_2.md

      
    Raw
  

              bioc2019_2.md
            
          
    Challenges in label free mass spec (Lieven Clement)


MSqRob Workflow
Imputation is detrimental
Robust summarization avoids the need for imputation
Robust inference with linear models further improves performance
bioRxiv 668863
Implemented in MSnBase Bioconductor package

cleanUpdTSeq and InPas for accurate identification and differential usage of polyA sites (Julie Zhu)


Polyadenylation -- addition of poly-A tail to nascent RNA
Alternative polyadenylation

widespread
generates transcript/protein diversity
post-transcriptional regulation (alternative 3' UTR)
highly regulated process


Sequencing approaches

PAS-seq, PolyA-seq

Oligo-dT priming, imperfect end identification


3P-seq

Direct ligation to 3' end


Measuring accuracy: how well can the polyA signal motif be identified
Better identificaiton of internal priming -- false positive pA sites

Training sets of true and false sites
-> model selection
-> compare to heuristic filtering
3P-seq to define true set, RNA-seq to define false set (both must be in PAS-seq)
Training data represents expected populations
Algorithm features

Upstream: presence/absence of hexamers
Downstream: single nuc frequency, dinuc frequency, average As to pA site
Naive bayes classifier


Putative sites identified by PAS-seq only with this classifier

Common (with 3P) have higher read coverage, so likely just missed due to depth?


Biological validation in zebrafish: 98% accuracy vs 88%
InPas: identify novel polyA sites from RNA-seq data

Learning cis-regulatory code at base-pair resolution (Anshul Kundaje)

@anshulkundaje

"Interpreting relatively blackbox deep learning models"
Types of data for profiling regulatory information: ChIP-seq, ATAC-seq/DNase-seq
Modeling framework

Given an assay like ChIP-seq or ATAC-seq
Consider genome bins, 200bp - 1kb
Assign either binary or quantitative label to each bin
Learn relationship between DNA seqence in bin and label


Here using (deep convolutional) neural networks, with one-hot encoding of the DNA
"Democratizing ML for genomics" -- kipoi.org a model zoo for genomics

Wrappers for running models, simplified model comparison
Usually performance is based on how you clean the data, train the models (e.g. batches), design the labels


ChIP-exo / nexus -- ultra high resolution TF binding profiles

BPNet: DNA sequence to base-pair resolution profile regression

Just predict + and - strand base pair level read counts

optimize linear combination of two loss functions
MSE loss for log total counts
multinomial loss for capturing distribution of counts across regions


Performance metric

set SN threshold on observed data (binarize spikes)
allow some slack (1bp, 5bp, 20bp) and consider auPRC
as good as replicates for spikes
Less good for total count prediction

Spike prediction is more local, where total occupancy depends on more global features


Interpreting the "black box"

DeepLIFT: Back propagate importance through the network

e.g. Oct4 distabl enhancer, footprints for all of Klf4, Nanog, Oct4 Sox2


TF-MODISCO

Inferring globally predictive motifs across all binding events

43 distinct motifs to explain the binding of 4 TFs!


Identifies TF binding sites in lone transposable elements


10.9bp periodic pattern around NANOG site -- major groove
Combinatorial binding
Predicting expression in MPRA using BPNet model


Tutorials: kundajelab.github.io/dragonn


Break

Three reasons to Bioconductor on Containers (Nitesh Turaga)


Containers

Completely isolated environments, don't bundle the full OS, lightweight, cross-platform
Bioconductor using Docker because it is popular, has lots of open-souce tooling


Reason 1: No longer worry about system dependencies

Bioconductor packages depend on other Bioconductor packages, system dependencies, ...
Containers make it easier for developers to distribute packages and users to get them
bioconductor/bioconductor_full:RELEASE_3_9 -- container with full Bioconductor release
comes with RStudio inside


Reason 2: Reproducibility

Image stays in a steady state, can ensure you are always running in the same environment
Dockerfile tells you exactly how that image was built


Reason 3: Sharing your work.
Future

Starting a Google cloud cluster -- gcloud container clusters
Start kubernetes pod with Bioconductor and redis
Using BioCParallel, communicating with redis, across five workers


Bioconductor at "cloud scale" (Vince Carey)


Cloud oriented agenda for Cioc:

Simplify discovery of data and methods
Evaluate and optimize representations of data to foster cloud-scale data science
Reduce barriers to distributed/scalable analysis for learning from cloud-scale genomic data


Task for discovery: improved annotation

Annotation is often highly ad hoc
e.g. Bioconductor:Cancer43k Shiny app
Annotation: discrete or continuous

example 100 topic model from LDA on 988 cancer related abstracts


Suggestion for enhanced discovery

Use all available information, build annotations based on semantic distances


Data representation

HDF scalable data service + DelayedArray
HSDS server is multiplexed, concurrent
Need more work on the many different approaches to representing data in cloud...
(a million options presented here...)


Conclusions

Bioc has been cloud friendly for a long time
Need to help with discovery, representation, scalable analysis
Avenues to explore

Semantic neighborhoods
Unified representation
Container orchestration
Need more experimentation and communication!


Superintronic: tidy coverage analysis and range-wise diagnostics (Stuart Lee)

@_StuartLee

Core data structure of Bioconductor -- GRanges -- is a tidy data structure
plyranges -- add dplyr concepts to Bioconductor
superintronic: coverage and range based summarization

experimental design, summarization, visualization


Accounting for experimental design

design is a data frame; design %>% compute_coverage_long(source="bam")


Finding interesting regions of coverage

Scatterplot diagnostics (Tukeys)
Rather than look at large numbers of scatterplots, compute summaries of scatterplot shape/etc
For a given range index x set of keys; rangle( x, .var=score, .index=seqnames, .funs=ranglers() )

e.g. mean / var of sliding window means across all regions of interest

bumpy, lumpy, max mean shift, max variance shift, count distribution, flat spots, crossing points
visualize all summaries on a biplot


Example: ranglers for intronic sequence coverage in RNA-seq

BiocProject: Bioconductor oriented approach to project management (Nathan Sheffield)

databio.org/slides

Most workflows require individual metadata organization
Then other tools require a different organization
Solution: represent the metadata in a common way, design adapters ("plugs") so each tool sees the data in the form it wants
Portable Encapsulated Project ("PEP")

Pipelines and workflows understand the format, but also could be used for sharing, etc.


peppy python package, pepr R package, geofetch to create a PEP from GEO, looper runs arbitrary commands over a PEP representation
Format: project_condig.yaml (project level attributes) and a CSV file for sample level attributes
BiocProject integrates PEP into Bioconductor

Storage and analysis of microbiome quality control data (Karun Rejaesh)

github.com/karunrajesh/MBQC_Phyloseq

Microbiome quality control

Small number of quality control and lab-to-lab variability study


Process pipeline

Several types of samples: fresh samples, artificial samples, negative controls
15 different extraction labs (blinded)
9 different bioinformatics processing labs
Result: OTU tables
integrated OTU tables nearly 1TB


Here: compress all of this data into a phyloseq object, only 28Mb
Results:

\alpha diversity highly variable across the 15 labs: significant lab-to-lab variation
Oral mock community, 22 true species, different bioinformatics labs detect different taxa
PCoA: PC2 (13%) appears to largely be bioinformatics lab (is PC1 processing lab?)


PulmonDB: curated gene expression database of lung diseases (Ana Beatriz Villaseñor-Altamirano)


Talking about COPD, IPF; lung diseases associated with smoking

In COPD alveoli are partially destroyed
In IPF excess ECM is produced


Significant amount of transcriptome data available, thousands of transcriptome datasets for COPD in GEO
Goal: generate a COPD/IPF transcriptome database from public data
Extract data from recount2

filter for COPD/IPF and raw data availability
use "COMMAND" to curate datasets into a controlled vocabulary
store in a database: pulmonDB


3,000+ samples, across multiple platforms, 200+ from RNA-seq
Making it available

Website for data exploration
Availability in R environment

PulmonDB package allows creating SummarizedExperiment and annotation as Dataframes from the SQL database


COPD Case Study

Lung tissue, contrasts COPD/Healthy vs Healthy/Healthy

132 genes over/under expressed


Snapcount (Chris Wilks)

github.com/langmead-lab/snapr

"Are we doing the kind of research we could be doing if public data was available at scale"
SRA: 100's of TB of raw sequence data
Search engine "stack" for RNA-seq

Rail-RNA -- the crawler
recount2 -- the database
Snaptron -- the search engine


Snaptron database

splice junctions x samples x counts


Snapcount

Bioconductor interface to the Snaptron web services

Query all data in recount2 plus some additional studies
Region centric: slice a genomic region across all samples (~2000 studies)
Output: RangedSummarizedExperiment object


Other questions snapcount can answer

Relative usafe of plicing pattern
Tissue enrichment
...


What's next: recount3!

redoing the whole alignment pipeline, annotation-free
rerun with new runs including MOUSE!
streamlined pipeline that users can run on their own samples and integrate


End of talks!