Day 1: 25 June 2019
- Martin providing some "brief logistics"
Inference after prediction (Jeffrey "John" Leek)
- Work led by Siruo (Sara) Wang.
- Starting from an old observation: 8 normal tissue samples but very obvious clustering
- In this case, array batch effect: time
- Now we know batch effects are everywhere
- SVA, RUV, etc...
- What makes primary cancer different from metastatic cancer?
- Can we eliminate steps in the experiment -> data analysis pipeline to reduce the time to answers
- e.g. recount2; preprocess 70k+ human rna-seq samples to uniformly process expression levels
- but for most of this data missing phenotype information
- even when phenotypes are provided, they may not be provided uniformly (no structured data models, lots of freeform text)
- Predicting phenotypes: train using TCGA and GTEx, predict phenotypes in recount2 dataset
- This is just one example, phenotype prediction, polygenic scores, enterotypes, ...
- "I'm going to do a little math here" -- comparing actual and predicted data in a linear model
- "Predictions are much skinnier" than the actual data -- underestimating variance due to prediction
- Categorical case even worse
- Seems circular, take X, predict Y, then fit relationship between X and predicted Y, but is common!
- e.g. predict individual phenotype base on relatives, then perform GWAS on the predicted phenotype
- Three models
- Key observation:
- Consider observed values vs predicted, looks like a simple regression model
- Even with SVM, RF, Neural Net
- Consider observed values vs predicted, looks like a simple regression model
- Condition Y_p (predicted) on Y (observed) and estimate that relationship using a linear model
- $$ g( E[ Y | Y_p ] ) = \gamma_0 + \gamma_1 Y_p $$
- Predict, bootstrap, estimate
- E.g. in the categorical case, improves relationship and consistency of the t statistic comparison
- and in the continuous case t statistics are very close to those from the real data
- Key observation:
- Application
- Post morten brain tissue: RNA degredation is a problem
- Predict RNA quality in SRA
- Uncorrected, standard errors are too low, boostrap correction approach sucesfully corrects
- JEFFBOT
-
Population studies -> Analytical tools -> Clinical decision support
-
Identify (genomic) biomarkers that can define disease, inform therapy
-
Requires large scale population studies
- Thousands of samples + deep phenotyping + multiple assays
- Needs well curated, structured, annotated data
-
Data genration, analyses, reporting lifecycle
- "Control freak in me wanted to develop a framework to manage the data generation process"
- Importantly, all data structured to serve not just initial project, but future meta-analysis for the community
- Biggest catalyst for innovation in translational research
- Questions
- Small scale: where is the data, how was it analyzed, is there more data for this patient, ...
- Large scale: 1 patient x 10 samples x 10 analysis x 10k individuals --> millions!
- State of the art tools,
- Data utilization, integration, meta-analysis to support R&D,
- Digital biobanks! Major institutional asset
-
First, map out the process from generation, analysis, interpretation
- "If we do something more than once it should be automated"
-
ISABL: a platform for scalable bioinformatics operations
- Data management
- Audit trail for everything: enhance reproducibility
- Automation: reduce costs
- Postgres DB, restful API (django), User Interface (Vue.js, yay!)
- Open source -- oh wait not yet but will be
- Docker compose configuration provided, plug and play, PyPI
- Patient centric model
- Multiple ways to ingest data: Excel import, REST APIs (OpenAPI docs, yay!)
- Integrations: e.g. integration with sequencing core submission, and return
- Command line client: "completely pipeline agnostic"
- "Tools are registered as applications", hmm... another tool definition format?
- ...and workflows
-
Putting this in a clinical setting -- too fast for me
-
Vue.js Web UI, looks very interactive/reactive
- Embeds visualization of "raw" data, e.g. IGV.js, circos (in a clinical report?!)
-
In production: 200 projects, 23k analysis
-
"Isabl is becoming the platform for computational oncology at MSKCC"
-
Questions (actual questions are hard to hear)
- "How do you handle who can access what"
- "Anyone who works with us, anyone can access any data" WHAT?!
- Do have "enabling of permissions, easily implemented in the backend"
- Segregating data may not be a good idea, may predict a PI in the short term but come at a cost. Should advocate for data sharing
- EPIC and getting data directly from the EHR
- IF the data is structured and data is API enabled can pull the data through, but have not worked with EPIC
- Relationship to cBioPortal
- They just talk to each other, data can be pushed to cBioPortal through API
- Standardize metadata entry
- Upon entry have a correction, validation process to make input uniform
- Confidentiality
- They are in research setting so this is PHI free, but link to clinical data exists
- "How do you handle who can access what"
Developing software to build networks and perform data integration in precision oncology (Simina Boca)
@siminaboca
- Project goal: CDGNet (Cancer Drug Gene Network)
- Help researchers expand targeted therapies for individuals with cancer: e.g more individualized therapies that have fewer side effects
- Precision oncology: identify biomarkers (genetic, mRNA, protein levels) that allow tailoring interventions
- Tumor molecular profiling is now becoming routine at the time of initial tumor identification
- e.g. KRAS mutation and EGFR inhibitors
- or ER+ breast cancer and tamoxifan, HER2+ and trastuzumab
- In many cases molecular profiling is used after the patient has progressed of multiple lines of therapy / has few options left
- How to include pathway information? Look at downstream targets on oncogenes
- Therapy prioritization: 4 categories
- FDA approved drug for their tumor type -- better evidence but fewer options
- approved in other tumor types
- drugs which target downstream of input oncogene ... pathway corresponding to this tumor type
- ... other tumor types -- worse evidence but more options
- Goal: approach which is automated, transparent, ... (omg so fast)
- Back to CDGnet. A shiny application
- Can select cancer types, filter based on various criteria
- e.g. looking for category 3 recommendations
- Based on molecular profile, identify genes downstream in the network that have associated drugs
- How to choose pathways?
- Currently using KEGG
- well curated by experts, but sometimes out of date, not all cancer types included
- due to KEGG's data access policies, need to use some workarounds to get the data, not easy -- difficult to update
- Currently using KEGG
- How to choose oncogenes?
- Also provided by KEGG
- Problem, oncogenes are an oversimplification of reality
- Not necessarily and oncogene in all tissue types
- Based on some discritization from a model
- Consider using oncogenes from the pan cancer TCGA paper
- How to get the gene drug connections?
- DrugBank database: comprehensive and expert curated
- Not tissue-specific. Maybe a problem?
- Possible alternative tools in BioConductor: Chemmine, paxtoolsr
- How to get FDA approved therapies for Biomarkers?
- Manual curation
- First list of targeted therapies from NCI
- Followed links manually to get drug labels and curated from "Indications and usage"
- Does not scale, currently working on an approach to automate using NLP
- Manual curation
- bioRxiv 605261
- Future
- Categorize by therapy classes, rank and reduce number of unique recs
- Expand beyond KEGG, allow user defined
- Moar automation
- Drill deeper into 1-2 cancer types
- Questions
- Does the FDA know you have to go through all this manual creation?
- Some lists exist, but for targeted therapy they have not found a better way
- Did you implement into molecular tumor board
- Not yet but considering
- Consider using pharmocogenomic data (didn't hear specific database)
- No.
- What about Tumor Suppressor Genes?
- Not considered yet.
- Does the FDA know you have to go through all this manual creation?
Break, then two sessions of contributed talks next.
I'm in "1a - Single-cell theme"
- Introduction: Fred Hutch Bezos Family Immunotherapy clinic
- Translational data science integrated research center: link between computational biologists an clinicians
- Next generation immunotherapy
- Isolate T cells, genetically modify cell surface receptors, reintroduce -- CAR-T cells "living drug"
- How is this process being monitored? Profiling at single cell resolution
- What tools are being used? Bioconductor of course.
- Huge growth over last ~9 years in single cell tools in Bioconductor, organized around SingleCellExperiment
- Book on single-cell orchestration coming soon
-
Cell type: permanent aspects of a cell's identity vs cell state: more transient aspects
-
Differential analysis: within a cluster (type prediction) find differences in state between groups (e.g. healthy/disease)
-
cell- vs sample-level approaches: aggregate across features
-
Simulation studies
- Cluster of methods that all perform well and similarly
- pseudobulk methods + edgeR
- How well do methods agree?
- Where a lot of methods agree, there is a lot of truth
- Lots of false discoveries that are method specific (singletons)
- Cluster of methods that all perform well and similarly
-
Study: 8 mice, WT vs LPS treated, treatment seen as shift in pseudobulkd
-
recommendation
- Fast and scalable
- use aggregation + established tests (edgeR, DESeq2, limma)
-
Questions
- How many cells do you need
- More is of course better. Did not see certain methods affected more/less.
- Reasonable results with 30 per cluster per sample
- How many cells do you need
- Motivation: manage the complexity of benchmarking single-cell methods
- Normalization, imputation, clustering, trajectories, diff exp, ...
- Benchmarking pipelines as a whole, possible pipelines grow combinatorially
- CellBench
- Simple and flexible, easy to combine methods, tidyverse compatible
- Framework
- lists of datasets (SingleCellExperiment), methods, run
apply_methods
to expand out and evaluate combinations - method wrappers allow methods of a given type to have a uniform interface
- s/
apply\_methods
/time\_methods
/ gives timing information - sweep over different parameter values at each step
- errors are propagated through rather than failing the whole analysis
- lists of datasets (SingleCellExperiment), methods, run
- Quick Application
- Applying four different clustering methods across three different protocols
- Sampling different numbers of cells to measure time complexity
- Summary
- Not single cell specific
- Inspired by DSC python package
- Mostly powered by BiocParallel
- See also SummarizedBenchmarks
@stephaniehicks
- Analyzing big data sets: 500k cells, 1M cells, ...
- Might not even be able to load into memory, so what to do?
- Want
- Fast method to cluster multiple times thousands to millions of cells in PCA space (may fit in memory)
- Cluster data on disk, data never fully loaded into memory, e.g. on top of an HDF5 file
- Sometimes speed is more important that accuracy (e.g. for normalization)
- Good ol' k-means clustering
- Mini-batch k-means (Sculley 2010)
- At each iteration, only use a random subset of the data
- Only the distance between the mini-batch elements and centroids need to be computed
- At each iteration, only use a random subset of the data
- Existing implementations?
- CluisterR exists but requires data to be stored in memoryt
MiniBatchKMeans
exists in scikit-learn, can use an HDF5 file, but still loads the full matrix into memory
- So...
mbkmeans
, a Bioconductor package. Input can be any matrix, e.g. DelayedArray, HDF5Array. - Preliminary results
- Bivariate Gaussian in 2D, three groups, then project into high dimensional space
- Accuracy improves with larger batches, good at 5% and overlapping full kmeans at 25%
- At 5% the compute time is reduced, and further reduced as number of data point increases
- Memory savings comes more from using mini batches than from not loading the data into memory!
- Bivariate Gaussian in 2D, three groups, then project into high dimensional space
- Preliminary results using HCA data
- 12 mins for clustering full 333k x 4k matrix (DelayedMatrix), ...
- Next
- Other clustering methods, e.g. DBSCAN and BIRCH
- More comparisons to other methods
- Multi-omics experiments increasingly common, e.g. NCI-60 panel
- 60 cell lines, 9 types of cancer tissue
- Data on mRNA, protein, micro-RNA, DNA me, etc.
- Basic setup
- Have performed per gene differential expression analysis, test statistic from standard pipelines (e.g. limma)
- Have such values for "multiple omics"
- Is a set of genes different as a group given multiple omics measures
- Method 1: average t-statistic values
- Need to account for correlation
- Method 2: multivariate linear model of T
Lunch