Skip to content

Instantly share code, notes, and snippets.

@jxtx
Last active June 26, 2019 15:46
Show Gist options
  • Save jxtx/4c2885f8eb889fd8155f0657078f5a2a to your computer and use it in GitHub Desktop.
Save jxtx/4c2885f8eb889fd8155f0657078f5a2a to your computer and use it in GitHub Desktop.

Day 1: 25 June 2019

BioC2019: Where Software and Biology Connect (Martin)

Inference after prediction (Jeffrey "John" Leek)

aka "What do we do after we have machine learned everything"

  • Work led by Siruo (Sara) Wang.
  • Starting from an old observation: 8 normal tissue samples but very obvious clustering
    • In this case, array batch effect: time
    • Now we know batch effects are everywhere
    • SVA, RUV, etc...
  • What makes primary cancer different from metastatic cancer?
    • Can we eliminate steps in the experiment -> data analysis pipeline to reduce the time to answers
    • e.g. recount2; preprocess 70k+ human rna-seq samples to uniformly process expression levels
      • but for most of this data missing phenotype information
      • even when phenotypes are provided, they may not be provided uniformly (no structured data models, lots of freeform text)
    • Predicting phenotypes: train using TCGA and GTEx, predict phenotypes in recount2 dataset
    • This is just one example, phenotype prediction, polygenic scores, enterotypes, ...
  • "I'm going to do a little math here" -- comparing actual and predicted data in a linear model
    • "Predictions are much skinnier" than the actual data -- underestimating variance due to prediction
    • Categorical case even worse
    • Seems circular, take X, predict Y, then fit relationship between X and predicted Y, but is common!
      • e.g. predict individual phenotype base on relatives, then perform GWAS on the predicted phenotype
  • Three models
    • Key observation:
      • Consider observed values vs predicted, looks like a simple regression model
        • Even with SVM, RF, Neural Net
    • Condition Y_p (predicted) on Y (observed) and estimate that relationship using a linear model
      • $$ g( E[ Y | Y_p ] ) = \gamma_0 + \gamma_1 Y_p $$
      • Predict, bootstrap, estimate
    • E.g. in the categorical case, improves relationship and consistency of the t statistic comparison
    • and in the continuous case t statistics are very close to those from the real data
  • Application
    • Post morten brain tissue: RNA degredation is a problem
    • Predict RNA quality in SRA
    • Uncorrected, standard errors are too low, boostrap correction approach sucesfully corrects
  • JEFFBOT

Elli Papaemmanuil

  • Population studies -> Analytical tools -> Clinical decision support

  • Identify (genomic) biomarkers that can define disease, inform therapy

  • Requires large scale population studies

    • Thousands of samples + deep phenotyping + multiple assays
    • Needs well curated, structured, annotated data
  • Data genration, analyses, reporting lifecycle

    • "Control freak in me wanted to develop a framework to manage the data generation process"
    • Importantly, all data structured to serve not just initial project, but future meta-analysis for the community
      • Biggest catalyst for innovation in translational research
    • Questions
      • Small scale: where is the data, how was it analyzed, is there more data for this patient, ...
      • Large scale: 1 patient x 10 samples x 10 analysis x 10k individuals --> millions!
        • State of the art tools,
        • Data utilization, integration, meta-analysis to support R&D,
        • Digital biobanks! Major institutional asset
  • First, map out the process from generation, analysis, interpretation

    • "If we do something more than once it should be automated"
  • ISABL: a platform for scalable bioinformatics operations

    • Data management
    • Audit trail for everything: enhance reproducibility
    • Automation: reduce costs
    • Postgres DB, restful API (django), User Interface (Vue.js, yay!)
    • Open source -- oh wait not yet but will be
    • Docker compose configuration provided, plug and play, PyPI
    • Patient centric model
    • Multiple ways to ingest data: Excel import, REST APIs (OpenAPI docs, yay!)
    • Integrations: e.g. integration with sequencing core submission, and return
    • Command line client: "completely pipeline agnostic"
      • "Tools are registered as applications", hmm... another tool definition format?
      • ...and workflows
  • Putting this in a clinical setting -- too fast for me

  • Vue.js Web UI, looks very interactive/reactive

    • Embeds visualization of "raw" data, e.g. IGV.js, circos (in a clinical report?!)
  • In production: 200 projects, 23k analysis

  • "Isabl is becoming the platform for computational oncology at MSKCC"

  • Questions (actual questions are hard to hear)

    • "How do you handle who can access what"
      • "Anyone who works with us, anyone can access any data" WHAT?!
      • Do have "enabling of permissions, easily implemented in the backend"
      • Segregating data may not be a good idea, may predict a PI in the short term but come at a cost. Should advocate for data sharing
    • EPIC and getting data directly from the EHR
      • IF the data is structured and data is API enabled can pull the data through, but have not worked with EPIC
    • Relationship to cBioPortal
      • They just talk to each other, data can be pushed to cBioPortal through API
    • Standardize metadata entry
      • Upon entry have a correction, validation process to make input uniform
    • Confidentiality
      • They are in research setting so this is PHI free, but link to clinical data exists

Developing software to build networks and perform data integration in precision oncology (Simina Boca)

@siminaboca

  • Project goal: CDGNet (Cancer Drug Gene Network)
    • Help researchers expand targeted therapies for individuals with cancer: e.g more individualized therapies that have fewer side effects
  • Precision oncology: identify biomarkers (genetic, mRNA, protein levels) that allow tailoring interventions
  • Tumor molecular profiling is now becoming routine at the time of initial tumor identification
    • e.g. KRAS mutation and EGFR inhibitors
    • or ER+ breast cancer and tamoxifan, HER2+ and trastuzumab
  • In many cases molecular profiling is used after the patient has progressed of multiple lines of therapy / has few options left
  • How to include pathway information? Look at downstream targets on oncogenes
  • Therapy prioritization: 4 categories
    1. FDA approved drug for their tumor type -- better evidence but fewer options
    2. approved in other tumor types
    3. drugs which target downstream of input oncogene ... pathway corresponding to this tumor type
    4. ... other tumor types -- worse evidence but more options
  • Goal: approach which is automated, transparent, ... (omg so fast)
  • Back to CDGnet. A shiny application
    • Can select cancer types, filter based on various criteria
    • e.g. looking for category 3 recommendations
      • Based on molecular profile, identify genes downstream in the network that have associated drugs
    • How to choose pathways?
      • Currently using KEGG
        • well curated by experts, but sometimes out of date, not all cancer types included
        • due to KEGG's data access policies, need to use some workarounds to get the data, not easy -- difficult to update
    • How to choose oncogenes?
      • Also provided by KEGG
      • Problem, oncogenes are an oversimplification of reality
        • Not necessarily and oncogene in all tissue types
        • Based on some discritization from a model
        • Consider using oncogenes from the pan cancer TCGA paper
    • How to get the gene drug connections?
      • DrugBank database: comprehensive and expert curated
      • Not tissue-specific. Maybe a problem?
      • Possible alternative tools in BioConductor: Chemmine, paxtoolsr
    • How to get FDA approved therapies for Biomarkers?
      • Manual curation
        • First list of targeted therapies from NCI
        • Followed links manually to get drug labels and curated from "Indications and usage"
        • Does not scale, currently working on an approach to automate using NLP
    • bioRxiv 605261
  • Future
    • Categorize by therapy classes, rank and reduce number of unique recs
    • Expand beyond KEGG, allow user defined
    • Moar automation
    • Drill deeper into 1-2 cancer types
  • Questions
    • Does the FDA know you have to go through all this manual creation?
      • Some lists exist, but for targeted therapy they have not found a better way
    • Did you implement into molecular tumor board
      • Not yet but considering
    • Consider using pharmocogenomic data (didn't hear specific database)
      • No.
    • What about Tumor Suppressor Genes?
      • Not considered yet.

Break, then two sessions of contributed talks next.

I'm in "1a - Single-cell theme"


Orchestrating single-cell analysis with Bioconductor (Robert Amezquita)

bioRxiv

  • Introduction: Fred Hutch Bezos Family Immunotherapy clinic
    • Translational data science integrated research center: link between computational biologists an clinicians
    • Next generation immunotherapy
      • Isolate T cells, genetically modify cell surface receptors, reintroduce -- CAR-T cells "living drug"
      • How is this process being monitored? Profiling at single cell resolution
      • What tools are being used? Bioconductor of course.
  • Huge growth over last ~9 years in single cell tools in Bioconductor, organized around SingleCellExperiment
  • Book on single-cell orchestration coming soon

Analysis of multi-sample multi-group scRNA-seq data (Heather Crowell)

  • Cell type: permanent aspects of a cell's identity vs cell state: more transient aspects

  • Differential analysis: within a cluster (type prediction) find differences in state between groups (e.g. healthy/disease)

  • cell- vs sample-level approaches: aggregate across features

  • Simulation studies

    • Cluster of methods that all perform well and similarly
      • pseudobulk methods + edgeR
    • How well do methods agree?
      • Where a lot of methods agree, there is a lot of truth
      • Lots of false discoveries that are method specific (singletons)
  • Study: 8 mice, WT vs LPS treated, treatment seen as shift in pseudobulkd

  • recommendation

    • Fast and scalable
    • use aggregation + established tests (edgeR, DESeq2, limma)
  • Questions

    • How many cells do you need
      • More is of course better. Did not see certain methods affected more/less.
      • Reasonable results with 30 per cluster per sample

CellBench: A Framework for Evaluating Single Cell Analysis Pipelines (Shian Su)

  • Motivation: manage the complexity of benchmarking single-cell methods
  • Normalization, imputation, clustering, trajectories, diff exp, ...
  • Benchmarking pipelines as a whole, possible pipelines grow combinatorially
  • CellBench
    • Simple and flexible, easy to combine methods, tidyverse compatible
    • Framework
      • lists of datasets (SingleCellExperiment), methods, run apply_methods to expand out and evaluate combinations
      • method wrappers allow methods of a given type to have a uniform interface
      • s/apply\_methods/time\_methods/ gives timing information
      • sweep over different parameter values at each step
      • errors are propagated through rather than failing the whole analysis
  • Quick Application
    • Applying four different clustering methods across three different protocols
    • Sampling different numbers of cells to measure time complexity
  • Summary
  • Not single cell specific
  • Inspired by DSC python package
  • Mostly powered by BiocParallel
  • See also SummarizedBenchmarks

mbkmeans: fast clustering for single cell data using mini-batch k-means (Stephanie Hicks)

@stephaniehicks

  • Analyzing big data sets: 500k cells, 1M cells, ...
    • Might not even be able to load into memory, so what to do?
  • Want
    • Fast method to cluster multiple times thousands to millions of cells in PCA space (may fit in memory)
    • Cluster data on disk, data never fully loaded into memory, e.g. on top of an HDF5 file
    • Sometimes speed is more important that accuracy (e.g. for normalization)
  • Good ol' k-means clustering
  • Mini-batch k-means (Sculley 2010)
    • At each iteration, only use a random subset of the data
      • Only the distance between the mini-batch elements and centroids need to be computed
  • Existing implementations?
    • CluisterR exists but requires data to be stored in memoryt
    • MiniBatchKMeans exists in scikit-learn, can use an HDF5 file, but still loads the full matrix into memory
  • So... mbkmeans, a Bioconductor package. Input can be any matrix, e.g. DelayedArray, HDF5Array.
  • Preliminary results
    • Bivariate Gaussian in 2D, three groups, then project into high dimensional space
      • Accuracy improves with larger batches, good at 5% and overlapping full kmeans at 25%
      • At 5% the compute time is reduced, and further reduced as number of data point increases
      • Memory savings comes more from using mini batches than from not loading the data into memory!
  • Preliminary results using HCA data
    • 12 mins for clustering full 333k x 4k matrix (DelayedMatrix), ...
  • Next
    • Other clustering methods, e.g. DBSCAN and BIRCH
    • More comparisons to other methods

Gene Set Enrichment Analysis with Multi-omics Data (Deepayan Sarkar)

  • Multi-omics experiments increasingly common, e.g. NCI-60 panel
    • 60 cell lines, 9 types of cancer tissue
    • Data on mRNA, protein, micro-RNA, DNA me, etc.
  • Basic setup
    • Have performed per gene differential expression analysis, test statistic from standard pipelines (e.g. limma)
    • Have such values for "multiple omics"
    • Is a set of genes different as a group given multiple omics measures
  • Method 1: average t-statistic values
    • Need to account for correlation
  • Method 2: multivariate linear model of T

Lunch


Challenges in label free mass spec (Lieven Clement)

  • MSqRob Workflow
  • Imputation is detrimental
  • Robust summarization avoids the need for imputation
  • Robust inference with linear models further improves performance
  • bioRxiv 668863
  • Implemented in MSnBase Bioconductor package

cleanUpdTSeq and InPas for accurate identification and differential usage of polyA sites (Julie Zhu)

  • Polyadenylation -- addition of poly-A tail to nascent RNA
  • Alternative polyadenylation
    • widespread
    • generates transcript/protein diversity
    • post-transcriptional regulation (alternative 3' UTR)
    • highly regulated process
  • Sequencing approaches
    • PAS-seq, PolyA-seq
      • Oligo-dT priming, imperfect end identification
    • 3P-seq
      • Direct ligation to 3' end
  • Measuring accuracy: how well can the polyA signal motif be identified
  • Better identificaiton of internal priming -- false positive pA sites
    • Training sets of true and false sites
    • -> model selection
    • -> compare to heuristic filtering
    • 3P-seq to define true set, RNA-seq to define false set (both must be in PAS-seq)
    • Training data represents expected populations
    • Algorithm features
      • Upstream: presence/absence of hexamers
      • Downstream: single nuc frequency, dinuc frequency, average As to pA site
      • Naive bayes classifier
  • Putative sites identified by PAS-seq only with this classifier
    • Common (with 3P) have higher read coverage, so likely just missed due to depth?
  • Biological validation in zebrafish: 98% accuracy vs 88%
  • InPas: identify novel polyA sites from RNA-seq data

Learning cis-regulatory code at base-pair resolution (Anshul Kundaje)

@anshulkundaje

  • "Interpreting relatively blackbox deep learning models"
  • Types of data for profiling regulatory information: ChIP-seq, ATAC-seq/DNase-seq
  • Modeling framework
    • Given an assay like ChIP-seq or ATAC-seq
    • Consider genome bins, 200bp - 1kb
    • Assign either binary or quantitative label to each bin
    • Learn relationship between DNA seqence in bin and label
  • Here using (deep convolutional) neural networks, with one-hot encoding of the DNA
  • "Democratizing ML for genomics" -- kipoi.org a model zoo for genomics
    • Wrappers for running models, simplified model comparison
    • Usually performance is based on how you clean the data, train the models (e.g. batches), design the labels
  • ChIP-exo / nexus -- ultra high resolution TF binding profiles
    • BPNet: DNA sequence to base-pair resolution profile regression
      • Just predict + and - strand base pair level read counts
        • optimize linear combination of two loss functions
        • MSE loss for log total counts
        • multinomial loss for capturing distribution of counts across regions
      • Performance metric
        • set SN threshold on observed data (binarize spikes)
        • allow some slack (1bp, 5bp, 20bp) and consider auPRC
        • as good as replicates for spikes
        • Less good for total count prediction
          • Spike prediction is more local, where total occupancy depends on more global features
  • Interpreting the "black box"
    • DeepLIFT: Back propagate importance through the network
      • e.g. Oct4 distabl enhancer, footprints for all of Klf4, Nanog, Oct4 Sox2
    • TF-MODISCO
      • Inferring globally predictive motifs across all binding events
        • 43 distinct motifs to explain the binding of 4 TFs!
      • Identifies TF binding sites in lone transposable elements
    • 10.9bp periodic pattern around NANOG site -- major groove
    • Combinatorial binding
    • Predicting expression in MPRA using BPNet model
  • Tutorials: kundajelab.github.io/dragonn

Break

Three reasons to Bioconductor on Containers (Nitesh Turaga)

  • Containers
    • Completely isolated environments, don't bundle the full OS, lightweight, cross-platform
    • Bioconductor using Docker because it is popular, has lots of open-souce tooling
  • Reason 1: No longer worry about system dependencies
    • Bioconductor packages depend on other Bioconductor packages, system dependencies, ...
    • Containers make it easier for developers to distribute packages and users to get them
    • bioconductor/bioconductor_full:RELEASE_3_9 -- container with full Bioconductor release
    • comes with RStudio inside
  • Reason 2: Reproducibility
    • Image stays in a steady state, can ensure you are always running in the same environment
    • Dockerfile tells you exactly how that image was built
  • Reason 3: Sharing your work.
  • Future
    • Starting a Google cloud cluster -- gcloud container clusters
    • Start kubernetes pod with Bioconductor and redis
    • Using BioCParallel, communicating with redis, across five workers

Bioconductor at "cloud scale" (Vince Carey)

  • Cloud oriented agenda for Cioc:
    • Simplify discovery of data and methods
    • Evaluate and optimize representations of data to foster cloud-scale data science
    • Reduce barriers to distributed/scalable analysis for learning from cloud-scale genomic data
  • Task for discovery: improved annotation
    • Annotation is often highly ad hoc
    • e.g. Bioconductor:Cancer43k Shiny app
    • Annotation: discrete or continuous
      • example 100 topic model from LDA on 988 cancer related abstracts
  • Suggestion for enhanced discovery
    • Use all available information, build annotations based on semantic distances
  • Data representation
    • HDF scalable data service + DelayedArray
    • HSDS server is multiplexed, concurrent
    • Need more work on the many different approaches to representing data in cloud...
    • (a million options presented here...)
  • Conclusions
    • Bioc has been cloud friendly for a long time
    • Need to help with discovery, representation, scalable analysis
    • Avenues to explore
      • Semantic neighborhoods
      • Unified representation
      • Container orchestration
      • Need more experimentation and communication!

Superintronic: tidy coverage analysis and range-wise diagnostics (Stuart Lee)

@_StuartLee

  • Core data structure of Bioconductor -- GRanges -- is a tidy data structure
  • plyranges -- add dplyr concepts to Bioconductor
  • superintronic: coverage and range based summarization
    • experimental design, summarization, visualization
  • Accounting for experimental design
    • design is a data frame; design %>% compute_coverage_long(source="bam")
  • Finding interesting regions of coverage
    • Scatterplot diagnostics (Tukeys)
    • Rather than look at large numbers of scatterplots, compute summaries of scatterplot shape/etc
    • For a given range index x set of keys; rangle( x, .var=score, .index=seqnames, .funs=ranglers() )
      • e.g. mean / var of sliding window means across all regions of interest
        • bumpy, lumpy, max mean shift, max variance shift, count distribution, flat spots, crossing points
        • visualize all summaries on a biplot
  • Example: ranglers for intronic sequence coverage in RNA-seq

BiocProject: Bioconductor oriented approach to project management (Nathan Sheffield)

databio.org/slides

  • Most workflows require individual metadata organization
  • Then other tools require a different organization
  • Solution: represent the metadata in a common way, design adapters ("plugs") so each tool sees the data in the form it wants
  • Portable Encapsulated Project ("PEP")
    • Pipelines and workflows understand the format, but also could be used for sharing, etc.
  • peppy python package, pepr R package, geofetch to create a PEP from GEO, looper runs arbitrary commands over a PEP representation
  • Format: project_condig.yaml (project level attributes) and a CSV file for sample level attributes
  • BiocProject integrates PEP into Bioconductor

Storage and analysis of microbiome quality control data (Karun Rejaesh)

github.com/karunrajesh/MBQC_Phyloseq

  • Microbiome quality control
    • Small number of quality control and lab-to-lab variability study
  • Process pipeline
    • Several types of samples: fresh samples, artificial samples, negative controls
    • 15 different extraction labs (blinded)
    • 9 different bioinformatics processing labs
    • Result: OTU tables
    • integrated OTU tables nearly 1TB
  • Here: compress all of this data into a phyloseq object, only 28Mb
  • Results:
    • \alpha diversity highly variable across the 15 labs: significant lab-to-lab variation
    • Oral mock community, 22 true species, different bioinformatics labs detect different taxa
    • PCoA: PC2 (13%) appears to largely be bioinformatics lab (is PC1 processing lab?)

PulmonDB: curated gene expression database of lung diseases (Ana Beatriz Villaseñor-Altamirano)

  • Talking about COPD, IPF; lung diseases associated with smoking
    • In COPD alveoli are partially destroyed
    • In IPF excess ECM is produced
  • Significant amount of transcriptome data available, thousands of transcriptome datasets for COPD in GEO
  • Goal: generate a COPD/IPF transcriptome database from public data
  • Extract data from recount2
    • filter for COPD/IPF and raw data availability
    • use "COMMAND" to curate datasets into a controlled vocabulary
    • store in a database: pulmonDB
  • 3,000+ samples, across multiple platforms, 200+ from RNA-seq
  • Making it available
    1. Website for data exploration
    2. Availability in R environment
      • PulmonDB package allows creating SummarizedExperiment and annotation as Dataframes from the SQL database
  • COPD Case Study
    • Lung tissue, contrasts COPD/Healthy vs Healthy/Healthy
      • 132 genes over/under expressed

Snapcount (Chris Wilks)

github.com/langmead-lab/snapr

  • "Are we doing the kind of research we could be doing if public data was available at scale"
  • SRA: 100's of TB of raw sequence data
  • Search engine "stack" for RNA-seq
    • Rail-RNA -- the crawler
    • recount2 -- the database
    • Snaptron -- the search engine
  • Snaptron database
    • splice junctions x samples x counts
  • Snapcount
    • Bioconductor interface to the Snaptron web services
      • Query all data in recount2 plus some additional studies
      • Region centric: slice a genomic region across all samples (~2000 studies)
      • Output: RangedSummarizedExperiment object
    • Other questions snapcount can answer
      • Relative usafe of plicing pattern
      • Tissue enrichment
      • ...
  • What's next: recount3!
    • redoing the whole alignment pipeline, annotation-free
    • rerun with new runs including MOUSE!
    • streamlined pipeline that users can run on their own samples and integrate

End of talks!


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment