michaelmhoffman/bioc2018.md

## bioc2018.md

      
    Raw
  

              bioc2018.md
            
          
    Conference info: https://bioc2018.bioconductor.org/
My first Bioconductor meeting, and I'm not a BioC or R expert so these notes are probably going to be naïve!
Contents


Developer Day
Day One

Developer Day

(I arrived at 1:00pm and missed the morning sessions)
⚡️Lightning talks II

Meetups (Aedin Culhane)


Boston first, then NYC
Bioconductor has a community Slack, there is an #meetups channel
Boston: 40 to 100 people turn up

BioCFileCache (Lori Shepherd)

Local file management. Cache files locally to avoid downloading from remote sources if not needed. Also, try to have a better way to organize files.
BiocFileCache(). Backed by a sqlite database.

bfcadd( rname=..., fpath=...) adds an existing file to track in the cache
bfcnew( rname=... ) gives a new path in the cache
bfcneedsupdate() check if a dataset has changed remotely and needs to be downloaded again
bfcquery(...) search for datasets in the cache
bfcrpath(...) gives the local path of file by id

Can also attach metadata to datasets.
VariantExperiment (Qian Liu)


Stored variants (genotypes, multiple assays, multiple individuals).
Extends RangedSummarizedExperiment.
Can construct from gds file or from a vcf file.
Subsetting and range slicing.
"Many statistical methods are defined". Example: hwe.

Scalable computing in BioConductor (Nitesh Turaga)


map/reduce in R: lapply( X, FUN, ...).
BioCParallel: bplapply( ..., bpparam )
bpparam determines the BioCParallel backend to use
e.g. SerialParam or MulticoreParam
New: scalaing across clusters, BatchToolsParam( workers=..., cluster=... )

cluster might be SGE, SLURM, LSF, PBS, et cetera


Example: Salmon psuedoalignment

instantiate BatchToolsParam with institution specific template
Write a function that processes a single sample
Pass function to bplapply to run in parallel

Progress Bar!


Benefits: easier cluster management, ...
https://github/nturaga/BatchtoolsParam_examples

Workshops (Levi Waldron)

How the confernece workshops (will run Thursday/Friday) materials were built.

https://github.com/Bioconductor/BiocWorkshops
What's different about (BioC conference) workshops this year?
Collection of Rmd files collated with bookdown.

130 package dependencies (ouch!)
Coordinated with GitHub issues+tags


Produces a single gh_pages site with all of the workshops (https://bioconductor.github.io/BiocWorkshops)
Building an AMI for all workshops

Using packer.io to build AMI with everything needed for every workshop


Organism.dplyr (Daniel van Twisk)


Alternative interface to the org.* packages, similar purpose to OrganismDBI
Any organism with both a org and txdb package can be used
src_organism( "org..." ) provides the interface. Compatible with all methods from dplyr
11 genomic coordinate extractor methods available, e.g. transcripts gets a GRanges, transcripts_tbl gets a tibble
Examples of a variety of complex filters (too small to read!)

Birds of a Feather

Parallel tracks, Peter Hickey on Effectively using DelayedArray and Levi on New Data Structures for Bioconductor. (Sorry @PeteHaitch but I can only choose one).
New Data Structures for Bioconductor (Levi Waldron)

First, a presentation "Why re-use core classes: A plea to developers of Bioconductor packages" (Levi).

What is Bioconductor? 1,400 packages on a backbone of data structures.

e.g.: GenomicRanges, SummarizedExperiment


Why do core classes matter?

Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.

But your frame has limited testing, and probably doesn't handle many use cases


It is easy to define a new S4 class in R

But you shouldn't very difficult to build a robust and flexible class for genomic data analysis


Example from phylogenetics / microbiome packages: not using common classes --> limits interoperability


What are the core classes?

SummarizedExperiment, GenomicRanges, Biostrings, GSEABase, MultiAssayExperiment, SingleCellExperiment, MSnbase...
https://bioconductor.org/developers/how-to/commonMethodsAndClasses/


Core classes represent years of work and maintenence and have been used by tens of thousands of users

Discussion


Q: What would you do instead of defining your own class (in the case of phyloseq)


.MicrobiomeExperiment <- setClass("MicrobiomeExperiment", 
        contains="SummarizedExperiment", 
        representation( rowData="MicrobiomeFeatures" )
)


Gives you the benefits of SummarizedExperiment, compatible with MultiAssayExperiment


Exploring S4 classes

extends tells you what superclasses a given class extends

e.g. RangedSummarizedExperiment isa SummarizedExperiment, Vector, Annotated


showclass adds known subclasses and what slots it contains
methods tells you what methods are defined for a class

e.g. 100+ methods on SummarizedExperiment, but 54 of those from parent classes)


Example that has "done things right": SingleCellExpriment

extends RangedSummarizedExperiment and defines additional methods...


PharmacoGx Updates (Petr Smirnov)

https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html


WIP: Fixing up to work better with Bioconductor objects


Drug sensitivity data: "treat a cell line with a drug and see how well it kills it"


Structure

molecularProfiles: List of ExpressionSet objects
sensitivity: List of a couple of data frames and an array

Initially Experiment IDs with dose/viability pairs

But, drugs combinations, other dimensions, not naturally a matrix


Solution: LongArray object

col.ids and row.ids: data.frame's

Example (combination_name, drugA, drugB)


Get data as if from a list

"long array but behaves when you use a single bracket as if a matrix"


Slicing example

dcText is a longArray object with "rows" across 2 variables and "columns" across 1 variable
slicing dcTest[c("5-FU", "Bortezomib", "Erlotinib"), "A2058"]


Q: Why is this different from a SummarizedExperiment?

Multiple experiments, e.g. drug combinations of 2, 3, ... n drugs
Followup Q: what about MultiAssayExperiment?

Would lose quick subsetting through multiple dimensions (?)


TxRegInfra (Vince Carey)

https://bioconductor.org/packages/release/bioc/html/TxRegInfra.html

Investigator's idea: eQTL from GTex, DHS,... from ENCODE, TFBS from FIMO... use this to interpret GWAS hits
Developers: do as little as possible to resolve and keep metadata
Existing resources that could help:

rtracklayer+tabix, GenomicRanges, RaggedExperiments, ...
mongolite


RaggedExperiment

(completely missed this and the slides blanked out)
But seems like experiment where observations have different features, some shared some not
I need to read this later: https://bioconductor.org/packages/release/bioc/manuals/RaggedExperiment/man/RaggedExperiment.pdf


Example

Data

collection of eQTL from GTEx
encode footprint (not sure what this actually is)
encode DHS hotspots


Documents in a mongodb database (RaggedMongoExperiment)

Every document has a genomic range, so can respond to range queries


Summary

Basic layout: genomic coordinates x sample/tissue type x assay type
MutltiAssayExperiment: could work but not an immediate fit


"There is a competitor called Giggle"

https://github.com/ryanlayer/giggle


⚡️Lightning talks III

Bioconductor tricks for dealing with genome annotation (Michael Steinbaugh)

http://steinbaugh.com/basejump/

Recommended packages

GenomicRanges, rtracklayer (GTF -> GRanges), AnnotationHub, ensembldb, GenomicFeatures


basejump extends these tools
Rich metadata columns (GRanges), mcol(...)

iSEE (Charlotte Soneson)

https://f1000research.com/articles/7-741/v1

Interactively explore any data in a SummarizedExperiment object (or subclass)
Multiple panels with different visualizations, can see how they are linked

Managing project metadata with a standard project format (Nathan Sheffield)

https://pepkit.github.io; https://databio.org

Motivation: Most pipelines require individual metadata organization
PEP: a standard format for project metadata -- "Portable Encapsualted Project"
Ecosystem of tools:

format itself: project_config.yaml, samples.csv
peppy: Python package
pepr: R package
geofetch, looper -- map samples onto pipelines and run in different compute environments


Finding Bioconductor Packages (Shian Su)

github.com/shians/biocexplorer

Bioconductor packages are not that easy to find

Prioritization: can sort by title (alpha), author (alpha), not really that useful


Alternative: BiocExplorer:

Prioritize packages based on usage
Provides a graph of packages, prioritizing those that are widely used (not sure what


Recent cloud-scale innovations in Bioconductor (Vince Carey)

So fast... so  small...


Summary first

DelayedArray: seamless element level access to out-of-memory / remote array-like resources
SummarizedExperiment/MAE: a sort of query language for annotated omics resources
Current efforts: improve efficiency of statistical learning using Delayed* resources


DelayedArray backends: HDF5 server, BigTable, ...


HDF cloud / HDF Kita (Example using 10x)


BigTable (Example using (OncoTk)


The point (I think): You can work with all of these types of remote data in the current version of Bioconductor


Sesame: a sensible way to analyze a DNAme array (Tim Triche)

https://www.bioconductor.org/packages/devel/bioc/html/sesame.html

Improves masking on hyper-polymorphic region (e.g. MHC)

Community Activites

Brainstorm and prioritize some products that can be produced in ~45 minutes (and then do the thing).
(Martin using slido.com to accumulate suggestions from the attendees)
Voting, winners are:

Come up with a data structure for PharmaGx data -- 9
Strategies for posting and answering support site questions -- 8
Checks on SummarizedExperiment rownames, rowData()<- -- 7
name clashes between BiocGenerics, S4Vectors etc. and tidyverse -- 6
Pull requests to fix usage and other warnings in core packages, e.g., Rsamtools -- 6
Initiate collaborative development of ... (like iSEE at BiocEurope) -- 6

We'll see what happens...
Summaries of community activites

Support site


Main idea: Template for the "ask a question" box: provide some guidance for how to ask a good question

Reproducible example
Things you tired
sessionInfo


Other ideas (Google doc link)

Collaborative project


"Biocverse": visualizing the Bioconductor ecosystem
Use cases

New users: given a task, show me all packages, ranked by "importance"
Experienced users
New developers


SummarizedExperiment rownames

Fixing a problem in assigning rownames. I think.
PharmacoGx

(more) Discussion of how to store drug sensitivity data.
Store as a database. Write construtor functions that create matrix / SummarizedExperiment from the database.
There's some code in github in a project called longArray but I can't read the user/org name.
Panel discussion: project directions and opportunities

Q&A for the project leadership team.
Q: "Can you think of ways to expand the network of project leaders, expand ownership, expand people who feel they are part of the project"

Martin: project has life outside the core. e.g. Single Cell developments largely outside the core.
Vince: (turns question back) Is there a lack of recognition or barriers to participation?
Aedin: Surprising how many R people don't know Bioconductor

Q: Mechanism to kick packages out?

M: There are obstacles that discourage people from participating, but the tradeoff is worth it. Quality of packages, having vignettes. Contrast in how tidyverse works with how Bioconductor works, different views on how software should work. Interesting to think about compromises in making those play along.
V: Advantages of putting your package in Bioconductor: development vs release branch (not available in CRAN), vignettes and examples. Not trying to sell Bioconductor methods to people not trying to do them.

Q: Synchronization between Bioconductor and CRAN

M: We communicate regularly. At a technical level there is communication. At a social level, much more restricted. CRAN has task views, but no overlap with Bioconductor

Q: How do you recommend new users learn the Bioconductor ecosystem

Wolfgang: need a new book. Something like a textbook. Challenge defining what it means to use/learn Bioconductor
A: A beginners user guide to Bioconductor should start with SummarizedExperiment, basics... The thing I direct people to is the f1000 channel
If you want to learn Bioconducter there is a reason for that. You have some data.
V: Talking about content from the edX MOOC
Kasper: introduction should focus on the things that EVERY user of Bioconductor should know. Should streamline and cleanup what is presented.

Q: Funding mechanisms the Bioconductor community should be applying for


M: Historically get a lot of money into one big shop, easier than scrambling for smaller grants from a variety of places. More recently, have started to diversify. e.g. Human cell atlas grants written by a diverse group of people. More junior faculty wanting to expand participation but need funding to do so. But wants to call on James Taylor...


ACK IT ME: What works for Galaxy 1) Core group goes after diverse funding opportunities, one big pot but also lots of other pots 2) Full time funded community outreach. Multiple people across the project dedicated to this. Really makes a huge difference to have someone spending all their time on this so it is never lost/back-burnered [writing what I think not what I say right now]


Q: Increasing participation, diversity, gender balance

M: (I didn't capture this well so nothing here)

Q: Question about connections with Africa (H3A?) training and outreach

A: There are some efforts/connections
V: There is a foundation for Bioconductor devoted to charitable works, has some money, could potentially be used to expand training

(discussion about website here, needs to be simplified, streamlined, refreshed... couldn't hear everything)
(Back to diversity, a plug for participation in Girls who Code and other such groups, always looking for help)
END OF DEVELOPER DAY.
Day "1"

Starting with introductory remarks from Martin (thanking sponsors, organizers, logistics -- all on the conference website).
Last item, Code of Conduct (https://bioc2018.bioconductor.org/code_of_conduct), interesting, shorter than many.
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome (@michaelhoffman)

https://www.biorxiv.org/content/early/2018/02/28/168419

Introduction

"Transcription over-simplified": TF binds DNA, recrits PolII, RNA is made (yup, that's all)
ChIP-seq, you might have heard of it...

Problem: ChIP-seq needs 10^6 to 10^8 cells ("Determined using 'Cunningham's Law'")
Solution: computational prediction of transcription factor binding

Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better

Michael thinkg HINT works pretty well


How to move forward

Use experimental data from ChIP-seq in other cell types
Learn association between local TF binding and global cellular state as measured from transcribed RNA


Learning from the transcriptome

ChIP-seq data from some cell types, RNA-seq data from more cell types.

Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)

Correlation matrix is genome-wide, consider cases where p < 0.1
For a given bin, consider all genes with "significant" correlation, compute spearman rank correlationp between expression and correlation == expression score (yes, correlations of correlations)
phastCons + chromatin accessibility + expression score + number of cell types with TF binding + motif score into a "very simple neural network" [MH: it's simple in that it is fully connected, nothing fancy in the architecture]

MLP... optimized # layers, size of layers, activation function, ... looks like lots of hyperparameters [MH: only 4 hyperparameters! 3×3×3×4 = 36 different possible values examined in the grid search]


Evaluation

(Describing cross-validation scheme, use of precision-recall curves)
ChIP-seq data from other cell types most important, then expression score, then chromatin accessibility


Performance

"A few" TFs where we do very well -- auPR >= 0.5 -- SMC3, CTCF, RAD21... some others, went by too fast [MH: performance plot is in preprint]
For 36 TFs MCC more than 0.3 in validation cell types (Roadmap)
Correctly predicts novel TF binding sites


Trackhub available, didn't see a URL though, probably in the preprint [MH: should add to presentation. https://virchip.hoffmanlab.org/]
Future:

Position dependency -- adding time dimension to network


Questions:

Q: Now that we have 1600 cell lines with RNA-seq, would you be interested in inferring ChIP-seq for those cell lines

M: Yes, I would be intersted in that, but open-chromatin data is important, not sure can do without


Q: Can this be applied to single cell transcriptome data?

M: Yes, could do something, again need measurement of open chromatin ("having single cell ATAC would be best")


Q: (I think the question is using the model to find most similar cell line)

M: (Not sure)


Q: What two assays would you do on cells from a donor (for a difficult to acquire tissue)

M: transcripts and open chromatin -- but still unclear for single cell


Enter the Matrix: Interpreting omics through matrix factorization  (@FertigLab)

https://www.biorxiv.org/content/early/2018/04/02/196915


First, an appeal to the audience: "Need more tools for visualization across different matrix factorization techniques"


Introduction

Pattern detection is critical in the genomics big data era
(Many types of) omics data can be represented in matrices

Focus here is mainly on transcription


Omics data can be interpreted through matrix factorization (PCA, ICA, NMF, ...)


Data = Amplitude x Pattern

D: molecules by samples
A: molecules by features
P: features by samples


Focus: Smooth sparse NMF, Bioconductor package CoGAPS

$$A_{i,j} ~ \Gamma(\alpha^A_{i,j},\lambda); \alpha^A_{i,j} ~ Poisson(\alpha)$$
Gamma yields sparsity constraint
Implementation: finds constrained non-negative sparse matrices using MCMC Gibbs sampler


Application: Biological model of theraputic resistance

Cancer cell line initially sensitive, generate long term resistance, acquire time series (weekly) for cells acquiring resistance and controls
Generated gene expression data (bulk RNA-seq?)
Initial clustering

Gives information about treatment but not much about resistance


Matrix factorization for time-course analysis

Perform sparse NMF (CoGAPS) and view the Patterns over time

Reveals time dependent patterns in resistance


BUT: how does one make these abstract patterns useful?


Amplitude matrix allows mapping patterns back into gene expression space (or whatever original feature space)
Instead of finding genes most highly associated with each pattern, what are the genes associated with only one of the patterns

"Pattern marker genes"

Group that slowly increases with resistance, another group that slowly decreases, clearly groups treatment and controls


Can do standard GSEA on these marker genes


Relate this back to non cell-line data?

Take weights and project onto another dataset (ProjectR package on GitHub)
Human tumors treated with the same therapy

Found that the resistance patterns were elevated in the patient tumors that were resistant


"Resurgance of matrix factorization for single-cell data"

What's different?

Datasets are orders of magnitude larger
Cell types and timing of individual cells are unknown a priori


Showing UMAP of 10x 100k cell data, 10 different time points in mouse retina

Cell types are hand annotated to get "ground truth" (hrmmmmmm...)
scCoGAPS distinguishes cell types and trajectories

Looks like a rod pattern and a cell type pattern...


Conclusions

Matrix factorization has a long history in genomics
Adding new visualization andnew statistics to the ouputs of MF can enable robust pattern detection
Applicable to single-cell datasets


Q: 1. about manually classifying 100k cells, 2. (didn't get this one)

Research question: how do you use these factorizations to aid classification
Replicates: not able to replicate the whole time course (not enough $), but a collaborator had previously developed resistance in same cell line, found tremendous heterogeneity, sounds like generalizability still unclear


Q: Scalibility, algorithms people should focus on in the face of HCA, 2M cell scale...

My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches

|| across different sets of genes, cells
compaction approahces (group related cells, factorize in reduced space)


Q: "Are you aware of groups that have ressurected CUR decomposition... quantization approaches... where you hit the limit"

Haven't seen that. Surprised at the amount of reinventing the wheel. Need to go back to that literature.


(BREAK)
Analysis of high content microscopy data generated through automated yeast genetics (Brenda Andrews)


Introduction

Major challenge, predicting phenotype from genotype using genetic interactions
Using budding yeast because reagents for systematic genetics including

Yeast deletion collection: 5000 yeast strains each deleted for a single non-essential gene
1000 temperature sensitive alleles of essential genes


Need methods for detecting gene interactions

SGA (Synthetic genetic array): introduce any marked allele into an arrayed set of straings


Main phenotype is growth (colony size)

e.g. tested 23.4 million double mutants identifying 1.1M genetic interactions
generated "hierarchical modle of cell function" (Costanzo et al. Science 2016)


~35% of nonessential query gene mutatnts exhibit weak genetic interaction profiles
Most of the time double mutants do not have a growth phenotype, but may have other phenotypes


"Marker project"

Introduce flourescent markers for sub cellular compartments in to the arrayed strains
How does mutation of any gene influence sub-cellular compartments?
Developing a general phenotypic profiling pipeline

Make strain collection: Use SGA to introduce three markers: compartment of interest, nucleus, cytoplasm
Image: Opera Phenix automated confocal live cell
Data collection


Single cell images
Single cell morphological features

Cell Profiler, ~300 features, 10-50 PCs
VAE (autoencode) to find latent feature vector


Phenotyping profiling


Detecting mutants -> penetrence

Finding outliers, one-class SVM, distance methods, ... "no one size fits all"


Classifying mutant pheonotypes

Neural networks