jxtx/bioc2018.md

## bioc2018.md

      
    Raw
  

              bioc2018.md
            
          
    Conference info: https://bioc2018.bioconductor.org/
My first Bioconductor meeting, and I'm not a BioC or R expert so these notes are probably going to be naïve!
Contents


Developer Day

⚡️talks II
BoFs
⚡️ III
Community Activities


Day One
Day Two

Developer Day

(I arrived at 1:00pm and missed the morning sessions)
⚡️Lightning talks II

Meetups (Aedin Culhane)


Boston first, then NYC
Bioconductor has a community Slack, there is an #meetups channel
Boston: 40 to 100 people turn up

BioCFileCache (Lori Shepherd)

Local file management. Cache files locally to avoid downloading from remote sources if not needed. Also, try to have a better way to organize files.
BiocFileCache(). Backed by a sqlite database.

bfcadd( rname=..., fpath=...) adds an existing file to track in the cache
bfcnew( rname=... ) gives a new path in the cache
bfcneedsupdate() check if a dataset has changed remotely and needs to be downloaded again
bfcquery(...) search for datasets in the cache
bfcrpath(...) gives the local path of file by id

Can also attach metadata to datasets.
VariantExperiment (Qian Liu)


Stored variants (genotypes, multiple assays, multiple individuals).
Extends RangedSummarizedExperiment.
Can construct from gds file or from a vcf file.
Subsetting and range slicing.
"Many statistical methods are defined". Example: hwe.

Scalable computing in BioConductor (Nitesh Turaga)


map/reduce in R: lapply( X, FUN, ...).
BioCParallel: bplapply( ..., bpparam )
bpparam determines the BioCParallel backend to use
e.g. SerialParam or MulticoreParam
New: scalaing across clusters, BatchToolsParam( workers=..., cluster=... )

cluster might be SGE, SLURM, LSF, PBS, et cetera


Example: Salmon psuedoalignment

instantiate BatchToolsParam with institution specific template
Write a function that processes a single sample
Pass function to bplapply to run in parallel

Progress Bar!


Benefits: easier cluster management, ...
https://github/nturaga/BatchtoolsParam_examples

Workshops (Levi Waldron)

How the confernece workshops (will run Thursday/Friday) materials were built.

https://github.com/Bioconductor/BiocWorkshops
What's different about (BioC conference) workshops this year?
Collection of Rmd files collated with bookdown.

130 package dependencies (ouch!)
Coordinated with GitHub issues+tags


Produces a single gh_pages site with all of the workshops (https://bioconductor.github.io/BiocWorkshops)
Building an AMI for all workshops

Using packer.io to build AMI with everything needed for every workshop


Organism.dplyr (Daniel van Twisk)


Alternative interface to the org.* packages, similar purpose to OrganismDBI
Any organism with both a org and txdb package can be used
src_organism( "org..." ) provides the interface. Compatible with all methods from dplyr
11 genomic coordinate extractor methods available, e.g. transcripts gets a GRanges, transcripts_tbl gets a tibble
Examples of a variety of complex filters (too small to read!)

Birds of a Feather

Parallel tracks, Peter Hickey on Effectively using DelayedArray and Levi on New Data Structures for Bioconductor. (Sorry @PeteHaitch but I can only choose one).
New Data Structures for Bioconductor (Levi Waldron)

First, a presentation "Why re-use core classes: A plea to developers of Bioconductor packages" (Levi).

What is Bioconductor? 1,400 packages on a backbone of data structures.

e.g.: GenomicRanges, SummarizedExperiment


Why do core classes matter?

Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.

But your frame has limited testing, and probably doesn't handle many use cases


It is easy to define a new S4 class in R

But you shouldn't very difficult to build a robust and flexible class for genomic data analysis


Example from phylogenetics / microbiome packages: not using common classes ⟶ limits interoperability


What are the core classes?

SummarizedExperiment, GenomicRanges, Biostrings, GSEABase, MultiAssayExperiment, SingleCellExperiment, MSnbase...
https://bioconductor.org/developers/how-to/commonMethodsAndClasses/


Core classes represent years of work and maintenence and have been used by tens of thousands of users

Discussion


Q: What would you do instead of defining your own class (in the case of phyloseq)


.MicrobiomeExperiment <- setClass("MicrobiomeExperiment", 
        contains="SummarizedExperiment", 
        representation( rowData="MicrobiomeFeatures" )
)


Gives you the benefits of SummarizedExperiment, compatible with MultiAssayExperiment


Exploring S4 classes

extends tells you what superclasses a given class extends

e.g. RangedSummarizedExperiment isa SummarizedExperiment, Vector, Annotated


showclass adds known subclasses and what slots it contains
methods tells you what methods are defined for a class

e.g. 100+ methods on SummarizedExperiment, but 54 of those from parent classes)


Example that has "done things right": SingleCellExpriment

extends RangedSummarizedExperiment and defines additional methods...


PharmacoGx Updates (Petr Smirnov)

https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html


WIP: Fixing up to work better with Bioconductor objects


Drug sensitivity data: "treat a cell line with a drug and see how well it kills it"


Structure

molecularProfiles: List of ExpressionSet objects
sensitivity: List of a couple of data frames and an array

Initially Experiment IDs with dose/viability pairs

But, drugs combinations, other dimensions, not naturally a matrix


Solution: LongArray object

col.ids and row.ids: data.frame's

Example (combination_name, drugA, drugB)


Get data as if from a list

"long array but behaves when you use a single bracket as if a matrix"


Slicing example

dcText is a longArray object with "rows" across 2 variables and "columns" across 1 variable
slicing dcTest[c("5-FU", "Bortezomib", "Erlotinib"), "A2058"]


Q: Why is this different from a SummarizedExperiment?

Multiple experiments, e.g. drug combinations of 2, 3, ... n drugs
Followup Q: what about MultiAssayExperiment?

Would lose quick subsetting through multiple dimensions (?)


TxRegInfra (Vince Carey)

https://bioconductor.org/packages/release/bioc/html/TxRegInfra.html

Investigator's idea: eQTL from GTex, DHS,... from ENCODE, TFBS from FIMO... use this to interpret GWAS hits
Developers: do as little as possible to resolve and keep metadata
Existing resources that could help:

rtracklayer+tabix, GenomicRanges, RaggedExperiments, ...
mongolite


RaggedExperiment

(completely missed this and the slides blanked out)
But seems like experiment where observations have different features, some shared some not
I need to read this later: https://bioconductor.org/packages/release/bioc/manuals/RaggedExperiment/man/RaggedExperiment.pdf


Example

Data

collection of eQTL from GTEx
encode footprint (not sure what this actually is)
encode DHS hotspots


Documents in a mongodb database (RaggedMongoExperiment)

Every document has a genomic range, so can respond to range queries


Summary

Basic layout: genomic coordinates x sample/tissue type x assay type
MutltiAssayExperiment: could work but not an immediate fit


"There is a competitor called Giggle"

https://github.com/ryanlayer/giggle


⚡️Lightning talks III

Bioconductor tricks for dealing with genome annotation (Michael Steinbaugh)

http://steinbaugh.com/basejump/

Recommended packages

GenomicRanges, rtracklayer (GTF -> GRanges), AnnotationHub, ensembldb, GenomicFeatures


basejump extends these tools
Rich metadata columns (GRanges), mcol(...)

iSEE (Charlotte Soneson)

https://f1000research.com/articles/7-741/v1

Interactively explore any data in a SummarizedExperiment object (or subclass)
Multiple panels with different visualizations, can see how they are linked

Managing project metadata with a standard project format (Nathan Sheffield)

https://pepkit.github.io; https://databio.org

Motivation: Most pipelines require individual metadata organization
PEP: a standard format for project metadata -- "Portable Encapsualted Project"
Ecosystem of tools:

format itself: project_config.yaml, samples.csv
peppy: Python package
pepr: R package
geofetch, looper -- map samples onto pipelines and run in different compute environments


Finding Bioconductor Packages (Shian Su)

github.com/shians/biocexplorer

Bioconductor packages are not that easy to find

Prioritization: can sort by title (alpha), author (alpha), not really that useful


Alternative: BiocExplorer:

Prioritize packages based on usage
Provides a graph of packages, prioritizing those that are widely used (not sure what


Recent cloud-scale innovations in Bioconductor (Vince Carey)

So fast... so  small...


Summary first

DelayedArray: seamless element level access to out-of-memory / remote array-like resources
SummarizedExperiment/MAE: a sort of query language for annotated omics resources
Current efforts: improve efficiency of statistical learning using Delayed* resources


DelayedArray backends: HDF5 server, BigTable, ...


HDF cloud / HDF Kita (Example using 10x)


BigTable (Example using (OncoTk)


The point (I think): You can work with all of these types of remote data in the current version of Bioconductor


Sesame: a sensible way to analyze a DNAme array (Tim Triche)

https://www.bioconductor.org/packages/devel/bioc/html/sesame.html

Improves masking on hyper-polymorphic region (e.g. MHC)

Community Activites

Brainstorm and prioritize some products that can be produced in ~45 minutes (and then do the thing).
(Martin using slido.com to accumulate suggestions from the attendees)
Voting, winners are:

Come up with a data structure for PharmaGx data -- 9
Strategies for posting and answering support site questions -- 8
Checks on SummarizedExperiment rownames, rowData()<- -- 7
name clashes between BiocGenerics, S4Vectors etc. and tidyverse -- 6
Pull requests to fix usage and other warnings in core packages, e.g., Rsamtools -- 6
Initiate collaborative development of ... (like iSEE at BiocEurope) -- 6

We'll see what happens...
Summaries of community activites

Support site


Main idea: Template for the "ask a question" box: provide some guidance for how to ask a good question

Reproducible example
Things you tired
sessionInfo


Other ideas (Google doc link)

Collaborative project


"Biocverse": visualizing the Bioconductor ecosystem
Use cases

New users: given a task, show me all packages, ranked by "importance"
Experienced users
New developers


SummarizedExperiment rownames

Fixing a problem in assigning rownames. I think.
PharmacoGx

(more) Discussion of how to store drug sensitivity data.
Store as a database. Write construtor functions that create matrix / SummarizedExperiment from the database.
There's some code in github in a project called longArray but I can't read the user/org name.
Panel discussion: project directions and opportunities

Q&A for the project leadership team.
Q: "Can you think of ways to expand the network of project leaders, expand ownership, expand people who feel they are part of the project"

Martin: project has life outside the core. e.g. Single Cell developments largely outside the core.
Vince: (turns question back) Is there a lack of recognition or barriers to participation?
Aedin: Surprising how many R people don't know Bioconductor

Q: Mechanism to kick packages out?

M: There are obstacles that discourage people from participating, but the tradeoff is worth it. Quality of packages, having vignettes. Contrast in how tidyverse works with how Bioconductor works, different views on how software should work. Interesting to think about compromises in making those play along.
V: Advantages of putting your package in Bioconductor: development vs release branch (not available in CRAN), vignettes and examples. Not trying to sell Bioconductor methods to people not trying to do them.

Q: Synchronization between Bioconductor and CRAN

M: We communicate regularly. At a technical level there is communication. At a social level, much more restricted. CRAN has task views, but no overlap with Bioconductor

Q: How do you recommend new users learn the Bioconductor ecosystem

Wolfgang: need a new book. Something like a textbook. Challenge defining what it means to use/learn Bioconductor
A: A beginners user guide to Bioconductor should start with SummarizedExperiment, basics... The thing I direct people to is the f1000 channel
If you want to learn Bioconducter there is a reason for that. You have some data.
V: Talking about content from the edX MOOC
Kasper: introduction should focus on the things that EVERY user of Bioconductor should know. Should streamline and cleanup what is presented.

Q: Funding mechanisms the Bioconductor community should be applying for


M: Historically get a lot of money into one big shop, easier than scrambling for smaller grants from a variety of places. More recently, have started to diversify. e.g. Human cell atlas grants written by a diverse group of people. More junior faculty wanting to expand participation but need funding to do so. But wants to call on James Taylor...


ACK IT ME: What works for Galaxy 1) Core group goes after diverse funding opportunities, one big pot but also lots of other pots 2) Full time funded community outreach. Multiple people across the project dedicated to this. Really makes a huge difference to have someone spending all their time on this so it is never lost/back-burnered [writing what I think not what I say right now]


Q: Increasing participation, diversity, gender balance

M: (I didn't capture this well so nothing here)

Q: Question about connections with Africa (H3A?) training and outreach

A: There are some efforts/connections
V: There is a foundation for Bioconductor devoted to charitable works, has some money, could potentially be used to expand training

(discussion about website here, needs to be simplified, streamlined, refreshed... couldn't hear everything)
(Back to diversity, a plug for participation in Girls who Code and other such groups, always looking for help)
END OF DEVELOPER DAY.
Day "1"

Starting with introductory remarks from Martin (thanking sponsors, organizers, logistics -- all on the conference website).
Last item, Code of Conduct (https://bioc2018.bioconductor.org/code_of_conduct), interesting, shorter than many.
Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome (@michaelhoffman)

https://www.biorxiv.org/content/early/2018/02/28/168419

Introduction

"Transcription over-simplified": TF binds DNA, recrits PolII, RNA is made (yup, that's all)
ChIP-seq, you might have heard of it...

Problem: ChIP-seq needs 10^6 to 10^8 cells ("Determined using 'Cunningham's Law'")
Solution: computational prediction of transcription factor binding

Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better

Michael thinkg HINT works pretty well


How to move forward

Use experimental data from ChIP-seq in other cell types
Learn association between local TF binding and global cellular state as measured from transcribed RNA


Learning from the transcriptome

ChIP-seq data from some cell types, RNA-seq data from more cell types.

Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)

Correlation matrix is genome-wide, consider cases where p < 0.1
For a given bin, consider all genes with "significant" correlation, compute spearman rank correlationp between expression and correlation == expression score (yes, correlations of correlations)
phastCons + chromatin accessibility + expression score + number of cell types with TF binding + motif score into a "very simple neural network" [MH: it's simple in that it is fully connected, nothing fancy in the architecture]

MLP... optimized # layers, size of layers, activation function, ... looks like lots of hyperparameters [MH: only 4 hyperparameters! 3×3×3×4 = 36 different possible values examined in the grid search]


Evaluation

(Describing cross-validation scheme, use of precision-recall curves)
ChIP-seq data from other cell types most important, then expression score, then chromatin accessibility


Performance

"A few" TFs where we do very well -- auPR >= 0.5 -- SMC3, CTCF, RAD21... some others, went by too fast [MH: performance plot is in preprint]
For 36 TFs MCC more than 0.3 in validation cell types (Roadmap)
Correctly predicts novel TF binding sites


Trackhub available, didn't see a URL though, probably in the preprint [MH: should add to presentation. https://virchip.hoffmanlab.org/]
Future:

Position dependency -- adding time dimension to network


Questions:

Q: Now that we have 1600 cell lines with RNA-seq, would you be interested in inferring ChIP-seq for those cell lines

M: Yes, I would be intersted in that, but open-chromatin data is important, not sure can do without


Q: Can this be applied to single cell transcriptome data?

M: Yes, could do something, again need measurement of open chromatin ("having single cell ATAC would be best")


Q: (I think the question is using the model to find most similar cell line)

M: (Not sure)


Q: What two assays would you do on cells from a donor (for a difficult to acquire tissue)

M: transcripts and open chromatin -- but still unclear for single cell


Enter the Matrix: Interpreting omics through matrix factorization  (@FertigLab)

https://www.biorxiv.org/content/early/2018/04/02/196915


First, an appeal to the audience: "Need more tools for visualization across different matrix factorization techniques"


Introduction

Pattern detection is critical in the genomics big data era
(Many types of) omics data can be represented in matrices

Focus here is mainly on transcription


Omics data can be interpreted through matrix factorization (PCA, ICA, NMF, ...)


Data = Amplitude x Pattern

D: molecules by samples
A: molecules by features
P: features by samples


Focus: Smooth sparse NMF, Bioconductor package CoGAPS

$$A_{i,j} ~ \Gamma(\alpha^A_{i,j},\lambda); \alpha^A_{i,j} ~ Poisson(\alpha)$$
Gamma yields sparsity constraint
Implementation: finds constrained non-negative sparse matrices using MCMC Gibbs sampler


Application: Biological model of theraputic resistance

Cancer cell line initially sensitive, generate long term resistance, acquire time series (weekly) for cells acquiring resistance and controls
Generated gene expression data (bulk RNA-seq?)
Initial clustering

Gives information about treatment but not much about resistance


Matrix factorization for time-course analysis

Perform sparse NMF (CoGAPS) and view the Patterns over time

Reveals time dependent patterns in resistance


BUT: how does one make these abstract patterns useful?


Amplitude matrix allows mapping patterns back into gene expression space (or whatever original feature space)
Instead of finding genes most highly associated with each pattern, what are the genes associated with only one of the patterns

"Pattern marker genes"

Group that slowly increases with resistance, another group that slowly decreases, clearly groups treatment and controls


Can do standard GSEA on these marker genes


Relate this back to non cell-line data?

Take weights and project onto another dataset (ProjectR package on GitHub)
Human tumors treated with the same therapy

Found that the resistance patterns were elevated in the patient tumors that were resistant


"Resurgance of matrix factorization for single-cell data"

What's different?

Datasets are orders of magnitude larger
Cell types and timing of individual cells are unknown a priori


Showing UMAP of 10x 100k cell data, 10 different time points in mouse retina

Cell types are hand annotated to get "ground truth" (hrmmmmmm...)
scCoGAPS distinguishes cell types and trajectories

Looks like a rod pattern and a cell type pattern...


Conclusions

Matrix factorization has a long history in genomics
Adding new visualization andnew statistics to the ouputs of MF can enable robust pattern detection
Applicable to single-cell datasets


Q: 1. about manually classifying 100k cells, 2. (didn't get this one)

Research question: how do you use these factorizations to aid classification
Replicates: not able to replicate the whole time course (not enough $), but a collaborator had previously developed resistance in same cell line, found tremendous heterogeneity, sounds like generalizability still unclear


Q: Scalibility, algorithms people should focus on in the face of HCA, 2M cell scale...

My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches

|| across different sets of genes, cells
compaction approahces (group related cells, factorize in reduced space)


Q: "Are you aware of groups that have ressurected CUR decomposition... quantization approaches... where you hit the limit"

Haven't seen that. Surprised at the amount of reinventing the wheel. Need to go back to that literature.


(BREAK)
Analysis of high content microscopy data generated through automated yeast genetics (Brenda Andrews)


Introduction

Major challenge, predicting phenotype from genotype using genetic interactions
Using budding yeast because reagents for systematic genetics including

Yeast deletion collection: 5000 yeast strains each deleted for a single non-essential gene
1000 temperature sensitive alleles of essential genes


Need methods for detecting gene interactions

SGA (Synthetic genetic array): introduce any marked allele into an arrayed set of straings


Main phenotype is growth (colony size)

e.g. tested 23.4 million double mutants identifying 1.1M genetic interactions
generated "hierarchical modle of cell function" (Costanzo et al. Science 2016)


~35% of nonessential query gene mutatnts exhibit weak genetic interaction profiles
Most of the time double mutants do not have a growth phenotype, but may have other phenotypes


"Marker project"

Introduce flourescent markers for sub cellular compartments in to the arrayed strains
How does mutation of any gene influence sub-cellular compartments?
Developing a general phenotypic profiling pipeline

Make strain collection: Use SGA to introduce three markers: compartment of interest, nucleus, cytoplasm
Image: Opera Phenix automated confocal live cell
Data collection


Single cell images
Single cell morphological features

Cell Profiler, ~300 features, 10-50 PCs
VAE (autoencode) to find latent feature vector


Phenotyping profiling


Detecting mutants -> penetrence

Finding outliers, one-class SVM, distance methods, ... "no one size fits all"


Classifying mutant pheonotypes

"Neural networks"


Application: Endocytosis

Four markers: actin pathc, clathrin coat, late endosome, vacuole
Phenotype assignment and classification: two hidden layer MLP, "probabilistic" output layer
21 phenotypes: 4 WT, 17 mutants, ~88,5% accuracy
Marker penetrance: ~1230 genes with sig penetrance for at least one marker, ~50% of mutations affect more than one marker
What are the machnisms leading to incomplete penetrance / cell heterogeneity?
(I missed a bit here but obviously lots of cool stuff you can do with systematic genetics at this scale!)


Q: (Can't hear, not using the microphone)


Q: Relationship between imaging features and growth phenotype

A: Typically get a morphological phenotype when there is a growth defect


scClustViz - Single-cell RNAseq Cluster Assessment and Interactive Visualisation (Brendan Innes)


Motivation

Emergence of adult neural stem cells using scRNA-seq

First question is a hard question, how many clusters?
Answer: keep clustering until there are no longer significant differences between clusters

FDR corrected Wilcoxon rank-sum test


From this, built an R/Shiny app for visualizing clustering

Plots for evaluating number of genes

Number of differentially expressed genes (Tukey plot), next to silohette
"Classic" tSNE and UMAP
Cell cycle -- "cycling" tool? (missed this, look up)

Answer from @PeteHaitch! Thanks: it was the cyclone() function the in the scran Bioconductor package. Some detail in 'Classification of cell cycle phase' section of https://f1000research.com/articles/5-2122/v2 and based on https://www.ncbi.nlm.nih.gov/pubmed/26142758


Gene expression distribution (both detection rate and normalized expression in the cells where detected)

JT: I like Scanpy's dotplot for this, more compact visualization for selected markers


Manually select cells for DE testing


Q

A: right now handles seurat objects. Can load in whatever single cell object you want


Differential Enriched Scan 2 (DEScan2): a fast pipeline for broad peak analysis (Dario Righelli)


Example: ATAC-seq, mouse (brain?) before/after ECT


DEScan2

Call and filter peaks + counts ⟶ SummarizedExperiment peak (range) x count

Peak calling based on Poison Liklihood without overdispersion
Filter score threshold, number of samples


Differential expression with edgeR, DESeq2
Integration: few samples, annotation, many samples, mixomicx


Future: comparing with other packages, testing on ChIP-seq, visualiztion, test on ATAC single cell.


Q: If we wanted to look at ATAC-seq is it in Biocondutor experimental datasets

A: No


Improving the accuracy of taxonomic classification for identifying taxa in microbiome samples (Eric Wright)


First, udpates on DECIPHER (https://bioconductor.org/packages/release/bioc/html/DECIPHER.html)
Classification for Microbiome sequences

Two types of pipelines, 16s (or other marker genes) or WGS

Unsupervised: create OTUs and Phylogenetic trees, ...
Supervised: classify into known taxonomy (idTaxa in DECIPHER)


Problem for taxonomic classification: reference databases are incomplete

Evalutaing accuracy on a Mock community

When there are missing organisms in database BLAST has very high false positive rate. idTaxa does not (0.01% false positive rate?)
(Either I spaced out or there were no details on how it actually works)


DEsingle for detecting three types of differential expression in single-cell RNA-seq data (Zhun Miao)

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty332/4983067

Detect three types of DE between two groups in a count matrix
Using ZINB model, estimates proportion real vs drop-out zeros
DE types

DE status -- paramters \theta
DE abundance -- argh... slide is gone
"General DE"


Showing heatmaps... not using the microphone so I'm missing a lot here
Presenting yutorial of actually using DESingle

Towards a gold standard for benchmarking gene set enrichment analysis (Ludwig Geistlinger)


Goal: benchmarking GSEA

GSA theory null hypotheses: 1) competitive (vs other set), 2) self-contained (no genes in the set of interest are DE)


Generations: 1) overrepresentation 2) functional classification 3) integration of network topology
Bioconductor package: EnrichmentBrowser
Enrichment analysis in practice: GO/KEGG overenrichment, dozens of methods claim improvements, how to evaluate?
Solution: GSEABenchmarkeR, standardized benchmarking of GSEA methods

Comprehensive real data compendium
Systematic and reproducible assessment


Benchmark Panel

Array Panel: Tarca12/13 19 diseases
RNA-seq: 33 datasets from TCGA


Assessment

Statistical sig

For each method how many genes found to be significant

Competitive methods 10-20%, Self-contained much higher, like 80%?


Phenotype relevance

Zooooooom....


(BREAK)
Workshops

Workshop content is online

e.g. Bioconductor for everyone

Day Two

Binding motifs for DNA and RNA binding proteins (Tim Hughes)


Introduction

What are TF/RBP binding motifs; Sequence logos; using a PFM to estimate binding affinity

Lots of ways to do this; log scale mis-represents what protein actually does, prefers to stay in linear domain


Where do motifs come from? 1) SELEX and variants, 2) PBMS, RNAcompute, and variants, 3) ChIP/CLIP-seq and variants

PBMs easier to deal with, more realistic
ChIP/CLIP "for the afficienados this is actually the worst way to obtains the motfis". Influence of chromatin, combinatorial relationships, ...
All data types ammenable to k-mer counting, and learning PWMs in counting or discrimination mode

"Most people are using MEME approaches which are not trying to be quantitative, something to keep in mind"


Sources of motifs

TRANSFAC, HOCOMOCO, JASPAR, UniPROBE

In Bioconductor a panel of TF motif collections are collated as MotifDb

PLEASE cite original data for individual findings, not JUST aggregation


CIS-BP (http://cisbp.ccbr.utoronto.ca/)
Determining binding preferences of all human TFs first requires... a list of all human TFs!

Collaboration with JUssi Taipale and Matt Weirauch to manually reassess and collate human TFs

Pooling existing sources gives 2700 potential TFs; ~14% of human genes "way above anyones reasonable estimate"
Review by two independent expert judges
PIs resolved disagreements (less than 100, mostly "a specific assay that somebody doesn't believe"
Final list of 1639 (actually 1638) TFs. 1105 with Motif, 104 with homolog motif

http://humantfs.ccbr.utoronto.ca/


Take home messages

New list has hundreds of TFs not in the last TF survey paper (2009) [+347, -49]
Most human TFs (74%) have at least one known motif and most that don't are C2H2 ZFs


Tim has clones for the remaining 426, "if you have an assay you want to run we will give you the clone"
C2H2 are not only largest and least well characterized, also most diverse, ...

Het ChIP-seq motifs for 131 C2H2-ZFs

Almost all have a different motif -- diversification
Many bind endogenous retro elements


So, duplicate easily, carry a KRAB domain around, diversify quickly.. largest within class diversity of any family by far


Cancer Biomarker Discovery: Building a Bridge Between Preclinical and Clinical Research (Benjamin Haibe-Kains)


Acknowledgements first, Yay!
Introduction

Genotype matched clinical trials

Only 18% to 39% of patients can be matched to a trial based on targetted sequencing


Biological materials for biomarker discovery

In situ: patient tumors
In vivo: patient derived xenografts, engineered mouse models
In vitro: immortalized cell lines, 2d cell lines from biopsy, organoids


Why cell lines?

No ethical issues, cheap high throughput, approved for drugs


Biomarker discovery using cell lines

Comprehensive molecular profiling
Drug senitivity screening


Data sharing

Long history. NCI60/DTP........... CCLE and GDSC (2012)........ FIMM


Pharmacogenomics datasets require curation

Genesis of PHarmacoGx
Cellosaurus to uniquely identify cell lines and tissues
Drugs annotated with PubChemID, InChiKey, SMILES


PharmacoGx

1691 cell lines, 41 tissues, 759 compounds, 650k drug response experiments, 200 million+ gene/drug associations
(Some block diagrams of the database model here but way too small to read anything)


PharmacoDB (http://pharmacodb.ca)

Web interface to mine data, cell line, tissue, compound, gene, ...

Response curves across multiple datasets

Lot of heterogeneity; even within one dataset!


"Real power is being able to combine across datasets"

Cell line overlaps, compound overlaps between datasets (showing UpSet plots)
Concordance of drug sensitivity

Correlation performs poorly
Disconnect between statistical analysis and how scientists look at the data


Defining modified concordance index greatly increases condcordance between GDSC and CCLE

(I didn't really follow the details of the mCI unfortuantely)


Preclinical <--> Clinical

Drug matching, integration with cBioPortal


Conclusions

Selecting right therapy for patient is crucial, need biomarkers predictive of drug response
cBioPortal/PharmacoDB can discovery robust biomarkers and share them with clinicians
Challenge: how do we properly validate the pre-clinical biomarkers for clinical use?

Organoids? xenografts? clinical trials?


On reproducibility: GEO, GitHub, Bioconductor, Docker + Code Ocean

plyranges: a fluent interface to Bioconductor’s Ranges infrastructure (Stuart Lee @_StuartLee)

https://sa-lee.github.io/plyranges
"If our technology behaves" (RevealJS...)

Introduction

Bioconductor infra is powerful: GenomicRanges, IRanges

Enables developing new tools


However, R world has changed (tidyverse). A lot is expected of a new R user to analyze data with Bioconductor

More complexity, S4 system


How far can user get with just GRanges?

GRanges are (already) tidy!

Range is a variable, metadata columns are observations


plyranges: design a grammer

endomorphism (ops always give back a GRanges)
cohesion
consistency (lease surprise)
expressiveness (funtions describe what they do)


Incldue genomic semantics (bioRxiv 327841)

arithmatic (shift, resize, flank, coverage)
restrict (by metadata or by range queries)
aggregation (summariz over ranges)
merges


Examples

Changing coordinates; exons %>% mutate( width=2*width ) doubles the width, by default starting at the START

Anchoring: exons %>% anchor_center() %>% mutate( width=2*width )

[This seems really odd to me, why is anchoring not an argument to mutate? This needs to carry some state along magically]
[Stuart just explained it to me, anchoring is like group_by, it gives back an "AnchoredGRanges" which behaves like / wraps the original but also provides the anchoring info. Okay, makes sense, not that magical]


Merging: treating genomic intersections like table joins

join_overlap_inner(a, b)
join_overlap_inner_within(a, b)
join_overlap_inner_directed(a, b)


Automatic metadata propagation for RNA-seq (Mike Love @mikelove)

bit.ly/tximeta

Motiating example, data re-use

From a paper: "we performed RNA-seq quantification with Salmon using hg38 and the RefSEq annotation"

Provided raw and summary data: we have a count matrix, know RefSeq ans hg38


Reusing this data

read_tsv( "counts" ) ⟶ gene symbols
genes(TxDB...)
MapIDS ⟶ lose some genes right away (3770)
Just work with interesection...
Make a SummarizedExperiment
Import ChIP-seq peaks... don't know what genome they are from...
😡


Thus: reuse is difficult, time consuming


coldata = data.frame( files, names, condition, batch )
se = tximeta(coldata)

uses BiocFileCache
matches transcriptome automatically, adds transcript range data and metadata


Whether or not originaly analyst kept track of it! [Awesome]


What's it actually doing?

Salmon hashes the contents of the transcriptome. Salmon index contains a hash of the transcriptome
Salmon quant outputs include this hash
txmieta will hash "all the transcriptomes"
If you post entire quantification directory


linkedTxomes

Link your custome (e.g. filtered) transcriptome to upstream source, e.g. FASTA sources + GTFs


A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs (Charlotte Soneson @CSoneson)

"Should be a pre-print any minute"


Motivating Example

ZADH2 data from GTEx. Has four isoforms, distinct 5' and 3' UTRs

Which isoforms are expressed? Given junction mapping data

GTEx expression data at this locus does not appear to be consistent with the junction evidence


Can we detect genes with conflicting isoform expression in an automated fashion


The idea

Assume: the annotation is complete AND we know the abundance of all transcripts AND we can model library prep and sequencing bias
Then: should be able to predict the coverage of each genomic region

In particular the number of reads spanning each exon/exon junction


Thus: if we can't predict the coverage accurately, one of the assumptions is wrong (e.g. abundances)


In practice

Predict junction coverages from transcript abundance and bias models


Can estimate in many ways, Salmon kallisto sailfish rsem sringtie, ... whatever
Predict coverage from transcript coverage and bias models. Bioconductor package Alpine

https://bioconductor.org/packages/release/bioc/html/alpine.html
From predicted coverage, weight and combine by transcript ⟶ expected junction coverage


Consider correlation of observed and predicted junction counts


Generally high correlation $r = 0.945$, $r_s = 0.903$

Compatibility score $JCC_i$: for each gene $i$ quantify deviation between predicted and observed junction coverage


Most genes have low $JCC$ scores

"Different abundance estimations give different scores, but approximately equally affected"


Back to ZADH2

Read coverage of exons explains why the "wrong" isoform is over-weighted


Choice of annotation effects score

Some genes for which CHESS is better and others for which Ensembl is better.


Assembling missing transcripts improves the score

Run stringtie and include new transcripts in quantification, many more genes are better with StringTie

[Shouldn't we always do this?]


Summary

JCC score lets you flag genes for which tx abundance estimates are unreliable (regardless of underlying cause)
works on a sample by sample basis (don't need replicates)
requires only RNA-seq data


Unpacking signal from RNA-seq intron reads using Rsubread and limma packages (Charity Law)

https://www.biorxiv.org/content/early/2018/06/21/352823

Two different count types: exon (any annotated exon) and intron (uniquely to intron)

Rsubread: align using subjunc, summarize using featureCounts
Total RNA: 56% exon, 21% intron; polyA: 69%, 7%


Making use on intron reads
What are they useful for? No consensus... intron retention? Nascent transcription? Noise/useless?
Intro reads are informative: MDS plots using limma, samples separate by biological groups and library preparation
Looking at coverage patterns in bins across genes, metagene and individual genes
Final thoughts

Majority of expressed genes have both exon and intron signal
3' coverage bias may impact intron retention detection
Lots of intron reads in single cell data... Needs to be looked at


bcbioSingleCell: R package for bcbio single-cell analysis (M Steinbuagh)

Slides: https://github.com/steinbaugh/presentations/blob/master/2018-07-27/bioc2018.pdf

Motivation: cell population analysis facilitates cell type (clustering) and differentiation (trajectory, pseudotime) analysis
Technologies: InDrops, Chromium ("use this one if you can"), DropSeq, Seq-well
bcbio: best practice pipelines for lots of bioinformatics tasks...

bcbio single-cell pipeline

Not in R: bcl2fastq -> umis -> RapMap/Kallisto
bcbioSingleCell() -> QC: scrater, scran, TRAJ: monocle, STREAM, SPRING, Cluster: SEurat, ...

Sparse matrix (Matrix) SingleCellExperiment. QC tools, visualization, DE testing


(Usage examples for the bcbioSingleCellExperiment stuff)

gwasurvivr: an R package to perform survival association testing on imputed genetic data (Abbas Rizvi @aarizvi)


Introduction: what's a GWAS?

Association between genotype and pheonotype
Typically using arrays, 750k to 1M markers
Standard practice is to impute genotypes

Popular options: IMPUTE2, Minimac3, BEAGLE

Estimte haploytypes using a reference panel (1000 genomes, etc)


Survival packages for GWAS.. a bunch


gwassurvir

Fast survival analysis
model SNP by covariate interactions
Filter SNP imputation/quality metrics
Leverage existing Bioconductor packages, GWASTools, VariantAnnotation
|| across compute cores, modify existing survival::coxph.fit

Goal: decrease number of iterations needed for convergence


(Usage examples, as usual these are too small to read)
Benchmarking

Runtime comparison, orders of magnitude faster
"Highly correlated results"

[Looks identical? Is correlation the right metric to use here?]


Data-Adaptive Estimation and Inference for Differential Methylation Analysis (Nima Hejazi)

methyvim package
I can't read these slides at all due to the colors ;( bit.ly/bioc_methyvim_2018

Data: Infinium EPIC methylation arrays; ~850k sites
Is disease state related to methylation?
First pass analysis, fit linear model to each site, get $\beta$, test significance

Does this answer the right question?


Sites are not independent, consider effects of neighboring sites
Approach uses "causual inference and machine learning"

Filter sites with some evidence of differential methylation and cluster
Estimate variable importance measure at each site controlling for pattern at neighboring sites
"and of course we need to correct for multiple testing"


Data Adaptive Evaluation of Preprocessing Methods using Ensemble Machine Learning (Rachael Phillips)


Introduction

Arsenic: naturally occuring, acute and chronic health effects
Question: Qhat is the effect of early life arsenic exposure on CpG methylation in adults


Data

44 blood samples (21 exposed, 23 unexposed)
EPIC chip, 850k CpGs, batches...


Pre-processing

Raw -> QC -> Transform -> Filter -> Batch Correction -> Downstream

Lots of options for every step... "why not let my data tell me what appraoch might be the best?" "maximize my biological system"


Data Adaptive Methodology

Assess using positive control vars (age, exposure, smoking) and negative control variables (batch, date, who processed, ...)
Zoooooom... but basically maximize (avreage risk) prediction of positive controls, minimize prediction of negatives
Ensemble learning

~10 different base learners, ~10 different normalization strategies
5-fold cross validation


Future

Defining optimal choice
Refine screening to be more efficient
Limitations of data (confounding)
...


(END OF TALKS!)