Skip to content

Instantly share code, notes, and snippets.

@jxtx jxtx/
Last active Aug 22, 2018

What would you like to do?
#bioC 2018 Conference Notes

Conference info:

My first Bioconductor meeting, and I'm not a BioC or R expert so these notes are probably going to be naïve!


Developer Day

(I arrived at 1:00pm and missed the morning sessions)

⚡️Lightning talks II

Meetups (Aedin Culhane)

  • Boston first, then NYC
  • Bioconductor has a community Slack, there is an #meetups channel
  • Boston: 40 to 100 people turn up

BioCFileCache (Lori Shepherd)

Local file management. Cache files locally to avoid downloading from remote sources if not needed. Also, try to have a better way to organize files.

BiocFileCache(). Backed by a sqlite database.

  • bfcadd( rname=..., fpath=...) adds an existing file to track in the cache
  • bfcnew( rname=... ) gives a new path in the cache
  • bfcneedsupdate() check if a dataset has changed remotely and needs to be downloaded again
  • bfcquery(...) search for datasets in the cache
  • bfcrpath(...) gives the local path of file by id

Can also attach metadata to datasets.

VariantExperiment (Qian Liu)

  • Stored variants (genotypes, multiple assays, multiple individuals).
  • Extends RangedSummarizedExperiment.
  • Can construct from gds file or from a vcf file.
  • Subsetting and range slicing.
  • "Many statistical methods are defined". Example: hwe.

Scalable computing in BioConductor (Nitesh Turaga)

  • map/reduce in R: lapply( X, FUN, ...).
  • BioCParallel: bplapply( ..., bpparam )
  • bpparam determines the BioCParallel backend to use
  • e.g. SerialParam or MulticoreParam
  • New: scalaing across clusters, BatchToolsParam( workers=..., cluster=... )
    • cluster might be SGE, SLURM, LSF, PBS, et cetera
  • Example: Salmon psuedoalignment
    • instantiate BatchToolsParam with institution specific template
    • Write a function that processes a single sample
    • Pass function to bplapply to run in parallel
      • Progress Bar!
  • Benefits: easier cluster management, ...
  • https://github/nturaga/BatchtoolsParam_examples

Workshops (Levi Waldron)

How the confernece workshops (will run Thursday/Friday) materials were built.

Organism.dplyr (Daniel van Twisk)

  • Alternative interface to the org.* packages, similar purpose to OrganismDBI
  • Any organism with both a org and txdb package can be used
  • src_organism( "org..." ) provides the interface. Compatible with all methods from dplyr
  • 11 genomic coordinate extractor methods available, e.g. transcripts gets a GRanges, transcripts_tbl gets a tibble
  • Examples of a variety of complex filters (too small to read!)

Birds of a Feather

Parallel tracks, Peter Hickey on Effectively using DelayedArray and Levi on New Data Structures for Bioconductor. (Sorry @PeteHaitch but I can only choose one).

New Data Structures for Bioconductor (Levi Waldron)

First, a presentation "Why re-use core classes: A plea to developers of Bioconductor packages" (Levi).

  • What is Bioconductor? 1,400 packages on a backbone of data structures.
    • e.g.: GenomicRanges, SummarizedExperiment
  • Why do core classes matter?
    • Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.
      • But your frame has limited testing, and probably doesn't handle many use cases
    • It is easy to define a new S4 class in R
      • But you shouldn't very difficult to build a robust and flexible class for genomic data analysis
    • Example from phylogenetics / microbiome packages: not using common classes ⟶ limits interoperability
  • What are the core classes?
  • Core classes represent years of work and maintenence and have been used by tens of thousands of users


  • Q: What would you do instead of defining your own class (in the case of phyloseq)

    • .MicrobiomeExperiment <- setClass("MicrobiomeExperiment", 
              representation( rowData="MicrobiomeFeatures" )
    • Gives you the benefits of SummarizedExperiment, compatible with MultiAssayExperiment
  • Exploring S4 classes

    • extends tells you what superclasses a given class extends
      • e.g. RangedSummarizedExperiment isa SummarizedExperiment, Vector, Annotated
    • showclass adds known subclasses and what slots it contains
    • methods tells you what methods are defined for a class
      • e.g. 100+ methods on SummarizedExperiment, but 54 of those from parent classes)
  • Example that has "done things right": SingleCellExpriment

    • extends RangedSummarizedExperiment and defines additional methods...

PharmacoGx Updates (Petr Smirnov)

  • WIP: Fixing up to work better with Bioconductor objects

  • Drug sensitivity data: "treat a cell line with a drug and see how well it kills it"

  • Structure

    • molecularProfiles: List of ExpressionSet objects
    • sensitivity: List of a couple of data frames and an array
      • Initially Experiment IDs with dose/viability pairs
        • But, drugs combinations, other dimensions, not naturally a matrix
      • Solution: LongArray object
        • col.ids and row.ids: data.frame's
          • Example (combination_name, drugA, drugB)
        • Get data as if from a list
          • "long array but behaves when you use a single bracket as if a matrix"
        • Slicing example
          • dcText is a longArray object with "rows" across 2 variables and "columns" across 1 variable
          • slicing dcTest[c("5-FU", "Bortezomib", "Erlotinib"), "A2058"]
  • Q: Why is this different from a SummarizedExperiment?

    • Multiple experiments, e.g. drug combinations of 2, 3, ... n drugs
    • Followup Q: what about MultiAssayExperiment?
      • Would lose quick subsetting through multiple dimensions (?)

TxRegInfra (Vince Carey)

  • Investigator's idea: eQTL from GTex, DHS,... from ENCODE, TFBS from FIMO... use this to interpret GWAS hits
  • Developers: do as little as possible to resolve and keep metadata
  • Existing resources that could help:
    • rtracklayer+tabix, GenomicRanges, RaggedExperiments, ...
    • mongolite
  • RaggedExperiment
  • Example
    • Data
      • collection of eQTL from GTEx
      • encode footprint (not sure what this actually is)
      • encode DHS hotspots
    • Documents in a mongodb database (RaggedMongoExperiment)
      • Every document has a genomic range, so can respond to range queries
  • Summary
    • Basic layout: genomic coordinates x sample/tissue type x assay type
    • MutltiAssayExperiment: could work but not an immediate fit
  • "There is a competitor called Giggle"

⚡️Lightning talks III

Bioconductor tricks for dealing with genome annotation (Michael Steinbaugh)

  • Recommended packages
    • GenomicRanges, rtracklayer (GTF -> GRanges), AnnotationHub, ensembldb, GenomicFeatures
  • basejump extends these tools
  • Rich metadata columns (GRanges), mcol(...)

iSEE (Charlotte Soneson)

  • Interactively explore any data in a SummarizedExperiment object (or subclass)
  • Multiple panels with different visualizations, can see how they are linked

Managing project metadata with a standard project format (Nathan Sheffield);

  • Motivation: Most pipelines require individual metadata organization
  • PEP: a standard format for project metadata -- "Portable Encapsualted Project"
  • Ecosystem of tools:
    • format itself: project_config.yaml, samples.csv
    • peppy: Python package
    • pepr: R package
    • geofetch, looper -- map samples onto pipelines and run in different compute environments

Finding Bioconductor Packages (Shian Su)

  • Bioconductor packages are not that easy to find
    • Prioritization: can sort by title (alpha), author (alpha), not really that useful
  • Alternative: BiocExplorer:
    • Prioritize packages based on usage
    • Provides a graph of packages, prioritizing those that are widely used (not sure what

Recent cloud-scale innovations in Bioconductor (Vince Carey)

So fast... so small...

  • Summary first

    • DelayedArray: seamless element level access to out-of-memory / remote array-like resources
    • SummarizedExperiment/MAE: a sort of query language for annotated omics resources
    • Current efforts: improve efficiency of statistical learning using Delayed* resources
  • DelayedArray backends: HDF5 server, BigTable, ...

  • HDF cloud / HDF Kita (Example using 10x)

  • BigTable (Example using (OncoTk)

  • The point (I think): You can work with all of these types of remote data in the current version of Bioconductor

Sesame: a sensible way to analyze a DNAme array (Tim Triche)

  • Improves masking on hyper-polymorphic region (e.g. MHC)

Community Activites

Brainstorm and prioritize some products that can be produced in ~45 minutes (and then do the thing).

(Martin using to accumulate suggestions from the attendees)

Voting, winners are:

  • Come up with a data structure for PharmaGx data -- 9
  • Strategies for posting and answering support site questions -- 8
  • Checks on SummarizedExperiment rownames, rowData()<- -- 7
  • name clashes between BiocGenerics, S4Vectors etc. and tidyverse -- 6
  • Pull requests to fix usage and other warnings in core packages, e.g., Rsamtools -- 6
  • Initiate collaborative development of ... (like iSEE at BiocEurope) -- 6

We'll see what happens...

Summaries of community activites

Support site

  • Main idea: Template for the "ask a question" box: provide some guidance for how to ask a good question
    • Reproducible example
    • Things you tired
    • sessionInfo
  • Other ideas (Google doc link)

Collaborative project

  • "Biocverse": visualizing the Bioconductor ecosystem
  • Use cases
    • New users: given a task, show me all packages, ranked by "importance"
    • Experienced users
    • New developers

SummarizedExperiment rownames

Fixing a problem in assigning rownames. I think.


(more) Discussion of how to store drug sensitivity data.

Store as a database. Write construtor functions that create matrix / SummarizedExperiment from the database.

There's some code in github in a project called longArray but I can't read the user/org name.

Panel discussion: project directions and opportunities

Q&A for the project leadership team.

Q: "Can you think of ways to expand the network of project leaders, expand ownership, expand people who feel they are part of the project"

  • Martin: project has life outside the core. e.g. Single Cell developments largely outside the core.
  • Vince: (turns question back) Is there a lack of recognition or barriers to participation?
  • Aedin: Surprising how many R people don't know Bioconductor

Q: Mechanism to kick packages out?

  • M: There are obstacles that discourage people from participating, but the tradeoff is worth it. Quality of packages, having vignettes. Contrast in how tidyverse works with how Bioconductor works, different views on how software should work. Interesting to think about compromises in making those play along.
  • V: Advantages of putting your package in Bioconductor: development vs release branch (not available in CRAN), vignettes and examples. Not trying to sell Bioconductor methods to people not trying to do them.

Q: Synchronization between Bioconductor and CRAN

  • M: We communicate regularly. At a technical level there is communication. At a social level, much more restricted. CRAN has task views, but no overlap with Bioconductor

Q: How do you recommend new users learn the Bioconductor ecosystem

  • Wolfgang: need a new book. Something like a textbook. Challenge defining what it means to use/learn Bioconductor
  • A: A beginners user guide to Bioconductor should start with SummarizedExperiment, basics... The thing I direct people to is the f1000 channel
  • If you want to learn Bioconducter there is a reason for that. You have some data.
  • V: Talking about content from the edX MOOC
  • Kasper: introduction should focus on the things that EVERY user of Bioconductor should know. Should streamline and cleanup what is presented.

Q: Funding mechanisms the Bioconductor community should be applying for

  • M: Historically get a lot of money into one big shop, easier than scrambling for smaller grants from a variety of places. More recently, have started to diversify. e.g. Human cell atlas grants written by a diverse group of people. More junior faculty wanting to expand participation but need funding to do so. But wants to call on James Taylor...

  • ACK IT ME: What works for Galaxy 1) Core group goes after diverse funding opportunities, one big pot but also lots of other pots 2) Full time funded community outreach. Multiple people across the project dedicated to this. Really makes a huge difference to have someone spending all their time on this so it is never lost/back-burnered [writing what I think not what I say right now]

Q: Increasing participation, diversity, gender balance

  • M: (I didn't capture this well so nothing here)

Q: Question about connections with Africa (H3A?) training and outreach

  • A: There are some efforts/connections
  • V: There is a foundation for Bioconductor devoted to charitable works, has some money, could potentially be used to expand training

(discussion about website here, needs to be simplified, streamlined, refreshed... couldn't hear everything)

(Back to diversity, a plug for participation in Girls who Code and other such groups, always looking for help)


Day "1"

Starting with introductory remarks from Martin (thanking sponsors, organizers, logistics -- all on the conference website).

Last item, Code of Conduct (, interesting, shorter than many.

Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome (@michaelhoffman)

  • Introduction
    • "Transcription over-simplified": TF binds DNA, recrits PolII, RNA is made (yup, that's all)
    • ChIP-seq, you might have heard of it...
      • Problem: ChIP-seq needs 10^6 to 10^8 cells ("Determined using 'Cunningham's Law'")
      • Solution: computational prediction of transcription factor binding
        • Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better
          • Michael thinkg HINT works pretty well
  • How to move forward
    1. Use experimental data from ChIP-seq in other cell types
    2. Learn association between local TF binding and global cellular state as measured from transcribed RNA
  • Learning from the transcriptome
    • ChIP-seq data from some cell types, RNA-seq data from more cell types.
      • Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)
        • Correlation matrix is genome-wide, consider cases where p < 0.1
        • For a given bin, consider all genes with "significant" correlation, compute spearman rank correlationp between expression and correlation == expression score (yes, correlations of correlations)
        • phastCons + chromatin accessibility + expression score + number of cell types with TF binding + motif score into a "very simple neural network" [MH: it's simple in that it is fully connected, nothing fancy in the architecture]
          • MLP... optimized # layers, size of layers, activation function, ... looks like lots of hyperparameters [MH: only 4 hyperparameters! 3×3×3×4 = 36 different possible values examined in the grid search]
  • Evaluation
    • (Describing cross-validation scheme, use of precision-recall curves)
    • ChIP-seq data from other cell types most important, then expression score, then chromatin accessibility
  • Performance
    • "A few" TFs where we do very well -- auPR >= 0.5 -- SMC3, CTCF, RAD21... some others, went by too fast [MH: performance plot is in preprint]
    • For 36 TFs MCC more than 0.3 in validation cell types (Roadmap)
    • Correctly predicts novel TF binding sites
  • Trackhub available, didn't see a URL though, probably in the preprint [MH: should add to presentation.]
  • Future:
    • Position dependency -- adding time dimension to network


  • Q: Now that we have 1600 cell lines with RNA-seq, would you be interested in inferring ChIP-seq for those cell lines
    • M: Yes, I would be intersted in that, but open-chromatin data is important, not sure can do without
  • Q: Can this be applied to single cell transcriptome data?
    • M: Yes, could do something, again need measurement of open chromatin ("having single cell ATAC would be best")
  • Q: (I think the question is using the model to find most similar cell line)
    • M: (Not sure)
  • Q: What two assays would you do on cells from a donor (for a difficult to acquire tissue)
    • M: transcripts and open chromatin -- but still unclear for single cell

Enter the Matrix: Interpreting omics through matrix factorization (@FertigLab)

  • First, an appeal to the audience: "Need more tools for visualization across different matrix factorization techniques"

  • Introduction

    • Pattern detection is critical in the genomics big data era
    • (Many types of) omics data can be represented in matrices
      • Focus here is mainly on transcription
    • Omics data can be interpreted through matrix factorization (PCA, ICA, NMF, ...)
      • Data = Amplitude x Pattern
        • D: molecules by samples
        • A: molecules by features
        • P: features by samples
  • Focus: Smooth sparse NMF, Bioconductor package CoGAPS

    • $$A_{i,j} ~ \Gamma(\alpha^A_{i,j},\lambda); \alpha^A_{i,j} ~ Poisson(\alpha)$$
    • Gamma yields sparsity constraint
    • Implementation: finds constrained non-negative sparse matrices using MCMC Gibbs sampler
  • Application: Biological model of theraputic resistance

    • Cancer cell line initially sensitive, generate long term resistance, acquire time series (weekly) for cells acquiring resistance and controls
    • Generated gene expression data (bulk RNA-seq?)
    • Initial clustering
      • Gives information about treatment but not much about resistance
    • Matrix factorization for time-course analysis
      • Perform sparse NMF (CoGAPS) and view the Patterns over time
        • Reveals time dependent patterns in resistance
    • BUT: how does one make these abstract patterns useful?
      • Amplitude matrix allows mapping patterns back into gene expression space (or whatever original feature space)
      • Instead of finding genes most highly associated with each pattern, what are the genes associated with only one of the patterns
        • "Pattern marker genes"
          • Group that slowly increases with resistance, another group that slowly decreases, clearly groups treatment and controls
      • Can do standard GSEA on these marker genes
    • Relate this back to non cell-line data?
      • Take weights and project onto another dataset (ProjectR package on GitHub)
      • Human tumors treated with the same therapy
        • Found that the resistance patterns were elevated in the patient tumors that were resistant
  • "Resurgance of matrix factorization for single-cell data"

    • What's different?
      • Datasets are orders of magnitude larger
      • Cell types and timing of individual cells are unknown a priori
    • Showing UMAP of 10x 100k cell data, 10 different time points in mouse retina
      • Cell types are hand annotated to get "ground truth" (hrmmmmmm...)
      • scCoGAPS distinguishes cell types and trajectories
        • Looks like a rod pattern and a cell type pattern...
  • Conclusions

    • Matrix factorization has a long history in genomics
    • Adding new visualization andnew statistics to the ouputs of MF can enable robust pattern detection
    • Applicable to single-cell datasets
  • Q: 1. about manually classifying 100k cells, 2. (didn't get this one)

    • Research question: how do you use these factorizations to aid classification
    • Replicates: not able to replicate the whole time course (not enough $), but a collaborator had previously developed resistance in same cell line, found tremendous heterogeneity, sounds like generalizability still unclear
  • Q: Scalibility, algorithms people should focus on in the face of HCA, 2M cell scale...

    • My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches
      • || across different sets of genes, cells
      • compaction approahces (group related cells, factorize in reduced space)
  • Q: "Are you aware of groups that have ressurected CUR decomposition... quantization approaches... where you hit the limit"

    • Haven't seen that. Surprised at the amount of reinventing the wheel. Need to go back to that literature.


Analysis of high content microscopy data generated through automated yeast genetics (Brenda Andrews)

  • Introduction

    • Major challenge, predicting phenotype from genotype using genetic interactions
    • Using budding yeast because reagents for systematic genetics including
      • Yeast deletion collection: 5000 yeast strains each deleted for a single non-essential gene
      • 1000 temperature sensitive alleles of essential genes
    • Need methods for detecting gene interactions
      • SGA (Synthetic genetic array): introduce any marked allele into an arrayed set of straings
    • Main phenotype is growth (colony size)
      • e.g. tested 23.4 million double mutants identifying 1.1M genetic interactions
      • generated "hierarchical modle of cell function" (Costanzo et al. Science 2016)
    • ~35% of nonessential query gene mutatnts exhibit weak genetic interaction profiles
    • Most of the time double mutants do not have a growth phenotype, but may have other phenotypes
  • "Marker project"

    • Introduce flourescent markers for sub cellular compartments in to the arrayed strains
    • How does mutation of any gene influence sub-cellular compartments?
    • Developing a general phenotypic profiling pipeline
      1. Make strain collection: Use SGA to introduce three markers: compartment of interest, nucleus, cytoplasm
      2. Image: Opera Phenix automated confocal live cell
      3. Data collection
      • Single cell images
      • Single cell morphological features
        • Cell Profiler, ~300 features, 10-50 PCs
        • VAE (autoencode) to find latent feature vector
      1. Phenotyping profiling
      • Detecting mutants -> penetrence
        • Finding outliers, one-class SVM, distance methods, ... "no one size fits all"
      • Classifying mutant pheonotypes
        • "Neural networks"
  • Application: Endocytosis

    • Four markers: actin pathc, clathrin coat, late endosome, vacuole
    • Phenotype assignment and classification: two hidden layer MLP, "probabilistic" output layer
    • 21 phenotypes: 4 WT, 17 mutants, ~88,5% accuracy
    • Marker penetrance: ~1230 genes with sig penetrance for at least one marker, ~50% of mutations affect more than one marker
    • What are the machnisms leading to incomplete penetrance / cell heterogeneity?
    • (I missed a bit here but obviously lots of cool stuff you can do with systematic genetics at this scale!)
  • Q: (Can't hear, not using the microphone)

  • Q: Relationship between imaging features and growth phenotype

    • A: Typically get a morphological phenotype when there is a growth defect

scClustViz - Single-cell RNAseq Cluster Assessment and Interactive Visualisation (Brendan Innes)

  • Motivation

    • Emergence of adult neural stem cells using scRNA-seq
      • First question is a hard question, how many clusters?
      • Answer: keep clustering until there are no longer significant differences between clusters
        • FDR corrected Wilcoxon rank-sum test
  • From this, built an R/Shiny app for visualizing clustering

    • Plots for evaluating number of genes
    • Gene expression distribution (both detection rate and normalized expression in the cells where detected)
      • JT: I like Scanpy's dotplot for this, more compact visualization for selected markers
    • Manually select cells for DE testing
  • Q

    • A: right now handles seurat objects. Can load in whatever single cell object you want

Differential Enriched Scan 2 (DEScan2): a fast pipeline for broad peak analysis (Dario Righelli)

  • Example: ATAC-seq, mouse (brain?) before/after ECT

  • DEScan2

    • Call and filter peaks + counts ⟶ SummarizedExperiment peak (range) x count
      • Peak calling based on Poison Liklihood without overdispersion
      • Filter score threshold, number of samples
    • Differential expression with edgeR, DESeq2
    • Integration: few samples, annotation, many samples, mixomicx
  • Future: comparing with other packages, testing on ChIP-seq, visualiztion, test on ATAC single cell.

  • Q: If we wanted to look at ATAC-seq is it in Biocondutor experimental datasets

    • A: No

Improving the accuracy of taxonomic classification for identifying taxa in microbiome samples (Eric Wright)

  • First, udpates on DECIPHER (
  • Classification for Microbiome sequences
    • Two types of pipelines, 16s (or other marker genes) or WGS
      • Unsupervised: create OTUs and Phylogenetic trees, ...
      • Supervised: classify into known taxonomy (idTaxa in DECIPHER)
  • Problem for taxonomic classification: reference databases are incomplete
    • Evalutaing accuracy on a Mock community
      • When there are missing organisms in database BLAST has very high false positive rate. idTaxa does not (0.01% false positive rate?)
      • (Either I spaced out or there were no details on how it actually works)

DEsingle for detecting three types of differential expression in single-cell RNA-seq data (Zhun Miao)

  • Detect three types of DE between two groups in a count matrix
  • Using ZINB model, estimates proportion real vs drop-out zeros
  • DE types
    • DE status -- paramters \theta
    • DE abundance -- argh... slide is gone
    • "General DE"
  • Showing heatmaps... not using the microphone so I'm missing a lot here
  • Presenting yutorial of actually using DESingle

Towards a gold standard for benchmarking gene set enrichment analysis (Ludwig Geistlinger)

  • Goal: benchmarking GSEA
    • GSA theory null hypotheses: 1) competitive (vs other set), 2) self-contained (no genes in the set of interest are DE)
  • Generations: 1) overrepresentation 2) functional classification 3) integration of network topology
  • Bioconductor package: EnrichmentBrowser
  • Enrichment analysis in practice: GO/KEGG overenrichment, dozens of methods claim improvements, how to evaluate?
  • Solution: GSEABenchmarkeR, standardized benchmarking of GSEA methods
    • Comprehensive real data compendium
    • Systematic and reproducible assessment
  • Benchmark Panel
    • Array Panel: Tarca12/13 19 diseases
    • RNA-seq: 33 datasets from TCGA
  • Assessment
    • Statistical sig
      • For each method how many genes found to be significant
        • Competitive methods 10-20%, Self-contained much higher, like 80%?
  • Phenotype relevance
    • Zooooooom....



Workshop content is online

Day Two

Binding motifs for DNA and RNA binding proteins (Tim Hughes)

  • Introduction
    • What are TF/RBP binding motifs; Sequence logos; using a PFM to estimate binding affinity
      • Lots of ways to do this; log scale mis-represents what protein actually does, prefers to stay in linear domain
    • Where do motifs come from? 1) SELEX and variants, 2) PBMS, RNAcompute, and variants, 3) ChIP/CLIP-seq and variants
      • PBMs easier to deal with, more realistic
      • ChIP/CLIP "for the afficienados this is actually the worst way to obtains the motfis". Influence of chromatin, combinatorial relationships, ...
      • All data types ammenable to k-mer counting, and learning PWMs in counting or discrimination mode
        • "Most people are using MEME approaches which are not trying to be quantitative, something to keep in mind"
    • Sources of motifs
        • In Bioconductor a panel of TF motif collections are collated as MotifDb
          • PLEASE cite original data for individual findings, not JUST aggregation
  • CIS-BP (
  • Determining binding preferences of all human TFs first requires... a list of all human TFs!
    • Collaboration with JUssi Taipale and Matt Weirauch to manually reassess and collate human TFs
      • Pooling existing sources gives 2700 potential TFs; ~14% of human genes "way above anyones reasonable estimate"
      • Review by two independent expert judges
      • PIs resolved disagreements (less than 100, mostly "a specific assay that somebody doesn't believe"
      • Final list of 1639 (actually 1638) TFs. 1105 with Motif, 104 with homolog motif
  • Take home messages
    1. New list has hundreds of TFs not in the last TF survey paper (2009) [+347, -49]
    2. Most human TFs (74%) have at least one known motif and most that don't are C2H2 ZFs
    • Tim has clones for the remaining 426, "if you have an assay you want to run we will give you the clone"
    • C2H2 are not only largest and least well characterized, also most diverse, ...
      • Het ChIP-seq motifs for 131 C2H2-ZFs
        • Almost all have a different motif -- diversification
        • Many bind endogenous retro elements
      • So, duplicate easily, carry a KRAB domain around, diversify quickly.. largest within class diversity of any family by far

Cancer Biomarker Discovery: Building a Bridge Between Preclinical and Clinical Research (Benjamin Haibe-Kains)

  • Acknowledgements first, Yay!
  • Introduction
    • Genotype matched clinical trials
      • Only 18% to 39% of patients can be matched to a trial based on targetted sequencing
    • Biological materials for biomarker discovery
      • In situ: patient tumors
      • In vivo: patient derived xenografts, engineered mouse models
      • In vitro: immortalized cell lines, 2d cell lines from biopsy, organoids
    • Why cell lines?
      • No ethical issues, cheap high throughput, approved for drugs
    • Biomarker discovery using cell lines
      1. Comprehensive molecular profiling
      2. Drug senitivity screening
    • Data sharing
      • Long history. NCI60/DTP........... CCLE and GDSC (2012)........ FIMM
    • Pharmacogenomics datasets require curation
      • Genesis of PHarmacoGx
      • Cellosaurus to uniquely identify cell lines and tissues
      • Drugs annotated with PubChemID, InChiKey, SMILES
  • PharmacoGx
    • 1691 cell lines, 41 tissues, 759 compounds, 650k drug response experiments, 200 million+ gene/drug associations
    • (Some block diagrams of the database model here but way too small to read anything)
  • PharmacoDB (
    • Web interface to mine data, cell line, tissue, compound, gene, ...
      • Response curves across multiple datasets
        • Lot of heterogeneity; even within one dataset!
  • "Real power is being able to combine across datasets"
    • Cell line overlaps, compound overlaps between datasets (showing UpSet plots)
    • Concordance of drug sensitivity
      • Correlation performs poorly
      • Disconnect between statistical analysis and how scientists look at the data
    • Defining modified concordance index greatly increases condcordance between GDSC and CCLE
      • (I didn't really follow the details of the mCI unfortuantely)
  • Preclinical <--> Clinical
    • Drug matching, integration with cBioPortal
  • Conclusions
    • Selecting right therapy for patient is crucial, need biomarkers predictive of drug response
    • cBioPortal/PharmacoDB can discovery robust biomarkers and share them with clinicians
    • Challenge: how do we properly validate the pre-clinical biomarkers for clinical use?
      • Organoids? xenografts? clinical trials?
  • On reproducibility: GEO, GitHub, Bioconductor, Docker + Code Ocean

plyranges: a fluent interface to Bioconductor’s Ranges infrastructure (Stuart Lee @_StuartLee)

"If our technology behaves" (RevealJS...)

  • Introduction
    • Bioconductor infra is powerful: GenomicRanges, IRanges
      • Enables developing new tools
    • However, R world has changed (tidyverse). A lot is expected of a new R user to analyze data with Bioconductor
      • More complexity, S4 system
  • How far can user get with just GRanges?
    • GRanges are (already) tidy!
      • Range is a variable, metadata columns are observations
  • plyranges: design a grammer
    • endomorphism (ops always give back a GRanges)
    • cohesion
    • consistency (lease surprise)
    • expressiveness (funtions describe what they do)
  • Incldue genomic semantics (bioRxiv 327841)
    • arithmatic (shift, resize, flank, coverage)
    • restrict (by metadata or by range queries)
    • aggregation (summariz over ranges)
    • merges
  • Examples
    • Changing coordinates; exons %>% mutate( width=2*width ) doubles the width, by default starting at the START
      • Anchoring: exons %>% anchor_center() %>% mutate( width=2*width )
        • [This seems really odd to me, why is anchoring not an argument to mutate? This needs to carry some state along magically]
        • [Stuart just explained it to me, anchoring is like group_by, it gives back an "AnchoredGRanges" which behaves like / wraps the original but also provides the anchoring info. Okay, makes sense, not that magical]
    • Merging: treating genomic intersections like table joins
      • join_overlap_inner(a, b)
      • join_overlap_inner_within(a, b)
      • join_overlap_inner_directed(a, b)

Automatic metadata propagation for RNA-seq (Mike Love @mikelove)

  • Motiating example, data re-use
    • From a paper: "we performed RNA-seq quantification with Salmon using hg38 and the RefSEq annotation"
      • Provided raw and summary data: we have a count matrix, know RefSeq ans hg38
    • Reusing this data
      • read_tsv( "counts" ) ⟶ gene symbols
      • genes(TxDB...)
      • MapIDS ⟶ lose some genes right away (3770)
      • Just work with interesection...
      • Make a SummarizedExperiment
      • Import ChIP-seq peaks... don't know what genome they are from...
      • 😡
    • Thus: reuse is difficult, time consuming
  • coldata = data.frame( files, names, condition, batch )
  • se = tximeta(coldata)
    1. uses BiocFileCache
    2. matches transcriptome automatically, adds transcript range data and metadata
    • Whether or not originaly analyst kept track of it! [Awesome]
  • What's it actually doing?
    • Salmon hashes the contents of the transcriptome. Salmon index contains a hash of the transcriptome
    • Salmon quant outputs include this hash
    • txmieta will hash "all the transcriptomes"
    • If you post entire quantification directory
  • linkedTxomes
    • Link your custome (e.g. filtered) transcriptome to upstream source, e.g. FASTA sources + GTFs

A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs (Charlotte Soneson @CSoneson)

"Should be a pre-print any minute"

  • Motivating Example

    • ZADH2 data from GTEx. Has four isoforms, distinct 5' and 3' UTRs
      • Which isoforms are expressed? Given junction mapping data
        • GTEx expression data at this locus does not appear to be consistent with the junction evidence
      • Can we detect genes with conflicting isoform expression in an automated fashion
  • The idea

    • Assume: the annotation is complete AND we know the abundance of all transcripts AND we can model library prep and sequencing bias
    • Then: should be able to predict the coverage of each genomic region
      • In particular the number of reads spanning each exon/exon junction
    • Thus: if we can't predict the coverage accurately, one of the assumptions is wrong (e.g. abundances)
  • In practice

    1. Predict junction coverages from transcript abundance and bias models
    • Can estimate in many ways, Salmon kallisto sailfish rsem sringtie, ... whatever
    • Predict coverage from transcript coverage and bias models. Bioconductor package Alpine
    1. Consider correlation of observed and predicted junction counts
    • Generally high correlation $r = 0.945$, $r_s = 0.903$
    • Compatibility score $JCC_i$: for each gene $i$ quantify deviation between predicted and observed junction coverage
  • Most genes have low $JCC$ scores

    • "Different abundance estimations give different scores, but approximately equally affected"
  • Back to ZADH2

    • Read coverage of exons explains why the "wrong" isoform is over-weighted
  • Choice of annotation effects score

    • Some genes for which CHESS is better and others for which Ensembl is better.
  • Assembling missing transcripts improves the score

    • Run stringtie and include new transcripts in quantification, many more genes are better with StringTie
      • [Shouldn't we always do this?]
  • Summary

    • JCC score lets you flag genes for which tx abundance estimates are unreliable (regardless of underlying cause)
    • works on a sample by sample basis (don't need replicates)
    • requires only RNA-seq data

Unpacking signal from RNA-seq intron reads using Rsubread and limma packages (Charity Law)

  • Two different count types: exon (any annotated exon) and intron (uniquely to intron)
    • Rsubread: align using subjunc, summarize using featureCounts
    • Total RNA: 56% exon, 21% intron; polyA: 69%, 7%
  • Making use on intron reads
  • What are they useful for? No consensus... intron retention? Nascent transcription? Noise/useless?
  • Intro reads are informative: MDS plots using limma, samples separate by biological groups and library preparation
  • Looking at coverage patterns in bins across genes, metagene and individual genes
  • Final thoughts
    • Majority of expressed genes have both exon and intron signal
    • 3' coverage bias may impact intron retention detection
    • Lots of intron reads in single cell data... Needs to be looked at

bcbioSingleCell: R package for bcbio single-cell analysis (M Steinbuagh)


  • Motivation: cell population analysis facilitates cell type (clustering) and differentiation (trajectory, pseudotime) analysis
  • Technologies: InDrops, Chromium ("use this one if you can"), DropSeq, Seq-well
  • bcbio: best practice pipelines for lots of bioinformatics tasks...
    • bcbio single-cell pipeline
      • Not in R: bcl2fastq -> umis -> RapMap/Kallisto
      • bcbioSingleCell() -> QC: scrater, scran, TRAJ: monocle, STREAM, SPRING, Cluster: SEurat, ...
        • Sparse matrix (Matrix) SingleCellExperiment. QC tools, visualization, DE testing
  • (Usage examples for the bcbioSingleCellExperiment stuff)

gwasurvivr: an R package to perform survival association testing on imputed genetic data (Abbas Rizvi @aarizvi)

  • Introduction: what's a GWAS?
    • Association between genotype and pheonotype
    • Typically using arrays, 750k to 1M markers
    • Standard practice is to impute genotypes
      • Popular options: IMPUTE2, Minimac3, BEAGLE
        • Estimte haploytypes using a reference panel (1000 genomes, etc)
    • Survival packages for GWAS.. a bunch
  • gwassurvir
    • Fast survival analysis
    • model SNP by covariate interactions
    • Filter SNP imputation/quality metrics
    • Leverage existing Bioconductor packages, GWASTools, VariantAnnotation
    • || across compute cores, modify existing
      • Goal: decrease number of iterations needed for convergence
  • (Usage examples, as usual these are too small to read)
  • Benchmarking
    • Runtime comparison, orders of magnitude faster
    • "Highly correlated results"
      • [Looks identical? Is correlation the right metric to use here?]

Data-Adaptive Estimation and Inference for Differential Methylation Analysis (Nima Hejazi)

methyvim package

I can't read these slides at all due to the colors ;(

  • Data: Infinium EPIC methylation arrays; ~850k sites
  • Is disease state related to methylation?
  • First pass analysis, fit linear model to each site, get $\beta$, test significance
    • Does this answer the right question?
  • Sites are not independent, consider effects of neighboring sites
  • Approach uses "causual inference and machine learning"
    • Filter sites with some evidence of differential methylation and cluster
    • Estimate variable importance measure at each site controlling for pattern at neighboring sites
    • "and of course we need to correct for multiple testing"

Data Adaptive Evaluation of Preprocessing Methods using Ensemble Machine Learning (Rachael Phillips)

  • Introduction
    • Arsenic: naturally occuring, acute and chronic health effects
    • Question: Qhat is the effect of early life arsenic exposure on CpG methylation in adults
  • Data
    • 44 blood samples (21 exposed, 23 unexposed)
    • EPIC chip, 850k CpGs, batches...
  • Pre-processing
    • Raw -> QC -> Transform -> Filter -> Batch Correction -> Downstream
      • Lots of options for every step... "why not let my data tell me what appraoch might be the best?" "maximize my biological system"
  • Data Adaptive Methodology
    • Assess using positive control vars (age, exposure, smoking) and negative control variables (batch, date, who processed, ...)
    • Zoooooom... but basically maximize (avreage risk) prediction of positive controls, minimize prediction of negatives
    • Ensemble learning
      • ~10 different base learners, ~10 different normalization strategies
      • 5-fold cross validation
  • Future
    • Defining optimal choice
    • Refine screening to be more efficient
    • Limitations of data (confounding)
    • ...



This comment has been minimized.

Copy link

michaelmhoffman commented Jul 26, 2018

Thanks for taking these notes! I have forked this and added some comments on the notes for my talk. Unfortunately there doesn't seem to be a way to do a pull request for a Gist, but here are the changes:


This comment has been minimized.

Copy link

PeteHaitch commented Jul 26, 2018

Re "Cell cycle -- "cycling" tool? (missed this, look up)"; it was the cyclone() function the in the scran Bioconductor package. Some detail in 'Classification of cell cycle phase' section of and based on


This comment has been minimized.

Copy link

hpages commented Jul 26, 2018

Yep, thanks for taking these notes. They're a great resource for those who couldn't make it to Toronto.


This comment has been minimized.

Copy link

TapscottLab commented Jul 26, 2018

This is really awesome. Great resource. Thank you!!!


This comment has been minimized.

Copy link
Owner Author

jxtx commented Jul 26, 2018

Thanks! I merged Michael's changes and added Peter's info on cyclone


This comment has been minimized.

Copy link

lcolladotor commented Jul 27, 2018

These notes are really impressive! Thanks for sharing them James!!!


This comment has been minimized.

Copy link

PeteHaitch commented Jul 27, 2018

Thanks, James. These notes are excellent and it was great to have you here.


This comment has been minimized.

Copy link

moldach commented Aug 22, 2018

These notes were great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.