Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@michaelmhoffman
Forked from jxtx/bioc2018.md
Last active July 26, 2018 15:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save michaelmhoffman/ccfa3f946578912754c437db181e3920 to your computer and use it in GitHub Desktop.
Save michaelmhoffman/ccfa3f946578912754c437db181e3920 to your computer and use it in GitHub Desktop.
#bioC 2018 Conference Notes

Conference info: https://bioc2018.bioconductor.org/

My first Bioconductor meeting, and I'm not a BioC or R expert so these notes are probably going to be naïve!

Contents

Developer Day

(I arrived at 1:00pm and missed the morning sessions)

⚡️Lightning talks II

Meetups (Aedin Culhane)

  • Boston first, then NYC
  • Bioconductor has a community Slack, there is an #meetups channel
  • Boston: 40 to 100 people turn up

BioCFileCache (Lori Shepherd)

Local file management. Cache files locally to avoid downloading from remote sources if not needed. Also, try to have a better way to organize files.

BiocFileCache(). Backed by a sqlite database.

  • bfcadd( rname=..., fpath=...) adds an existing file to track in the cache
  • bfcnew( rname=... ) gives a new path in the cache
  • bfcneedsupdate() check if a dataset has changed remotely and needs to be downloaded again
  • bfcquery(...) search for datasets in the cache
  • bfcrpath(...) gives the local path of file by id

Can also attach metadata to datasets.

VariantExperiment (Qian Liu)

  • Stored variants (genotypes, multiple assays, multiple individuals).
  • Extends RangedSummarizedExperiment.
  • Can construct from gds file or from a vcf file.
  • Subsetting and range slicing.
  • "Many statistical methods are defined". Example: hwe.

Scalable computing in BioConductor (Nitesh Turaga)

  • map/reduce in R: lapply( X, FUN, ...).
  • BioCParallel: bplapply( ..., bpparam )
  • bpparam determines the BioCParallel backend to use
  • e.g. SerialParam or MulticoreParam
  • New: scalaing across clusters, BatchToolsParam( workers=..., cluster=... )
    • cluster might be SGE, SLURM, LSF, PBS, et cetera
  • Example: Salmon psuedoalignment
    • instantiate BatchToolsParam with institution specific template
    • Write a function that processes a single sample
    • Pass function to bplapply to run in parallel
      • Progress Bar!
  • Benefits: easier cluster management, ...
  • https://github/nturaga/BatchtoolsParam_examples

Workshops (Levi Waldron)

How the confernece workshops (will run Thursday/Friday) materials were built.

Organism.dplyr (Daniel van Twisk)

  • Alternative interface to the org.* packages, similar purpose to OrganismDBI
  • Any organism with both a org and txdb package can be used
  • src_organism( "org..." ) provides the interface. Compatible with all methods from dplyr
  • 11 genomic coordinate extractor methods available, e.g. transcripts gets a GRanges, transcripts_tbl gets a tibble
  • Examples of a variety of complex filters (too small to read!)

Birds of a Feather

Parallel tracks, Peter Hickey on Effectively using DelayedArray and Levi on New Data Structures for Bioconductor. (Sorry @PeteHaitch but I can only choose one).

New Data Structures for Bioconductor (Levi Waldron)

First, a presentation "Why re-use core classes: A plea to developers of Bioconductor packages" (Levi).

  • What is Bioconductor? 1,400 packages on a backbone of data structures.
    • e.g.: GenomicRanges, SummarizedExperiment
  • Why do core classes matter?
    • Suppose you want to build a rocket powered bike. You could start from raw steel and forge your own frame.
      • But your frame has limited testing, and probably doesn't handle many use cases
    • It is easy to define a new S4 class in R
      • But you shouldn't very difficult to build a robust and flexible class for genomic data analysis
    • Example from phylogenetics / microbiome packages: not using common classes --> limits interoperability
  • What are the core classes?
  • Core classes represent years of work and maintenence and have been used by tens of thousands of users

Discussion

  • Q: What would you do instead of defining your own class (in the case of phyloseq)

    • .MicrobiomeExperiment <- setClass("MicrobiomeExperiment", 
              contains="SummarizedExperiment", 
              representation( rowData="MicrobiomeFeatures" )
      )
      
    • Gives you the benefits of SummarizedExperiment, compatible with MultiAssayExperiment
  • Exploring S4 classes

    • extends tells you what superclasses a given class extends
      • e.g. RangedSummarizedExperiment isa SummarizedExperiment, Vector, Annotated
    • showclass adds known subclasses and what slots it contains
    • methods tells you what methods are defined for a class
      • e.g. 100+ methods on SummarizedExperiment, but 54 of those from parent classes)
  • Example that has "done things right": SingleCellExpriment

    • extends RangedSummarizedExperiment and defines additional methods...

PharmacoGx Updates (Petr Smirnov)

https://bioconductor.org/packages/release/bioc/html/PharmacoGx.html

  • WIP: Fixing up to work better with Bioconductor objects

  • Drug sensitivity data: "treat a cell line with a drug and see how well it kills it"

  • Structure

    • molecularProfiles: List of ExpressionSet objects
    • sensitivity: List of a couple of data frames and an array
      • Initially Experiment IDs with dose/viability pairs
        • But, drugs combinations, other dimensions, not naturally a matrix
      • Solution: LongArray object
        • col.ids and row.ids: data.frame's
          • Example (combination_name, drugA, drugB)
        • Get data as if from a list
          • "long array but behaves when you use a single bracket as if a matrix"
        • Slicing example
          • dcText is a longArray object with "rows" across 2 variables and "columns" across 1 variable
          • slicing dcTest[c("5-FU", "Bortezomib", "Erlotinib"), "A2058"]
  • Q: Why is this different from a SummarizedExperiment?

    • Multiple experiments, e.g. drug combinations of 2, 3, ... n drugs
    • Followup Q: what about MultiAssayExperiment?
      • Would lose quick subsetting through multiple dimensions (?)

TxRegInfra (Vince Carey)

https://bioconductor.org/packages/release/bioc/html/TxRegInfra.html

  • Investigator's idea: eQTL from GTex, DHS,... from ENCODE, TFBS from FIMO... use this to interpret GWAS hits
  • Developers: do as little as possible to resolve and keep metadata
  • Existing resources that could help:
    • rtracklayer+tabix, GenomicRanges, RaggedExperiments, ...
    • mongolite
  • RaggedExperiment
  • Example
    • Data
      • collection of eQTL from GTEx
      • encode footprint (not sure what this actually is)
      • encode DHS hotspots
    • Documents in a mongodb database (RaggedMongoExperiment)
      • Every document has a genomic range, so can respond to range queries
  • Summary
    • Basic layout: genomic coordinates x sample/tissue type x assay type
    • MutltiAssayExperiment: could work but not an immediate fit
  • "There is a competitor called Giggle"

⚡️Lightning talks III

Bioconductor tricks for dealing with genome annotation (Michael Steinbaugh)

http://steinbaugh.com/basejump/

  • Recommended packages
    • GenomicRanges, rtracklayer (GTF -> GRanges), AnnotationHub, ensembldb, GenomicFeatures
  • basejump extends these tools
  • Rich metadata columns (GRanges), mcol(...)

iSEE (Charlotte Soneson)

https://f1000research.com/articles/7-741/v1

  • Interactively explore any data in a SummarizedExperiment object (or subclass)
  • Multiple panels with different visualizations, can see how they are linked

Managing project metadata with a standard project format (Nathan Sheffield)

https://pepkit.github.io; https://databio.org

  • Motivation: Most pipelines require individual metadata organization
  • PEP: a standard format for project metadata -- "Portable Encapsualted Project"
  • Ecosystem of tools:
    • format itself: project_config.yaml, samples.csv
    • peppy: Python package
    • pepr: R package
    • geofetch, looper -- map samples onto pipelines and run in different compute environments

Finding Bioconductor Packages (Shian Su)

github.com/shians/biocexplorer

  • Bioconductor packages are not that easy to find
    • Prioritization: can sort by title (alpha), author (alpha), not really that useful
  • Alternative: BiocExplorer:
    • Prioritize packages based on usage
    • Provides a graph of packages, prioritizing those that are widely used (not sure what

Recent cloud-scale innovations in Bioconductor (Vince Carey)

So fast... so small...

  • Summary first

    • DelayedArray: seamless element level access to out-of-memory / remote array-like resources
    • SummarizedExperiment/MAE: a sort of query language for annotated omics resources
    • Current efforts: improve efficiency of statistical learning using Delayed* resources
  • DelayedArray backends: HDF5 server, BigTable, ...

  • HDF cloud / HDF Kita (Example using 10x)

  • BigTable (Example using (OncoTk)

  • The point (I think): You can work with all of these types of remote data in the current version of Bioconductor

Sesame: a sensible way to analyze a DNAme array (Tim Triche)

https://www.bioconductor.org/packages/devel/bioc/html/sesame.html

  • Improves masking on hyper-polymorphic region (e.g. MHC)

Community Activites

Brainstorm and prioritize some products that can be produced in ~45 minutes (and then do the thing).

(Martin using slido.com to accumulate suggestions from the attendees)

Voting, winners are:

  • Come up with a data structure for PharmaGx data -- 9
  • Strategies for posting and answering support site questions -- 8
  • Checks on SummarizedExperiment rownames, rowData()<- -- 7
  • name clashes between BiocGenerics, S4Vectors etc. and tidyverse -- 6
  • Pull requests to fix usage and other warnings in core packages, e.g., Rsamtools -- 6
  • Initiate collaborative development of ... (like iSEE at BiocEurope) -- 6

We'll see what happens...

Summaries of community activites

Support site

  • Main idea: Template for the "ask a question" box: provide some guidance for how to ask a good question
    • Reproducible example
    • Things you tired
    • sessionInfo
  • Other ideas (Google doc link)

Collaborative project

  • "Biocverse": visualizing the Bioconductor ecosystem
  • Use cases
    • New users: given a task, show me all packages, ranked by "importance"
    • Experienced users
    • New developers

SummarizedExperiment rownames

Fixing a problem in assigning rownames. I think.

PharmacoGx

(more) Discussion of how to store drug sensitivity data.

Store as a database. Write construtor functions that create matrix / SummarizedExperiment from the database.

There's some code in github in a project called longArray but I can't read the user/org name.

Panel discussion: project directions and opportunities

Q&A for the project leadership team.

Q: "Can you think of ways to expand the network of project leaders, expand ownership, expand people who feel they are part of the project"

  • Martin: project has life outside the core. e.g. Single Cell developments largely outside the core.
  • Vince: (turns question back) Is there a lack of recognition or barriers to participation?
  • Aedin: Surprising how many R people don't know Bioconductor

Q: Mechanism to kick packages out?

  • M: There are obstacles that discourage people from participating, but the tradeoff is worth it. Quality of packages, having vignettes. Contrast in how tidyverse works with how Bioconductor works, different views on how software should work. Interesting to think about compromises in making those play along.
  • V: Advantages of putting your package in Bioconductor: development vs release branch (not available in CRAN), vignettes and examples. Not trying to sell Bioconductor methods to people not trying to do them.

Q: Synchronization between Bioconductor and CRAN

  • M: We communicate regularly. At a technical level there is communication. At a social level, much more restricted. CRAN has task views, but no overlap with Bioconductor

Q: How do you recommend new users learn the Bioconductor ecosystem

  • Wolfgang: need a new book. Something like a textbook. Challenge defining what it means to use/learn Bioconductor
  • A: A beginners user guide to Bioconductor should start with SummarizedExperiment, basics... The thing I direct people to is the f1000 channel
  • If you want to learn Bioconducter there is a reason for that. You have some data.
  • V: Talking about content from the edX MOOC
  • Kasper: introduction should focus on the things that EVERY user of Bioconductor should know. Should streamline and cleanup what is presented.

Q: Funding mechanisms the Bioconductor community should be applying for

  • M: Historically get a lot of money into one big shop, easier than scrambling for smaller grants from a variety of places. More recently, have started to diversify. e.g. Human cell atlas grants written by a diverse group of people. More junior faculty wanting to expand participation but need funding to do so. But wants to call on James Taylor...

  • ACK IT ME: What works for Galaxy 1) Core group goes after diverse funding opportunities, one big pot but also lots of other pots 2) Full time funded community outreach. Multiple people across the project dedicated to this. Really makes a huge difference to have someone spending all their time on this so it is never lost/back-burnered [writing what I think not what I say right now]

Q: Increasing participation, diversity, gender balance

  • M: (I didn't capture this well so nothing here)

Q: Question about connections with Africa (H3A?) training and outreach

  • A: There are some efforts/connections
  • V: There is a foundation for Bioconductor devoted to charitable works, has some money, could potentially be used to expand training

(discussion about website here, needs to be simplified, streamlined, refreshed... couldn't hear everything)

(Back to diversity, a plug for participation in Girls who Code and other such groups, always looking for help)

END OF DEVELOPER DAY.

Day "1"

Starting with introductory remarks from Martin (thanking sponsors, organizers, logistics -- all on the conference website).

Last item, Code of Conduct (https://bioc2018.bioconductor.org/code_of_conduct), interesting, shorter than many.

Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome (@michaelhoffman)

https://www.biorxiv.org/content/early/2018/02/28/168419

  • Introduction
    • "Transcription over-simplified": TF binds DNA, recrits PolII, RNA is made (yup, that's all)
    • ChIP-seq, you might have heard of it...
      • Problem: ChIP-seq needs 10^6 to 10^8 cells ("Determined using 'Cunningham's Law'")
      • Solution: computational prediction of transcription factor binding
        • Old problem, originally entirely sequence based, current methods use open-chromatin and so a lot better
          • Michael thinkg HINT works pretty well
  • How to move forward
    1. Use experimental data from ChIP-seq in other cell types
    2. Learn association between local TF binding and global cellular state as measured from transcribed RNA
  • Learning from the transcriptome
    • ChIP-seq data from some cell types, RNA-seq data from more cell types.
      • Bin genome and look for bins where ChIP-seq/RNA-seq correlate (I think ChIP-seq is 200bp bins and RNA-seq is gene level)
        • Correlation matrix is genome-wide, consider cases where p < 0.1
        • For a given bin, consider all genes with "significant" correlation, compute spearman rank correlationp between expression and correlation == expression score (yes, correlations of correlations)
        • phastCons + chromatin accessibility + expression score + number of cell types with TF binding + motif score into a "very simple neural network" [MH: it's simple in that it is fully connected, nothing fancy in the architecture]
          • MLP... optimized # layers, size of layers, activation function, ... looks like lots of hyperparameters [MH: only 4 hyperparameters! 3×3×3×4 = 36 different possible values examined in the grid search]
  • Evaluation
    • (Describing cross-validation scheme, use of precision-recall curves)
    • ChIP-seq data from other cell types most important, then expression score, then chromatin accessibility
  • Performance
    • "A few" TFs where we do very well -- auPR >= 0.5 -- SMC3, CTCF, RAD21... some others, went by too fast [MH: performance plot is in preprint]
    • For 36 TFs MCC more than 0.3 in validation cell types (Roadmap)
    • Correctly predicts novel TF binding sites
  • Trackhub available, didn't see a URL though, probably in the preprint [MH: should add to presentation. https://virchip.hoffmanlab.org/]
  • Future:
    • Position dependency -- adding time dimension to network

Questions:

  • Q: Now that we have 1600 cell lines with RNA-seq, would you be interested in inferring ChIP-seq for those cell lines
    • M: Yes, I would be intersted in that, but open-chromatin data is important, not sure can do without
  • Q: Can this be applied to single cell transcriptome data?
    • M: Yes, could do something, again need measurement of open chromatin ("having single cell ATAC would be best")
  • Q: (I think the question is using the model to find most similar cell line)
    • M: (Not sure)
  • Q: What two assays would you do on cells from a donor (for a difficult to acquire tissue)
    • M: transcripts and open chromatin -- but still unclear for single cell

Enter the Matrix: Interpreting omics through matrix factorization (@FertigLab)

https://www.biorxiv.org/content/early/2018/04/02/196915

  • First, an appeal to the audience: "Need more tools for visualization across different matrix factorization techniques"

  • Introduction

    • Pattern detection is critical in the genomics big data era
    • (Many types of) omics data can be represented in matrices
      • Focus here is mainly on transcription
    • Omics data can be interpreted through matrix factorization (PCA, ICA, NMF, ...)
      • Data = Amplitude x Pattern
        • D: molecules by samples
        • A: molecules by features
        • P: features by samples
  • Focus: Smooth sparse NMF, Bioconductor package CoGAPS

    • $$A_{i,j} ~ \Gamma(\alpha^A_{i,j},\lambda); \alpha^A_{i,j} ~ Poisson(\alpha)$$
    • Gamma yields sparsity constraint
    • Implementation: finds constrained non-negative sparse matrices using MCMC Gibbs sampler
  • Application: Biological model of theraputic resistance

    • Cancer cell line initially sensitive, generate long term resistance, acquire time series (weekly) for cells acquiring resistance and controls
    • Generated gene expression data (bulk RNA-seq?)
    • Initial clustering
      • Gives information about treatment but not much about resistance
    • Matrix factorization for time-course analysis
      • Perform sparse NMF (CoGAPS) and view the Patterns over time
        • Reveals time dependent patterns in resistance
    • BUT: how does one make these abstract patterns useful?
      • Amplitude matrix allows mapping patterns back into gene expression space (or whatever original feature space)
      • Instead of finding genes most highly associated with each pattern, what are the genes associated with only one of the patterns
        • "Pattern marker genes"
          • Group that slowly increases with resistance, another group that slowly decreases, clearly groups treatment and controls
      • Can do standard GSEA on these marker genes
    • Relate this back to non cell-line data?
      • Take weights and project onto another dataset (ProjectR package on GitHub)
      • Human tumors treated with the same therapy
        • Found that the resistance patterns were elevated in the patient tumors that were resistant
  • "Resurgance of matrix factorization for single-cell data"

    • What's different?
      • Datasets are orders of magnitude larger
      • Cell types and timing of individual cells are unknown a priori
    • Showing UMAP of 10x 100k cell data, 10 different time points in mouse retina
      • Cell types are hand annotated to get "ground truth" (hrmmmmmm...)
      • scCoGAPS distinguishes cell types and trajectories
        • Looks like a rod pattern and a cell type pattern...
  • Conclusions

    • Matrix factorization has a long history in genomics
    • Adding new visualization andnew statistics to the ouputs of MF can enable robust pattern detection
    • Applicable to single-cell datasets
  • Q: 1. about manually classifying 100k cells, 2. (didn't get this one)

    • Research question: how do you use these factorizations to aid classification
    • Replicates: not able to replicate the whole time course (not enough $), but a collaborator had previously developed resistance in same cell line, found tremendous heterogeneity, sounds like generalizability still unclear
  • Q: Scalibility, algorithms people should focus on in the face of HCA, 2M cell scale...

    • My algorithm won't converge on data at that scale, gradient alogirthms will converge (but badly). Two approaches
      • || across different sets of genes, cells
      • compaction approahces (group related cells, factorize in reduced space)
  • Q: "Are you aware of groups that have ressurected CUR decomposition... quantization approaches... where you hit the limit"

    • Haven't seen that. Surprised at the amount of reinventing the wheel. Need to go back to that literature.

(BREAK)

Analysis of high content microscopy data generated through automated yeast genetics (Brenda Andrews)

  • Introduction
    • Major challenge, predicting phenotype from genotype using genetic interactions
    • Using budding yeast because reagents for systematic genetics including
      • Yeast deletion collection: 5000 yeast strains each deleted for a single non-essential gene
      • 1000 temperature sensitive alleles of essential genes
    • Need methods for detecting gene interactions
      • SGA (Synthetic genetic array): introduce any marked allele into an arrayed set of straings
    • Main phenotype is growth (colony size)
      • e.g. tested 23.4 million double mutants identifying 1.1M genetic interactions
      • generated "hierarchical modle of cell function" (Costanzo et al. Science 2016)
    • ~35% of nonessential query gene mutatnts exhibit weak genetic interaction profiles
    • Most of the time double mutants do not have a growth phenotype, but may have other phenotypes
  • "Marker project"
    • Introduce flourescent markers for sub cellular compartments in to the arrayed strains
    • How does mutation of any gene influence sub-cellular compartments?
    • Developing a general phenotypic profiling pipeline
      1. Make strain collection: Use SGA to introduce three markers: compartment of interest, nucleus, cytoplasm
      2. Image: Opera Phenix automated confocal live cell
      3. Data collection
      • Single cell images
      • Single cell morphological features
        • Cell Profiler, ~300 features, 10-50 PCs
        • VAE (autoencode) to find latent feature vector
      1. Phenotyping profiling
      • Detecting mutants -> penetrence
        • Finding outliers, one-class SVM, distance methods, ... "no one size fits all"
      • Classifying mutant pheonotypes
        • Neural networks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment