Skip to content

Instantly share code, notes, and snippets.

@jxtx
Last active May 17, 2017 22:05
Show Gist options
  • Save jxtx/c3c0c2a2bdbd18f29d47cf0dd645be58 to your computer and use it in GitHub Desktop.
Save jxtx/c3c0c2a2bdbd18f29d47cf0dd645be58 to your computer and use it in GitHub Desktop.
Notes for 3D genome track at GLBio 2017

Keles -- Statistical Methods for profiling long range chromatin interactions from repetitive regions of the genome

  • Multi-mapping reads (multi-reads) are typically thrown out in many HTS analyses incuding Hi-C
    • Assays predominently rely on short-read (50-150bp) so multi-reads are common
    • Using ChIP-seq as an example, incorporating multi-reads finds peaks in regions where "uni-reads" do not
    • e.g. Perm-seq using DHS + ChIP-seq data and multi-reads. 27.3% more peaks compared to ENCODE uniform processing pipeline
  • How to combine this with Hi-C data?
    • Hi-C read processing
      • Typical pipelines: singletons, multi-mapping ends, low map quality, and unaligned all discarded
      • Evaluation of the impact of this using IMR90 and Plasmodium datasets
        • Impact on sequencing depth - high quality multimaps look like ~20% in all cases (interesting, need to check how repeat content varies)
        • Restriction fragment filtering (invalid read pairs) makes some multi-reads become single-reads
      • Continuing with normal Hi-C processing: Bin -> Raw contact maps -> Normalize -> Identify signficicant contacts
      • Overall imapact of including multi-reads: ~5-9% of reads
      • Assigning reads that remain multi-reads after all filters needs modeling
    • Model for Hi-C multi-reads
      • Leverage other reads within the same vicinity
        • Observed $Y_{i,(j,k) = 1$, valid read pair $i$ aligned, Summation of Y might be more than one for multi-reads
        • Hidden $Z_{i,(j,k) = 1$ , summation of Z must be 1
        • $Z_i ~ Multinomial( \pi_{(1,2)}, ... \pi_{(M,M-1) )$, $\pi$'s get Dirichlet prior, based on genomic distance between bins.
        • (Fit-Hi-C like stuff here, I'm probably not capturing it perfectly)
        • Fit with EM, get posterior probabilities of read-pairs over each contact bin, threshold to get to counts
      • Evaluation
        • Number of significant contacts: Always gaining more contacts than losing
        • 41% more significant contacts at higher FDR, 31% specific to using multi-reads
        • Reproducibility (across replicates)
          • Common to Uni and Multi are highly reproducible
          • Specific to Multi more reproducible than specific to Uni
        • Novel enhancer/promoter interactions: 20.4% more EPIs that are reproducible using multi-reads (not sure how EPIs were called here)
    • Beta version "mHiC" available from yezheng@stat.wisc.edu
    • Concludes that multi-reads play an even bigger role in Hi-C data (than other data types)
    • Future: incorporate multi-mapping into interaction calls
    • https://github.com/keleslab
    • Questions
      • On structural variants, A: incorporate copy number parameter into model
      • Do 3D models change when incorporating questions, not tested
      • (Not able to hear all questions)

Rohan Paul -- Predicting topological domains from ChIP-seq data using pairwise feature extraction

  • Introducting TADs
    • Histone marks around bondaries (peaks and dips)
  • Predict TADs from histone marks (from ENCODE)
    • Classifiers SVM, SGD, Random Forrest (scikit-learn)
    • Extract 1D features from the 2 boundaries of each tad
      • Two ways: Binarized and a continuous strength
      • Compute correlation (pearon) across all marks (?)
    • Negative examples:
      • Case 1: sample another region of similar length (anywhere on the same chromosome?)
      • Case 2: Fix one boundary to real TAD
    • Single cell line test, 5 fold cross validation AUC ~0.9 (SVD)
      • CTCF is most important feature
    • Does this generalize across cell lines? AUC ~0.9 on held out cell line (RF)
  • "Bag of boundaries" appraoch
    • Hold out one cell line and train on the bag of boundaries
    • predict bag of boundaries with held-out features
    • "Enables TAD prediction in new cell line"
    • "Limited predictive power"
    • Basically, you get a set of boundaries from other cell lines and the predict whether they form a TAD in a new cell line
    • So, any boundaries that are new in that cell line will not be predicted
    • From questions (important) the TADs here are 200-300kb domains (WHAT IS A TAD?)

Jacob Schreiber -- Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture

  • Introducing Hi-C, distance effect, Fit-Hi-C, Splines, nulls, outliers
  • Would like to predict interactions (and such) without doing Hi-C, reduce cost / inform genetic basis
  • What features? DNA sequence and DNase hypersensitivity
  • What training data? 82M Fit-Hi-C contacts with q <= 0.1 (1kb on genome), 1-hot encoded DNA, binarized DHS signal
  • "Obvious classifier is neural networks"
    • DNA network C/P/C/P
    • DNase P/C
    • Combine and then C/C/P
    • Combine two arms + distance, D/D/Predict
  • Looks like AUC ~.8, better than genomic distance and other single features
  • To validate in cell types with lower resolution Hi-C data, convert from 1kb to 5kb resolution (how?)
    • Performs better than gneomics distance or using GM12878 contact map in other cell types
  • Predictions can recreate insulation score (or anyway, good correlations), also good correlations with replication timing
  • Predicted structures cluster by cell function

Shilu Zhang -- In silico prediction of high-resolution Hi-C interaction matrices

  • Introducing distal gene regulation, TAD disruption in disease, Hi-C contact maps
  • Hi-C-Reg: regression approach for predicting contaxy counts
    • Extract aggregated histone mark and DHS signal across cell lines
    • Pair regions (?)
    • Predict contact count for pair regions
    • RF regression correlation of ~.83
    • Window features (I believe these are the features from the genomic region between the endpoints) are "very helpful" for improving predictions
    • Window features also important for capturing domains
      • Picture of predicted map. No taddy domains without the window feature (or reduced), more clear with window features
    • Ensemble: average training across cell lines. As good or better than cross-cell line predictions

end of day 1

Sheng Zhong –- Mapping RNA-RNA interactions and RNA-chromatin interactions

  • Three parts:
    • Mapping RNA interactome in vivo (MARIO)
    • Mapping RNA-genome interactions (MARGI)
    • XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  • MARIO
    • Protocol
      • Cross-link RNA/protein complex
      • Attach to surface
      • Add linker DNA to 5' end of RNA
      • Double ligate to RNA-biotin+linker+RNA
      • RT into DNA complement of chimera and sequence
    • Advantages
      • Unbiased selection
      • Applicable to human tissue
    • Risk
      • Random ligation of RNA
        • Mitigate: extreme wash conditions, large distance between complexes on surface
    • Output
      • Pairded end reads with ends mapping anywhere on the genome (hopefully in a known RNA loci)
      • Use to create pairwise interaction network
    • Validation
      • Test co-localization with single molecule RNA imaging (two color labeling)
        • Appears to validate (a few images shown)
  • MARGI
    • https://t.co/rv86Uq1cOP
    • Protocol
      • Similar idea, RNA/DNA complex tethered on solid surface, add to RNA a ss/ds adapter, enriches for RNA/DNA interactions
        • How to determine which side was originally RNA vs DNA? phase the linker so the junction is very specific
        • Circularlize/linearize to ensure the linker remains in the read
        • (How efficient is all this?)
      • Purify, RT, amplify, sequence
    • Figure: a bipartite genome browser showing links between RNA ends and DNA ends
    • What are the chromatin interaction non-coding RNAs?
      • snoRNA (~200 genes), miRNA (~100), misc, antisense, miRNA, pesudo, (big drop), linc (~20), proc transc
    • Where do they interact?
      • 80% proximal, 4% cis-distal, 16% trans
      • Distal and trans accumulate at TSSs, and density appears correlated with expression level
      • Reverse correlation between RNA attachment and H3K9me3, but no corr wih H4K4me3/H4K27Ac (scatterplots but no correlation quantification here) (also showing RNA attachment peaks have high H3K9me3)
  • XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    • Oooooooooh... secret stuff here.

James Taylor -- Pretty pictures or chromosomes served two ways

Sushmita Roy -- Computational methods to study dynamics of gene regulation

  • Motivation: dynamics of regulatory networks in lineages (either cell lineage or species phylogeny), how do networks change
  • What controls cell-type specific regulation
  • Computational tools I: Comparing 3D organization across cell types / species
    • A graph is a natural representation of a Hi-C dataset: regions == nodes, interaction strengths == edges
    • Does graph clustering help? (to identify structures/domains)
    • Spectral clustering
      • Adjacency Matrix -> Laplacian -> Eignen decomp -> k-means
    • Assessment: how good are the clusters? enrichment of genomic signals?
      • Spectral clustering tends to do better on different measures (compared to hierarchical and k-means)
    • Spectral clustering of Hi-C data for human ESCs -- 10 clusters
      • Two types, 1) associated with chromtin marks 2) associated with LADs and gene poor
    • Arboretum for clustering regulatory networks across species (existing work), how to adapt for Hi-C
      • Graph combines orthology maps (trees) for regions (genes) and interactions of regions (genes?) within each species
      • Assert: chromatin organization is more similar within species than between
      • Algorithm gives conserved signatures in matched clusters
      • Chromatin organization is conserved -- changes in clustes are between clusters of the same type (these are the two types from earlier)
      • Summary: Graph based methods maybe more effective, Arboretum-Hi-C allows comparison of related datasets, organization is conserved across species
  • Tools II: Chromatin state dynamics across cell lineages
    • Data: characterizing chromatin state during reprogramming (MEF->IPS-C->IPS) 5+ marks and 3- marks.
    • Chromatin module: group of genomic loci that have the same chromatin state (where state == same chromatin marks)
    • What are the modules in each cell type and how do they transition?
    • CMINT: Chromatin Module INference on Trees: each module is a MV gaussian, group is mixture
    • Chromatin state during reprogramming defined by 15 different patterns (each is labeled by one dominant histone mark)
    • Transitions between modules
      • e.g. chromatin transition states of Oct4 -- switched completely in iPSC but no completely in pre-
  • Conclusions
    • Chromatin state can be studied in 1D and 3D
    • Predict EPIs
    • Compare chromatin state and 3D state across cell types and species

Michael Hoffman -- Novel inferences from Hi-C data with protein-coding gene data

  • I think we were just asked not to post (unpublished work)

Feng Yue -- HiCPlus: a deep convolutional neural network for Hi-C interaction matrix enhancement

  • Challenges in current Hi-C data
    • Expensive (10 day protocol), 6+ billion reads for kb resolution
    • Most datasets are low (40kb) resolution; too low to infer EPIs
  • Introducing deep learning for resolution enhancement in the context of images
  • Convolution net Low res -> Feature extraction (low) > Fully connected mapping to -> High resolution features -> Output
  • Chromatin interactions are predictable from neighboring regions, hence can impute
  • Training on chromosome 1-17, test on 18: Local average has correlection ~0.8, larger matrices do better (decays with distance)
  • Prediction
    • Down-sample to 1/16 of reads, create Hi-C map (noisy)
    • Enhanced matrix is highly similar to original matrix -- enhanced is very close to correlation with biological replicate (probably a ceiling on performance)
  • Across cell types: train on GM12878, IMR90, K562 all perform well in correlation measures
  • Identification of interactions in HiCPlus enhanced matrices
    • Enhanced and hi-res recover similar numbers of interactions, low-dept misses many
    • Recovers 50% of peaks from Chia-PET, most missed in low-resolution Hi-C, similar results with Capture Hi-C data
  • Summary
    • Convolutional net to impute hi-res from low-res Hi-C
    • Works with 1/16 to 1/25 depth
    • Will be available at 3DGenome.org ("If you haven't tried it try it, it's very fast")

lunch

Ferhat Ay -- Gene regulation via 3D chromatin organization in eukaryotic nuclei

  • Existing work: Fit-Hi-C: Assigning statistical confidence estimates to chromatin contact maps
    • Software available in Python (more scalable) and R
    • Captures 3C validated cell specific enahncer promoter contacts
    • Model works for other chromatin conformation capture assays (e.g. PLAC-seg)
  • Three distinct diseases
    • Malaria
      • Plasmodium falciparum
        • 3D reconstruction of genome. Centromeres colocalize in 3D, so do telomeres
        • Virulence gene clusters also colocolize
          • Plasmodium has 60 var genes (isoforms of same gene), at ends of chromosomes, exactly one expressed per cell
          • Colocalization confirmed using DNA fish
        • Genes close together have correlated expression profiles
        • Telomeres have a repressive effect (closer you are to telomere in 3d lower expression)
        • H2A.Z is depleted in these regions
      • Newer work
        • Major changes in genome organization between transmission stages (PCA plot stratifying stages shown)
        • Gametocyte specific super-domain formation (similar to X chromosome activation)
        • Other parasites (vivax, knowlesi [monkey], Toxoplasma [feline], yoelli, berghei [mus])
          • Other plasmodium show telomeres with low expression, oposite phenomena (contromeres) in Toxo?
    • Asthma
      • GWAS locus known on chr17 (and nearly every other immune disorder)
        • Increased ORMDL3 expression in 17q21 locus when asthma risk variants present
        • SNPs overlap DHS, switch a CTCF binding site
          • Risk allele creates interactions with open chromatin sites far away
            • Gene normally has a nearby enhancer, which is lost with risk allele, creates an interaction is a different place == abberent expression
    • Cancer
      • Chromosomal rearrangements of all kinds common in cancer (and transformed cell lines)
      • HiCtrans: detect chromosomal translocations from Hi-C data (ENCODE 3D Nucleome group)
      • HiCnv: detect copy number variations

Abhijit Chakraborty -- A versatile pipeline to simulate Hi-C data with genomic rearrangements (AVeSim)

  • Hi-C provides "clue" to find genomic alterations (rearrangements of various types)
  • Use CNV information to setup a Hi-C matrix simulation pipeline
    • (I think the point here is that local amplifications / copy number variation can be seen in Hi-C matrices under simulation)
  • Two different simulation aproaches: random counts and scaled observed counts
  • Lots of examples of how different events look in simulated Hi-C maps
  • Preprint available and code on github

Kimberly MacKay -- A Logical Approach to Modelling the Three-Dimensional Genome

  • Choco: Predicting Chromosomal Organization using Constraints
  • Hypotheis: Use constraint logic programming (ECLiPSe; eclipseclp.org)
  • Finding a scalble representation of the "3D genome reconstruction problem" in CLP.
  • Model organism: Yeast (small genome, haploid, etc)
  • For each row of interaction map, select one representative cell
    • So, list with N column numbers and list with N frequency values
    • Intra and inter problems performed independently (1, 2, 3 in cis, 1+2, 2+3, 1+3 trans)
  • Cytoscape visualization
    • Centromeres and telomeres are clusterd (more validation experiments in progress)

Tao Yang -- HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient

  • Pearson correlation as a metric for replicibility may be misleading
    • Sometimes non-pairs have higher correlation
    • Distance dependence effects (dominates really) correlation
  • Steps
  • Evaluation
    • SCC differentiates pseudo-replicate pairs, biological replicate pairs, and non-replicate
    • Differentiates biological replicates from non-replicates
    • SCC + clustering allows reconstruction of the true relationship between cells
  • hicrep on github

Tim Kunz -- Visualizing and exploring chromatin interactions using the self-organizing map

  • SOM: grid of nodes in output space, each of which maps to a point in data space (constrained by the grid, sort of a manifold)
  • Example: 50x50 grid, trained on inter-chromosomal interaction frequencies
    • Each node contains a set of genomic loci
    • Genomic datasets can then be projected on the map
  • Projected six sub compartments onto map
    • Compartments are non overlapping but are split up on the map
    • Chromosomes cluster on map
  • Project epigentic marks, etc
    • Use Gini coefficient for measuring level of segregation on the map
    • CTCF and cohesin friends are low on segregation scale, znf??? is high (probably znf274)
    • Compartment associatons with histone marks from Rao et al. 2014 are recapitulated in the maps
  • github.com/seqcode/somatic

(Gotta go)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment