jxtx/GLBio_3D.md

## GLBio_3D.md

      
    Raw
  

              GLBio_3D.md
            
          
    Keles -- Statistical Methods for profiling long range chromatin interactions from repetitive regions of the genome


Multi-mapping reads (multi-reads) are typically thrown out in many HTS analyses incuding Hi-C

Assays predominently rely on short-read (50-150bp) so multi-reads are common
Using ChIP-seq as an example, incorporating multi-reads finds peaks in regions where "uni-reads" do not
e.g. Perm-seq using DHS + ChIP-seq data and multi-reads. 27.3% more peaks compared to ENCODE uniform processing pipeline


How to combine this with Hi-C data?

Hi-C read processing

Typical pipelines: singletons, multi-mapping ends, low map quality, and unaligned all discarded
Evaluation of the impact of this using IMR90 and Plasmodium datasets

Impact on sequencing depth - high quality multimaps look like ~20% in all cases (interesting, need to check how repeat content varies)
Restriction fragment filtering (invalid read pairs) makes some multi-reads become single-reads


Continuing with normal Hi-C processing: Bin -> Raw contact maps -> Normalize -> Identify signficicant contacts
Overall imapact of including multi-reads: ~5-9% of reads
Assigning reads that remain multi-reads after all filters needs modeling


Model for Hi-C multi-reads

Leverage other reads within the same vicinity

Observed $Y_{i,(j,k) = 1$, valid read pair $i$ aligned, Summation of Y might be more than one for multi-reads
Hidden $Z_{i,(j,k) = 1$ , summation of Z must be 1

$Z_i ~ Multinomial( \pi_{(1,2)}, ... \pi_{(M,M-1) )$, $\pi$'s get Dirichlet prior, based on genomic distance between bins.
(Fit-Hi-C like stuff here, I'm probably not capturing it perfectly)
Fit with EM, get posterior probabilities of read-pairs over each contact bin, threshold to get to counts


Evaluation

Number of significant contacts: Always gaining more contacts than losing
41% more significant contacts at higher FDR, 31% specific to using multi-reads
Reproducibility (across replicates)

Common to Uni and Multi are highly reproducible
Specific to Multi more reproducible than specific to Uni


Novel enhancer/promoter interactions: 20.4% more EPIs that are reproducible using multi-reads (not sure how EPIs were called here)


Beta version "mHiC" available from yezheng@stat.wisc.edu

Concludes that multi-reads play an even bigger role in Hi-C data (than other data types)
Future: incorporate multi-mapping into interaction calls
https://github.com/keleslab
Questions

On structural variants, A: incorporate copy number parameter into model
Do 3D models change when incorporating questions, not tested
(Not able to hear all questions)


Rohan Paul -- Predicting topological domains from ChIP-seq data using pairwise feature extraction


Introducting TADs

Histone marks around bondaries (peaks and dips)


Predict TADs from histone marks (from ENCODE)

Classifiers SVM, SGD, Random Forrest (scikit-learn)
Extract 1D features from the 2 boundaries of each tad

Two ways: Binarized and a continuous strength
Compute correlation (pearon) across all marks (?)


Negative examples:

Case 1: sample another region of similar length (anywhere on the same chromosome?)
Case 2: Fix one boundary to real TAD


Single cell line test, 5 fold cross validation AUC ~0.9 (SVD)

CTCF is most important feature


Does this generalize across cell lines? AUC ~0.9 on held out cell line (RF)


"Bag of boundaries" appraoch

Hold out one cell line and train on the bag of boundaries
predict bag of boundaries with held-out features
"Enables TAD prediction in new cell line"
"Limited predictive power"
Basically, you get a set of boundaries from other cell lines and the predict whether they form a TAD in a new cell line
So, any boundaries that are new in that cell line will not be predicted
From questions (important) the TADs here are 200-300kb domains (WHAT IS A TAD?)


Jacob Schreiber -- Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture


Introducing Hi-C, distance effect, Fit-Hi-C, Splines, nulls, outliers
Would like to predict interactions (and such) without doing Hi-C, reduce cost / inform genetic basis
What features? DNA sequence and DNase hypersensitivity
What training data? 82M Fit-Hi-C contacts with q <= 0.1 (1kb on genome), 1-hot encoded DNA, binarized DHS signal
"Obvious classifier is neural networks"

DNA network C/P/C/P
DNase P/C
Combine and then C/C/P
Combine two arms + distance, D/D/Predict


Looks like AUC ~.8, better than genomic distance and other single features
To validate in cell types with lower resolution Hi-C data, convert from 1kb to 5kb resolution (how?)

Performs better than gneomics distance or using GM12878 contact map in other cell types


Predictions can recreate insulation score (or anyway, good correlations), also good correlations with replication timing
Predicted structures cluster by cell function

Shilu Zhang -- In silico prediction of high-resolution Hi-C interaction matrices


Introducing distal gene regulation, TAD disruption in disease, Hi-C contact maps
Hi-C-Reg: regression approach for predicting contaxy counts

Extract aggregated histone mark and DHS signal across cell lines
Pair regions (?)
Predict contact count for pair regions
RF regression correlation of ~.83
Window features (I believe these are the features from the genomic region between the endpoints) are "very helpful" for improving predictions
Window features also important for capturing domains

Picture of predicted map. No taddy domains without the window feature (or reduced), more clear with window features


Ensemble: average training across cell lines. As good or better than cross-cell line predictions


end of day 1
Sheng Zhong –- Mapping RNA-RNA interactions and RNA-chromatin interactions


Three parts:

Mapping RNA interactome in vivo (MARIO)
Mapping RNA-genome interactions (MARGI)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


MARIO

Protocol

Cross-link RNA/protein complex
Attach to surface
Add linker DNA to 5' end of RNA
Double ligate to RNA-biotin+linker+RNA
RT into DNA complement of chimera and sequence


Advantages

Unbiased selection
Applicable to human tissue


Risk

Random ligation of RNA

Mitigate: extreme wash conditions, large distance between complexes on surface


Output

Pairded end reads with ends mapping anywhere on the genome (hopefully in a known RNA loci)
Use to create pairwise interaction network


Validation

Test co-localization with single molecule RNA imaging (two color labeling)

Appears to validate (a few images shown)


MARGI

https://t.co/rv86Uq1cOP
Protocol

Similar idea, RNA/DNA complex tethered on solid surface, add to RNA a ss/ds adapter, enriches for RNA/DNA interactions

How to determine which side was originally RNA vs DNA? phase the linker so the junction is very specific
Circularlize/linearize to ensure the linker remains in the read
(How efficient is all this?)


Purify, RT, amplify, sequence


Figure: a bipartite genome browser showing links between RNA ends and DNA ends
What are the chromatin interaction non-coding RNAs?

snoRNA (~200 genes), miRNA (~100), misc, antisense, miRNA, pesudo, (big drop), linc (~20), proc transc


Where do they interact?

80% proximal, 4% cis-distal, 16% trans
Distal and trans accumulate at TSSs, and density appears correlated with expression level
Reverse correlation between RNA attachment and H3K9me3, but no corr wih H4K4me3/H4K27Ac (scatterplots but no correlation quantification here) (also showing RNA attachment peaks have high H3K9me3)


XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Oooooooooh... secret stuff here.


James Taylor -- Pretty pictures or chromosomes served two ways


https://speakerdeck.com/jxtx/glbio-2017-3d-genome-session

Sushmita Roy -- Computational methods to study dynamics of gene regulation


Motivation: dynamics of regulatory networks in lineages (either cell lineage or species phylogeny), how do networks change
What controls cell-type specific regulation
Computational tools I: Comparing 3D organization across cell types / species

A graph is a natural representation of a Hi-C dataset: regions == nodes, interaction strengths == edges
Does graph clustering help? (to identify structures/domains)
Spectral clustering

Adjacency Matrix -> Laplacian -> Eignen decomp -> k-means


Assessment: how good are the clusters? enrichment of genomic signals?

Spectral clustering tends to do better on different measures (compared to hierarchical and k-means)


Spectral clustering of Hi-C data for human ESCs -- 10 clusters

Two types, 1) associated with chromtin marks 2) associated with LADs and gene poor


Arboretum for clustering regulatory networks across species (existing work), how to adapt for Hi-C

Graph combines orthology maps (trees) for regions (genes) and interactions of regions (genes?) within each species
Assert: chromatin organization is more similar within species than between
Algorithm gives conserved signatures in matched clusters
Chromatin organization is conserved -- changes in clustes are between clusters of the same type (these are the two types from earlier)
Summary: Graph based methods maybe more effective, Arboretum-Hi-C allows comparison of related datasets, organization is conserved across species


Tools II: Chromatin state dynamics across cell lineages

Data: characterizing chromatin state during reprogramming (MEF->IPS-C->IPS) 5+ marks and 3- marks.
Chromatin module: group of genomic loci that have the same chromatin state (where state == same chromatin marks)
What are the modules in each cell type and how do they transition?
CMINT: Chromatin Module INference on Trees: each module is a MV gaussian, group is mixture
Chromatin state during reprogramming defined by 15 different patterns (each is labeled by one dominant histone mark)
Transitions between modules

e.g. chromatin transition states of Oct4 -- switched completely in iPSC but no completely in pre-


Conclusions

Chromatin state can be studied in 1D and 3D
Predict EPIs
Compare chromatin state and 3D state across cell types and species


Michael Hoffman -- Novel inferences from Hi-C data with protein-coding gene data


I think we were just asked not to post (unpublished work)

Feng Yue -- HiCPlus: a deep convolutional neural network for Hi-C interaction matrix enhancement


Challenges in current Hi-C data

Expensive (10 day protocol), 6+ billion reads for kb resolution
Most datasets are low (40kb) resolution; too low to infer EPIs


Introducing deep learning for resolution enhancement in the context of images
Convolution net Low res -> Feature extraction (low) > Fully connected mapping to -> High resolution features -> Output
Chromatin interactions are predictable from neighboring regions, hence can impute
Training on chromosome 1-17, test on 18: Local average has correlection ~0.8, larger matrices do better (decays with distance)
Prediction

Down-sample to 1/16 of reads, create Hi-C map (noisy)
Enhanced matrix is highly similar to original matrix -- enhanced is very close to correlation with biological replicate (probably a ceiling on performance)


Across cell types: train on GM12878, IMR90, K562 all perform well in correlation measures
Identification of interactions in HiCPlus enhanced matrices

Enhanced and hi-res recover similar numbers of interactions, low-dept misses many
Recovers 50% of peaks from Chia-PET, most missed in low-resolution Hi-C, similar results with Capture Hi-C data


Summary

Convolutional net to impute hi-res from low-res Hi-C
Works with 1/16 to 1/25 depth
Will be available at 3DGenome.org ("If you haven't tried it try it, it's very fast")


lunch
Ferhat Ay -- Gene regulation via 3D chromatin organization in eukaryotic nuclei


Existing work: Fit-Hi-C: Assigning statistical confidence estimates to chromatin contact maps

Software available in Python (more scalable) and R
Captures 3C validated cell specific enahncer promoter contacts
Model works for other chromatin conformation capture assays (e.g. PLAC-seg)


Three distinct diseases

Malaria

Plasmodium falciparum

3D reconstruction of genome. Centromeres colocalize in 3D, so do telomeres
Virulence gene clusters also colocolize

Plasmodium has 60 var genes (isoforms of same gene), at ends of chromosomes, exactly one expressed per cell
Colocalization confirmed using DNA fish


Genes close together have correlated expression profiles
Telomeres have a repressive effect (closer you are to telomere in 3d lower expression)
H2A.Z is depleted in these regions


Newer work

Major changes in genome organization between transmission stages (PCA plot stratifying stages shown)
Gametocyte specific super-domain formation (similar to X chromosome activation)
Other parasites (vivax, knowlesi [monkey], Toxoplasma [feline], yoelli, berghei [mus])

Other plasmodium show telomeres with low expression, oposite phenomena (contromeres) in Toxo?


Asthma

GWAS locus known on chr17 (and nearly every other immune disorder)

Increased ORMDL3 expression in 17q21 locus when asthma risk variants present
SNPs overlap DHS, switch a CTCF binding site

Risk allele creates interactions with open chromatin sites far away

Gene normally has a nearby enhancer, which is lost with risk allele, creates an interaction is a different place == abberent expression


Cancer

Chromosomal rearrangements of all kinds common in cancer (and transformed cell lines)
HiCtrans: detect chromosomal translocations from Hi-C data (ENCODE 3D Nucleome group)
HiCnv: detect copy number variations


Abhijit Chakraborty -- A versatile pipeline to simulate Hi-C data with genomic rearrangements (AVeSim)


Hi-C provides "clue" to find genomic alterations (rearrangements of various types)
Use CNV information to setup a Hi-C matrix simulation pipeline

(I think the point here is that local amplifications / copy number variation can be seen in Hi-C matrices under simulation)


Two different simulation aproaches: random counts and scaled observed counts
Lots of examples of how different events look in simulated Hi-C maps
Preprint available and code on github

Kimberly MacKay -- A Logical Approach to Modelling the Three-Dimensional Genome


Choco: Predicting Chromosomal Organization using Constraints
Hypotheis: Use constraint logic programming (ECLiPSe; eclipseclp.org)
Finding a scalble representation of the "3D genome reconstruction problem" in CLP.
Model organism: Yeast (small genome, haploid, etc)
For each row of interaction map, select one representative cell

So, list with N column numbers and list with N frequency values
Intra and inter problems performed independently (1, 2, 3 in cis, 1+2, 2+3, 1+3 trans)


Cytoscape visualization

Centromeres and telomeres are clusterd (more validation experiments in progress)


Tao Yang -- HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient


Pearson correlation as a metric for replicibility may be misleading

Sometimes non-pairs have higher correlation
Distance dependence effects (dominates really) correlation


Steps

Smoothing

2d mean filter

(Can't read the equation but looks like a 2D local average. weighted?)


Stratification and aggregation

Analogy to CHM statistic (https://en.wikipedia.org/wiki/Cochran%E2%80%93Mantel%E2%80%93Haenszel_statistics)
Stratum-adjusted correlation coeeficient


Evaluation

SCC differentiates pseudo-replicate pairs, biological replicate pairs, and non-replicate
Differentiates biological replicates from non-replicates
SCC + clustering allows reconstruction of the true relationship between cells


hicrep on github

Tim Kunz -- Visualizing and exploring chromatin interactions using the self-organizing map


SOM: grid of nodes in output space, each of which maps to a point in data space (constrained by the grid, sort of a manifold)
Example: 50x50 grid, trained on inter-chromosomal interaction frequencies

Each node contains a set of genomic loci
Genomic datasets can then be projected on the map


Projected six sub compartments onto map

Compartments are non overlapping but are split up on the map
Chromosomes cluster on map


Project epigentic marks, etc

Use Gini coefficient for measuring level of segregation on the map
CTCF and cohesin friends are low on segregation scale, znf??? is high (probably znf274)
Compartment associatons with histone marks from Rao et al. 2014 are recapitulated in the maps


github.com/seqcode/somatic

(Gotta go)