Skip to content

Instantly share code, notes, and snippets.

Day 1: 25 June 2019

BioC2019: Where Software and Biology Connect (Martin)

Inference after prediction (Jeffrey "John" Leek)

aka "What do we do after we have machine learned everything"

@jxtx
jxtx / bioc2018.md
Last active August 22, 2018 14:30
#bioC 2018 Conference Notes
@jxtx
jxtx / using-hifive-re.md
Last active December 4, 2017 19:16
Using HiFive on restriction digest bulk Hi-C

Working through running HiFive on a Hi-C datasets.

First, a note on memory an performance: bin size influences everything. Starting with a bin size of 40kb, loading data in hg38 seems to stay under ~16GB. At fend level resolution memory requirements approach ~32GB and running time increases several fold.

Dealing with restriction fragment details

HiFive stores a fend file with information on the locations of restriction fragments in the genome. We need to get the locations of the RE sites into a BED

@jxtx
jxtx / GLBio_3D.md
Last active May 17, 2017 22:05
Notes for 3D genome track at GLBio 2017

Keles -- Statistical Methods for profiling long range chromatin interactions from repetitive regions of the genome

  • Multi-mapping reads (multi-reads) are typically thrown out in many HTS analyses incuding Hi-C
    • Assays predominently rely on short-read (50-150bp) so multi-reads are common
    • Using ChIP-seq as an example, incorporating multi-reads finds peaks in regions where "uni-reads" do not
    • e.g. Perm-seq using DHS + ChIP-seq data and multi-reads. 27.3% more peaks compared to ENCODE uniform processing pipeline
  • How to combine this with Hi-C data?
    • Hi-C read processing
      • Typical pipelines: singletons, multi-mapping ends, low map quality, and unaligned all discarded
  • Evaluation of the impact of this using IMR90 and Plasmodium datasets

Why is it called Galaxy

Once upon a time there was the Genome ALignment and Annotation database or GALA, which allowed for analysis of genomic elements alongside comparative genomic information. However, this tool supported only a few analyses. What-would-be-galaxy was born from the idea of being able to easily take any existing analysis tool and quickly integrate it into this platform. But what should we call this next direction? Bob Harris suggested the use of X/Y to represent this "next dimension" of analysis. GALA + XY ⟶ GALAXY ⟶ Galaxy.

Or at least this is how I remember it.

#usegalaxy

# Mostly based on this:
# https://github.com/Homebrew/linuxbrew/wiki/Standalone-Installation
# But I started with nothing (no ruby, no gcc)
# Ruby and GCC will go here
mkdir bootstrap
# Get GCC 4.4 and install under bootstrap
# We also need libstdc++ when we get to building gcc-4.9 because somebody decided it was a good idea to start writing GCC in C++
wget http://ftp1.scientificlinux.org/linux/scientific/55/x86_64/SL/gcc44-4.4.0-6.el5.x86_64.rpm
/**
* usage: node scrape_gs.js USERKEY
*
* Determine h-index for papers published AFTER each year found in a Google
* scholar profile. The USERKEY is found in your Google scholar citations
* page url.
*/
var request = require('request');
var cheerio = require('cheerio');
@jxtx
jxtx / blast.c
Last active October 14, 2020 17:29
Oldest nucleotide blast.c I can find...
/*
* BLAST - Search two DNA sequences for locally maximal segment pairs. The basic
* command syntax is
*
* BLAST sequence1 sequence2
*
* where sequence1 and sequence2 name files containing DNA sequences. Lines
* at the beginnings of the files that don't start with 'A', 'C', 'T' or 'G'
* are discarded. Thus a typical sequence file might begin:
*