Skip to content

Instantly share code, notes, and snippets.

View lh3's full-sized avatar

Heng Li lh3

View GitHub Profile
@lh3
lh3 / human_missing.txt
Created July 3, 2014 17:47
Supposed human sequences having no hits to GRCh38+ALT+hs37d5+EBV
>670363126:1315126374:97-3576:131-3480:1-3350 no hit?!
TTGAACAAGTTCAATTTAAAATATTTCATAAGCTTGTATGATATCATTGGGGCTAGAGTAGACGTGATACTAAATGTAGGAAATAAATGGAATAAACACTCAGAGAGGAACTATATGTTTTGTCTTAAAACTTCATTAAGTCTAAGAATTGATATTTTGTTGCTGGATGAATTGGTGGTTTGTGAGTTATAGTGAATATATCTCCCTGTTACATTCCTGTCGTTGTCCTATCCATTGTCCTATTGTTGGATGTTTCACTTGTGTATCCCATGTTACATAATTTTATCATAGAGAGCAAAAGAATTTGACTACAGTAAACTTAATGCTTTGCCATGTACAGATGTATGACAAAGAACATGTTATTTTGAAAGTTTTTGCTCTTATATTTTCTTTGCAGTCTGACTTCTTAATATTTGCTTCATTTTCATGTATGTGTTTTTACAAAGCATTCTCTTGGTATCACTAACCAAATACCTGTTTAGAGAAACAAAAGGCCAACTGGGGATGAAAACAGTTTATTTACTTACTGCACAGTATGTTTCTAGCTCTCTGTTATAAGAACAACAGTGTATTTGGTGTGTGTGTAGAACATACTTATTAAAAGTTTGATTCCTCTTTCCCCCGCAGATTCCATGATGAAGCCTTGTGATTTTGTGTCTTTACAAACTGTTCACTGATCTGGCAATGAAAATTGCCTCGATGTTAATAAATATTAGTTCATTAGCTATGCTTAATTAACACAAATGAAACTGATGAACCCAGTGTTGTGTGTTATTGCAAGCTTTCTATTTATATCAGCAAACCTCAGCGTTTATGATTTATTAAATAATTTATTCAATAGGCATTAATTGAGTATCAAGTATCAGTCAGAGACTTTGCTAAATGTTGCAAACCCAAATATGTCTTAAAGGAGTATACTGTCTTTCAAAAAATAGATGGCTAAAAAATAGTTATAATATAGATGATATA
// gcc -g -O2 -Wall fmr-rld.c mrope.c rle.c rope.c rld0.c -o fmr-rld -pthread
#include <stdio.h>
#include <stdint.h>
#include "rle.h"
#include "rld0.h"
#include "mrope.h"
int main(int argc, char *argv[])
{
@lh3
lh3 / kfreq.c
Last active August 29, 2015 14:05
#include <zlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)
unsigned char seq_nt6_table[256] = {
0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
@lh3
lh3 / 00_cmd.sh
Last active August 29, 2015 14:06
# generate the list of novel segments for each sample
seqtk seq -L200 001*.mag.gz | ./fermi2 match -dk101 hs38a.fmd /dev/stdin | k8 hs38d.js fm2seg -p 001 | gzip > 001.novel.gz
k8 hs38d.js tab2fa 001.novel.gz | bwa-r868 mem s38a.fa - | k8 hs38d.js dropsim | gzip -1 > 001.dropped.gz
tabtk isct -c 001.dropped.gz 001.novel.gz | gzip -1 > 001.div1.gz
#
zcat div1/*.div1.gz | cut -f1-5 | sort -k5,5 -k1,1 -S60G | k8 hs38d.js fixname | gzip -1 > 01ori.txt.gz
# tabtk isct -c gbk.novel.dropped gbk.novel.gz | sort -k5,5 -k1,1 -S60G | grep -v NNNNNNNNNN | gzip -1 > 01ori.txt.gz
tabtk isct -d_ -c contaminated.txt 01ori.txt.gz | gzip -1 > 02contam.txt.gz
# zcat 01ori.txt.gz|grep -vFf DoNotUseList_May2011.list | gzip -1 > 02contam.txt.gz
@lh3
lh3 / duprate.tex
Last active August 29, 2015 14:07
\documentclass[pdftex,10pt]{article}
\usepackage{amsmath}
\usepackage{amssymb}
\title{PCR and mapping duplicate rate in NGS}
\author{Heng Li}
\begin{document}
>957080628:743840991 3786 1068121565,93;3390644509,73;3390644511,73; 769693427,69;1379073786,70;1790286632,84;2297804583,67;
ATCACACACATCCATATACACAGCTGGACACCACGTGCACACAATCACACACATCCATATACACAGCTGGAC
ACCACGTGCACATACACACATAATTACATACATCCATAAACACACATGGACACCACATGCACACATACAATC
ACACACATCCATATATGCACATGGACACCACACACACACACACACATCCATAAACACAGCTGGACACAACAT
GCACACACACAATTACACACAACCATATACACACATGGATACCACATGCACACACACAATCACACACATCTA
TATACGCACATGGACACCACACATGCACACACACACATCCATATACACACATGGATACCACATGCACACACA
CTATCACATCTATATATGCACATGGACACCACATGCGCGCACACACAATTACACACATATACATACACAGAC
AACACATGCACACACAACCACACACATTTACATATGCACGTGGACACCACATACACAAACATATATGCATGT
GGTCATTACATGCACATGGACAAATACACACGGACACCACATATACACACAATCACATACATATATGTACAT
GGACACATGCACACACACACAAACATGCACATCCATATGTGCACACGGTCACATGCACAATTACACACATCC
@lh3
lh3 / alt-demo.graffle
Created November 19, 2014 15:00
Example figure in bwakit/README.md, created by OmniGraffle
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>ActiveLayerIndex</key>
<integer>0</integer>
<key>ApplicationVersion</key>
<array>
<string>com.omnigroup.OmniGraffle</string>
<string>139.18.0.187838</string>
@lh3
lh3 / 00fastq_parsing.md
Last active August 29, 2015 14:10
On FASTQ parsing performance

As we were talking about FASTQ parsing on Twitter, I dig out an old experiement and redid it with the latest kseq.h and SeqAn-1.4.2. The task is simple: to count the number of bases in a FASTQ files containing 2 million 100bp short reads. The file is put on the RAM disk. The following gives timing for a few programs/settings:

User time (s) Sys time (s) Command line
0.98 0.16 kseq-len /dev/shm/tmp.fq
4.24 0.11 kseq-len /dev/shm/tmp.fq.gz # gzip'd
5.53 0.13 seqtk fqchk /dev/shm/tmp.fq.gz # more than counting
12.03 0.12 seqan2-len /dev/shm/tmp.fq
14.11 0.28 seqan-len /dev/shm/tmp.fq
#include <bzlib.h>
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct zfile_s {
int is_bz;
gzFile fpz;
FILE *fp; // for bzip2
BZFILE *fpbz;
@lh3
lh3 / ksfile.c
Created December 29, 2014 01:54
Peek file content
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <stdint.h>
#include <pthread.h>
#define KO_BUFSIZE 0x40000
typedef struct {
FILE *fp_in, *fp_out;