Skip to content

Instantly share code, notes, and snippets.

View lh3's full-sized avatar

Heng Li lh3

View GitHub Profile
#include <bzlib.h>
#include <zlib.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct zfile_s {
int is_bz;
gzFile fpz;
FILE *fp; // for bzip2
BZFILE *fpbz;
@lh3
lh3 / 00fastq_parsing.md
Last active August 29, 2015 14:10
On FASTQ parsing performance

As we were talking about FASTQ parsing on Twitter, I dig out an old experiement and redid it with the latest kseq.h and SeqAn-1.4.2. The task is simple: to count the number of bases in a FASTQ files containing 2 million 100bp short reads. The file is put on the RAM disk. The following gives timing for a few programs/settings:

User time (s) Sys time (s) Command line
0.98 0.16 kseq-len /dev/shm/tmp.fq
4.24 0.11 kseq-len /dev/shm/tmp.fq.gz # gzip'd
5.53 0.13 seqtk fqchk /dev/shm/tmp.fq.gz # more than counting
12.03 0.12 seqan2-len /dev/shm/tmp.fq
14.11 0.28 seqan-len /dev/shm/tmp.fq
@lh3
lh3 / inthash.c
Last active December 11, 2022 07:27
Invertible integer hash functions
/*
For any 1<k<=64, let mask=(1<<k)-1. hash_64() is a bijection on [0,1<<k), which means
hash_64(x, mask)==hash_64(y, mask) if and only if x==y. hash_64i() is the inversion of
hash_64(): hash_64i(hash_64(x, mask), mask) == hash_64(hash_64i(x, mask), mask) == x.
*/
// Thomas Wang's integer hash functions. See <https://gist.github.com/lh3/59882d6b96166dfc3d8d> for a snapshot.
uint64_t hash_64(uint64_t key, uint64_t mask)
{
key = (~key + (key << 21)) & mask; // key = (key << 21) - key - 1;
@lh3
lh3 / inthash.md
Last active September 5, 2023 21:10 — forked from badboy/inthash.md
@lh3
lh3 / alt-demo.graffle
Created November 19, 2014 15:00
Example figure in bwakit/README.md, created by OmniGraffle
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>ActiveLayerIndex</key>
<integer>0</integer>
<key>ApplicationVersion</key>
<array>
<string>com.omnigroup.OmniGraffle</string>
<string>139.18.0.187838</string>
@lh3
lh3 / eigen.c
Created November 8, 2014 04:53
Eigenvalues/vectors for symmetric and asymmetric matrices
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#define RADIX 2.0
/*************************
* balance a real matrix *
*************************/
@lh3
lh3 / crc32.c
Last active January 21, 2021 17:30
Source code for computing CRC32 and SHA1 checksum
#include <zlib.h>
#include <fcntl.h>
#include <stdio.h>
#define BUF_LEN 0x100000
int main(int argc, char *argv[])
{
uLong crc = crc32(0L, Z_NULL, 0);
unsigned char buf[BUF_LEN];
@lh3
lh3 / 0.00README.md
Last active April 28, 2022 21:04
Mapping short reads with a ~50bp INDEL

This is a small experiment on the alignment of ~50bp INDELs. The query sequences are shown in 0.01.fq below, where seq_ori is a 204bp sequence extracted from the human reference genome, seq_del54 contains a 54bp deletion in the middle, seq_del84 contains a 84bp deletion in a 120bp read, and seq_ins40 contains a 40bp insertion in a 140bp read. These four short sequences were mapped to the human reference genome with Bowtie2, BWA-MEM, LAST, Novoalign, SNAP and Stampy with default settings. Non-default scoring functions were also tested for Bowtie2 (--rdg 5,1 --rfg 5,1), BWA-MEM (-A2 -E1) and LAST (-r2 -q4). The output by various mappers/settings can be found in this gist. The following table gives my summary:

Mapper Setting -84bp -54bp +40bp
BBMAP default Yes Yes Yes
Bowtie2 default No No No
Bowtie2 --rdg 5,1 --rfg 5,1 as insertion as insertion Yes
BWA-MEM default as split Yes Yes
BWA-MEM -A2 -E1 Yes Yes Yes
LAST default as split as split
>957080628:743840991 3786 1068121565,93;3390644509,73;3390644511,73; 769693427,69;1379073786,70;1790286632,84;2297804583,67;
ATCACACACATCCATATACACAGCTGGACACCACGTGCACACAATCACACACATCCATATACACAGCTGGAC
ACCACGTGCACATACACACATAATTACATACATCCATAAACACACATGGACACCACATGCACACATACAATC
ACACACATCCATATATGCACATGGACACCACACACACACACACACATCCATAAACACAGCTGGACACAACAT
GCACACACACAATTACACACAACCATATACACACATGGATACCACATGCACACACACAATCACACACATCTA
TATACGCACATGGACACCACACATGCACACACACACATCCATATACACACATGGATACCACATGCACACACA
CTATCACATCTATATATGCACATGGACACCACATGCGCGCACACACAATTACACACATATACATACACAGAC
AACACATGCACACACAACCACACACATTTACATATGCACGTGGACACCACATACACAAACATATATGCATGT
GGTCATTACATGCACATGGACAAATACACACGGACACCACATATACACACAATCACATACATATATGTACAT
GGACACATGCACACACACACAAACATGCACATCCATATGTGCACACGGTCACATGCACAATTACACACATCC
@lh3
lh3 / duprate.tex
Last active August 29, 2015 14:07
\documentclass[pdftex,10pt]{article}
\usepackage{amsmath}
\usepackage{amssymb}
\title{PCR and mapping duplicate rate in NGS}
\author{Heng Li}
\begin{document}