Skip to content

Instantly share code, notes, and snippets.

@obenshaindw
obenshaindw / Split VCF by Chromosome
Last active February 4, 2022 09:03
Split VCF by Chromosome
seq 1 22 | xargs -n1 -P4 -I {} /usr/bin/vcftools/vcftools_0.1.11/bin/vcftools --gzvcf largevcf.vcf.gz --chr {} --recode --recode-INFO-all --out /split_by_chr/largevcf.chr{}
@obenshaindw
obenshaindw / Fix Chromosome Name in a VCF
Last active August 29, 2015 14:14
Fix Chromosome Name in VCF
/usr/bin/htslib/bcftools/bcftools view vcf_with_chr.vcf | sed "s/chr//g" | /usr/bin/htslib/htslib/bgzip -c > BCM_hg19.reheader.no_chr.vcf.gz
@obenshaindw
obenshaindw / Reheader a VCF file
Last active August 29, 2015 14:14
Reheader VCF
/usr/bin/htslib/bcftools view -H vcf_with_bad_header.vcf > vcf_header.vcf
vim vcf_header.vcf
#Make changes to header
/usr/bin/htslib/bcftools/bcftools reheader -h vcf_header vcf_with_bad_header.vcf -o reheadered.vcf
@obenshaindw
obenshaindw / Add dbSNP IDs to a VCF file
Last active August 21, 2023 21:47
Add dbSNP IDs to a VCF file that doesn't have them.
#GATK Method <- Slower and keeps original ID plut dbSNP rsID
# R=Reference FASTA
# V=VCF file to add IDs to
# --dbsnp = dbsnp VCF -- download from NCBI FTP
java -jar GenomeAnalysisTK.jar -R /reference/Homo_sapiens_assembly19.fasta -T VariantAnnotator -V vcf_to_add_id_to.vcf --dbsnp /reference/dbsnp_137.b37.vcf.gz --out /data/Broad.chr1.annotated.vcf
#bcftools Method <- Faster, replaces existing ID with dbSNP rsID
/usr/bin/htslib/bcftools/bcftools annotate -a /reference/dbsnp_137.b37.vcf.gz -c ID vcf_to_add_id_to.vcf
@obenshaindw
obenshaindw / Stream VCF from S3
Last active April 6, 2023 09:45
Stream VCF file from AWS s3 and do stuff (sort, gzip, index, subset for specific region)
#!/usr/bin/bash
#
# make_gz.sh
#
# Call this script with a list of s3 locations with VCF files to parse
# aws --profile NDAR s3 ls s3:/S3_URL/ | awk '{print $4}' | xargs -n1 -P4 sh make_gz.sh
# xargs -n1 -P4 accepts one argument and runs 4 parallel processes
#
@obenshaindw
obenshaindw / extract-genotypes.pl
Created February 4, 2015 15:47
Extract genotypes from multisample VCF file using vcftools
use strict;
use warnings;
use Vcf;
my $filename = $ARGV[0];
open ( my $handle, "<", $filename);
my $vcf = Vcf->new(fh=>$handle);
$vcf->parse_header();
vcf_iterate();
@obenshaindw
obenshaindw / Zip files in s3
Last active August 29, 2015 14:14
get files from s3, zip, and put back into s3
echo $1
# Use grep REGEX to extract portion of s3 URL to reuse as zip file name.
folder=`echo $1 | grep -Eio '\/([0-9]+)\/$' | grep -Eio '([0-9]+)'`
mkdir ./$folder
echo s3cmd get --recursive $1 ./$folder
s3cmd get --recursive $1 ./$folder
echo zip -r $folder ./$folder/*
zip -r $folder ./$folder/*
echo rm -rf ./$folder/
rm -rf ./$folder/
@obenshaindw
obenshaindw / gist:bb6c2b4cf2aa7028813a
Created August 6, 2015 17:51
Steam large files from s3 (i.e., FASTQ)
#!/bin/bash
# Pass in s3 URL=$1
# Set up Pathing
## Drop s3://
pname=${1#*//}
## Drop Bucket Name, i.e., NDAR_Central*, NDAR_Results, etc.
pname=${pname#*/}
## Get text after last /
fname=${1##*/}
@obenshaindw
obenshaindw / refresh_nda_token.sh
Created March 14, 2018 04:47
Bash function to update AWS FederationToken provided by NIMH Data Archive
#!/bin/bash
## NDA AWS Token Generator
## Author: NIMH Data Archives
## http://ndar.nih.gov
## License: MIT
## https://opensource.org/licenses/MIT
##############################################################################
#
# Script to retrieve generated AWS Tokens from NIMHDA
@obenshaindw
obenshaindw / mff-zipper.sh
Last active January 15, 2019 20:36
Package MFF files into zip files
#!/bin/bash
MFF_DIRECTORY=$1
for mffzip in "$MFF_DIRECTORY"*.mff.zip; do
echo "Renaming $mffzip directories to just ${mffzip%.zip}"
mv "$mffzip" "${mffzip%.zip}";
done
for mff in *.mff; do