Skip to content

Instantly share code, notes, and snippets.

@petermchale
Last active February 6, 2020 17:02
Show Gist options
  • Save petermchale/b2e8122ae83289d485a91ec87bb4d90a to your computer and use it in GitHub Desktop.
Save petermchale/b2e8122ae83289d485a91ec87bb4d90a to your computer and use it in GitHub Desktop.
Download a variety of genomics regions that are commonly excluded in genomics analysis
#!/usr/bin/env bash
# download segmental duplications
seg_dups='genomicSuperDups'
seg_dups_final=$seg_dups.sorted.bed.gz
if [[ ! -e $seg_dups_final ]]; then
echo "downloading, unzipping, bed-ifying segmental duplication data..."
# database schema
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/$seg_dups.sql
touch $seg_dups.sql
# extract relevant columns from database to make a valid bed file
curl http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/$seg_dups.txt.gz | \
gunzip --stdout | \
cut -f 2,3,4 | \
sed 's/^chr//g' | \
sort -k1,1 -k2,2n | \
uniq | \
bgzip --stdout > $seg_dups_final
tabix --force $seg_dups_final
fi
# download mysql to Google Colab, if necessary
if !(command -v mysql) > /dev/null; then
cat /etc/os-release # this should yield Ubuntu on Google Colab
apt-get install mysql-client
else
echo "mysql already available"
fi
table=rmsk # RepeatMasker
database=hg38
chromosome=genoName
start_coordinate=genoStart
end_coordinate=genoEnd
if [[ ! -e ${table}.bed ]]; then
echo "downloading ${table} from ${database} from UCSC..."
# https://genome.ucsc.edu/goldenPath/help/mysql.html
# https://genome.ucsc.edu/cgi-bin/hgTables
mysql --user=genome --host=genome-mysql.soe.ucsc.edu \
--port=3306 --skip-column-names --batch --no-auto-rehash \
--execute="SELECT ${chromosome}, ${start_coordinate}, ${end_coordinate} from ${table};" ${database} \
| bedtools sort -i stdin \
| bgzip -c > ${table}.bed.gz
fi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment