Skip to content

Instantly share code, notes, and snippets.

@cbp44
Last active November 17, 2021 04:18
Show Gist options
  • Save cbp44/e6c198d27fa67dcf6a749ed474fcff73 to your computer and use it in GitHub Desktop.
Save cbp44/e6c198d27fa67dcf6a749ed474fcff73 to your computer and use it in GitHub Desktop.
Create NCBI RefSeq Select BED12 file

Create NCBI RefSeq Select BED12 file

This gist shows you how to create a BED file in BED12 format containing every protein-coding NCBI RefSeq Select gene with the exons annotated as blocks in the BED file.

  1. First, download a TSV file of NCBI RefSeq Select genes

    1. Go to the UCSC Table Browser
    2. Select these parameters
      • Assembly: Dec. 2013 (GRCh38/hg38)
      • Group: Genes and Gene Predictions
      • Track: NCBI RefSeq
      • Table: RefSeq Select and MANE (ncbiRefSeqSelect)
      • Output format: all fields from selected table
      • Output filename: GRCh38.ncbiRefSeqSelect.tsv.gz
      • File type returned: gzip compressed
    3. Click "get output"
  2. Use the awk script below to process the TSV into BED12 format.

    zcat GRCh38.ncbiRefSeqSelect.tsv.gz \
      | tail -n +2 \
      | awk -f ncbi_refseq_tsv_to_exon_bed12.awk - \
      > GRCh38.ncbiRefSeqSelect.genes.bed
BEGIN {
IFS="\t"; OFS="\t";
}
{
n=split($10, exon_starts, ",");
split($11, exon_ends, ",");
exon_sizes_str=exon_ends[1]-exon_starts[1];
exon_starts_str=exon_starts[1];
for (i=2; i<n; i++) {
exon_sizes_str=exon_sizes_str","exon_ends[i]-exon_starts[i];
exon_starts_str=exon_starts_str","exon_starts[i];
};
print $3,$5,$6,$13,0,$4,$7,$8,"33,33,33",$9,exon_sizes_str,exon_starts_str
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment