Skip to content

Instantly share code, notes, and snippets.

@disulfidebond
Created July 29, 2019 15:59
Show Gist options
  • Save disulfidebond/88e6813782f7fa78ce15724747923a20 to your computer and use it in GitHub Desktop.
Save disulfidebond/88e6813782f7fa78ce15724747923a20 to your computer and use it in GitHub Desktop.
BED file parsing

Overview

A BED file needs to be parsed and reformatted into a CSV file. Broadly speaking, there are two options: use a GUI, or use scripting.

GUI

Use BBEdit, Atom, or Apple's TextEdit (see caution) to search and replace. An example is shown here:

gui_parsing

  • Caution: Apple's TextEdit has an extremely useful GUI for search and replace, that even simplifies Regex. However, it may replace some characters with one that is not recognized by all text editors, such as the double-quote character. You've been warned.

Scripting

I haven't been able to find a one-step process, but the following scripts accomplish the task, and also can serve as a cookbook for similar tasks.

  • Parse out only the track line and the following line:

      grep -A1 'track' radCohort_geneList.bed.txt > radCohort_geneList.unparsed.txt 
      # truncated output:
      # track name="SOMETHING" description=""
      # chr1  87863625	87864548 
      # --
    
  • remove grep characters

      perl -pe 's/--\n//' radCohort_geneList.unparsed.txt > tmp
    
  • append bed line to track line. Use a Regex negative lookahead

      perl -0pe 's/\n(?!([a-z]{5}|$))//g' tmp > tmp2
      # truncated output:
      # track name="ATM" description="ATM serine/threonine kinase [Source:HGNC Symbol;Acc:HGNC:795]" chr11      108222484 
    
  • reformat line into comma-separated and parse out bed entries

      perl -pe 's/\t/,/g' tmp2 | cut -d, -f1 > tmp3
      # commands can be piped into each other, example:
      perl -pe 's/\t/,/g' tmp2 | sed 's/itemRgb=\"On\"/chromosome=\"/g' | cut -d, -f1 | sed 's/$/\"/' > tmp4
      # truncated output from the above piped command is:
      # track name="ATM" description="ATM serine/threonine kinase [Source:HGNC Symbol;Acc:HGNC:795]" chromosome="chr11"
    
  • reformat space-delimited into a comma-separated file, in this case the pattern '" ' is used, but this can be modified as necessary

      sed 's/\" /,/g' tmp4 > radCohort_geneList.txt
    
  • finally, remove the 'track ' string from the beginning of the line:

      sed 's/^track //g' radCohort_geneList.txt > radCohort_geneList.csv
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment