Skip to content

Instantly share code, notes, and snippets.

@crazyhottommy
Last active November 17, 2021 22:13
Show Gist options
  • Save crazyhottommy/dad1706198037c6641c3d8332e9951fd to your computer and use it in GitHub Desktop.
Save crazyhottommy/dad1706198037c6641c3d8332e9951fd to your computer and use it in GitHub Desktop.

Three gotchas when using R for Genomic data analysis

During my daily work with R for genomic data analysis, I encountered several instances that R gives me some (bad) surprises.

1. The devil 1 and 0 coordinate system

read detail here https://github.com/crazyhottommy/DNA-seq-analysis#tips-and-lessons-learned-during-my-dna-seq-data-analysis-journey

some files such as bed file is 0 based. Two genomic regions:

chr1    0    1000
chr1    1000    2000

when you import that bed file into R using rtracklayer::import(), it will become

chr1     1    1000
chr1    1001    2000

The function converts it to 1 based internally (R is 1 based unlike python).

The problem is that when you read the bed file with read.table and use GenomicRanges::makeGRangesFromDataFrame() to convert it to a GRanges object, do not forget to add 1 to the start before doing it.

Similarily, when you write a GRanges object to disk using rtracklayer::export, you do not need to worry, R will convert it back to 0 based in file.

However, if you make a dataframe out of the GRanges object, you need to remember do start -1 before writing to a file.

2. read_tsv column types

If you use read_tsv from readr, it will use the first 1000 rows to determine the column types (integer, charater etc). For genomic data, however, especially for the chromosome column, you may or may not have chr prepending.

1    0    1000    
1    1000    2000
.
.
.
X
Y
MT

you may fail to read rows for chromosome X, Y and MT. (To make things worse, UCSC uses chrM rather than chrMT...)

The solution is that read in all the data as characters.

library(readr)
challenge2 <- read_tsv("my.bed", 
  col_types = cols(.default = col_character())
)

see http://r4ds.had.co.nz/data-import.html

3. Scientific notation for genomic coordinates

This is kind of related to 2. 1200000 will be written as 1.2e6 in a dataframe if R thinks it is an integer. So, you will need to read in the columns all as characters, or if you convert the character to numeric and wants to write to a file, add options(scipen=500) on the top of your script.

The scientific notation can not be disabled in write_csv: tidyverse/readr#671

one more gotcha with colnames

Base R will change the - to . for the column names. e.g. TCGA=06-ABCD will be changed to TCGA.06.ABCD. This can cause troubles when you use the colnames to match samples. readr will maintain the -.

@fpbarthel
Copy link

Two and three remind of MS Excel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment