Below are the first few lines(*) of two files exported from the Agilent CytoGenomics Software. These files are plain text files with a few "metadata" rows before the actual tab-delimited data begins. The tab-delimited data appears to have a header row with tab-delimited column names followed by corresponding tab-delimited data rows. But it also appears as if there are "random" TABs (ASCII 0x09 = '\t') added to the end of either to header row or the data rows. What's up with that?
PS. (*) I cut out the top few lines by reading the files as a binary file in order to remove the risk of it being me/my editor introducing these TAB outliers. Also, I've verified using binary diff that these example files are identicial to the top of the full files.
> countTABs <- function(x) sum(charToRaw(x) == charToRaw("\t"))
> data <- readLines("ProbeBasedReport_foo.xls")[20:23]
> data
[1] "AberrationNo\tCytoBand\tChrName\tProbeName\tStart\tStop\tDescription\tGenes\t\tLogratio\tAmplification\tDeletion"
[2] "1.1\tp36.23\tchr1\tA_16_P35022894\t8788292\t8788351\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.4210719\t\t-0.32680264\t\t"
[3] "1.2\tp36.23\tchr1\tA_16_P35022909\t8797211\t8797270\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.32013443\t\t-0.32680264\t\t"
[4] "1.3\tp36.23\tchr1\tA_18_P10026487\t8803013\t8803072\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t0.058376655\t\t-0.32680264\t\t"
## Header has 12 columns, data rows 14
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 12 14 14 14
How can we read this in R?
> read.table("ProbeBasedReport_foo.xls", header=TRUE, sep="\t", skip=19L)
Error in read.table("ProbeBasedReport_foo.xls", header = TRUE, sep = "\t",
more columns than column names
I don't think it's possible to read the header and the data rows at the same time; one needs to parse them separately :(
Update: What's up with that empty column #9 with an empty column name and no data (also no data in the full file with 400,000+ rows).
> data <- readLines("CGH_Log_Ratio_foo.txt")[2:5]
> data
[1] "Probe Name\tChrName\tStart\tStop\tFeatureNum\tDescription\tName of Gene\tNormalized log ratio(010214_foo_252185099999_1_1)\t"
[2] "A_16_P15000916\t1\t564424\t564483\t300219\tUnknown\tENST00000503254\t-0.6407301"
[3] "A_18_P10001325\t1\t746608\t746667\t79122\tUnknown\tENST00000435300\t0.15713063"
[4] "A_16_P30000295\t1\t752717\t752764\t338141\tUnknown\tENST00000435300\t0.46870643"
## Header has 9 columns, data rows 8
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 9 8 8 8
How to read in R?
> read.table("CGH_Log_Ratio_foo.txt", header=TRUE, sep="\t", skip=1L, fill=TRUE)
Probe.Name ChrName Start Stop FeatureNum Description Name.of.Gene
1 A_16_P15000916 1 564424 564483 300219 Unknown ENST00000503254
2 A_18_P10001325 1 746608 746667 79122 Unknown ENST00000435300
3 A_16_P30000295 1 752717 752764 338141 Unknown ENST00000435300
Normalized.log.ratio.010214_foo_252185099999_1_1. X
1 -0.6407301 NA
2 0.1571306 NA
3 0.4687064 NA
So in the case where there are more columns in the data rows than in the column names header, fill=TRUE
will do the trick. But still, what's up with that last empty column?!?
Nice. I encounter this way too often with biological data!
As an alternative to count TABS I use:
should I be weary of this (it's never caused problems thus far...)?