HenrikBengtsson/Agilent_and_TABs.md

## Agilent_and_TABs.md

      
    Raw
  

              Agilent_and_TABs.md
            
          
    Agilent adds random TABs to end of data rows?!?

Below are the first few lines(*) of two files exported from the Agilent CytoGenomics Software.  These files are plain text files with a few "metadata" rows before the actual tab-delimited data begins.  The tab-delimited data appears to have a header row with tab-delimited column names followed by corresponding tab-delimited data rows.  But it also appears as if there are "random" TABs (ASCII 0x09 = '\t') added to the end of either to header row or the data rows.  What's up with that?
PS. (*) I cut out the top few lines by reading the files as a binary file in order to remove the risk of it being me/my editor introducing these TAB outliers.  Also, I've verified using binary diff that these example files are identicial to the top of the full files.
Function to count TABs in a string

> countTABs <- function(x) sum(charToRaw(x) == charToRaw("\t"))
Example file #1 - Too few column names

> data <- readLines("ProbeBasedReport_foo.xls")[20:23]
> data
[1] "AberrationNo\tCytoBand\tChrName\tProbeName\tStart\tStop\tDescription\tGenes\t\tLogratio\tAmplification\tDeletion"
[2] "1.1\tp36.23\tchr1\tA_16_P35022894\t8788292\t8788351\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.4210719\t\t-0.32680264\t\t"
[3] "1.2\tp36.23\tchr1\tA_16_P35022909\t8797211\t8797270\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.32013443\t\t-0.32680264\t\t"
[4] "1.3\tp36.23\tchr1\tA_18_P10026487\t8803013\t8803072\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t0.058376655\t\t-0.32680264\t\t"

## Header has 12 columns, data rows 14
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 12 14 14 14
How can we read this in R?
> read.table("ProbeBasedReport_foo.xls", header=TRUE, sep="\t", skip=19L)
Error in read.table("ProbeBasedReport_foo.xls", header = TRUE, sep = "\t",
  more columns than column names
I don't think it's possible to read the header and the data rows at the same time; one needs to parse them separately :(
Update: What's up with that empty column #9 with an empty column name and no data (also no data in the full file with 400,000+ rows).
Example file #2 - Too many column names

> data <- readLines("CGH_Log_Ratio_foo.txt")[2:5]
> data
[1] "Probe Name\tChrName\tStart\tStop\tFeatureNum\tDescription\tName of Gene\tNormalized log ratio(010214_foo_252185099999_1_1)\t"
[2] "A_16_P15000916\t1\t564424\t564483\t300219\tUnknown\tENST00000503254\t-0.6407301"
[3] "A_18_P10001325\t1\t746608\t746667\t79122\tUnknown\tENST00000435300\t0.15713063"
[4] "A_16_P30000295\t1\t752717\t752764\t338141\tUnknown\tENST00000435300\t0.46870643"

## Header has 9 columns, data rows 8
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 9 8 8 8
How to read in R?
> read.table("CGH_Log_Ratio_foo.txt", header=TRUE, sep="\t", skip=1L, fill=TRUE)
      Probe.Name ChrName  Start   Stop FeatureNum Description    Name.of.Gene
1 A_16_P15000916       1 564424 564483     300219     Unknown ENST00000503254
2 A_18_P10001325       1 746608 746667      79122     Unknown ENST00000435300
3 A_16_P30000295       1 752717 752764     338141     Unknown ENST00000435300
  Normalized.log.ratio.010214_foo_252185099999_1_1.  X
1                                        -0.6407301 NA
2                                         0.1571306 NA
3                                         0.4687064 NA
So in the case where there are more columns in the data rows than in the column names header, fill=TRUE will do the trick.  But still, what's up with that last empty column?!?

  
## CGH_Log_Ratio_foo.txt
AMADID	021850	hg19
Probe Name	ChrName	Start	Stop	FeatureNum	Description	Name of Gene	Normalized log ratio(010214_foo_252185099999_1_1)
A_16_P15000916	1	564424	564483	300219	Unknown	ENST00000503254	-0.6407301
A_18_P10001325	1	746608	746667	79122	Unknown	ENST00000435300	0.15713063
A_16_P30000295	1	752717	752764	338141	Unknown	ENST00000435300	0.46870643

## ProbeBasedReport_foo.xls
Genome: hg19
Aberration Algorithm: ADM-1
Threshold: 6.0
Fuzzy Zero: OFF
GC Correction: OFF
Centralization (legacy): ON
Centralization (legacy) Bin Size: 10
Centralization (legacy) Threshold: 6.0
Diploid Peak Centralization: OFF
Combine Replicates (Intra Array): OFF
Array Level Filter: NONE
Metric Set Filter: NONE
Aberration Filter:   Minimum Number of Probes for Amplification >= 3 AND Nesting Level <= 100 AND Minimum Avg. Absolute Log Ratio for Amplification >= 0.25 AND Minimum Size (Kb) of Region for Amplification >= 0.0 AND Minimum Size (Kb) of Region for Deletion >= 0.0 AND Minimum Number of Probes for Deletion >= 3 AND Minimum Avg. Absolute Log Ratio for Deletion >= 0.25; Include matching values=true
Feature Level Filter:   gIsSaturated = true OR rIsSaturated = true OR gIsFeatNonUnifOL = true OR rIsFeatNonUnifOL = true OR LogRatio = 0; Include matching values=false
Design Level Filter: NONE
Genomic Boundary: OFF
Array Design Info: 010214_foo_252185099999_1_1[021850_20111015]

011414_CC-206M_Qubit_252185021525_1_1
AberrationNo	CytoBand	ChrName	ProbeName	Start	Stop	Description	Genes		Logratio	Amplification	Deletion
1.1	p36.23	chr1	A_16_P35022894	8788292	8788351	ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.	RERE		-0.4210719		-0.32680264
1.2	p36.23	chr1	A_16_P35022909	8797211	8797270	ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.	RERE		-0.32013443		-0.32680264
1.3	p36.23	chr1	A_18_P10026487	8803013	8803072	ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.	RERE		0.058376655		-0.32680264
1.4	p36.23	chr1	A_16_P15021671	8807706	8807765	ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.	RERE		-0.39028245		-0.32680264
	AMADID 021850 hg19
	Probe Name ChrName Start Stop FeatureNum Description Name of Gene Normalized log ratio(010214_foo_252185099999_1_1)
	A_16_P15000916 1 564424 564483 300219 Unknown ENST00000503254 -0.6407301
	A_18_P10001325 1 746608 746667 79122 Unknown ENST00000435300 0.15713063
	A_16_P30000295 1 752717 752764 338141 Unknown ENST00000435300 0.46870643
	Genome: hg19
	Aberration Algorithm: ADM-1
	Threshold: 6.0
	Fuzzy Zero: OFF
	GC Correction: OFF
	Centralization (legacy): ON
	Centralization (legacy) Bin Size: 10
	Centralization (legacy) Threshold: 6.0
	Diploid Peak Centralization: OFF
	Combine Replicates (Intra Array): OFF
	Array Level Filter: NONE
	Metric Set Filter: NONE
	Aberration Filter: Minimum Number of Probes for Amplification >= 3 AND Nesting Level <= 100 AND Minimum Avg. Absolute Log Ratio for Amplification >= 0.25 AND Minimum Size (Kb) of Region for Amplification >= 0.0 AND Minimum Size (Kb) of Region for Deletion >= 0.0 AND Minimum Number of Probes for Deletion >= 3 AND Minimum Avg. Absolute Log Ratio for Deletion >= 0.25; Include matching values=true
	Feature Level Filter: gIsSaturated = true OR rIsSaturated = true OR gIsFeatNonUnifOL = true OR rIsFeatNonUnifOL = true OR LogRatio = 0; Include matching values=false
	Design Level Filter: NONE
	Genomic Boundary: OFF
	Array Design Info: 010214_foo_252185099999_1_1[021850_20111015]

	011414_CC-206M_Qubit_252185021525_1_1
	AberrationNo CytoBand ChrName ProbeName Start Stop Description Genes Logratio Amplification Deletion
	1.1 p36.23 chr1 A_16_P35022894 8788292 8788351 ref\|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.4210719 -0.32680264
	1.2 p36.23 chr1 A_16_P35022909 8797211 8797270 ref\|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.32013443 -0.32680264
	1.3 p36.23 chr1 A_18_P10026487 8803013 8803072 ref\|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE 0.058376655 -0.32680264
	1.4 p36.23 chr1 A_16_P15021671 8807706 8807765 ref\|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.39028245 -0.32680264