Skip to content

Instantly share code, notes, and snippets.

@HenrikBengtsson
Last active September 23, 2015 19:32
Show Gist options
  • Save HenrikBengtsson/dabc383aaa958c0ed49a to your computer and use it in GitHub Desktop.
Save HenrikBengtsson/dabc383aaa958c0ed49a to your computer and use it in GitHub Desktop.
Agilent adds random TABs to end of data rows?!?

Agilent adds random TABs to end of data rows?!?

Below are the first few lines(*) of two files exported from the Agilent CytoGenomics Software. These files are plain text files with a few "metadata" rows before the actual tab-delimited data begins. The tab-delimited data appears to have a header row with tab-delimited column names followed by corresponding tab-delimited data rows. But it also appears as if there are "random" TABs (ASCII 0x09 = '\t') added to the end of either to header row or the data rows. What's up with that?

PS. (*) I cut out the top few lines by reading the files as a binary file in order to remove the risk of it being me/my editor introducing these TAB outliers. Also, I've verified using binary diff that these example files are identicial to the top of the full files.

Function to count TABs in a string

> countTABs <- function(x) sum(charToRaw(x) == charToRaw("\t"))

Example file #1 - Too few column names

> data <- readLines("ProbeBasedReport_foo.xls")[20:23]
> data
[1] "AberrationNo\tCytoBand\tChrName\tProbeName\tStart\tStop\tDescription\tGenes\t\tLogratio\tAmplification\tDeletion"
[2] "1.1\tp36.23\tchr1\tA_16_P35022894\t8788292\t8788351\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.4210719\t\t-0.32680264\t\t"
[3] "1.2\tp36.23\tchr1\tA_16_P35022909\t8797211\t8797270\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t-0.32013443\t\t-0.32680264\t\t"
[4] "1.3\tp36.23\tchr1\tA_18_P10026487\t8803013\t8803072\tref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.\tRERE\t\t0.058376655\t\t-0.32680264\t\t"

## Header has 12 columns, data rows 14
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 12 14 14 14

How can we read this in R?

> read.table("ProbeBasedReport_foo.xls", header=TRUE, sep="\t", skip=19L)
Error in read.table("ProbeBasedReport_foo.xls", header = TRUE, sep = "\t",
  more columns than column names

I don't think it's possible to read the header and the data rows at the same time; one needs to parse them separately :(

Update: What's up with that empty column #9 with an empty column name and no data (also no data in the full file with 400,000+ rows).

Example file #2 - Too many column names

> data <- readLines("CGH_Log_Ratio_foo.txt")[2:5]
> data
[1] "Probe Name\tChrName\tStart\tStop\tFeatureNum\tDescription\tName of Gene\tNormalized log ratio(010214_foo_252185099999_1_1)\t"
[2] "A_16_P15000916\t1\t564424\t564483\t300219\tUnknown\tENST00000503254\t-0.6407301"
[3] "A_18_P10001325\t1\t746608\t746667\t79122\tUnknown\tENST00000435300\t0.15713063"
[4] "A_16_P30000295\t1\t752717\t752764\t338141\tUnknown\tENST00000435300\t0.46870643"

## Header has 9 columns, data rows 8
> sapply(data, FUN=countTABs, USE.NAMES=FALSE) + 1L
[1] 9 8 8 8

How to read in R?

> read.table("CGH_Log_Ratio_foo.txt", header=TRUE, sep="\t", skip=1L, fill=TRUE)
      Probe.Name ChrName  Start   Stop FeatureNum Description    Name.of.Gene
1 A_16_P15000916       1 564424 564483     300219     Unknown ENST00000503254
2 A_18_P10001325       1 746608 746667      79122     Unknown ENST00000435300
3 A_16_P30000295       1 752717 752764     338141     Unknown ENST00000435300
  Normalized.log.ratio.010214_foo_252185099999_1_1.  X
1                                        -0.6407301 NA
2                                         0.1571306 NA
3                                         0.4687064 NA

So in the case where there are more columns in the data rows than in the column names header, fill=TRUE will do the trick. But still, what's up with that last empty column?!?

AMADID 021850 hg19
Probe Name ChrName Start Stop FeatureNum Description Name of Gene Normalized log ratio(010214_foo_252185099999_1_1)
A_16_P15000916 1 564424 564483 300219 Unknown ENST00000503254 -0.6407301
A_18_P10001325 1 746608 746667 79122 Unknown ENST00000435300 0.15713063
A_16_P30000295 1 752717 752764 338141 Unknown ENST00000435300 0.46870643
Genome: hg19
Aberration Algorithm: ADM-1
Threshold: 6.0
Fuzzy Zero: OFF
GC Correction: OFF
Centralization (legacy): ON
Centralization (legacy) Bin Size: 10
Centralization (legacy) Threshold: 6.0
Diploid Peak Centralization: OFF
Combine Replicates (Intra Array): OFF
Array Level Filter: NONE
Metric Set Filter: NONE
Aberration Filter: Minimum Number of Probes for Amplification >= 3 AND Nesting Level <= 100 AND Minimum Avg. Absolute Log Ratio for Amplification >= 0.25 AND Minimum Size (Kb) of Region for Amplification >= 0.0 AND Minimum Size (Kb) of Region for Deletion >= 0.0 AND Minimum Number of Probes for Deletion >= 3 AND Minimum Avg. Absolute Log Ratio for Deletion >= 0.25; Include matching values=true
Feature Level Filter: gIsSaturated = true OR rIsSaturated = true OR gIsFeatNonUnifOL = true OR rIsFeatNonUnifOL = true OR LogRatio = 0; Include matching values=false
Design Level Filter: NONE
Genomic Boundary: OFF
Array Design Info: 010214_foo_252185099999_1_1[021850_20111015]
011414_CC-206M_Qubit_252185021525_1_1
AberrationNo CytoBand ChrName ProbeName Start Stop Description Genes Logratio Amplification Deletion
1.1 p36.23 chr1 A_16_P35022894 8788292 8788351 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.4210719 -0.32680264
1.2 p36.23 chr1 A_16_P35022909 8797211 8797270 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.32013443 -0.32680264
1.3 p36.23 chr1 A_18_P10026487 8803013 8803072 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE 0.058376655 -0.32680264
1.4 p36.23 chr1 A_16_P15021671 8807706 8807765 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA. RERE -0.39028245 -0.32680264
@npjc
Copy link

npjc commented Jun 16, 2015

Nice. I encounter this way too often with biological data!

As an alternative to count TABS I use:

stringi::stri_count_fixed(x, '\t')

should I be weary of this (it's never caused problems thus far...)?

@hadley
Copy link

hadley commented Sep 23, 2015

@npjc you should definitely be weary of it, but I don't think you need to be wary 😛

@HenrikBengtsson
Copy link
Author

Recent updates to hadley/readr now handles both types of corrupt data, cf. tidyverse/readr#189.

Example:

# Install the developers version
> source("http://callr.org/install#hadley/readr")

> library(readr)

> data <- read_tsv("ProbeBasedReport_foo.xls", skip=19L)
Warning: 4 parsing failures.
row col   expected     actual
  1  -- 12 columns 14 columns
  2  -- 12 columns 14 columns
  3  -- 12 columns 14 columns
  4  -- 12 columns 14 columns
> data
  AberrationNo CytoBand ChrName      ProbeName   Start    Stop
1          1.1    36.23       1 A_16_P35022894 8788292 8788351
2          1.2    36.23       1 A_16_P35022909 8797211 8797270
3          1.3    36.23       1 A_18_P10026487 8803013 8803072
4          1.4    36.23       1 A_16_P15021671 8807706 8807765

         Description
1 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.
2 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.
3 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.
4 ref|Homo sapiens arginine-glutamic acid dipeptide (RE) repeats (RERE), transcript variant 2, mRNA.
  Genes   NA    Logratio Amplification   Deletion
1  RERE <NA> -0.42107190          <NA> -0.3268026
2  RERE <NA> -0.32013443          <NA> -0.3268026
3  RERE <NA>  0.05837665          <NA> -0.3268026
4  RERE <NA> -0.39028245          <NA> -0.3268026

> data <- read_tsv("CGH_Log_Ratio_foo.txt", skip=1L)
Warning: 3 parsing failures.
row col  expected    actual
  1  -- 9 columns 8 columns
  2  -- 9 columns 8 columns
  3  -- 9 columns 8 columns
> data
      Probe Name ChrName  Start   Stop FeatureNum Description Name of Gene
1 A_16_P15000916       1 564424 564483     300219     Unknown       503254
2 A_18_P10001325       1 746608 746667      79122     Unknown       435300
3 A_16_P30000295       1 752717 752764     338141     Unknown       435300
  Normalized log ratio(010214_foo_252185099999_1_1)   NA
1                                        -0.6407301 <NA>
2                                         0.1571306 <NA>
3                                         0.4687064 <NA>

Perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment