Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active November 3, 2018 17:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save benmarwick/9278490 to your computer and use it in GitHub Desktop.
Save benmarwick/9278490 to your computer and use it in GitHub Desktop.
Convert a single Excel file (one text per row) into separate text files. A function in R.
#' Making many text files from a single Excel file
#'
#' Convert a single Excel file (one text per row) into
#' separate text files. A function in R. Requires Java Runtime
#' Environment (JRE), version 6.0 or higher
#'
#' To use this function for the first time run:
#' install.packages("devtools")
#' then thereafter you just need to load the function
#' fom github like so:
#' library(devtools) # windows users need Rtools installed, mac users need XCode installed
#' source_url("https://gist.github.com/benmarwick/9278490/raw/xls2txts.R")
#'
#' Here's how to set the arguments to the function
#'
#' mydir is the full path of the folder that contains your Excel file
#' for example "C:/Downloads/myxlsfile" Note that it must have
#' quote marks around it and forward slashes, which are not default
#' in windows. This should be a folder with a *single* Excel file in it.
#' The file can be xls or xlsx. Note that this function doesn't
#' work if your Excel file is in use - close the file before running this.
#'
#' sheet is the name of the sheet in the Excel file that your data is
#' in. The sheet name should be given with quote marks around it
#' like this "Sheet 1" Note that this function will combine text in
#' all the columns (from column 2 onwards, we assume column 1 is a
#' document name) into one long character string. If your sheet has a
#' comlicated structure you may want to use the XLconnect package directly
#' in R, it's very flexible.
#'
#' labels refers to the column number with the document labels. The
#' default is the first column, but in can your want to use a different
#' column you can set it like so (for example if col 2 has the labels):
#' labels = 2
#'
#' header refers to the first row. If the first row of your EXcel file
#' does not contain column headings then you should add
#' header = FALSE
#' into the function. If you do have column headers you don't need to
#' add anything, the default is header = TRUE
#'
#'
#' A full example, assuming you've sourced the
#' function from github already:
#' xls2txt("C:/Downloads/myxlsfile", "Sheet 1", labels = 2)
#' and after a moment you'll get a message in the R console
#' saying 'Your texts files can be found in C:/Downloads/myxlsfile'
xls2txt <- function(mydir, sheet, labels = 1, header = TRUE){
# function to check for package and download if not found
# from http://stackoverflow.com/a/19870272/1036500
# convert arguments to vector
install_load <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE, repos = "http://cran.rstudio.com" )
# Load package after installing
require( i , character.only = TRUE )
}
}
}
# Then try/install packages...
install_load("XLConnect")
# CRAN version wasn't loading, emtpy DESCRIPTION,
# so let's try github...
# install_github("xlconnect", username = "miraisolutions", ref = "master")
# allocate lots of memory to Java
options( java.parameters = "-Xmx4g" )
# Get the name of the xls file, if more than one, just get the first
myxlsfile <- list.files(mydir, full.names = TRUE, pattern = "*.xls|.xlsx|.XLS|.XLSX")[[1]]
# Load Excel workbook into R
message("Loading Excel file into R, this may take few moments...")
wb <- loadWorkbook(myxlsfile, create = FALSE)
# make dataframe from workbook
suppressWarnings(myxlsdata <- readWorksheet(wb, sheet = sheet, header = header))
message("Finished loading Excel file")
# free memory
xlcFreeMemory()
# combine all except the first column together into
# one long character string for each row
mytxtsconcat <- apply(myxlsdata[-(1:labels)], 1, paste, collapse=" ")
# make a dataframe with the file names and texts
mytxtsdf <- data.frame(filename = myxlsdata[,labels], # get the first col for the text file names
fulltext = mytxtsconcat)
# Now write one text file for each row of the csv
# use 'invisible' so we don't see anything in the console
setwd(mydir)
message("Writing text files into your folder, this may take a few moments...")
invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[i,2],
file = paste0(mytxtsdf[i,1], ".txt"),
row.names = FALSE, col.names = FALSE,
quote = FALSE)))
message("Done writing text files into your folder")
# now check your folder to see the txt files
message(paste0("Your text files can be found in ", getwd()))
}
@upimple
Copy link

upimple commented Oct 12, 2018

I am getting an XLconnect related error
"Loading Excel file into R, this may take few moments...
Error: NoSuchMethodError (Java): org.apache.poi.hssf.usermodel.HSSFWorkbook.getNumCellStyles()I "

How I can fix it ?

Thanks

@keilabcs
Copy link

keilabcs commented Nov 3, 2018

invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[1:49,1],
file = paste(mytxtsdf[i,0], "1.txt"),
row.names = FALSE, col.names = FALSE,
quote = FALSE)))

it saves all the line in a single txt, would like to save in separate txt... help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment