Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert a single Excel file (one text per row) into separate text files. A function in R.
#' Making many text files from a single Excel file
#'
#' Convert a single Excel file (one text per row) into
#' separate text files. A function in R. Requires Java Runtime
#' Environment (JRE), version 6.0 or higher
#'
#' To use this function for the first time run:
#' install.packages("devtools")
#' then thereafter you just need to load the function
#' fom github like so:
#' library(devtools) # windows users need Rtools installed, mac users need XCode installed
#' source_url("https://gist.github.com/benmarwick/9278490/raw/xls2txts.R")
#'
#' Here's how to set the arguments to the function
#'
#' mydir is the full path of the folder that contains your Excel file
#' for example "C:/Downloads/myxlsfile" Note that it must have
#' quote marks around it and forward slashes, which are not default
#' in windows. This should be a folder with a *single* Excel file in it.
#' The file can be xls or xlsx. Note that this function doesn't
#' work if your Excel file is in use - close the file before running this.
#'
#' sheet is the name of the sheet in the Excel file that your data is
#' in. The sheet name should be given with quote marks around it
#' like this "Sheet 1" Note that this function will combine text in
#' all the columns (from column 2 onwards, we assume column 1 is a
#' document name) into one long character string. If your sheet has a
#' comlicated structure you may want to use the XLconnect package directly
#' in R, it's very flexible.
#'
#' labels refers to the column number with the document labels. The
#' default is the first column, but in can your want to use a different
#' column you can set it like so (for example if col 2 has the labels):
#' labels = 2
#'
#' header refers to the first row. If the first row of your EXcel file
#' does not contain column headings then you should add
#' header = FALSE
#' into the function. If you do have column headers you don't need to
#' add anything, the default is header = TRUE
#'
#'
#' A full example, assuming you've sourced the
#' function from github already:
#' xls2txt("C:/Downloads/myxlsfile", "Sheet 1", labels = 2)
#' and after a moment you'll get a message in the R console
#' saying 'Your texts files can be found in C:/Downloads/myxlsfile'
xls2txt <- function(mydir, sheet, labels = 1, header = TRUE){
# function to check for package and download if not found
# from http://stackoverflow.com/a/19870272/1036500
# convert arguments to vector
install_load <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE, repos = "http://cran.rstudio.com" )
# Load package after installing
require( i , character.only = TRUE )
}
}
}
# Then try/install packages...
install_load("XLConnect")
# CRAN version wasn't loading, emtpy DESCRIPTION,
# so let's try github...
# install_github("xlconnect", username = "miraisolutions", ref = "master")
# allocate lots of memory to Java
options( java.parameters = "-Xmx4g" )
# Get the name of the xls file, if more than one, just get the first
myxlsfile <- list.files(mydir, full.names = TRUE, pattern = "*.xls|.xlsx|.XLS|.XLSX")[[1]]
# Load Excel workbook into R
message("Loading Excel file into R, this may take few moments...")
wb <- loadWorkbook(myxlsfile, create = FALSE)
# make dataframe from workbook
suppressWarnings(myxlsdata <- readWorksheet(wb, sheet = sheet, header = header))
message("Finished loading Excel file")
# free memory
xlcFreeMemory()
# combine all except the first column together into
# one long character string for each row
mytxtsconcat <- apply(myxlsdata[-(1:labels)], 1, paste, collapse=" ")
# make a dataframe with the file names and texts
mytxtsdf <- data.frame(filename = myxlsdata[,labels], # get the first col for the text file names
fulltext = mytxtsconcat)
# Now write one text file for each row of the csv
# use 'invisible' so we don't see anything in the console
setwd(mydir)
message("Writing text files into your folder, this may take a few moments...")
invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[i,2],
file = paste0(mytxtsdf[i,1], ".txt"),
row.names = FALSE, col.names = FALSE,
quote = FALSE)))
message("Done writing text files into your folder")
# now check your folder to see the txt files
message(paste0("Your text files can be found in ", getwd()))
}
@upimple

This comment has been minimized.

Copy link

upimple commented Oct 12, 2018

I am getting an XLconnect related error
"Loading Excel file into R, this may take few moments...
Error: NoSuchMethodError (Java): org.apache.poi.hssf.usermodel.HSSFWorkbook.getNumCellStyles()I "

How I can fix it ?

Thanks

@keilabcs

This comment has been minimized.

Copy link

keilabcs commented Nov 3, 2018

invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[1:49,1],
file = paste(mytxtsdf[i,0], "1.txt"),
row.names = FALSE, col.names = FALSE,
quote = FALSE)))

it saves all the line in a single txt, would like to save in separate txt... help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.