Last active
November 3, 2018 17:19
-
-
Save benmarwick/9278490 to your computer and use it in GitHub Desktop.
Convert a single Excel file (one text per row) into separate text files. A function in R.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#' Making many text files from a single Excel file | |
#' | |
#' Convert a single Excel file (one text per row) into | |
#' separate text files. A function in R. Requires Java Runtime | |
#' Environment (JRE), version 6.0 or higher | |
#' | |
#' To use this function for the first time run: | |
#' install.packages("devtools") | |
#' then thereafter you just need to load the function | |
#' fom github like so: | |
#' library(devtools) # windows users need Rtools installed, mac users need XCode installed | |
#' source_url("https://gist.github.com/benmarwick/9278490/raw/xls2txts.R") | |
#' | |
#' Here's how to set the arguments to the function | |
#' | |
#' mydir is the full path of the folder that contains your Excel file | |
#' for example "C:/Downloads/myxlsfile" Note that it must have | |
#' quote marks around it and forward slashes, which are not default | |
#' in windows. This should be a folder with a *single* Excel file in it. | |
#' The file can be xls or xlsx. Note that this function doesn't | |
#' work if your Excel file is in use - close the file before running this. | |
#' | |
#' sheet is the name of the sheet in the Excel file that your data is | |
#' in. The sheet name should be given with quote marks around it | |
#' like this "Sheet 1" Note that this function will combine text in | |
#' all the columns (from column 2 onwards, we assume column 1 is a | |
#' document name) into one long character string. If your sheet has a | |
#' comlicated structure you may want to use the XLconnect package directly | |
#' in R, it's very flexible. | |
#' | |
#' labels refers to the column number with the document labels. The | |
#' default is the first column, but in can your want to use a different | |
#' column you can set it like so (for example if col 2 has the labels): | |
#' labels = 2 | |
#' | |
#' header refers to the first row. If the first row of your EXcel file | |
#' does not contain column headings then you should add | |
#' header = FALSE | |
#' into the function. If you do have column headers you don't need to | |
#' add anything, the default is header = TRUE | |
#' | |
#' | |
#' A full example, assuming you've sourced the | |
#' function from github already: | |
#' xls2txt("C:/Downloads/myxlsfile", "Sheet 1", labels = 2) | |
#' and after a moment you'll get a message in the R console | |
#' saying 'Your texts files can be found in C:/Downloads/myxlsfile' | |
xls2txt <- function(mydir, sheet, labels = 1, header = TRUE){ | |
# function to check for package and download if not found | |
# from http://stackoverflow.com/a/19870272/1036500 | |
# convert arguments to vector | |
install_load <- function(x){ | |
for( i in x ){ | |
# require returns TRUE invisibly if it was able to load package | |
if( ! require( i , character.only = TRUE ) ){ | |
# If package was not able to be loaded then re-install | |
install.packages( i , dependencies = TRUE, repos = "http://cran.rstudio.com" ) | |
# Load package after installing | |
require( i , character.only = TRUE ) | |
} | |
} | |
} | |
# Then try/install packages... | |
install_load("XLConnect") | |
# CRAN version wasn't loading, emtpy DESCRIPTION, | |
# so let's try github... | |
# install_github("xlconnect", username = "miraisolutions", ref = "master") | |
# allocate lots of memory to Java | |
options( java.parameters = "-Xmx4g" ) | |
# Get the name of the xls file, if more than one, just get the first | |
myxlsfile <- list.files(mydir, full.names = TRUE, pattern = "*.xls|.xlsx|.XLS|.XLSX")[[1]] | |
# Load Excel workbook into R | |
message("Loading Excel file into R, this may take few moments...") | |
wb <- loadWorkbook(myxlsfile, create = FALSE) | |
# make dataframe from workbook | |
suppressWarnings(myxlsdata <- readWorksheet(wb, sheet = sheet, header = header)) | |
message("Finished loading Excel file") | |
# free memory | |
xlcFreeMemory() | |
# combine all except the first column together into | |
# one long character string for each row | |
mytxtsconcat <- apply(myxlsdata[-(1:labels)], 1, paste, collapse=" ") | |
# make a dataframe with the file names and texts | |
mytxtsdf <- data.frame(filename = myxlsdata[,labels], # get the first col for the text file names | |
fulltext = mytxtsconcat) | |
# Now write one text file for each row of the csv | |
# use 'invisible' so we don't see anything in the console | |
setwd(mydir) | |
message("Writing text files into your folder, this may take a few moments...") | |
invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[i,2], | |
file = paste0(mytxtsdf[i,1], ".txt"), | |
row.names = FALSE, col.names = FALSE, | |
quote = FALSE))) | |
message("Done writing text files into your folder") | |
# now check your folder to see the txt files | |
message(paste0("Your text files can be found in ", getwd())) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
invisible(lapply(1:nrow(mytxtsdf), function(i) write.table(mytxtsdf[1:49,1],
file = paste(mytxtsdf[i,0], "1.txt"),
row.names = FALSE, col.names = FALSE,
quote = FALSE)))
it saves all the line in a single txt, would like to save in separate txt... help!