Skip to content

Instantly share code, notes, and snippets.

@briatte
Last active September 21, 2017 17:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save briatte/4699960 to your computer and use it in GitHub Desktop.
Save briatte/4699960 to your computer and use it in GitHub Desktop.
lookfor -- an equivalent to the -lookfor- Stata command that supports foreign and memisc data objects (part of the questionr package)
lookfor <- function(data,
keywords = "weight|sample",
labels = TRUE,
ignore.case = TRUE) {
# search scope
n <- names(data)
if(!length(n)) stop("there are no names to search in that object")
# search function
look <- function(x) { grep(paste(keywords, collapse="|"), x, ignore.case = ignore.case) }
# names search
x <- look(n)
variable <- n[x]
# foreign objects
l <- attr(data, "variable.labels")
if(is.null(l)) l <- attr(data, "var.labels")
# memisc objects
if(grepl("data.set|importer", class(data))) {
suppressMessages(suppressWarnings(require(memisc)))
l <- as.vector(description(data))
}
if(length(l) & labels) {
# search labels
y <- look(l)
# remove duplicates, reorder
x <- sort(c(x, y[!(y %in% x)]))
# add variable labels
variable <- n[x]
label <- l[x]
variable <- cbind(variable, label)
}
# output
if(length(x)) return(as.data.frame(variable, x))
else message("Nothing found. Sorry.")
}

lookfor: a function to search datasets for keywords

The lookfor function emulates the lookfor Stata command in R. It searches for one or more keywords in the variable names of a dataset. It can also search the variable labels of datasets imported into R with the foreign and memisc packages.

Installation

install.packages("devtools")
library(devtools)

# install
source_gist("https://gist.github.com/briatte/4699960")

# recommended 
install.packages("memisc")

Syntax

lookfor(data, keywords = "weight|sample", labels = TRUE, ignore.case = TRUE)
  • data is a data frame, which can be annotated by the read.dta or read.spss functions of the foreign package, or built by the data.set or importer methods of the memisc package.
  • keywords is a character string, which can be formatted as a regular expression, or a vector of character strings; the syntax of regular expression patterns must be that of the grep function.
  • labels indicates whether or not to search variable labels, as passed through attributes by either the foreign or the memisc methods; labels = TRUE by default.
  • ignore.case indicates whether or not to make the keywords case sensitive; ignore.case = TRUE by default, which means that, as in the Stata lookfor command, case is ignored during matching.

Examples

The lookfor function requires a dataset and usually takes one or more additional keyword(s) as a character string. Only variable names are searched in datasets with no variable descriptions, as below.

lookfor(iris, "petal")

The memisc package offers a simple way to try out the command on a richer form of dataset. The following chunk loads the data file of the [American National Election Study of 1948](http://www.electionstudies.org/studypages/1948prepost/1948prepost. htm) in SPSS format.

require(memisc)
nes1948.por <- UnZip("anes/NES1948.ZIP","NES1948.POR", package="memisc")
nes1948 <- spss.portable.file(nes1948.por)

The lookfor function accepts either a single keyword, a vector of keywords, or a regular expression that matches the syntax of a grep pattern.

# Look for single keyword.
lookfor(nes1948, "truman")

# Look for a vector of keywords.
lookfor(nes1948, c("truman", "dewey"))

# Look for a regular expression.
lookfor(nes1948, "truman|dewey")

Variable labels can be excluded from the search scope. This causes the previous examples to find nothing in the variable names alone. Identically, making the search case sensitive will fail to find anything.

lookfor(nes1948, "truman", labels = FALSE)
lookfor(nes1948, "truman", ignore.case = FALSE) 

The next examples require to download the data file for the General Social Survey of 2010 in Stata format. The data is first imported with the memisc package.

# Download the GSS 2010.
if(!file.exists(zip <- "2010.zip")) {
  url <- "http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/2010_stata.zip"
  download.file(url, zip)
}

# Load as a memisc object.
gss <- UnZip(zip, "2010.dta")
gss <- Stata.file(gss)

If no keyword is specified, the lookfor function searches for the 'sample' and 'weight' keywords by default, assuming that the user might start by looking for sampling and weighting variables.

lookfor(gss)

The lookfor function will also read variable labels from objects loaded with the foreign package.

require(foreign)
unzip(zip)

# Load as a data frame with attributes.
gss <- read.dta("2010.dta")

# Look for a single keyword.
lookfor(gss, "homosex")

Notes

Variable labels are searched by looking at the var.labels and variable.labels attributes in all data frames, and at the results of the description function in memisc objects.

The query function of the memisc package also allows to search for keywords in a data file, and supports fuzzy search via agrep. It also covers value labels, which makes it 'wider' than lookfor.

Please send comments and suggestions through this Gist or by email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment