Skip to content

Instantly share code, notes, and snippets.

@treysp
Created February 22, 2017 21:19
Show Gist options
  • Save treysp/53f16bdb39c8168bdd4ae4d5dcb53106 to your computer and use it in GitHub Desktop.
Save treysp/53f16bdb39c8168bdd4ae4d5dcb53106 to your computer and use it in GitHub Desktop.
tibble::glimpse() with variable labels!
title author date output
glimpse_labels() -- glimpse() with variable labels!
Trey Spiller
February 22, 2017
html_document

glimpse_labels() -- glimpse() with variable labels!

Background

Most data analysts use non-R statistics software packages such Stata, SPSS, and SAS.

All of those packages allow one to specify certain pieces of metadata about the contents of a dataset. One type is "variable labels," which are a brief description of what the variable is or means. For data from a survey questionnaire, the variable label might include the text of the question from which the variable came. In such situations, having the variable labels attached to the data can be quite useful. Some packages include functions that display the data contents with the labels (e.g., Stata's describe function).

Fortunately, the tidyverse package haven imports datasets created by the three stats packages listed above. An imported dataset is of the tbl_df class, and each variable has a label attribute that contains its variable label.

The function glimpse_labels() is a first pass at including variable labels in the printout generated by tibble::glimpse(). This gist shows how it works and points out some problems that would need to be solved for it to be fully functional.

Toy data

Before we get going, lets create some data with variable labels to work with. We create a small tbl_df and add a label to the variables by assigning text to their label attribute. We're just adding label text for the x variable right now.

Note that the code that assigns the label text simultaneously creates the label attribute.

# create some labelled data
dat1 <- tibble::data_frame(x = 1:20, y = 21:40, z = letters[1:20], a = 41:60)

attributes(dat1[["x"]])$label <- "this is the x label"
attributes(dat1[["y"]])$label <- ""
attributes(dat1[["z"]])$label <- ""
attributes(dat1[["a"]])$label <- ""

tibble::glimpse()

First, let's see what the regular glimpse() function shows:

tibble::glimpse(dat1)
## Observations: 20
## Variables: 4
## $ x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ y <int> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, ...
## $ z <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", ...
## $ a <int> 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, ...

There's some useful stuff here. As expected, however, the existence of the label attribute is not reflected in the printout.

Viewing variable labels now

Currently, in order to view the variable labels for this tbl_df, we'd do something like the following, which gives us a named character vector of the variable labels.

# Ugly character vector output
vapply(dat1, function(x) attributes(x)[["label"]], character(1))
##                     x                     y                     z 
## "this is the x label"                    ""                    "" 
##                     a 
##                    ""

It's clear how much less useful this is than what we get from tibble::glimpse().

glimpse_labels() setup

The glimpse_labels() function is currently in the R package spillr, available on Github. You can install it from Github as below, or copy the function code directly from here.

If you want to install the package, make sure you have devtools and run:

library(devtools)
install_github("treysp/spillr")

Short labels

Now, let's see the printout for glimpse_labels():

library(spillr)

glimpse_labels(dat1)
## Observations: 20
## Variables: 4
## $ x this is the x label <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ y                     <int> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, ...
## $ z                     <chr> "a", "b", "c", "d", "e", "f", "g", "h", ...
## $ a                     <int> 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...

The label text is now displayed between the variable name x and its class <int>. The other variables don't have label text, so they have blank space of the same width as the x variable's label. If the other variables didn't have a label attribute at all, the printout would still look like this.

Wrapped labels

By default, the width of the printout is the width of the console. Depending on the width of your console when you run it and the length of the variable labels, a variable may be longer than the console is wide. Let's see what happens when that occurs:

dat2 <- dat1
attributes(dat2[["x"]])$label <- paste0(rep("[x-label]", 8), collapse = " ")
attributes(dat2[["y"]])$label <- paste0(rep("[y-label]", 3), collapse = " ")
attributes(dat2[["z"]])$label <- paste0(rep("[z-label]", 1), collapse = " ")
 
glimpse_labels(dat2)
## Observations: 20
## Variables: 4
## $ x [x-label] [x-label] [x-label] [x-label] [x-label] [x-label] [x-
##     label] [x-label]              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label]                     <chr> "a", "b", "c", "d", "e", "f", ...
## $ a                               <int> 41, 42, 43, 44, 45, 46, 47, 48...

We can see that x's label was wrapped over two lines. Because its last line is shorter than y's label, it is padded so the variable classes are lined up (x's <int> is directly above y's <int>).

When there is a wrapped label, glimpse_labels() identifies the longest label that does NOT wrap, makes the label column that wide, and pads any wrapped labels shorter than that width to make everything line up.

However, sometimes the last line of a wrapped label is longer than the label column - here's what happens then:

 dat3 <- dat2
 attributes(dat3[["x"]])$label <- paste0(rep("[x label]", 10), collapse = " ")

 glimpse_labels(dat3)
## Observations: 20
## Variables: 4
## $ x [x label] [x label] [x label] [x label] [x label] [x label] [x
##     label] [x label] [x label] [x label] <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label]                     <chr> "a", "b", "c", "d", "e", "f", ...
## $ a                               <int> 41, 42, 43, 44, 45, 46, 47, 48...

The x variable's class and data display have now been pushed over to the right such that they don't line up with those for the other variables.

Things get messy

Things start to get messy when there are multiple wrapped labels in close proximity:

 dat4 <- dat3
 attributes(dat4[["z"]])$label <- paste0(rep("[z-label]", 11), collapse = " ")
 attributes(dat4[["a"]])$label <- "[a-label]"
 
 glimpse_labels(dat4)
## Observations: 20
## Variables: 4
## $ x [x label] [x label] [x label] [x label] [x label] [x label] [x
##     label] [x label] [x label] [x label] <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label] [z-label] [z-label] [z-label] [z-label] [z-label] [z-
##     label] [z-label] [z-label] [z-label] [z-label] <chr> "a", "b", "c"...
## $ a [a-label]                     <int> 41, 42, 43, 44, 45, 46, 47, 48...

The variable classes are vertically aligned for variables y and a, but it's very hard to tell.

One potential solution would be to have the variable class and data display on the next row when a wrapped label is longer than the label column width (so in the example above the x class and data display would be on a new line before the y variable and the z class and data display would be an a new line before the a variable).

Another option would be to add spacing lines between variable rows, perhaps depending on whether a variable's label wrapped.

Suggested solutions welcome!

Known limitations

The current implementation will break if stringr::str_wrap() can't wrap the label such that it fits in the console width. The wrapping algorithm is based on text with spaces or hyphens between words, so I think this would only happen if there were words longer than the console width.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment