Skip to content

Instantly share code, notes, and snippets.

@smach
Last active December 11, 2019 00:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save smach/f83952fbe9b2ac95c2a8b6bebec5aee2 to your computer and use it in GitHub Desktop.
Save smach/f83952fbe9b2ac95c2a8b6bebec5aee2 to your computer and use it in GitHub Desktop.
Reshaping some very messy data
library(dplyr)
library(tidyr)
library(janitor)
library(stringr)
starwars_garbage_data1 <- data.frame(
stringsAsFactors = FALSE,
v1 = c(
"Character Name", "C-3PO", "Person-film ID", "2218529825", "7731900678",
"123598423", "238952395", "6232048034", "3036308047",
"Species:", "Character Name", "Darth Vader", "Person-film ID",
"7731900678", "2072161410", "5775513390", "5583683275",
"Species:", "Character Name", "Yoda", "Person-film ID",
"6562229951", "9476002994", "958505842", "6754035462", "2138742347",
"Species:"
),
v2 = c(
NA, NA, "Homeworld", "Tattooine", "Tattooine", "Tattooine", "Tattooine",
"Tattooine", "Tattooine", "Droid", NA, NA, "Homeworld", NA, NA, NA,
NA, "Human", NA, NA, "Homeworld", "Tattooine", "Tattooine", "Tattooine",
"Tattooine", "Tattooine", "Yoda's Species"
),
v3 = c(
"Eye Color", "Yellow", "Film", "Attack of the Clones", "The Phantom Menace",
"Revenge of the Sith", "Return of the Jedi",
"The Empire Strikes Back", "A New Hope", "", "Eye Color", "Yellow", "Film",
"Revenge of the Sith", "Return of the Jedi",
"The Empire Strikes Back", "A New Hope", "", "Eye Color", "Brown", "Film",
"Attack of the Clones", "The Phantom Menace",
"Revenge of the Sith", "Return of the Jedi", "The Empire Strikes Back", ""
),
v4 = c(
"Height", "167", "Arbitrary Number", "150000", "300000", "300000", "300000",
"400000", "500000", "", "Height", "202", "Arbitrary Number",
"100000", "150000", "350000", "500000", "", "Height", "66",
"Arbitrary Number", "160000", "160000", "200000", "200000", "200000",
""
)
)
reshape_main_blocks <- function(TheNewColumn, TheOldColumn, TheValue, mydf) {
mydf[[TheNewColumn]] <- ifelse(mydf[[TheOldColumn]] == TheValue, lead(mydf[[TheOldColumn]]), "")
for(i in 1:nrow(mydf)) {
mydf[[TheNewColumn]][i] = ifelse(mydf[[TheNewColumn]][i] == "", mydf[[TheNewColumn]][i - 1], mydf[[TheNewColumn]][i])
}
return(mydf)
}
tidydf <- reshape_main_blocks("TheCharacter", "v1", "Character Name", starwars_garbage_data1)
tidydf <- reshape_main_blocks("TheEyeColor", "v3", "Eye Color", tidydf)
tidydf <- reshape_main_blocks("TheHeight", "v4", "Height", tidydf)
tidydf <- mutate(tidydf, TheSpecies = ifelse(v1 == "Species:", v2, ""))
for(i in nrow(tidydf):1) {
tidydf$TheSpecies[i] = ifelse(tidydf$TheSpecies[i] == "", tidydf$TheSpecies[i + 1], tidydf$TheSpecies[i])
}
find_homeworld <- function(mycolumn) {
x <- janitor::tabyl(mycolumn)
names(x)[1] <- "Value"
if(max(x$n) > 1){
myresult <- x$Value[which.max(x$n)]
} else {
myresult = ""
}
}
tidydf <- tidydf %>%
group_by(TheCharacter) %>%
mutate(
Homeworld = find_homeworld(v2)
)
tidy_IDs <- tidydf %>%
group_by(TheCharacter) %>%
filter(stringr::str_detect(v1, "^\\d\\d\\d")) %>%
select(TheCharacter, PersonFilmID = v1, Film = v3, ArbitraryNumber = v4)
# tidydf <- select(tidydf, -c(v1:v4) )
# Per @jmcastagnetto, final code to create desired data frame could be:
tidydf <- tidydf %>%
select(-c(v1:v4) ) %>%
distinct() %>%
full_join(
tidy_IDs,
by = "TheCharacter"
) %>%
select(
PersonFilmID,
Homeworld,
Film,
ArbitraryNumber,
TheCharacter,
TheEyeColor,
TheHeight,
TheSpecies
)
@jmcastagnetto
Copy link

jmcastagnetto commented Dec 10, 2019

Your solution works, but I would change the last line to get the dataframe needed:

tidydf <- tidydf %>% 
  select(-c(v1:v4) ) %>% 
  distinct() %>% 
  full_join(
    tidy_IDs,
    by = "TheCharacter"
  ) %>% 
  select(
    PersonFilmID,
    Homeworld,
    Film,
    ArbitraryNumber,
    TheCharacter,
    TheEyeColor,
    TheHeight,
    TheSpecies
  )

Refs:

Desired dataframe

@smach
Copy link
Author

smach commented Dec 11, 2019

@jmcastagnetto Thanks! I figured she could get the rest from my code, but good idea to complete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment