Skip to content

Instantly share code, notes, and snippets.

@knbknb
Last active June 21, 2019 10:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save knbknb/8598c16f2e0937779ea53fcdfb3c63f6 to your computer and use it in GitHub Desktop.
Save knbknb/8598c16f2e0937779ea53fcdfb3c63f6 to your computer and use it in GitHub Desktop.
Clean tweets .json files taken from twitter streaming API.
#!/bin/sh
# knb 2019-06 -- untested
#
# Clean tweets .json files taken from twitter streaming API.
# (can probably also remove non-tweet-objects). One JSON object per line is important.
#
# Some tweets might have been corrupted by errors made by the application,
# or by the operating system
# But tweets must be well-formed
# in order to read them in quickly by R or some other postprocessing app.
#
# infile: 1 tweet/line, most of them well-formed, 1 tweet/line.
infile_orig=tweets_file.json
infile=some_file.json
outfile=some_other_file.json
# remove duplicate tweets
uniq <$infile_orig > $outfile
mv $outfile $infile
# use jq to pretty-print the well-formed tweets
# (not-well-formed lines will stay in place, long lines)
< $infile > $outfile jq -R -r '. as $line | try fromjson catch $line'
# find un-pretty printed lines, and the linenumber in the file
# (remove them manually if possible).
# They typically start with {" .
# Pretty printed lines have their starting { all on a single line.
perl -ne '/^{"/ && print qq($. $_)' < stream__34c3._2.json
# use jq again to check if parsing errors remain
jq . < $infile 1>/dev/null
# compact infile, back to one-tweet-per row format
jq -c . < $infile > $outfile
@knbknb
Copy link
Author

knbknb commented Jun 21, 2019


# In R, read JSON file in like this:
library(purrr)
library(jsonlite)

tw_dir - "some_dir"
 tw_file <- "infile.json"
  
  tweets_list <- file.path(tw_dir, tw_file) 
  
  read_my_tweets <- function(x, simplify = FALSE){
    x %>% 
    file( 'r') %>% 
    jsonlite::stream_in(simplifyVector = simplify, 
                        simplifyDataFrame = simplify, 
                        simplifyMatrix = simplify)
  }
  # remove remaining small tweet fragments of length 1
  is_tweet <- as_mapper(~ length(.) > 1)  
  tweet_list <- read_my_tweets(tweets_list, FALSE) %>% 
    keep(is_tweet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment