Skip to content

Instantly share code, notes, and snippets.

@brendano
Created June 14, 2011 02:56
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save brendano/1024217 to your computer and use it in GitHub Desktop.
How much text versus metadata is in a tweet?
How much text versus metadata is in a tweet?
Brendan O'Connor (brenocon.com), 2011-06-13
http://twitter.com/brendan642/status/80473880111742976
What's it mean to compare the amount of text versus metadata?
Let's start with raw size of the data that comes over the wire from Twitter.
## Get tweets out of a sample stream archive.
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets
## Full tweet size.
% cat 100k_tweets | wc
100000 3869324 211077132
^^^^^^^^^
size: ~2kb/tweet
## Example tweet.
% head -1 100k_tweets
{"in_reply_to_status_id_str":null,"text":"Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?","contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"geo":null,"source":"web","coordinates":null,"in_reply_to_user_id":null,"truncated":false,"entities":{"hashtags":[],"urls":[],"user_mentions":[]},"place":null,"favorited":false,"created_at":"Thu May 19 00:15:11 +0000 2011","user":{"contributors_enabled":false,"profile_sidebar_fill_color":"","profile_image_url":"http:\/\/a3.twimg.com\/profile_images\/1240716730\/feiaaaa_normal.jpg","follow_request_sent":null,"profile_background_tile":true,"url":"http:\/\/www.orkut.com.br\/Main#Profile?uid=13310717337747944298","screen_name":"Natalyyia","profile_link_color":"fa5573","description":"Falsidade \u00e9 caracter\u00edstica de fracos. Os fortes falam na cara, e n\u00e3o se importam de escutar o que a outra pessoa pensa ao seu respeito....\r\n","show_all_inline_media":false,"verified":false,"geo_enabled":false,"favourites_count":6,"profile_sidebar_border_color":"181A1E","listed_count":0,"time_zone":"Santiago","followers_count":52,"location":"Bras\u00edlia DF","is_translator":false,"notifications":null,"profile_use_background_image":true,"lang":"en","statuses_count":1947,"friends_count":195,"profile_background_color":"1A1B1F","protected":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/253006506\/tumblr_ll3kpikZm91qe4nyno1_500_large.jpg","created_at":"Thu Oct 07 23:39:29 +0000 2010","name":"Nat\u00e1lya","default_profile_image":false,"default_profile":false,"id":199887606,"id_str":"199887606","following":null,"utc_offset":-14400,"profile_text_color":"0a060a"},"retweet_count":0,"in_reply_to_status_id":null,"id":71005947086110720,"id_str":"71005947086110720","in_reply_to_screen_name":null}
## Extract text. (Quick and dirty; JSON parser would be cleaner, but slower.)
% cat 100k_tweets | grep -Po '"text":.*?[^\\]",' | perl -pe 's/"text"://; s/^"//; s/",$//' > 100k_texts
## Example texts.
% head -5 100k_texts
Se Deus, que \u00e9 Deus n\u00e3o te julga, porque se importar com que os outros pensam?
@gunslikeana kd clara ana DD:
\u041a\u0440\u0430\u0441\u0438\u0432\u043e!!!RT@avvlas \u0413\u0434\u0435-\u0442\u043e \u0432 \u0417\u0430\u043f\u0430\u0434\u043d\u043e\u0439 \u0410\u0444\u0440\u0438\u043a\u0435 http:\/\/nblo.gs\/i2tug
@bamin27 i already tried
I just took \"[1-4] Justin Bieber hid camera's in 'Never Say Never' Poster's of himself ...\" and got: Part 1 <3! Try it: http:\/\/bit.ly\/jm6wHX
## Text size.
% cat 100k_texts | wc
129499 1184935 10553809
^^^^^^^^
size: ~100 bytes/tweet
## Percent text.
> 10553809/(10553809+211077132)
[1] 0.04761884
Note, we're using the JSON-escaped version of the text. Is this defensible?
If we parsed the JSON then measured the UTF-8 size, it would be smaller.
But it's not like UTF-8 is the "true" representation either: if we used UCS-2
or -4 it would most likely be bigger. So let's just use something close to
the format we get over the wire from Twitter.
This suggests the next experiment: sizes under compression. This is a fairer
comparison because the algorithm should normalize for different levels of
redundancy in different representations. A perfect compression algorithm
would tell us their true information contents.
Life is short, so we use gzip.
% cat 100k_tweets | gzip -9 | wc -c
36191951
% cat 100k_texts | gzip -9 | wc -c
4625716
So under compression:
* 362 bytes / tweet
* 46 bytes of text / tweet
This implies text is 11% the information content, if you believe LZW does a
reasonable job at language modeling. This isn't totally true, but I might
hazard it's good enough for the general comparison against metadata. I once
did a class project where PPM with a 3-gram Markov model outperformed gzip on
multilingual text by a relative 7ish%, so let's say the text is <=10% the
information content of the entire tweet.
The other consideration is whether LZW is good at compressing metadata. I bet
it is very good, especially at all those repetitive key/value pairs. Though,
the metadata has lots of language, space, and time data in it too, for which
you could imagine we could make good probabilistic models and therefore better
compression. This might be interesting to further investigate. As a start:
full tweet compression ratio is 17%, compared to text compression ratio of 44%.
The metadata compresses much better.
Data caveat: Sample is out of a tiny slice of time. I wonder of time zone
effects cause different language speakers to be more active (one half of the
globe is asleep), therefore causing sizes of texts to be different (since Asian
languages have larger texts, and there may be other geographic
community-specific variation in text size too).
@iceMBD
Copy link

iceMBD commented Mar 11, 2014

Hello, thank you about that way, but what about if i want to grep 3 element from each tweets.
for example i need 'text' and 'location' and 'geo'.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment