Skip to content

Instantly share code, notes, and snippets.

@Nakilon
Last active November 19, 2019 12:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Nakilon/2d7c0540a649b4d4e7dd7cba5098486d to your computer and use it in GitHub Desktop.
Save Nakilon/2d7c0540a649b4d4e7dd7cba5098486d to your computer and use it in GitHub Desktop.
mix "happy" and "cursing" words stats together for final ranking
require "bundler/inline"
gemfile do
source :rubygems
gem "pcbr"
end
require "open-uri"
require "csv"
happy, cursing = %w{ happy cursing }.map do |filename|
CSV.parse(
open("https://raw.githubusercontent.com/Dobiasd/programming-language-subreddits-and-their-choice-of-words/master/analysis/#{filename}.csv", &:read),
headers: true, converters: :numeric
).values_at("subreddit", "sum").to_h
end
pcbr = PCBR.new
(happy.keys & cursing.keys).each do |sub|
pcbr.store sub, [-happy[sub], cursing[sub]]
end
puts "From best to worst:"
pcbr.table.sort_by(&:last).chunk(&:last).each_with_index do |g, i|
puts "\t%2s. %s" % [i + 1, g.last.map(&:first).join(", ")]
end
From best to worst:
1. objectivec
2. lua
3. clojure
4. lisp
5. ruby
6. golang
7. haskell
8. visualbasic, mathematica, matlab, scala, swift
9. csharp, perl, rust
10. python, c_programming
11. sql
12. javascript
13. cpp, java, php
@Nakilon
Copy link
Author

Nakilon commented Nov 19, 2019

Note: original bar charts in @Dobiasd's repo README are misleading. For example, as you can see from CSV Lua had 33 "shit" and 0 "shitty" while Mathematica had 0 "shit" and 19 "shitty". All cursings were added together and only some sort of top-4 were taken for the chart:
misleading chart
This is obviously a bad approach because according to CSV Lua and Mathematica are close in cursing but they are twice much different in chart but no one spotted the mistake, so this also proves that /r/dataisbeautiful subreddit is shitty too.
At least /r/programming had a suspecting comment thread.

@Dobiasd
Copy link

Dobiasd commented Nov 19, 2019

Hi @Nakilon,

thanks for reminding me of this project from back then. I had quite some fun with it. 🙂

The implementation was a quick hack and not meant to be confused with actual science. Sorry if this caused confusion or anger.

Your concern with "shit" and "shitty" is totally valid. There are probably many more similar cases. I can assure you, I did not make these mistakes on purpose. It was merely nonattention.

In case you are interested in repeating the experiment but this time with more meaningful methods of measuring, please let me know. I'd be interested in seeing the results.

@Nakilon
Copy link
Author

Nakilon commented Nov 19, 2019

Oh, hello.

It was not confusion or anger but rather disappointment, because when I see charts I rarely check the numbers and use to share them with others in chats, etc.
Here I spotted the problem only because I was mixing two tables together and the result was not exactly how I imagined it. And here is why. While I agree that it might be interesting to see the ratio of word_A/word_B per language, if I wanted to rank the languages I would use the last CSV column -- "sum". And that's what I did.

I believe there was another research of the same kind made recently (maybe in winter or spring of 2019). Not sure what was the data source -- Reddit or Github, but probably it had more data, that would be nice since your CSVs have lots of "0".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment