-
-
Save Nakilon/2d7c0540a649b4d4e7dd7cba5098486d to your computer and use it in GitHub Desktop.
require "bundler/inline" | |
gemfile do | |
source :rubygems | |
gem "pcbr" | |
end | |
require "open-uri" | |
require "csv" | |
happy, cursing = %w{ happy cursing }.map do |filename| | |
CSV.parse( | |
open("https://raw.githubusercontent.com/Dobiasd/programming-language-subreddits-and-their-choice-of-words/master/analysis/#{filename}.csv", &:read), | |
headers: true, converters: :numeric | |
).values_at("subreddit", "sum").to_h | |
end | |
pcbr = PCBR.new | |
(happy.keys & cursing.keys).each do |sub| | |
pcbr.store sub, [-happy[sub], cursing[sub]] | |
end | |
puts "From best to worst:" | |
pcbr.table.sort_by(&:last).chunk(&:last).each_with_index do |g, i| | |
puts "\t%2s. %s" % [i + 1, g.last.map(&:first).join(", ")] | |
end |
From best to worst: | |
1. objectivec | |
2. lua | |
3. clojure | |
4. lisp | |
5. ruby | |
6. golang | |
7. haskell | |
8. visualbasic, mathematica, matlab, scala, swift | |
9. csharp, perl, rust | |
10. python, c_programming | |
11. sql | |
12. javascript | |
13. cpp, java, php |
Hi @Nakilon,
thanks for reminding me of this project from back then. I had quite some fun with it. 🙂
The implementation was a quick hack and not meant to be confused with actual science. Sorry if this caused confusion or anger.
Your concern with "shit" and "shitty" is totally valid. There are probably many more similar cases. I can assure you, I did not make these mistakes on purpose. It was merely nonattention.
In case you are interested in repeating the experiment but this time with more meaningful methods of measuring, please let me know. I'd be interested in seeing the results.
Oh, hello.
It was not confusion or anger but rather disappointment, because when I see charts I rarely check the numbers and use to share them with others in chats, etc.
Here I spotted the problem only because I was mixing two tables together and the result was not exactly how I imagined it. And here is why. While I agree that it might be interesting to see the ratio of word_A/word_B per language, if I wanted to rank the languages I would use the last CSV column -- "sum". And that's what I did.
I believe there was another research of the same kind made recently (maybe in winter or spring of 2019). Not sure what was the data source -- Reddit or Github, but probably it had more data, that would be nice since your CSVs have lots of "0".
Note: original bar charts in @Dobiasd's repo README are misleading. For example, as you can see from CSV Lua had 33 "shit" and 0 "shitty" while Mathematica had 0 "shit" and 19 "shitty". All cursings were added together and only some sort of top-4 were taken for the chart:
This is obviously a bad approach because according to CSV Lua and Mathematica are close in cursing but they are twice much different in chart but no one spotted the mistake, so this also proves that /r/dataisbeautiful subreddit is shitty too.
At least /r/programming had a suspecting comment thread.