Last active
May 8, 2021 20:38
-
-
Save anthony-cros/74811b85f9634f3e5646eed71ad7aa20 to your computer and use it in GitHub Desktop.
Reproducing example from https://medium.com/@thijser/doing-cool-data-science-in-java-how-3-dataframe-libraries-stack-up-5e6ccb7b437# with Gallia
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
object MediumArticle extends App { // reproducing example from https://medium.com/@thijser/doing-cool-data-science-in-java-how-3-dataframe-libraries-stack-up-5e6ccb7b437# | |
import gallia._ // see https://github.com/galliaproject/gallia-core/blob/init/README.md#dependencies | |
// --------------------------------------------------------------------------- | |
/* | |
for reference, pandas way provided in the article: | |
data = pd.read_csv('urb_cpop1_1_Data.csv') | |
filtered = data.drop(data[data.Value == ":"].index) | |
filtered['key'] = filtered['CITIES'] + ':' + filtered['INDIC_UR'] | |
filtered['Value'] = pd.to_numeric(filtered['Value']) | |
cities = filtered.pivot_table(index='key', columns='TIME', | |
values='Value', aggfunc="mean") | |
# Top 10 cities by pop in 2017: | |
cities.filter(like='January, total',axis=0). | |
sort_values(by=[2017], ascending=False).head(10) | |
# Highest growth cities: | |
cities["growth"] = (cities[2016] / cities[2010] - 1) * 100 | |
cities.filter(like='January, total',axis=0). | |
sort_values(by=["growth"], ascending=False).head(10) | |
*/ | |
// --------------------------------------------------------------------------- | |
/* | |
INPUT except: | |
$ cat /data/urb_cpop1_1_Data.csv | head | csv2tsv | column -nts$'\t' | |
TIME CITIES INDIC_UR Value Flag and Footnotes | |
2008 Germany Population on the 1st of January, total 82217837 | |
2008 Germany Population on the 1st of January, male 40274292 | |
2008 Germany Population on the 1st of January, female 41943545 | |
2008 Germany Population on the 1st of January, 0-4 years, total 3469044 | |
2008 Germany Population on the 1st of January, 0-4 years, male 1780414 | |
2008 Italy Population on the 1st of January, total 59619290 | |
2008 Italy Population on the 1st of January, male 28949747 | |
2008 Italy Population on the 1st of January, female 30669543 | |
2008 Italy Population on the 1st of January, 0-4 years, total 2807626 | |
*/ | |
val cities = | |
"/data/urb_cpop1_1_Data.csv" | |
.stream(_.table.noArraySeparator) // could also provide schema explicitely | |
.filterBy("Value").matches(_ != ":") | |
.convert ("Value").toInt | |
.generate("key").from("CITIES", "INDIC_UR").using(_ + ":" + _) // types are optional for trivial cases | |
.pivot(_.int("Value")).usingMean.rows("key").column("TIME").asNewKeys((2008 to 2017).map(_.toString)) | |
// Top 10 cities by pop in 2017: | |
cities | |
.filterBy(_.string("key")).matches(_.contains("January, total")) | |
.sortBy("2017", descending = true, missingLast = true) | |
.take(10) | |
.printTable | |
// Highest growth cities: | |
cities | |
.generate("growth") | |
// note that these are optional (denoted by '_' in Gallia) | |
.from(_.double_("2016"), _.double_("2010")) | |
// leaves it as missing if either from/to are missing | |
.using { (x, y) => for { to <- x; from <- y } yield (to / from - 1) * 100 } | |
.filterBy(_.string("key")).matches(_.contains("January, total")) | |
.sortBy("growth", descending = true, missingLast = true) | |
.take(10) | |
.printTable | |
/* | |
OUTPUT for Top 10 cities by pop in 2017 (formatted): | |
key 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 | |
United Kingdom:Population on the 1st of January, total 64853393 65379044 65844142 | |
Portugal:Population on the 1st of January, total 10627250 10637713 10573479 10572721 10542398 10487289 10427301 10374822 10341330 10309573 | |
London (greater city):Population on the 1st of January, total 7668300 7753600 8002000 8173941 8256400 8362500 8477600 8606201 8730803 8797330 | |
Slovakia:Population on the 1st of January, total 5412254 5424925 5435273 5397036 5404322 5410836 5415949 5421349 5426252 5435343 | |
Greater Manchester:Population on the 1st of January, total 2650800 2682528 2693800 2708600 2723900 2744508 2769152 2789822 | |
West Midlands urban area:Population on the 1st of January, total 2390000 2419500 2431200 2446600 2462300 2479550 2500093 2527245 | |
Latvia:Population on the 1st of January, total 2270894 2261294 2070371 2044813 2023825 2001468 1986096 1968957 1950116 | |
Lisboa (greater city):Population on the 1st of January, total 1790389 1784236 1857112 1863069 1860256 1849472 1837852 1835785 1835894 1842352 | |
Birmingham:Population on the 1st of January, total 1019200 1028700 1055600 1073045 1079900 1088900 1096800 1106334 1117938 1132600 | |
Greater Glasgow:Population on the 1st of January, total 986575 996545 | |
OUTPUT for Highest growth cities (formatted): | |
key 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 growth | |
Bournemouth:Population on the 1st of January, total 164600 176800 183491 185100 187700 190100 268124.5 271606 269698 53.62330316742081 | |
Oulu:Population on the 1st of January, total 131585 133550 139133 141671 143909 190847 193798 196291 198525 42.687212954511146 | |
Derry & Strabane Local Government District:Population on the 1st of January, total 109100 109600 108400 107877 108400 108600 108900 149336 149808 150320 38.19926199261994 | |
Southampton:Population on the 1st of January, total 234100 231600 236882 237600 240800 243700 311890 316571.5 316379 36.688903281519856 | |
Blackpool:Population on the 1st of January, total 140600 142700 142065 142000 141700 141000 194661.5 194388.5 195034 36.22179397337071 | |
Valencia:Population on the 1st of January, total 807200 814208 809267 792054 797028 792303 786424 1085048.5 1089284.5 34.60137383582922 | |
Granada:Population on the 1st of January, total 239154 241003 239017 237818 237540 317253.5 317160 32.617476605032735 | |
Pamplona/Iru?a:Population on the 1st of January, total 197275 198491 197488 195943 197604 196955 196166 257629 257984 30.63274730616543 | |
Milano (greater city):Population on the 1st of January, total 3154102 3854555 3875801 3925767 4038864 4061382 4074585 29.18367890448692 | |
Stoke-on-trent:Population on the 1st of January, total 239300 238900 246600 249008 249300 250100 250600 314612 316315 318791 28.270478507704787 | |
*/ | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
changed columns to column to reflect t210303101932 (commit 719bcfd)