Skip to content

Instantly share code, notes, and snippets.

@anthony-cros
Last active May 8, 2021 20:38
Show Gist options
  • Save anthony-cros/74811b85f9634f3e5646eed71ad7aa20 to your computer and use it in GitHub Desktop.
Save anthony-cros/74811b85f9634f3e5646eed71ad7aa20 to your computer and use it in GitHub Desktop.
object MediumArticle extends App { // reproducing example from https://medium.com/@thijser/doing-cool-data-science-in-java-how-3-dataframe-libraries-stack-up-5e6ccb7b437#
import gallia._ // see https://github.com/galliaproject/gallia-core/blob/init/README.md#dependencies
// ---------------------------------------------------------------------------
/*
for reference, pandas way provided in the article:
data = pd.read_csv('urb_cpop1_1_Data.csv')
filtered = data.drop(data[data.Value == ":"].index)
filtered['key'] = filtered['CITIES'] + ':' + filtered['INDIC_UR']
filtered['Value'] = pd.to_numeric(filtered['Value'])
cities = filtered.pivot_table(index='key', columns='TIME',
values='Value', aggfunc="mean")
# Top 10 cities by pop in 2017:
cities.filter(like='January, total',axis=0).
sort_values(by=[2017], ascending=False).head(10)
# Highest growth cities:
cities["growth"] = (cities[2016] / cities[2010] - 1) * 100
cities.filter(like='January, total',axis=0).
sort_values(by=["growth"], ascending=False).head(10)
*/
// ---------------------------------------------------------------------------
/*
INPUT except:
$ cat /data/urb_cpop1_1_Data.csv | head | csv2tsv | column -nts$'\t'
TIME CITIES INDIC_UR Value Flag and Footnotes
2008 Germany Population on the 1st of January, total 82217837
2008 Germany Population on the 1st of January, male 40274292
2008 Germany Population on the 1st of January, female 41943545
2008 Germany Population on the 1st of January, 0-4 years, total 3469044
2008 Germany Population on the 1st of January, 0-4 years, male 1780414
2008 Italy Population on the 1st of January, total 59619290
2008 Italy Population on the 1st of January, male 28949747
2008 Italy Population on the 1st of January, female 30669543
2008 Italy Population on the 1st of January, 0-4 years, total 2807626
*/
val cities =
"/data/urb_cpop1_1_Data.csv"
.stream(_.table.noArraySeparator) // could also provide schema explicitely
.filterBy("Value").matches(_ != ":")
.convert ("Value").toInt
.generate("key").from("CITIES", "INDIC_UR").using(_ + ":" + _) // types are optional for trivial cases
.pivot(_.int("Value")).usingMean.rows("key").column("TIME").asNewKeys((2008 to 2017).map(_.toString))
// Top 10 cities by pop in 2017:
cities
.filterBy(_.string("key")).matches(_.contains("January, total"))
.sortBy("2017", descending = true, missingLast = true)
.take(10)
.printTable
// Highest growth cities:
cities
.generate("growth")
// note that these are optional (denoted by '_' in Gallia)
.from(_.double_("2016"), _.double_("2010"))
// leaves it as missing if either from/to are missing
.using { (x, y) => for { to <- x; from <- y } yield (to / from - 1) * 100 }
.filterBy(_.string("key")).matches(_.contains("January, total"))
.sortBy("growth", descending = true, missingLast = true)
.take(10)
.printTable
/*
OUTPUT for Top 10 cities by pop in 2017 (formatted):
key 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
United Kingdom:Population on the 1st of January, total 64853393 65379044 65844142
Portugal:Population on the 1st of January, total 10627250 10637713 10573479 10572721 10542398 10487289 10427301 10374822 10341330 10309573
London (greater city):Population on the 1st of January, total 7668300 7753600 8002000 8173941 8256400 8362500 8477600 8606201 8730803 8797330
Slovakia:Population on the 1st of January, total 5412254 5424925 5435273 5397036 5404322 5410836 5415949 5421349 5426252 5435343
Greater Manchester:Population on the 1st of January, total 2650800 2682528 2693800 2708600 2723900 2744508 2769152 2789822
West Midlands urban area:Population on the 1st of January, total 2390000 2419500 2431200 2446600 2462300 2479550 2500093 2527245
Latvia:Population on the 1st of January, total 2270894 2261294 2070371 2044813 2023825 2001468 1986096 1968957 1950116
Lisboa (greater city):Population on the 1st of January, total 1790389 1784236 1857112 1863069 1860256 1849472 1837852 1835785 1835894 1842352
Birmingham:Population on the 1st of January, total 1019200 1028700 1055600 1073045 1079900 1088900 1096800 1106334 1117938 1132600
Greater Glasgow:Population on the 1st of January, total 986575 996545
OUTPUT for Highest growth cities (formatted):
key 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 growth
Bournemouth:Population on the 1st of January, total 164600 176800 183491 185100 187700 190100 268124.5 271606 269698 53.62330316742081
Oulu:Population on the 1st of January, total 131585 133550 139133 141671 143909 190847 193798 196291 198525 42.687212954511146
Derry & Strabane Local Government District:Population on the 1st of January, total 109100 109600 108400 107877 108400 108600 108900 149336 149808 150320 38.19926199261994
Southampton:Population on the 1st of January, total 234100 231600 236882 237600 240800 243700 311890 316571.5 316379 36.688903281519856
Blackpool:Population on the 1st of January, total 140600 142700 142065 142000 141700 141000 194661.5 194388.5 195034 36.22179397337071
Valencia:Population on the 1st of January, total 807200 814208 809267 792054 797028 792303 786424 1085048.5 1089284.5 34.60137383582922
Granada:Population on the 1st of January, total 239154 241003 239017 237818 237540 317253.5 317160 32.617476605032735
Pamplona/Iru?a:Population on the 1st of January, total 197275 198491 197488 195943 197604 196955 196166 257629 257984 30.63274730616543
Milano (greater city):Population on the 1st of January, total 3154102 3854555 3875801 3925767 4038864 4061382 4074585 29.18367890448692
Stoke-on-trent:Population on the 1st of January, total 239300 238900 246600 249008 249300 250100 250600 314612 316315 318791 28.270478507704787
*/
}
@anthony-cros
Copy link
Author

changed columns to column to reflect t210303101932 (commit 719bcfd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment