Skip to content

Instantly share code, notes, and snippets.

@ldesgoui

ldesgoui/post.md Secret

Last active April 5, 2018 16:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ldesgoui/743c1cb5d084e7bace5334c9413f31f8 to your computer and use it in GitHub Desktop.
Save ldesgoui/743c1cb5d084e7bace5334c9413f31f8 to your computer and use it in GitHub Desktop.

Hello,

I am twiikuu, I work on the technical side of TF2PL.com.
We're getting quite close to releasing the EU counterpart of our system, here are some insights in what I'm currently doing to prepare that.
UPDATE: This post was delayed because I ended up quite busy. This is mostly technical. There used to be feature announcements, they were redacted in favor or this: http://tf2pl.com/news.html. There's a few things I'd like to work on that weren't mentioned there yet, including turning the core logic of our product into a service that other Hubs could use, alerting admins whenever "Call For Admin" is pressed on match page (which there currently isn't an easy to way to hook into, and the Admin UI is quite bad), and finally starting to backup and analyse demo files to help admins find evidence on reports, but also for indexing of good moments (similarly to this https://twitter.com/ESEA/status/981663795142635521) or logs.tf integration.

FACEIT offers an Elo-based third party matchmaking, we've quickly realized during the TF2PL NA release that introducing a big population at once would produce low quality games until all players "grinded out" their according ranks.
We've decided to shorten this process by requesting FACEIT to manually set players' ranks, the level requested would depend on ESEA League experience or proof of dominating performance in our lower divisions. ESEA does not expose any data publicly, making this process completely manual, the evaluation had to be quick and foolproof.
However, ETF2L does expose a massive part of their data publicly, allowing us to automate a huge part of this process. We can build a dataset such as, for each player, we know how many matches they've played, if they were won, if they were during playoffs, and how old the match is. From that, we'll be able to decide with our EU admin how to slice the playerbase for the repartition.

I've used the following tools: bash, a bunch of coreutils, curl, jq, postgresql.
The following code snippets are licensed under MIT: https://opensource.org/licenses/MIT
Here is the documentation and base URL for ETF2L's API: http://api.etf2l.org
Please don't spam ETF2L's API, if you'd just like the dataset or certain parts of it, just ask me instead.
The documentation specifies that most listable sets are paginated, it turns out that the API returns the URL of the next page, which is basically what you're looking for, props to them for setting that up. We can take advantage of this by bundling up requests in a file, in this case pages, and keep refilling up that file with the next pages whenever they appear.

paginated() {
	while true; do
		rm -rf out
		mkdir -p out
		cat pages | xargs -P 2 -n 1 -I URL bash -c 'curl "URL" >> $(mktemp -p out)'
		cat out/* >> output
		jq -r '.page.next_page_url | select(.)' out/* > pages
		if [ ! -s pages ]; then
			break
		fi
	done
}

The best way to build this data set would have been to fetch the entire history of 6v6 league matches and the players who participated in that match, using something like /results?since=0, this could be nicely composed into the data set we want without any further requests. Sadly, the result model offered by this API does not include the list of players.
The only way to find if a player was part of a match would be to go through that player's results history, this is what /player/:id/results lets us do.
This is also quite suboptimal, in total we'll have requested 12 times as many results as needed (kinda bad but that's the only possible way), while naively going through each of the 131616 players result history (very not okay).
We can reduce this by ignoring anyone who hasn't been part of a 6v6 team, we can find the opposite through the endpoint /transfers?type=6v6. At 100 transfers per page, that's still 1924 pages to go through.

echo "http://api.etf2l.org/transfers.json?type=6on6&since=0&per_page=100" > pages
paginated
mv output transfers

This already gives us a hefty set, more than 140 megabytes. We're gonna proceed on setting up the bigger data fetch, match results for each player, all 32566 of them.

jq -s '.[].transfers[].who.id' transfers | sort | uniq > player_ids
mkdir players
for id in $(cat player_ids); do
	echo "http://api.etf2l.org/player/$id/results.json?since=0&per_page=100" > pages
	paginated
	mv output players/$id
	echo $id >> done
done

Imagine how much of a hassle it would have been if I didn't go through with making a function to follow next pages, a great lesson in life, in all honesty. Now that we have all of this data, let's condense it into something more queryable, let's start postgresql.

export PGDATA=/tmp/etf2l_dataset
pg_ctl -o "-A trust" init
pg_ctl -w start
createdb $USER
psql -c 'create table match_participation
	( user_id int
	, match_id int
	, primary key (user_id, match_id)
	, ts timestamp
	, competition_id int
	, competition text
	, tier int
	, score int
	, as_merc boolean
	, week int
	, round text
	);'

Now that the database knows how to host the data, let's import it.

for i in $(cd players; ls .); do
	jq -rs --argjson i $i '
		.[]
			| select(.results)
			| .results[]
			| select(.competition.category == "6v6 Season")
			| (if .clan1.was_in_team then .r2 - .r1 else .r1 - .r2 end) as $score
			| [$i, .id, if .time then (.time|todate) else .time end , .competition.id, .competition.name, .division.tier, $score, .merced, .week, .round]
			| @csv
		' players/$i >> import.csv
done
psql -c "copy match_participation from '$PWD/import.csv' with (format csv)"

Some of the tiers are missing! Let's manually fix that, the decisions are to my discretion based on similarities to the rest of the dataset but they can be re-evaluated if they appear to be causing issues.

psql -c "update match_participation set tier = 0 where tier is null and competition like '%Premiership%';"
psql -c "update match_participation set tier = 1 where tier is null and competition like '%High%';"
psql -c "update match_participation set tier = 2 where tier is null and competition like '%Mid%';"
psql -c "update match_participation set tier = 3 where tier is null and competition like '%Low%';"
psql -c "update match_participation set tier = 4 where tier is null and competition like '%Open%';"
psql -c "update match_participation set tier = 1 where tier is null and competition like '%Division 1%';"
psql -c "update match_participation set tier = 2 where tier is null and competition like '%Division 2%';"
psql -c "update match_participation set tier = 3 where tier is null and competition like '%Division 3%';"
psql -c "update match_participation set tier = 4 where tier is null and competition like '%Division 4%';"
psql -c "update match_participation set tier = 5 where tier is null and competition like '%Division 5%';"
psql -c "update match_participation set tier = 6 where tier is null and competition like '%Division 6%';"

Here are some cool facts about the dataset: https://twitter.com/twiikuu/status/977546327029637120
I mentioned in one of the tweets that 10 players never joined a team and only ever merced, this is a false statement, this assumption comes from the total number of match_participation being 10 short of a multiple of 12 (228878 + 10 = 19074 * 12), what I did later instead was to list the matches where the participation count wasn't equal to 12. It turns out 4060 matches are affected by this, or 21%, but only 5449 participations are missing, or 2.3%. I am almost certain those missing participations are caused by mis-reporting of status screenshots or from mercs who've never registered, meaning there's nothing I can really do about it.

psql -c "select f.* from (select match_id, count(*) as players from match_participation group by match_id) as f where f.players != 12;"

The last thing would be to compile easily digestible sums for each player alongside their name and steam ID, which I will then send to our EU admin. Notice how I lack skills to craft decent views. Let's also barf on statistic theory by throwing random calculations at the data we have to help sorting through it. Long story short, for bogo_score, playing and winning gives a point, playing and winning in playoffs gives 37 points, we sum that and scale that by 1/n, where n is the tier + 1. The magic number 37 comes from the fact a match is 37 times less likely to be a playoffs game. Unless agreed upon by the admin, this score won't be decisive, it's good enough for sorting through stuff.

psql -c 'create table player
	( user_id int
	, primary key (user_id)
	, name text
	, steam_id text
	);'
jq -rs '.[].transfers[].who | [.id, .name, .steam.id64] | @csv' transfers | sort | uniq >> players.csv
psql -c "copy player from '$PWD/players.csv' with (format csv)"
psql -c "create materialized view player_stats as
	with stats as (
		select
			user_id,
			sum((score > 0 and round not like 'Week%')::int) as won_in_playoffs,
			sum((round not like 'Week%')::int) as played_in_playoffs,
			sum((score > 0)::int) as won,
			count(1) as played,
			sum((tier = 0 and score > 0 and round not like 'Week%')::int) as won_in_t0_playoffs,
			sum((tier = 0 and round not like 'Week%')::int) as played_in_t0_playoffs,
			sum((tier = 0 and score > 0)::int) as won_in_t0,
			sum((tier = 0)::int) as played_in_t0,
			sum((tier = 1 and score > 0 and round not like 'Week%')::int) as won_in_t1_playoffs,
			sum((tier = 1 and round not like 'Week%')::int) as played_in_t1_playoffs,
			sum((tier = 1 and score > 0)::int) as won_in_t1,
			sum((tier = 1)::int) as played_in_t1,
			sum((tier = 2 and score > 0 and round not like 'Week%')::int) as won_in_t2_playoffs,
			sum((tier = 2 and round not like 'Week%')::int) as played_in_t2_playoffs,
			sum((tier = 2 and score > 0)::int) as won_in_t2,
			sum((tier = 2)::int) as played_in_t2,
			sum((tier = 3 and score > 0 and round not like 'Week%')::int) as won_in_t3_playoffs,
			sum((tier = 3 and round not like 'Week%')::int) as played_in_t3_playoffs,
			sum((tier = 3 and score > 0)::int) as won_in_t3,
			sum((tier = 3)::int) as played_in_t3,
			sum((tier = 4 and score > 0 and round not like 'Week%')::int) as won_in_t4_playoffs,
			sum((tier = 4 and round not like 'Week%')::int) as played_in_t4_playoffs,
			sum((tier = 4 and score > 0)::int) as won_in_t4,
			sum((tier = 4)::int) as played_in_t4,
			sum((tier = 5 and score > 0 and round not like 'Week%')::int) as won_in_t5_playoffs,
			sum((tier = 5 and round not like 'Week%')::int) as played_in_t5_playoffs,
			sum((tier = 5 and score > 0)::int) as won_in_t5,
			sum((tier = 5)::int) as played_in_t5,
			sum((tier = 6 and score > 0 and round not like 'Week%')::int) as won_in_t6_playoffs,
			sum((tier = 6 and round not like 'Week%')::int) as played_in_t6_playoffs,
			sum((tier = 6 and score > 0)::int) as won_in_t6,
			sum((tier = 6)::int) as played_in_t6
		from match_participation
		group by user_id
	)
	select
		p.name,
		p.steam_id,
		(0 + (played_in_t0 + won_in_t0 + (played_in_t0_playoffs + won_in_t0_playoffs) * 37) * 100
		   + (played_in_t1 + won_in_t1 + (played_in_t1_playoffs + won_in_t1_playoffs) * 37) * 50
		   + (played_in_t2 + won_in_t2 + (played_in_t2_playoffs + won_in_t2_playoffs) * 37) * 33
		   + (played_in_t3 + won_in_t3 + (played_in_t3_playoffs + won_in_t3_playoffs) * 37) * 25
		   + (played_in_t4 + won_in_t4 + (played_in_t4_playoffs + won_in_t4_playoffs) * 37) * 20
		   + (played_in_t5 + won_in_t5 + (played_in_t5_playoffs + won_in_t5_playoffs) * 37) * 17
		   + (played_in_t6 + won_in_t6 + (played_in_t6_playoffs + won_in_t6_playoffs) * 37) * 14
		) as bogo_score,
		s.*
	from stats s
	join player p on s.user_id = p.user_id;"
psql -c "copy (select * from player_stats order by bogo_score desc) to '$PWD/out.csv' with csv header;"

This concludes the technical explanation, I think I've managed to do a good enough job, I would say that planning ahead helped a lot of this process to go faster, especially the "paginated" function. Yep. Oh, here's the final output, feel free to trash your friends for having less "bogo_score" than you, hopefully you realize what bogo or bogus means. https://docs.google.com/spreadsheets/d/1UQ7VtDTBdAtZQ7ycqlbzpTE-ToJdWINkOZrn4lkKLJU/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment