In this article, I demonstrate some basic techinques using the pandas data analytics library for python3. I will be using the League of Legends 2018 Summer Split data from Oracle's Elixir. This was written on the 23th of August, 2018.
First, I'd like to be able to view the past games for a team, and some interesting data about them. The way the dataset is set up is each game appears twice, once from the viewpoint of each team.
For example, say we are reviewing Echo Fox vs Clutch Gaming, with a gameid
of 1002620109
. There will be two rows; from Echo Fox's point of view, they got 10 kills and suffered 5 deaths. To Clutch Gaming, that is 5 kills and 10 deaths. Data for each player is shown in a single row, which means 10 rows for each game, plus an extra two for the "team", which shows things like teamkills
and teamdeaths
. In this case the player
field is simply Team
.
This means to get both partipants in a single row, some work is required. Firstly, some basic cleaning of the data is in order:
https://gist.github.com/64a2b3b36fbd599c4a4f829fd37da753
Some of the 0 and 1 values, for example first blood, are saved as strings instead of numbers. I convert them all to floats, and also set blue
and red
to be integer values.
Next, geta all the games for a single team, with a team_games
method.
https://gist.github.com/4e01fa148a9764d274bc50b307071646
Running this gives a table like this:
https://gist.github.com/2dce6cc47f3f1a653ee2eab970bbe3b7
With a ton more data. Now we want the opposite of team_games
, opponent_games
:
https://gist.github.com/4a0a7b714f3394fdde5cbc09d2eecebd
Putting it all together:
https://gist.github.com/0213ef83a8d80fd481b229002b6e114b
Gives as a nice summary of the results up to date:
https://gist.github.com/0a2becd8add2abacead9bc7a5f3fd225
Now we have gotten a feel for the data, let's do some actual analysis.
Often a strong early game dictates the result - or does it? Let's investigate. We can grab the percentage of first bloods/turrets easily, using our team_games
function.
https://gist.github.com/b2d6932266a373ee26ddc1108b9f55b0
Which shows us:
https://gist.github.com/5faabf360dd41e057e5fb4652878bc60
There are a number of ways to interpret this: TSM doesn't prioritize dragon? Perhaps they have a weak lanes (thus not often getting first turret)? How about we check out the rest of the league? First, make games_by_league
and teams_by_league
functions:
https://gist.github.com/03be1e7c2e2b62881753d4df19b16eda
Now we can loop each team for the stats. Each stat is assigned a key in a dictionary, and the value is an array of the percentage for each team in the category.
https://gist.github.com/69b90b4ced019dae5414671eb4ac13ec
...or not. We get an error. It turns out one game was cancelled for technical reasons, and since there was no first blood, the column is blank. We can fix this easily, by first changing all whitespace into np.nan
, then using dropna
to get rid of those rows.
https://gist.github.com/82d0d4a5dd9d88adfd18981bfe863182
We get this nice table:
https://gist.github.com/9ef47f57418388f604d69c0a3bf7de66
To get some more context, let's add in the total wins for each team and sort by that:
https://gist.github.com/cc113a46b276714e9fa283626f7869d4
Now we get:
https://gist.github.com/8201f3766b55809b1f087026c98cd72d
First baron is pretty consistent in the top 7. All of them are within three games of each other, but I did expect a bit or a larger gap. First dragon, turret and blood appears completely non correlated though (we will see this is indeed not highly correlated later on, using pandas corr
method).
Golden Guardians is a bit of an outliner - a lot of first dragons, second only to Team Liquid. This could be for a number of reasons, such as the type of dragon, or some teams favoring early game junglers, for example.
Let's see if there is a relationship between first dragon, turret, etc, and actually winning the game. Create a do_correlation
function:
https://gist.github.com/1e195bc7b9dbb53dacc20c9a6886bd07
This gives us:
https://gist.github.com/3768dbe023df25f0c9f1737d60face92
Correlation goes from -1 to +1. -1 indicates an opposite relationship. For example if first blood has a correlation of 1 with result, that would mean a team wins every game they get first blood.
As expected, getting the first baron has a high correlation with winning the game, 0.76. First blood also has a high correlation with first dragon - maybe the team gets first blood botlane a lot, then transitions into dragon?
We defined blue side as 0, and red side as 1. Notice everything has a negative correlation with side. That means as side goes up, everything else goes down. In other words, red size is much worse, at least for Team Liquid. Let's see if this tendency extends to other teams.
https://gist.github.com/a9e25ee03ef9f6f6842615b0264f5643
Gives us:
https://gist.github.com/113772610e335ab91880af579437dbc7
First baron is still a solid indicator of who is going to win. First dragon and first turret, however, are largely unrelated - 0.11 is not very significant. However there is a strong relationship between first blood and first turret - perhaps NA teams have a tendency to play towards bot, often getting first blood (which leads to firrst turret)? How about compared to Korea's LCK, the strongest region?
https://gist.github.com/46ec0f4be5b8e977aaf802934cd8b127
First turret and first baron both have higher correlations. Perhaps the LCK is better at pushing an advantage? First blood, again, has no obvious relationship to the result. Blue side is slightly favoured, still.
Although machine learning libraries are the latest and greatest tools sweeping the data science community, you can draw some solid conclusions using a more simple library like pandas. I plan to do a follow up article using scikit to train some models to predict things like first blood, who wins a game, and so forth later.
Even if I intend to build a predictive model using a machine learning library, I usually pull in pandas and explore the data first, to get a good intuition for what I'm working with and what kind of model I want to train.
Some areas to explore further is generating graphs, which pandas supports (using matplotlib under the hood) and doing some analysis regarding single players across multiple splits or even seasons. Perhaps TL's dominant bot lane followed Doublelift when he transferred from TSM? Pandas is the perfect tool for this kind of high level analysis.