Skip to content

Instantly share code, notes, and snippets.

@rvanbruggen
Last active November 15, 2022 15:28
Show Gist options
  • Save rvanbruggen/f3c3b6b49a0a0ca94f16ef6dad93285d to your computer and use it in GitHub Desktop.
Save rvanbruggen/f3c3b6b49a0a0ca94f16ef6dad93285d to your computer and use it in GitHub Desktop.
WorldCup2022 in Qatar

A 2nd, better way to WorldCupGraph

Hours after publishing my previous blogpost about the WorldCup Graph, I actually found a better, and more up to date dataset that contained all the data of the actual squads that are going to play in the actual World Cup in Qatar. I found it on this wikipedia page, which lists all the tables with the actual squads, some player details, coaches etc. as they were announced on 10th/11th of November.

So: I figured it would be nice to revisit the WorldcupGraph, and show a simpler and faster way to achieve the results of the previous exercise. So: I have actually put this data in this spreadsheet, and then downloaded a .csv version:

These two files are super nice and simple, and therefore we can actually use the Neo4j Data Importer toolset to import these really easily.

Using the importer to import the dataset

As you can see from the screenshot below, I built a very similar, but slightly different, datamodel with the data importer tool. If you go to the link above, and then download the .zip file that contains the dataset as well as the model definition, you can load the data in a heartbeat:

The results appear within seconds:

Once we have that, we can start doing some of the queries that we had shared in the previous blogpost.

Let's get the query juices going again

Let's build this up nice and slowly.

Simple query: Let's find "Belgium"

All we do is lookup the NationalTeam node:

MATCH (t:NationalTeam {name: "Belgium"}) 
RETURN t;

Next: Things connected to Belgium

Then we look at a path that explores direct and indirect connections to the "Belgium" node:

  • Direct connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) 
    RETURN path;

  • indirect connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) 
    RETURN path;

    The result looks like this subgraph:

Next, let's look at some Player-centric queries.

Paths between players

First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:

MATCH path = ALLSHORTESTPATHS ((p1:Player {name: "Lionel Messi"})-[*..5]-(p2:Player {short_name: "Kevin De Bruyne"}))
RETURN path
LIMIT 100;

Find matches with players from same club on two sides

The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?

So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):

MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;

We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:

MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path

Last but not least: load a database from a backup via http

Again, just a matter of hours after the previous post, I got a tweet from Christoph, with another amazing suggestion.

Would be nice to publish a backup on Github as well so we can use CREATE DATABASE FROM https://.... in Neo4j 5 ;-)

— Christophe Willemsen (@ikwattro) November 15, 2022
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

I had not even heard of this, but it's SO AMAZING, that I really do have to mention it.

Instead of going through the above import procedure, why not just restore a backup to your server, over http???

Here's how you do that:

  • in your neo4j.conf, you have to enable the loading of a database backup from a URI: use this entry to allow that dbms.databases.seed_from_uri_providers=URLConnectionSeedProvider

  • then, all you do is run this statement:

    CREATE DATABASE worldcup22 
    OPTIONS { 
     existingData: "use", 
     seedUri: "https://downloads.graphaware.com/neo4j-db-seeds/world-cup-2022-neo4j.backup"
            }

And then you wait a few seconds for the database to come online, and you are done!

That's all I have for now. I hope this was an interesting update for you, and that you can easily experiment with this yourself as well.

Have fun!

Cheers

Rik

No WorldCup without a WorldCupGraph!

Sources of data

As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:

To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.

There's a couple of worksheets in this document, but I will only use some of them:

  • squad_stats with the definition of the different participating teams
  • players_22 with the different players' information
  • Qatar_group_stage with the definitions of the group stage matches
  • historical_win-loose-draw_ratios_qatar2022_teams with the historical data with regards to the different team's matches

So, now we can proceed to actually importing the data into Neo4j, and have some fun!


Constructing a model

The model will look like this:

Some interesting components to this:

  • on the left hand side of the model you find the player data, including the Clubs, Leagues, Positions and Traits of the Players.
  • in the center of the model is the NationalTeam that the Players play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between the Player and the NationalTeam.
    • HAS_NATIONALITY means that the Player is a potential player for the NationalTeam, but has not necessarily debuted yet.
    • PLAYS_FOR means that the Player has already played for the NationalTeam
  • on the right hand side of the model we find the Matches and the Phases of the Worldcup2022 Tournament.
  • on the far right you will find the stats of the historic Results of the Matches that have been scheduled.

Importing the WorldcupGraph, 2022 edition

Setting up the indexes

CREATE INDEX team_name FOR (t:NationalTeam) ON t.name;
CREATE INDEX player_short_name FOR (p:Player) ON p.short_name;

Running the import

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1122940013" AS csv
CREATE (t:NationalTeam)
SET t = csv;

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1368794561" AS csv
WITH csv
WHERE csv.in_qatar = "TRUE"
CREATE (p:Player)
SET p = csv;

MATCH (p:Player)
SET p.dob = datetime(p.dob)
SET p.club_joined = datetime(p.club_joined);

MATCH (p:Player)
UNWIND split(p.player_positions,", ") as positions
MERGE (po:Position {name: positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);

MATCH (p:Player)
MERGE (l:League {name: coalesce(p.league_name,"UNKNOWN")})
SET l.level = coalesce(p.league_level,"UNKNOWN")
MERGE (cl:Club {id: coalesce(p.club_team_id,"UNKNOWN")})
SET cl.name = coalesce(p.club_name,"UNKNOWN")
CREATE (p)-[:PLAYS_FOR {position: p.club_position, jersey_number: p.club_jersey_number, join_date: p.club_joined, contract_valid_until: p.club_contract_valid_until}]->(cl)
MERGE (cl)-[:PART_OF_LEAGUE]->(l);

MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NOT NULL
MERGE (p)-[:PLAYS_FOR]->(t);

MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NULL
MERGE (p)-[:HAS_NATIONALITY]->(t);

MATCH (p:Player)
UNWIND split(p.player_traits,", ") as traits
MERGE (tr:Trait {name: traits})
CREATE (p)-[:HAS_TRAIT]->(tr);

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=320760693" AS csv
MERGE (t:NationalTeam {name: csv.team})
SET t.date_of_score = csv.date
SET t.rank = csv.rank
SET t.goalkeeper_score = csv.goalkeeper_score
SET t.defense_score = csv.defense_score
SET t.offense_score = csv.offense_score
SET	t.midfield_score = csv.midfield_score;

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=256106829" AS csv
MERGE (t:Tournament {name:"Qatar2022"})
MERGE (ph:Phase {name: csv.phase})
MERGE (ph)-[:PHASE_OF]->(t)
CREATE (m:Match)-[:IN_PHASE]->(ph)
SET m.date = datetime(csv.date)
SET m.id = csv.match
SET m.home_team = csv.country1
SET m.away_team = csv.country2
WITH m
MATCH (t:NationalTeam {name: m.home_team})
CREATE (m)-[:IS_HOME_TEAM]->(t)
with m
MATCH (t:NationalTeam {name: m.away_team})
CREATE (m)-[:IS_AWAY_TEAM]->(t);

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=272034195" AS csv
MATCH (t:Tournament {name:"Qatar2022"})--(ph:Phase)--(m:Match),
(t1:NationalTeam {name: csv.country1})--(m)--(t2:NationalTeam {name: csv.country2})
CREATE (m)-[:HAS_HISTORIC_RESULTS]->(r:RESULTS)
SET r = csv;

Some queries to get the juices going

Simple query: Let's find "Belgium"

MATCH (t:NationalTeam {name: "Belgium"}) 
RETURN t;

Next: Things connected to Belgium

  • Direct connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) 
    RETURN path;
  • indirect connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) 
    RETURN path;

Paths between players

Shortest paths:

MATCH path = ALLSHORTESTPATHS ((p1:Player {short_name: "L. Messi"})-[*..5]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

All paths, but with a limit:

MATCH path =  ((p1:Player {short_name: "L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.

Find matches with players from same club on two sides

So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):

MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;

We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:

MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path
LIMIT 100;

Have fun!

Cheers

Rik

No WorldCup without a WorldCupGraph!

Last week I was having a conversation with one of my dear Neo4j colleagues, and we were talking about the fact that Graphs are simply so much fun to play around with, and that there's nothing like a great interesting dataset to have people really experiment and acquaint themselves with the technology. I know that to be extremely true, and I think I have demonstrated this elaborately over the years on this runaway blog of mine.

Then the conversation turned to a topic that I know very little about: the FIFA World Cup in Qatar that is starting next week. Now, reading this blog you may know that I am a little addicated to my 2 wheeled #mentalhealthmachine, and that chasing a ball across a field seems like a little bit of a game to me - but hey, that's ok! And with this conversation it actually dawned on me that at Neo4j, we had done "Worldcup Graphs" both in 2014 and in 2018: our friend and former colleague Mark Needham was the driving force behind both of those efforts.

You can still see some of the work that Mark did at the time on Github and Medium. It was truly another example of how a cool and timely dataset would get people to explore the wonderful world of graphs and get to know the technology in a fun and interesting way.

So: I decide that it would be nice to do that again. With all the new tech that is coming out of Neo4j with the release of Neo4j 5, that could not be very difficult, right? Let's take a look.

Looking for the world cup data

As with much of the sports data, it's not always that easy to find up to data interesting data. I think that's because much of that data is probably safely hidden behind copyright protection and paywalls - but there's always some stuff around. So after a bit of searching, I found a couple of interesting data sources:

To make life a bit easier for myself, I selected a couple of the most interesting .csv files abd put them together in a Google Spreadsheet.

There's a couple of worksheets in this document, but I will only use some of them:

  • squad_stats with the definition of the different participating teams
  • players_22 with the different players' information
  • Qatar_group_stage with the definitions of the group stage matches
  • historical_win-loose-draw_ratios_qatar2022_teams with the historical data with regards to the different team's matches

So, now we can proceed to actually importing the data into Neo4j, and have some fun!

Constructing a model

Whenever we do any kind of data wrangling, it's always important to ask yourself the question what kind of a data model you want to import the data into. The nice thing about graphs is of course that this super flexible and adaptable, but still you want to take some care to do the import appropriately. After going throught the data, I decided on the following structure:

Some interesting components to this:

  • on the left hand side of the model you find the player data, including the Clubs, Leagues, Positions and Traits of the Players.
  • in the center of the model is the NationalTeam that the Players play for, based on their nationality. Now, the player dataset actually includes a lot of players that will never get selected to their national team - so there's two potential relationships between the Player and the NationalTeam.
    • HAS_NATIONALITY means that the Player is a potential player for the NationalTeam, but has not necessarily debuted yet.
    • PLAYS_FOR means that the Player has already played for the NationalTeam
  • on the right hand side of the model we find the Matches and the Phases of the Worldcup2022 Tournament.
  • on the far right you will find the stats of the historic Results of the Matches that have been scheduled.

So let's get cracking with the import now.

Importing the WorldcupGraph, 2022 edition

Setting up the indexes

Quite a trivial thing to do, but it's always a good idea to set up some indexes before making some imports, as it will make the import transactions more efficient, especially if you are planning to do a lot of MERGEs.

CREATE INDEX team_name FOR (t:NationalTeam) ON t.name;
CREATE INDEX player_short_name FOR (p:Player) ON p.short_name;

Running the import

Once we have that, the import transactions can be done in a large multi-statement import:

Let's take you through these statements quickly.

Importing the national teams in Qatar2022

Using the .csv export of the spreadsheet, we can just point LOAD CSV to that URL and create the NationalTeam nodes:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1122940013" AS csv
CREATE (t:NationalTeam)
SET t = csv;

Importing all the FIFA players whose national team is in Qatar2022

Same process, except that this time we will filter on a specific part of the loaded .csv file, specifically the players of countries that are actually present at the 2022 WorldCup tournament:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=1368794561" AS csv
WITH csv
WHERE csv.in_qatar = "TRUE"
CREATE (p:Player)
SET p = csv;

As a formality, we also changed the date properties on these Player nodes, so that we could use them later on using the right datatypes.

MATCH (p:Player)
SET p.dob = datetime(p.dob)
SET p.club_joined = datetime(p.club_joined);

Connecting players to positions that they play for their teams

Next, we are creating the Position nodes and associating the players with these nodes. That involves splitting and unwiding these positions, as they are currently all put together ion the player_positions field of the Player nodes. Here's how that works:

MATCH (p:Player)
UNWIND split(p.player_positions,", ") as positions
MERGE (po:Position {name: positions})
CREATE (p)-[:PLAYS_TEAM_POSITION]->(po);

Connecting players to Leagues and Clubs

We also want to make sure that we split off the League and Club nodes of the Players - as that will be interesting to use later on when we do some queries around the capabilities of players in the tournament.

Note that we use the COALESCE function quite a bit at this stage of our import: this allows us to efficiently deal with potential NULL values in our dataset. If we know the league or club of a player, we use that value - but if we don't we use the "UNKNOWN" value for that player.

MATCH (p:Player)
MERGE (l:League {name: coalesce(p.league_name,"UNKNOWN")})
SET l.level = coalesce(p.league_level,"UNKNOWN")
MERGE (cl:Club {id: coalesce(p.club_team_id,"UNKNOWN")})
SET cl.name = coalesce(p.club_name,"UNKNOWN")
CREATE (p)-[:PLAYS_FOR {position: p.club_position, jersey_number: p.club_jersey_number, join_date: p.club_joined, contract_valid_until: p.club_contract_valid_until}]->(cl)
MERGE (cl)-[:PART_OF_LEAGUE]->(l);

Connecting Players to Qatar2022 teams

There's two scenarios in this case:

  1. A first scenario, where the player has a Position recorded for the NationalTeam. Then we create the PLAYS_FOR relationship between the Player and the NationalTeam.
MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NOT NULL
MERGE (p)-[:PLAYS_FOR]->(t);
  1. A second scenarion where we don't have a Position recorded for the Player in the NationalTeam. In that case, we just record that the Player has the nationality of the team - with the HAS_NATIONALITY relationship.
MATCH (p:Player), (t:NationalTeam)
WHERE  p.nationality_name = t.name
AND p.nation_position IS NULL
MERGE (p)-[:HAS_NATIONALITY]->(t);

Connecting Players to traits

Because we have the info, we may as well connect the Player to the Traits in our dataset:

MATCH (p:Player)
UNWIND split(p.player_traits,", ") as traits
MERGE (tr:Trait {name: traits})
CREATE (p)-[:HAS_TRAIT]->(tr);

Adding Ranking information to Teams of Qatar2022

We can also add the ranking information to the NationalTeam nodes like this:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=320760693" AS csv
MERGE (t:NationalTeam {name: csv.team})
SET t.date_of_score = csv.date
SET t.rank = csv.rank
SET t.goalkeeper_score = csv.goalkeeper_score
SET t.defense_score = csv.defense_score
SET t.offense_score = csv.offense_score
SET	t.midfield_score = csv.midfield_score;

Adding Qatar2022 matches

The last exercise is to add the actual Match information of the tournament to the database. The only thing special here is that every match will also have a IS_HOME_TEAM and an IS_AWAY_TEAM relationship to the teams.

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=256106829" AS csv
MERGE (t:Tournament {name:"Qatar2022"})
MERGE (ph:Phase {name: csv.phase})
MERGE (ph)-[:PHASE_OF]->(t)
CREATE (m:Match)-[:IN_PHASE]->(ph)
SET m.date = datetime(csv.date)
SET m.id = csv.match
SET m.home_team = csv.country1
SET m.away_team = csv.country2
WITH m
MATCH (t:NationalTeam {name: m.home_team})
CREATE (m)-[:IS_HOME_TEAM]->(t)
with m
MATCH (t:NationalTeam {name: m.away_team})
CREATE (m)-[:IS_AWAY_TEAM]->(t);

As a very last step in the import process, I have also added historical results to every one of the matches - so that we can potentially better understand likely outcomes of these matches:

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1VdtKb2KF_VvNxKQFELKk0Zpho1mtppi8ckz7gwNrfWQ/export?format=csv&gid=272034195" AS csv
MATCH (t:Tournament {name:"Qatar2022"})--(ph:Phase)--(m:Match),
(t1:NationalTeam {name: csv.country1})--(m)--(t2:NationalTeam {name: csv.country2})
CREATE (m)-[:HAS_HISTORIC_RESULTS]->(r:RESULTS)
SET r = csv;

Note that this actually adds two sets of RESULTS to every match, one always being the inverse of the other.

Now that we have the data loaded into our graph - we can start doing some queries.

Some queries to get the juices going

Let's build this up slowly and nicely.

Simple query: Let's find "Belgium"

All we do is lookup the NationalTeamnode:

MATCH (t:NationalTeam {name: "Belgium"}) 
RETURN t;

Next: Things connected to Belgium

Then we look at a path that explores direct and indirect connections to the "Belgium" node:

  • Direct connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})--(conn) 
    RETURN path;
  • indirect connections we find like this:

    MATCH path = (t:NationalTeam {name: "Belgium"})-[*..2]-(conn) 
    RETURN path;

The result looks like this subgraph:

Next, let's look at some Player-centric queries.

Paths between players

First, let's find the shortest paths between Lionel Messi and Kevin De Bruyne:

MATCH path = ALLSHORTESTPATHS ((p1:Player {short_name: "L. Messi"})-[*..5]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

The results are unsurprising, through some of the Trait nodes.

If we look at the paths that are not necessarily the shortest ones, we get very different subgraphs, of course.

MATCH path =  ((p1:Player {short_name: "L. Messi"})-[:PLAYS_FOR]-()-[*..4]-(p2:Player {short_name: "K. De Bruyne"}))
RETURN path
LIMIT 100;

Finally, I also wanted to take a look at how the matches are going to be played, and how the team composition might have an impact on this.

Find matches with players from same club on two sides

The basic idea behind this query is simple: would there be any WorldCup matches, where there would be players playing for National teams, that are actually on the same team during the rest of the season. That could be interesting in many different ways, right?

So here's a simple query specifically for Belgium, essentially looking for a cyclic pattern (starting and ending with the Club node):

MATCH path = (c)--(p1:Player)--(t1:NationalTeam {name: "Belgium"})--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path;

We can also make this more generic, but then we have to limit the results if we want to keep the results within the realms of what the Neo4j Browser can display:

MATCH path = (c)--(p1:Player)--(t1:NationalTeam)--(m:Match)--(t2:NationalTeam)--(p2:Player)--(c:Club)
RETURN path
LIMIT 100;

That's all I have for now. I hope this was an interesting set of ideas and queries for you, and that you can easily experiment with this yourself as well.

Have fun!

Cheers

Rik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment