Skip to content

Instantly share code, notes, and snippets.

@rvanbruggen
Last active August 2, 2021 13:12
Show Gist options
  • Save rvanbruggen/405b58e4f2b00a31b97bdbe5ae334ae6 to your computer and use it in GitHub Desktop.
Save rvanbruggen/405b58e4f2b00a31b97bdbe5ae334ae6 to your computer and use it in GitHub Desktop.
Summer fun with Musicbrainz: the real "Six degrees of Kanye West"

Summer fun with Musicbrainz: the "real" Six Degrees of Kanye West (part 1/3)

Last year, I had a lot of fun working with a fantastic little tool that a colleague of mine created, to analyse and enhance Spotify playlists using Neo4j. While I was working on that blogpost, and I was experimenting with what little I know of Python code, I came across another example project that Spotify actually highlighted on their website: it's called Six Degrees of Kanye West and it's simply amazing.

The idea behind this site seems to be similar to the "Six Degrees of Kevin Bacon": if you have ever worked with Kevin directly, your Bacon Number is 1. If you have worked with someone that has worked with Kevin, then your Bacon Number is 2. Etc etc - and then this is applied to the idea of musicians working together on songs.

For example, if you go there and you look up some unknown artist (like then inimitable Belgian schlager singer, Helmut Lotti), like this: You then get a really interesting result: Meaning that our dear friend Helmut is only 4 "hops" removed from the big Kanye. Helmut's Kanye Number is therefore: 4.

So, what I am going to try to do in this blogpost series, is to replicate that Using the Musicbrainz dataset. MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public. You can access it freely on their website, but you can also just download a copy of the data and take it for a spin. Guess what I am going to do?

Setting up the local Musicbrainz database

Musicbrainz uses a relational infrastructure to do what it's doing - specifically Postgres. After a few failed attempts using the dump of the RDBMS, I finally ended up installing it on my laptop using from this process - which uses Docker. Here's how this worked.

Installation of the local Docker-based MusicBrainz replica

First, I had to download the Github repository and change current working directory with:

git clone https://github.com/metabrainz/musicbrainz-docker.git
cd musicbrainz-docker

Then I could build the required Docker images with a simple command

sudo docker-compose build

and continue with the creation of the local Postgres database in the Docker container.

sudo docker-compose run --rm musicbrainz createdb.sh -fetch

Once this was finished, I have a full Docker setup running - without too much trouble:

Once I had this, I needed to publish ports of all services to the rest of my environment - but in order to do that I first had to go through through the small step of upgrading bash on OSX. Not too problematic - and then I could just run this command:

admin/configure add publishing-all-ports

and proceed with starting the system.

sudo docker-compose up -d

There we go - we have a local Musicbrainz database, which I can explore in DBeaver just to check if everything is working. Sure enough - all is good! So now we can get towork with the MusicBrainz schema, which you can find over here. Here's what (part of) the model looks like: the model Then parts that we will be most interested in to establish our Kanye Number will be the artist and recording tables.

In the next episode, we will go and start the import process of the MusicBrainz database into our beloved graph database, Neo4j.

CU then,

Rik Van Bruggen

Summer fun with Musicbrainz: the "real" Six Degrees of Kanye West (part 2/3)

In the first article of this series we talked about our mission to recreate the Six Degrees of Kanye West website in Neo4j - and how we are going to use the (Musicbrainz database)[www.musicbrainz.org] to do that. We have a running postgres database, and now we can start the import of part of the dataset into Neo4j to understand what the infamous Kanye Number of artists would be.

Loading data into Neo4j

There's lots of different approaches to loadhing the data, but when I started looking at the model in a bit more detail: the model If we wanted to create our Kanye Number graph out of this, we would need to work that into something like this: To go about that, I started to look into the detail of the schema, and I noticed that the artist and recording tables are linked together to something called the artistcredit. That's what's providing the links between the artists and the recording - the credit. On top of that, as explained on the Musicbrainz website, there are so-called "link tables" that already provide a number of precalculated links between entities. So that lead me to think of a 3 or 4 step import process:

  1. import the artists from the artist table;
  2. import the recordings from the recording table;
  3. import the relationships between artist and recording based on the already present, precalculated link table l_artist_recording;
  4. last but not least, infer a new set of relationsips between artist and recording, based on the artist_credit_name table.

So let's get cracking with that.

Setting up the JDBC connection

Conceptually, what we are going to do is to setup a JDBC connection between our Neo4j database server, and the Musicbrainz Postgres server. Using that connection, we can use apoc.load.jdbc to pull data from the Postgres server, and work with the resultset of that connection to create and/or update our graph database. Take a look at some of the examples on the website, and read up on the documentation. It's really quite easy to work with, especially once we have the connection set up.

The first thing that we need to do, is to dowload the postgres driver from the postgres site and put the .jar file into the plugins directory of Neo4j server. Then we need to restart the Neo4j server, and load the JDBC driver in cypher, so that apoc.load.jdbc can make use of it.

This does the trick:

call apoc.load.driver('org.postgresql.Driver')

Next, we can start loading data through the JDBC driver into our Neo4j graph database.

Part 1: Load the nodes into the graph

Looking back at the four step process above, we are going to start with the loading of the artist nodes first. Here's how we can do that, at scale, by batching our updates into 10000 record chunks using apoc.periodic.iterate - more info on that is over here.

CALL apoc.periodic.iterate(
  'CALL apoc.load.jdbc("jdbc:postgresql://localhost:5432/musicbrainz_db?user=musicbrainz&password=musicbrainz","artist") yield row',
  'CREATE (a:Artist) SET a += row',
  { batchSize:10000, parallel:true})

Next we can load the recordings in a very similar way:

CALL apoc.periodic.iterate(
  'CALL apoc.load.jdbc("jdbc:postgresql://localhost:5432/musicbrainz_db?user=musicbrainz&password=musicbrainz","recording") yield row',
  'CREATE (r:Recording) SET r += row',
  { batchSize:25000, parallel:true})

Now that we have that, we can start looking at loading the relationships into the graph.

Part 2: Load the relationships into the graph

To add the relationships, not only will we need to look up info from the two tables mentioned above (l_artist_recording and artist_credit_name) in the Postgress database, but we will actually also need to look up corresponding Artist and Recording nodes in the graph to create the links. To speed up these lookups, we will want to create some indexes in the graph database. We do that as follows (note that the syntax for this has actually recently changed, and that we now also have our wonderful now relationship property indexes):

create index for (a:Artist) on (a.id);
create index for (a:Artist) on (a.name);
create fulltext index fulltext_artist_name for (a:Artist) on each [a.name];
create index for (a:Artist) on (a.kanyenr);
create index for (r:Recording) on (r.id);
create index for (r:Recording) on (r.name);
create fulltext index fulltext_recording_name for (r:Recording) on each [r.name];
create index for (r:Recording) on (r.artist_credit);
create index for (ral:RecordingArtistLink) on (ral.entity0);
create index for (ral:RecordingArtistLink) on (ral.entity1);
create index for (acn:ArtistCreditName) on (acn.artist);
create index for (acn:ArtistCreditName) on (acn.artist_credit);
create index for ()-[r:RECORDED]->() on (r.source);

Running that is quick, but not that populating the indexes may take some time, especially with higher volumes:

Now we are ready to proceed.

Create the relationship between artist and recording via the l_artist_recording link table

This will be a three step process:

  1. first we actually create temporary nodes in our graph database based on the l_artist_recording table, and call these the (:RecordingArtistLink) nodes.
  2. then we create the relationships between (:Recording) and (:Artist) by reading that (:RecordingArtistLink) node and its .entity0 and .entity1 properties. Based on that, we can create the [:RECORDED] relationship, adding the .source property to it as well.
  3. then we remive the (:RecordingArtistLink) nodes

The Cypher statement for these operations go like this:

CALL apoc.periodic.iterate(
    'CALL apoc.load.jdbc("jdbc:postgresql://localhost:5432/musicbrainz_db?user=musicbrainz&password=musicbrainz","l_artist_recording") yield row',
    'create (link:RecordingArtistLink) set link += row',
    { batchSize:15000, parallel:true});

CALL apoc.periodic.iterate(
  'match (link:RecordingArtistLink) return link',
  'match (r:Recording), (a:Artist)
  where a.id = link.entity0 and r.id = link.entity1
  merge (a)-[:RECORDED {source:"linktable"}]->(r)',
  { batchSize:5000, parallel:false});

CALL apoc.periodic.iterate(
  'match (link:RecordingArtistLink) return link',
  'delete link',
  { batchSize:25000, parallel:true});

Next, we can do something very similar through the other link table.

Create the relationship between artist and recording via the artist_credit_name table

As we saw in the datamodel above, there is another way to infer the link between artists and recordings, which is through the artist_credit_name table. We will actually apply a very similar, 3-step process here:

  1. first we actually create temporary nodes in our graph database based on the artist_credit_name table, and call these the (:ArtistCreditName) nodes.
  2. then we create the relationships between (:Recording) and (:Artist) by reading that (:ArtistCreditName) node and its .artist and .artist_credit properties. The .artist identifier corresponds to the .id property on the (:Artist) nodes, and the .artist_credit property corresponds to the .artist_credit property on the (:Recording) nodes. Based on that, we can create the [:RECORDED] relationship, adding the .source property to it as well.
  3. then we remive the (:ArtistCreditName) nodes

The Cypher statements for these operations are like this:

CALL apoc.periodic.iterate(
  'CALL apoc.load.jdbc("jdbc:postgresql://localhost:5432/musicbrainz_db?user=musicbrainz&password=musicbrainz","artist_credit_name") yield row',
  'CREATE (acn:ArtistCreditName) SET acn += row',
  { batchSize:15000, parallel:true});
CALL apoc.periodic.iterate(
"match (acn:ArtistCreditName) return acn",
"match (r:Recording), (ar:Artist)
where acn.artist = ar.id and r.artist_credit = acn.artist_credit
merge (r)<-[rec:RECORDED]-(ar)
on create set rec.source='artistcredit'",
{batchSize:10000, parallel:false});
CALL apoc.periodic.iterate(
  'match (link:ArtistCreditName) return link',
  'delete link',
  { batchSize:25000, parallel:true});

Now we actually have a fully functioning Musicbrainz graph, ready to rumble.

And as you can see it's quite a cool and sizeable dataset:

match ()-[r:RECORDED]->() return count(r);

amd

match (n) return distinct labels(n), count(n);

In the next part of this blogpost series, we will explore the querying of this graph together.

Hope this was fun and useful.

Cheers

Rik Van Bruggen

Summer fun with Musicbrainz: the "real" Six Degrees of Kanye West (part 3/3)

So: now that we have had some fun setting up our local Musicbrainz database (part 1), and importing the data into our Neo4j database (part 2), we can now start having some fun. That means: checking if that actual 6 degrees of Kanye West, and the actual "Kanye Number", is findable and reproducible in our Neo4j database, in an efficient way. Let's take a look at that.

Note: part of this effort was actually motivated by the fact that I have noticed that the python code that powers the above website, actually caches the results (see the github repo for more info) rather than calculate the Kanye Number in real time like we will do here. I guess that speaks to the power of graph databases, right?

But let's take a look at some queries.

Find other artists that worked together with Kanye

Let's start with some simple

match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return kanye,r,a2
limit 100

That gives you a bit of a peek already: Kanye's co-recorders

Now let's continue on our Kanye Number mission.

Find Kanye Number for Helmut Lotti

Here's the query for that:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = shortestpath((kanye)-[*]-(a))
return a.name, length(path)/2 as kanyenumber;

Helmut Lotti KanyeNR

Of course, you can also look at that graphically, and return ALL the shortest paths as paths:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains "Helmut Lotti"
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path;

Helmut Lotti KanyeNr as a graph

Find Kanye Number for Bruce Springsteen using fulltext

Now, you will have noticed that I had to spell Helmut Lotti with a capital H and a capital L. That's not very handy - as we never really know how an artist may get spelled. So therefore, I actually implemented a so-called fulltext index on the .name property of the Artists. This allows us to ignore the case of the name of the artists, and even get a bit of a benefit from using some of the smart Lucene indexes that Neo4j provides for fulltext indexing of text.

Let's use a search for Bruce Springsteen, uncapitalised, as an example. The query for the Kanye Number then looks a little different, as we have to look up The Boss a little differently:

match (kanye:Artist {name: "Kanye West"})
with kanye
CALL db.index.fulltext.queryNodes("fulltext_artist_name", "bruce springsteen") YIELD node
with kanye, node
limit 5
match path = shortestpath((kanye)-[*]-(node))
return node.name, length(path)/2 as kanyenumber;

Bruce Springsteen Kanyenr fulltext

So that worked really well as well :) ... and it works in near real time for most, if not all, artists in the Musicbrainz database. Pretty neat!

Now, let's see how we could actually mimic the process that the [6degreesofkanyewest.com](6 Degrees of Kanye West website) uses, ie. precalculating the Kanye Number for every artist. In my mind, this is not a great approach (as we would need to rerun this with every new release, right?), but it's very much possible to do this in Neo4j. Let's see how that would work.

Calculate the Kanye Number in Batch for ALL artists

Essentially, we would need / want to create a process that runs over all the artists, runs the shortestpath algo, and then writes back the found number to the database. Here's a query that does that, for 500 artists at a time:

call apoc.periodic.iterate(
  "match (a:Artist), (k:Artist {name:'Kanye West'})
  return a,k",
  "match path = shortestpath((k)-[*]-(a))
    set a.kanyenr = length(path)/2",
  {batchSize:500, parallel:false});

Calculate Kanyenr in batch

Now, there's no magic here: running that query takes hours - I had to let it churn away at the dataset for an entire night (38406500ms / 1000 / 60 / 60 = 10,66 hours), but then it was done:

Result of batch kanyenr calculation

Of course, the big plus then is that you can now just read the kanyenumber from the property:

match (a:Artist)
where a.name contains "Bruce Springsteen"
return a.name, a.kanyenr

Reading kanyenr from property

There's a few more interesting experiments that I wanted to take a look at.

What are the most important artists with lowest Kanyenr

This was something I wanted to know. Artists with a low Kanyenr (ie. they have directly worked with the Big K himself) are not always equal: some artists have actually worked with Kanye a LOT, and others just once or twice. So the number of relationships matters here. Let's run that query:

match (kanye:Artist {name: "Kanye West"})--(r:Recording)--(a2:Artist)
return distinct a2.name, count(r) as nrofrecordings, a2.kanyenr
order by nrofrecordings desc
limit 100;

artists with low kanyenr and lots of recordings

So we see that Jay-Z and Kanye are actually great friends, it seems. Or, as I just found out (there goes my popmusic street credibility!!!), they were friends, then they weren't, and now they are again? OMG!

Recordings with the most artists

Another interesting thought that came in my mind. What recording, would actually have the most participating artists? I am thinking Band Aid or USA for Africa style.

Well, let's take a look at the latter first: USA for Africa That's interesting - there's about 54 artists in that subgraph - but is that it? Are there recordings with more artists?

Let's explore:

match (r:Recording)
with r, apoc.node.degree(r) as degree
order by degree desc
limit 10
match (a:Artist)-[rec:RECORDED]->(r)
return r.name as Recording, degree as NrOfArtists, apoc.coll.sort(collect(a.name));

Recordings with most artists in recording There's super interesting. Lots of Asian songs in the top, but also this We Have Seen God's Glory song, which indeed does host a very large number of participating artists!

Last but not least: exploring MusicBrainz in Bloom

Getting to the end of our exploration here, I decided to take the dataset for a spin inside Neo4j Bloom as well. So I created a perspective and a few search phrases as well:

Helmut Lotti's Kanyenr in Bloom using a search phrase

This search phrase allows me to look at Helmut Lotti's Kany Number:

match (kanye:Artist {name: "Kanye West"})
with kanye
match (a:Artist)
where a.name contains $artistname
with kanye, a
match path = allshortestpaths((kanye)-[*]-(a))
return path

Helmut Lotti in Bloom

Pretty neat!

Co-recordings of artist in Bloom

And this one allows me to look at the co-recordings of an artist:

match path = (a:Artist)-[:RECORDED]->(r:Recording)<-[:RECORDED]-(a2:Artist)
where a.name contains $artist
return path

Co-recordings of artist in Bloom

There's obviously endless additional possibilities, and we can share more of them in the Github repo.

In any case I hope you have enjoyed reading my experiments here - and I look forward to your feedback.

Cheers

Rik Van Bruggen

{"name":"KanyeNrPerspective","id":"9189d010-eec8-11eb-be64-0ff2af080e7e","categories":[{"id":1,"name":"Artist","color":"#FFE081","size":1,"icon":"DB188874-D25F-4B34-A296-E5E950072319","labels":["Artist"],"properties":[{"name":"type","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"gid","exclude":false,"isCaption":false,"dataType":"string"},{"name":"id","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"edits_pending","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"sort_name","exclude":false,"isCaption":false,"dataType":"string"},{"name":"name","exclude":false,"isCaption":true,"dataType":"string"},{"name":"ended","exclude":false,"isCaption":false,"dataType":"boolean"},{"name":"comment","exclude":false,"isCaption":false,"dataType":"string"},{"name":"area","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"begin_area","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"last_updated","exclude":false,"isCaption":false,"dataType":"LocalDateTime"},{"name":"begin_date_month","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"gender","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"kanyenr","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"begin_date_day","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"begin_date_year","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"end_area","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"end_date_year","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"end_date_month","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"end_date_day","exclude":false,"isCaption":false,"dataType":"bigint"}],"caption":[""],"createdAt":1627383076776,"lastEditedAt":1627453304919},{"id":2,"name":"Recording","color":"#C990C0","size":1,"icon":"4286028B-94A1-4832-9B18-E294D67AD42C","labels":["Recording"],"properties":[{"name":"video","exclude":false,"isCaption":false,"dataType":"boolean"},{"name":"artist_credit","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"name","exclude":false,"isCaption":true,"dataType":"string"},{"name":"id","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"length","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"edits_pending","exclude":false,"isCaption":false,"dataType":"bigint"},{"name":"last_updated","exclude":false,"isCaption":false,"dataType":"LocalDateTime"},{"name":"comment","exclude":false,"isCaption":false,"dataType":"string"},{"name":"gid","exclude":false,"isCaption":false,"dataType":"string"}],"caption":[""],"createdAt":1627383076776,"lastEditedAt":1627383355871}],"labels":{"Artist":[{"propertyKey":"type","type":"Artist","dataType":"bigint"},{"propertyKey":"gid","type":"Artist","dataType":"string"},{"propertyKey":"id","type":"Artist","dataType":"bigint"},{"propertyKey":"edits_pending","type":"Artist","dataType":"bigint"},{"propertyKey":"sort_name","type":"Artist","dataType":"string"},{"propertyKey":"name","type":"Artist","dataType":"string"},{"propertyKey":"ended","type":"Artist","dataType":"boolean"},{"propertyKey":"comment","type":"Artist","dataType":"string"}],"Recording":[{"propertyKey":"video","type":"Recording","dataType":"boolean"},{"propertyKey":"artist_credit","type":"Recording","dataType":"bigint"},{"propertyKey":"name","type":"Recording","dataType":"string"},{"propertyKey":"id","type":"Recording","dataType":"bigint"},{"propertyKey":"length","type":"Recording","dataType":"bigint"},{"propertyKey":"edits_pending","type":"Recording","dataType":"bigint"},{"propertyKey":"last_updated","type":"Recording","dataType":"LocalDateTime"},{"propertyKey":"comment","type":"Recording","dataType":"string"},{"propertyKey":"gid","type":"Recording","dataType":"string"}]},"relationshipTypes":[{"properties":[{"propertyKey":"source","type":"RECORDED","dataType":"string"}],"name":"RECORDED","id":"RECORDED"}],"palette":{"colors":["#FFE081","#C990C0","#F79767","#57C7E3","#F16667","#D9C8AE","#8DCC93","#ECB5C9","#4C8EDA","#FFC454","#DA7194","#569480","#848484","#D9D9D9"],"currentIndex":2},"createdAt":1627383076753,"lastEditedAt":1627383076753,"templates":[{"name":"Corecordings","id":"tmpl:1627453802317","createdAt":1627453802317,"text":"Corecordings of $artist","cypher":"match path = (a:Artist)-[:RECORDED]->(r:Recording)<-[:RECORDED]-(a2:Artist)\nwhere a.name contains $artist\nreturn path","params":[{"name":"$artist","dataType":"String","collapsed":false,"suggestionLabel":"Artist","suggestionProp":"name","suggestionBoolean":false,"cypher":null}],"hasCypherErrors":false},{"name":"Kanyenr for artist","id":"tmpl:1627383231912","createdAt":1627383231912,"text":"Kanyenr for $artistname","cypher":"match (kanye:Artist {name: \"Kanye West\"})\nwith kanye\nmatch (a:Artist)\nwhere a.name contains $artistname\nwith kanye, a\nmatch path = allshortestpaths((kanye)-[*]-(a))\nreturn path","params":[{"name":"$artistname","dataType":"String","collapsed":false,"suggestionLabel":"Artist","suggestionProp":"name","suggestionBoolean":false,"cypher":null}],"hasCypherErrors":false},{"name":"Artist by name","id":"tmpl:1627383122580","createdAt":1627383122580,"text":"Oeuvre of $param","cypher":"match path = (a:Artist)-[rec]-(r:Recording)\nwhere a.name = $param\nreturn path","params":[{"name":"$param","dataType":"String","collapsed":false,"suggestionLabel":null,"suggestionProp":null,"suggestionBoolean":false,"cypher":null}],"hasCypherErrors":false}],"hiddenRelationshipTypes":[],"hiddenCategories":[],"hideUncategorisedData":false,"isAuto":true,"parentPerspectiveId":false,"metadata":{"pathSegments":[{"source":"Artist","relationshipType":"RECORDED","target":"Recording"}],"indexes":[{"type":"full-text","state":"ONLINE","tokenNames":["Artist"],"name":"fulltext_artist_name","propertyNames":["name"],"provider":{"key":"fulltext","version":"1.0"}},{"type":"full-text","state":"ONLINE","tokenNames":["Recording"],"name":"fulltext_recording_name","propertyNames":["name"],"provider":{"key":"fulltext","version":"1.0"}},{"label":"Recording","type":"native","propertyKeys":["id","name"]},{"label":"Artist","type":"native","propertyKeys":["id","name"]}],"stats":{"labels":{},"relationshipTypes":{"RECORDED":35538187}}},"version":"1.7.0"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment