aialenti/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Clustering Pollock

I am not very passionate about art: when I visit a museum, I am a casual tourist that walks around observing paintings and sculptures, trying to learn as much as possible, unfortunately without appreciating the depth that is behind an art piece.
A few years ago I was in a cocktail party organized inside the Peggy Guggenheim Collection museum, in Venice, and I had the luck to see a Pollock painting: Alchemy, live for the first time. Jackson Pollock was an influential American painter, and the leading force behind the abstract expressionist movement in the art world [1]: https://www.jackson-pollock.org/. Pollock is well known for his use of the Drip Painting technique a form of abstract art in which paint is dripped or poured on to the canvas [2]:https://en.wikipedia.org/wiki/Pollock.
I remember that I was really fascinated about the fact that, in some way, my mind was completely caught by a painting that was just made by some colors randomly dripped on a canvas. I probably realized in that moment (while drinking a very expensive Aperol Spritz), that I was not looking at randomness, but at something that was created to be beautiful.
This introduction about my very superficial art knowledge, is just to explain why and when I've became curious about Pollock and, by consequence, why I decided to spend a few hours in doing the analysis below.
How did Pollock's colors usage evolved through time? To answer I decided to do some experiments with clustering, applying a few algorithms and plotting some charts.
Data

To do clustering on Pollock's paintings I needed a reliable source from which I could download the paintings and extract other data. A quick search on Google about the artist led me to this [website]: https://www.jackson-pollock.org/ (it's not officially related to Pollock as a person). I scraped the paintings from the website, saving also the year and the name of each artwork.
Pollock created many masterpieces with very different sizes; I decided to retrieve the size of each painting: to understand the artist's usage of color, it might have been interesting to include the canvas size in the mix. Unfortunately, this information was not available on jackson-pollock.org, so I needed to search each piece on Google... manually. I ended up having a csv file and a bunch of jpg images. Of course, my dataset contained only some Pollock's paintings and not all of his production; by the way, it's enough for what's coming next.

The Analysis

With all the information needed, we can now start to have some fun with Python.
First of all, we need to decide to either rescale the images according to their original size (a.k.a. the canvas size). This is a crucial decision due to the fairly high variability of painting dimensions. We have at least two options:

Rescale the images according to the original size: we will weight more the colors that are contained in bigger canvases (e.g. Autumn Rhythm (Number 30) is a 14 square meters piece). Doing so we would be evaluating "how many buckets" of each color Pollock used in his career.
Rescale the images to a fixed dimension, the same for each painting: every painting will have the same weight. What we would evaluate is the proportion of each color across the paintings. The smaller the images, the faster the algorithm the poorest the clustering results.


I think it's more interesting to analyze the proportions, so we'll go for the second route: we will reshape the images to a 200x200 pixels square.
The following code is a simple snippet that resizes the images in the desired shape (if the size parameter is None, then the rescaling is done accoriding to the data present in the csv file).
The SCALE_FACTOR constant, helps to reduce the complexity of the clustering algorithms, setting it to 1. The actual rescaling will is made using OpenCV.
https://gist.github.com/67d494f7cd6bb62992738a96839f1de6
As I wrote earlier, what we want to do is analyze how Pollock varied his usage of colors during his activity. Of course we can't use every single color we find in the dataset (we could potentially end up with 16M+ possible values). What we'll do, instead, is trying to perform a clustering to reduce the number of data points to plot; to do so, we are going to use kmeans.
kmeans aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster (Wikipedia). The algorithm tries to minimize the Euclidean distance between the cluster centers and the points in the cluster. In our case, the prototypes will represent the colors that we will track across the years: the intuition behind this choice is that kmeans will (?) group together similar colors, therefore, by tracking the prototype, we will be tracking many colors that are similar to it.
To follow a color across the years, we'll need to perform the clustering considering all the paintings: if we'd run kmeans on each single image we would end up with 54 different models and a number of different prototypes.
The first step, though, is to read every single image and stack them into a dataset. For the clustering task we are going to use the RGB (Red-Green-Blue) color space, so the dataset will have:

1 row for each single pixel in our images
3 columns, one for the Red channel, one for the Green and another for the Blue

Not the most efficient way to stack images, but it will do the trick:
https://gist.github.com/5b879f2000eabfde70302dc51622429c
In the stacked_images variable, we have exactly the dataset that I described earlier.
Before fitting the model, we still need a preprocessing bit: we are lifting the images in RGB format and each channel value can range from 0 to 255, we'll rescale everything in the [0,1] range.
https://gist.github.com/dd8690d5f221efc610ccf3013429d5af
After this step, we should have identified the 20 main colors used by Pollock in his paintings! First of all, let's see them.
https://gist.github.com/41071139ac320bc76154759a8f884788
To create the image below, I converted the pixels to the HSV color space (Hue-Saturation-Value) and I sorted the colors according to their Hue, Saturation and Value (in this order); the segment size shows the proportion of that color in (all) the paintings.

If we give a look to some of the clusters, we can see that the algorithm is kind of working; there is still some noise, i.e. colors that appear "not very similar" to the overall prototype color. One of the reasons behind this, is the distance metric that we used to decide when to group two colors in the same bucket; kmeans works with the Euclidean distance: it could be worth to choose other distance metrics and other algorithms to see how results change (I tried to use the [spherical kmeans]:https://www.jstatsoft.org/article/view/v050i10/v50i10.pdf but with poor results).
Anyway, the results are not that bad.


Now that we have the fitted model, we can just feed it all the paintings, one by one, and see the proportion of each prototype in every artwork (some examples below).


Visualizations

At this point, we have all that we need to play around with some visualizations The datata that we are going to use are:

The clusters -prototypes and their RGB components.
A bunch of csv files one for each artwork where we know how many times a particular color prototype have been used:

https://gist.github.com/a6f12b0e02d6a80b82f0a8acab8ae9c7
Let's pick some paintings (my favourites among those that we are analyzing), and let's see which colors kmeans sees into them:


The idea, though, was to see the evolution of Pollock's colors across multiple years. What we can do, is to analyze the data on the time dimension. We already have all the proportions for each image, so we only need to group the resulting dataframes according to the associated year.
https://gist.github.com/2256038cf9e80f339ed7ab93914b5f9e
A river chart can show how the color usage evolved through time

Conclusions

What to say, from the analysis above, it appears that Jackson Pollock progressively removed saturated colors (yellows, reds), focusing his attention on more faded beigeish shades, with a prevalence of greys.
Of course, we only considered a subset of his paintings and I wouldn't be surprised to see different results whenever analyzing the full portfolio. Note that we had less samples painted in the 30s, so it's obvious that the colors in the latest years tend to be more uniformly distributed.
Sometimes I am not really happy with the results of kmeans, but I think that the performance would improve by increasing the size of the reshaped images (those on which we ran the clustering); by the way, consider that with 54 images of 200x200 pixles, we already need to cluster 2.16 million points.
Finally

If you enjoyed the article and you want to take some time to share it, you'd make me really happy! If you didn't like it then please explain me why in the comments!
Feel free to connect with me on [LinkedIn]: https://www.linkedin.com/in/andreaialenti/ !
About me

I am a Data Scientist with expertise in Machine Learning, Business Intelligence and Software Development. Artificial Intelligence enthusiast with experience in different Advanced Analytics tasks (Classification, Regression, NLP, Image Analysis). I enjoy explaining Data Science to business users.