Instagram case study II
In the last Instagram case study we analyzed the characteristics of a few metrics that can be used to gauge the quality of an Instagram profile, the most important of which are the engagement rate and the followers to following ratio.
At the end of the post we mentioned that we were about to add 6.5 million profiles to our database, a 130% increase from the 5 million profiles we had. However, in the meantime we also applied numerous improvements to our social engines and in particular we radically changed the way we discover new profiles. The result is that now we don't only have 13.5 million profiles, but 24 million profiles.
In this post we will investigate the geographical data associated to the profiles, the relations between some important hashtags, and audience metrics.
The geographical location of a profile is estimated by taking into account a variety of factors, like the biography of the profile, the posts and the location tags on each post. The profiles in our database are distributed according to the following choropleth map:
As can be seen in the image, the countries with the highest number of Instagram profiles are the United States, Canada, Australia, the United Kingdom and Indonesia. Right behind are Russia, India, the principal European countries and South American countries like Argentina, Venezuela and Colombia.
Location data can also be used to infer the language spoken by Instagram users (only official languages are considered):
English and Spanish are the most common by a wide margin. French is in third place thanks to Canada and all the African countries where it is an official language, and Portuguese follows with similar numbers, being Brazil's official language. Arabic, Russian and Indonesian also have a strong presence in the top ten.
These results are in line with the data provided by Statista, suggesting that our software collects profiles in a uniform manner:
In particular, we can focus on the profiles from the United States:
California leads the rankings with almost one-third of all the Instagram profiles from the US. Texas, New York and Florida follow at close distance. Once again we are satisfied with these results, since they mirror the population density across US states: if the population is concentrated in a handful of states, clearly the influencers will be as well. This is also the reason why there are so few influencers from the Midwest states.
Our software collects hashtags from all the posts it analyzes. So far we have
collected more than 20 million unique hashtags. While a lot of those are
obscure and fairly meaningless, the most frequent ones in our database are also
the most frequent ones on Instagram:
#repost are the first five.
In this analysis we focus our attention to the tags that signal sponsored or
promoted content. The most used ones are
#sponsored. The next chart shows the number of profiles that have
used these hashtags in their posts (note: the bars indicate the number of
profiles, and not of posts containing a particular hashtag, which can be
easily looked up on Instagram if desired).
We wondered how these hashtags were related to each other, and it turns out that the connections are extremely complicated. In the image below, which has been severely simplified in an attempt to maintain readability, one can observe which topics are most frequent with the most common promotional hashtags.
The blue circles represent the most frequent hashtags used in combination with the one that indicate sponsored content, but there is not enough space to label them all, so we just summarized the most common topics.
#ad hashtag is used in pretty much all contexts, and the same is true, to
a lesser degree, for the
#collaboration hashtag. On the other hand, some less
frequently used hashtags for sponsored content appear in very specific
circumstances. In the beauty and cosmetics sector we see a predominance of the
#collaboration hashtags. The
#ambassador hashtag seems to be
used almost exclusively for fitness, sports and summertime topics. The
#partner hashtag appears for the most part in posts related to photography,
nature and fitness again.
#sp hashtag is a separate story. It is a shorthand for
"sponsored" and it is used by some influencers to disclose promoted content
(these practices have been deemed deceitful by the FCC in the US). However, we
discovered unusually strong connections with Brazil, and it turns out that it
is also used by Brazilian influencers to tag posts mentioning São Paulo,
further adding to the confusion (similarly,
#rj is used for Rio de Janeiro).
We include a quick analysis of the posting frequency, since this update also included this new metric for all the profiles. We start with an overview of the distribution of the number of posts per week across all the profiles:
The great majority of the profiles post a few times a week on average. One may wonder if major influencers post more than regular profiles. That is indeed the case: the biggest influencers post 4.2 times per week on average, while smaller profiles have an average of 3.2 posts per week.
Lastly, we investigated whether posting frequency had a visible relation with engagement rate. Short answer: it does. In the scatter plot below, one can see that the engagement tends to get lower as the posting frequency increases:
The two charts suggest that for major influencers this diminishing effect is not as strong, but the inverse relation is still visible. As to why a higher posting frequency correlates with a lower engagement rate there could be several factors at play, like:
- more posting corresponds to lower quality overall;
- too much posting is penalized by Instagram's feed algorithms.
Bonus: audience metrics
This update also includes the long-awaited audience metrics. From this moment our scrapers will be able to estimate the demographics of the audience of an Instagram influencers.
The technology behind this is proprietary, so this post cannot contain too many details. However, we can present a very general overview of the way it works. The computing unit in charge of estimating audience metrics is essentially a machine-learning pipeline composed by three different convolutional neural networks that work together to estimate age and gender from a profile picture. Of course, this is only possible if the profile picture is a selfie. This is where the first neural network comes into play: we use the other two models for prediction only if we detect a face in the picture.
This process is repeated for the users belonging to a particular influencer's audience, and then the results are aggregated. The final metrics can be summarized in a chart like the following one:
These metrics are extremely powerful and can provide deep insights into the interests of an influencer's audience. Audience metrics will be rolled out in batches and at first only for the biggest profiles, since these metrics cannot be estimated reliably when the audience size is too small.
In this blog post we presented the results from analyses that we performed on very recently collected Instagram data. In particular, we observed that the Instagram profiles are distributed across locations as one would expect from third-party data, and then we explored the differences between the most common hashtags used to signal promoted content.
Our engineering team continues to improve the infrastructure for the data collection and analysis. We will provide new updates and analyses as our database grows further. Hashtags and their relationships seem to particularly interesting, so in the future we may post a blog post that goes into detail about that.