Skip to content

Instantly share code, notes, and snippets.

@vecna
Created December 28, 2018 15:55
Show Gist options
  • Save vecna/20dd938159b5d207a57f6ae1d4cfa2db to your computer and use it in GitHub Desktop.
Save vecna/20dd938159b5d207a57f6ae1d4cfa2db to your computer and use it in GitHub Desktop.
slide 1 (title):
Analyzing, Facebook, reclaiming, algorithm: that's what this talk is about.
Instead of taking pictures during the talk, because I've 71 slides, I suggest you take a picture of this screen: the slides are available in .pdf on GitHub, at the URL below.
slide 2 (who am I):
I'm Claudio Agosti, @_vecna is my twitter handle, and if you want to talk about this project, the hashtag is #fbTREX
slide3 (big logo):
I guess some of you are out of Facebook: well done, I respect the approach and I'm happy you can do it, but in this talk, we deal with the impact of Facebook in the society. That's why we address Facebook as the first platform.
But the model I'm going to talk about can apply to any other platform the project name is facebook.tracking.exposed, we address facebook as first because is the most massive social media of our time. The methodology can apply to any platform which personalizes your perception.
slide4 (emotion manipulation)
personalization is made by algorithms. It is clear Facebook has many different algorithms and procedures, but in this talk, I would just refer to "the algorithm" like the sum of all the secret logics which have as the outcome the personalization of your Facebook experience.
The first time Facebook confirmed the creepiness of this, was in 2014 when they released the result of an experiment run on unaware users.
In there, more than six hundred thousans person get part of this experiment. Facebook act through the content showing up in their timeline. behind the scene, Facebook attribute a metadata assisting the sentiment expressed by the post: positive or negative. half of the users get the positive post removed, the other half get the negative removed. The study was in looking if they would have behaved differently: they did.
This tell us two thing: algorithm might be a tool of social control and influence, and there are metadata attributed to our posts which we are not aware.
slide 5 (ferguson)
Zenypeh Tufecki kept her eyes open the night Ferguson revolt has begun. She notices a very different perception between what was happening and what her Facebook timeline was displaying.
takeaway: Algorithms have a social and collective impact, this is why it matters. We as educated technologies believe we can escape from the influence, but it is not. In an unfair society where perception can be controlled by an information monopoly, we suffer too.
slide 6 (Caryn Vaiano)
And this is a personal story shared by Caryn Vaiano: she and a group of friends were in touch via Facebook, once one of their friends get recovered in a hospital and announce to facebook. Nobody of their friend saw it, nobody express any support or went to him, he died alone.
this is an efficient example to explain how the algorithm has an impact on our individual lives, but we need data. These, so far, are andectorical experiences.
slide 7 (bring people close)
As a reaction to Facebook being exploited to spread disinformation, the company announced to trim their own algorithm. This confirms how a hidden logic is used to regulate our interaction.
slide 8 (mark's quoted)
This is the commitment of Mark Zuckerberg, but a question left unsolved: how can Facebook know what matters to me? what is meaningful?
Their delusion becomes from having profiling users and leveraging of this surveillance to try to guess what we might like.
(In this presentation, when you see comic sans if because Facebook is talking)
On the right, advertising I saw in Berlin.
slide 9 (phenomena and lobbyists)
As you might have seen, politician and journalist trying to catch up with the impact of algorithms. The first by regulating, the second by reporting. In the last year a variety of organization, project, research group sprouts. Some of them are directly financed by the algorithm monopolist, some of them are simply corporate friendly.
With certain field expertise, you can spot a lobbyist dressed like an independent project, my suggestion is to look at their publication and resolution, and check if the power is going to shift.
If the power stays in the center of the network, it is corporate friendly. If moved in the hand of the connected people, is political activism of the digital age.
slide 10 (the guardian)
This is an article published 10 days ago.
slide 11 (table of political runners)
The two people in blue are the current vice prime minister, Luigi di Maio, and Matteo Salvini. What I don't like of this report is what it is implying: "engagement means winning". It is not like that, and should not be this the frame in which the story is told.
As first, this narrative reinforces Facebook as the place in which you have to invest, giving Facebook more important than what they deserve.
As a second, these reinforce companies like Cambridge Analytica, because their services are exactly a promise of more engagement.
But the worst of all is the quality of the metric. Engagement is considered as the sum of likes, comments, shares of the post produced by the political campaigns.
slide 12 (different variables)
Political debate on social media might depend on three variables, and each of them has different political implications.
might be the investment in the marketing campaign. In this case, you'll have engagement, but this just expresses how much the candidate has invested.
might be the hidden logic of Facebook, which we need to know. these are the logic which defines how we, as a society, perceive the world around us. , in this case, have all the impact of public policies, and therefore should be known by us.
might be due to what connected citizen actually want. and this is the only legitimate variable a campaigner might want to know. but engagement per se is the sum of the different natures. they can't be summed together.
and of course, engagement is a metric which can be fooled by a botnet pretending activity.
slide 13 (hackers don't look at likes)
if we use the metric controlled by the system of oppression, we have already lost. means playing in their own fields.
who defined the metric is also defining the game, the field, and likely selecting the winners.
slide 14 (components)
we should split this chain into components. They are three different kinds, and our goal is to isolate the variable of the Facebook black box.
slide 15 (passive actor)
On the left you see a hypothetical source, such as Facebook friends or pages, producing content, they publish material during the day.
In the evening, the user connects and get a selected reduction of the available content. plus some advertising.
The user might engage with them, and only with them. Unlikely would go on every individual page to check all the shared material. Our attention is limited, and if the content does not surface, despite not being actively censored, is algorithmically penalized to the extent that would never become popular.
slide 16 (black background)
we want to expose how this business logic has a negative impact on society
slide 17 (evidence)
To do this, any volunteer like you or your friend might install a browser extension which records what Facebook is sending to them. Not what they share to Facebook, but what Facebook has selected.
slide 18 (is hard)
And so what? we can only watch what facebook selected for individual and compare, in other to find recurring pattern and behaviors. But we have too many differencies and the data become hard to be compared.
slide 19 (black-box)
We reduced the variables to isolate the one of the facebook black-box. So we start making a bunch of accounts, with zero friends and made them run on a selected window of time: the months between January and March, over the electoral campaign.
These accounts were following the same thirty pages, so they potentially would be exposed by the same content.
slide 20 (bot, but)
they were only liking different content, to let facebook believe they belong to a specific political orientation.
This is not a statistically representative analysis, but it is helpful to see how much Facebook behave differently among users which the only difference is on the likes, or on the lack of thereof.
slide 21 (autoscroller)
This is a visualization about our collection: every row represents a day hour. The bots perform access 13 times during the daytime. once per hour. They were scrolling automatically to capture what Facebook selected for them.
a small technicality: The different amount of post is due to the size of the individual content. the scroll works by pixel count.
slide 22 (patterns)
And finally, we can start to compare our collections. We tried many ways, but a successful output comes by counting the amount of different media selected by Facebook.
We call "information diet" the selection of your daily informative experience. Imagine the media type as an ingredient: are you goint to have more video, text or pictures?
For us, this is a pattern because keep repeating with a certain constance among the days. The stranges bot is Michele, the one which likes the far-right. Somehow it takes more pictures than text. The Undecided bot is the one taking more video than others.
slide 23 (day)
We also checked if the same percentage is true in a single day
slide 24 (day expanded)
and it is, looks like Facebook has a percentage in mind for every user.
slide 25 (we have a metric)
The usage of the bot is a kind of synthetic experiment, but once we found a metric, we could apply the same to Facebook connected citizens which participate in our collection.
In this graph, every row represents a timeline, they belong to two users we can see only by pseudonymous.
papaya-shawarma-icecream is getting the majority of pictures, while pasta-asparagus-chocolate, videos and text.
slide 26 (another metric)
Another interesting metric we found, comes from the amount of repeated post. How much the same post get repeated after refreshes. In the graph, you can see a certain amount of posts seen one and then a lesser amount seen twice, a bit less see three-time and so one.
Curiously enough, the bot which put the likes in the far-right is the one with more repetition.
On the Andrea instead, you can see the bigger amount of content diversity. that bot get exposed to much more diverse content than anybody else.
slide 27 (another metric, second slide)
This is the same analysis, two weeks later. Pattern changes, and we can't know if this is happening because Facebook was changing or for any other reason.
We used this kind of research and others to develop way of visualize tha algorithm. We didn't yet provide a working UX for adopters, this is part of our research phase.
slide 28 ( algorithm monopoly )
Let's see for the organization which rely on Facebook to do business
slide 29
Using the, now closed, Facebook API we downloaded the 100% of the content produced by the 30 pages in the analysis. Here you see selected the three major newspaper included in the analysis.
The size of the bars represent how much active they were in publishing new content: "il giornale" produced the dobule of "la repubblica" and "il fatto quotidiano in between".
Now, if you want to bet with yourself:
if they would be treated fairly this ratio would be the same in ther content selected for our bots.
if you think the "filter bubble" lead, we should expect the bot liking green saw the green and the bot liking the yellow saw yellow, and so on.
slide 30 (8-14 February)
but the reality is complicated: seems the blue, "la repubblica", was considered much more than the other despite their liking pattern.
slide 31 (19 -26 February)
two weeks later the situation gets a bit more extreme.
It is curious because, when I show this graph, Italians which know the differences among the media tend to justify this. or, it is interesting because you see if and which kind of behavior you expect from the algorithm.
someone justify this because "la repubblica" is the most bipartisan media, but, if it is true that bipartisan views are preferred by Facebook, then the algorithm is flattening the society, and uniforming our perception against extremisms.
someone justify this because "la repubblica" has more likes. But if this is the logic, the most effective media company nowadays would be Cristiano Ronaldo, Shakira and Vin Diesel.
someone justify this because if "il giornale" is spamming compare to the other, it probably gets penalized. Might be, but which are the criteria? those media are doing business.
someone aware of the Italian political discourse seen how the right content has populated the bot of the movement five stars.
We don't have a definitive answer, but at least now you can have data and a method to assess the effective influence of Facebook black-box.
slide 32 (empowering)
Our goal is not to release a report, but rather to enable others. On your left, you see Copernicus, inventor of the heliocentric theory. On the right Galileo Galilei, which enable everyone in testing the said theory, and is in front of the Inquisitors of the Roman church.
Nowadays, the most similar thing to the Roman church is Google, you see below when they took down our web extension by trademark infringement. We behave as a sinner for some months, until we get forgiven, but this raises an additional issue: we want to do culture around algorithm when platforms are controlling how culture circulate.
slide 33 (shared knowledge)
We want to make criticism on algorithm mainstream
slide 34 (using election)
Before Snowden, talking about privacy was harder because you have to feel the political issue on being surveilled and targeted. And this issue depends on the political context. In Europe, talk about privacy was harder sometimes, because there is not the perception of the conflict. The vast majority is fed, have a day job, lives in peaceful time, and don't perceive the risk.
Advocate for algorithm accountability has the same issue, but the strategy here is to engage and research with communities and context which feel to be part of a conflict, they are probably using Facebook to do their advocacy.
slide 35 (simple method)
If you understand this issue and you know someone living a conflict: you are the project lead.
your goal could be to run an experiment and find out the right content, keyword, pages, habits the community has. This might not be a representative amount of people involved, but this do not make the analysis less true for you.
we should display, to ourselves and to anyone else which wants to use algorithms, how the society has diverse needs. Our uniqueness is the only parameter which can't scale in the way platform do business.
and if Facebook flatten the society, making meaningful the experiences of the majority, we should look for all the corner case, and disprove their model.
slide 36 (research group)
we are working on a more flexible UX to let people join a research team, at the moment, the only method is to mark the contribution with a tag, and with that download the metadata associated to the profiles.
you define the methodology, and if you are using human, bot, or a cyborg hybrid.
slide 37 (Southpark)
of course, the smartest way to challenge Facebook has been suggested by Southpark Episode 4 Season 21, but just in case you can't implement it
slide 38 (don't delete, donate)
Don't delete your Facebook profile, but give it to science. Every profile is a unique point of observation of the network
slide 39 (responsibilities)
of course, we are asking people to trust us, and we should behave like a Facebook which play with them and apologize later. We are doing our best and this is a process. Let's see our ethical mandate
slide 40 (fbTREX != socmint)
we only look at newsfeed, we don't want to enable social media intelligence
slide 41 (respect people's choices)
we only look at the public post, if something is shared for friends only, it is not acquired.
The consider the timeline an personal data. Despite is composed by public post, the aggregation of that depends on your sources and the unknown profile Facebook has about you.
slide 42 (respect people's choices)
The only client-side check is on the privacy settings, and on top of the post would appear a bar telling you if the content has been acquired or not.
slide 43 (respect people's choices)
The test we made, because they were involving only profiles made up by us, without any personal trait, and they were following only high visibility sources during the electoral time, we release them publicly. but what you collect is protected
slide 44 (adopters have )
The data observed by your profiles are yours, and they would not be visible outside of your profile.
You would know how they are used in an aggregated analysis.
slide 45 (adopters have )
there are still important things to improve. Features which should be exported with a proper UX. Now we have a budget to do it, it is in our roadmap. In practice, we want to let data retention be fully customizable by adopters, let the delete operation be disintermediated, and define properly how we are going to treat the adopter. At the moment they are not users with a login and password, but just publicKey trusted with TOFU.
slide 46 (adopters share)
An adopter might export a specific selection of their timeline. If comparison with peers is the key feature we want to support, you should be able to control what you are sharing with whom.
slide 47 (we want to show)
We want to let people merge their own informative experiences with Venn Diagrams. The keyword you see in this examples comes out of semantic analysis; the technology we are using looks for the Wikipedia entries which fits in the text provided.
The Venn diagram might help both because of the algorithm or because we actually follow different sources.
slide 48
we do not want to permit anyone to access our database. The code running on it is the one published in our github repository.
Still, would be possible run queries on the collective dataset if we are sure these are helping in understanding a public phenomeanon, or in analyze the algorithm. We do not want to permit any query which might leak or be linked to individual behaviors.
This can't be formally verified, so we should process these queries by hand, doing a privacy assessment.
We should also provide some synthetic dataset to let researcher esperiment safely.
slide 49 (this is how a )
This question posed by Wolfie Christl is an example of a query that use the entire database, that investigate on a phenomena, and is not privacy harmful. So has been implemented
slide 50 (January 2018)
Analyzing circa 20k timelines we can state, the average advertising is between 10% and 10%
slide 51 (February 2018)
The rounding has been shift to three percent, 12% seems to be the most frequent ration between content and advertising
slide 52 (March 2018)
In March the percentage seems to be the same, we are considering only timelines with more than five impressions
slide 53 (April 2018)
In April, the advertising seems to increase
slide 54 (May 2018)
looks the same in May.
Curiously, around 30% of advertising, there is always a small spike.
I do wonder if the users subject to the 30% are a small amount or a larger amount.
This kind of queries are privacy-preserving, and you can see the code of this very simple mongodb script. We want to let other researchers implement their research query, and we should do the privacy assessment. This procedure is not yet started.
slide 55 (we are committed)
Facebook display commitment in transparency, super, right?
slide 56 (improving enforcement)
These are some of the many posts the company released, but is it true?
slide 57 (Jake Creps)
Facebook has started to actually make the scraping harder, a quite big twitter threat display some of their tactics, and someone else found the presence of dedicated team to osculate this activity.
slide 58 (worse than)
it is since more than one year Facebook is adopting some of these techniques, but the worst part is that now every post contains the same string. as a decoy in the HTML, because Facebook does not want to let third-party recognize which are the sponsored content.
slide 59 (this is what)
Somedays I wake up, watch at the statistics, and saw a parser extracting 0 metadata. This is a sign the parser is outdated. Then my intervention consists in fix the parser, make it re-run over the last collected posts, and restore the metadata extraction. It is an effort, we found some solutions, but...
slide 60 (corner case)
we are publishing everything as a free software license, but the power in places are not really helping here. The new parsers logic I develop cost me many days, and I would feel quite stupid if I release it to have Facebook watching at it and breaking it.
I am quite sure Facebook do not look at us because, sometime I think to Facebook as Jupiter: a geseous giant, the most massive and anscient planet of the solar system, and we as a fart in the space.
but still, in theory, if a community large enough will work with us, we would be so many to be capable in reacting to Facebook changes. At the moment is not so, and I've separated the parser components on a repository not publicly indexed. What's going to happen in the long term is unknown.
slide 61 (algorithm might)
Algorithms have all the rights cards to be a legitimated discriminative technology, accountable ex-post and protected as a corporate secret, despite their impact in the public discourse
slide 62 (elections canada)
these are screenshots on how many groups are trying to use artificial intelligence as a means to select what is true and what is false. If these technologies get deployed with the current rate of imprecision, lack of appeal, centralization of values in defining what is true and what is not, this would look like a dystopian world.
slide 63 (algorithm diversity)
We advocate for what we call "algorithm diversity", the idea that individual might decide, program, share, remix and customize their own algorithm. If the person wants to change, can change also their algorithm. personalization algorithm implements your priorities, only you can know what is better for you.
In theory. at least. This is the same fallacy in which we fell with the distributed network utopia. If everyone is exposed to any information, has the right tool to judge the quality and put in context the information?
We already see the creation of segregated bubble where extreme speech and anti scientifical behavior, damage the society. the so-called new wave of populism would not exist if the public speech is mediated.
On the other hand, there is an algorithm monopoly which defined what deserves to be seen and what doesn't, and this not acceptable. We advocate for the algorithm diversity because we have to shift this status quo.
but in the long term, some solution in the middle would be found. As a society we should be capable in sustaining an informed discussion about it, understanding the limitations of our available spectrum. At this moment, we are only playing in this red field, where every option confirms the exploitative business model of the data monopolists.
slide 64 (critical thinking)
For these reasons, we should be informed about the upcoming debate. we can't leave this choice in the hands of lobbyists. as hacker community, we should be ready to offer alternate tools and approaches.
slide 65 (simple, stateless)
and our goal now is to aim for simple, stateless, tools.
for stateless I mean, no training. no machine learning. Idempotent functions. they would not be the best, in the beginning, but if the issue is content indexing and content retrieval, they might be enough to curate your information diet.
slide 66 (European elections)
next year the European election gives us an opportunity to increase the experiment, the analysis, and the stories we publish.
Facebook has already failed in securing the political speech, and this gives us an opportunity to explore new method as an independent observer, not of the public discourse, but of the platform which guides it
slide 67 (greatest gift)
Politicians prefer social media because they believe to be disintermediated. They contribute to making SM more valuable, and potentially for the platform to exert more influence.
despite is not a good starting point, it also justifies us in taking more creative steps
slide 68 (special answer)
the simplest algorithm is a selection criteria. From that URL you'll find access to RSS feed which exports the public post passively collected by the fbTREX adopters.
The keywords are only Wikipedia entry, and at the moment only five languages are supported. It is an unstable experimental feature but is a good example of how people can control their own information diet.
slide 69 (ambitious plan)
we have a pretty ambitious plan ahead: collect information from a network of volunteers, give them personal analytics capability, protect the rights of all the person involved, use a dataset in the public interest, and display how a democratic technology might look like. We need a diverse and large community.
slide 70 (thanks)
slide 71 (references)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment