Ayms/gistpublic.md Secret

## gistpublic.md

      
    Raw
  

              gistpublic.md
            
          
    Target

Protecting the bittorrent users from monitoring spies making their activity much less visible by changing the way they connect to a torrent and setting a method to establish dynamic blocklists and maintain them.
Abstract

Previous research has focused mainly on discovering monitors using trackers, this study focuses on tracking and blocking the monitors using the bittorrent peers and content discovery system only (called the DHT).
The global result is that the spies are organized to monitor automatically whatever exists in the bittorrent network, they are easy to find but difficult to follow since they might change their IP addresses and are polluting the DHT with existing peers not related to monitoring activities.
While spreading in the bittorrent network a torrent that does not exist, we show that the spies are organized in two levels, the first one being responsible for attracting the users in order to redirect them to a final spy waiting for them to connect and start the bittorrent handshake containing the torrent that they are requesting, as we demonstrate this is not enough to prove that the users did download the related torrent and the copyright enforcers might issue take-down notices based on this only. Indeed the spies never answer to the bittorrent handshake so can't know if the torrent was downloaded, but we cannot exclude that after detecting a peer other spies connect to it to make sure that it is downloading the said torrent, but then the activities of those spies will necessarily appear suspicious and would be detected by the method.
While the first level spies, which are not really dangerous but constantly change, are well distributed in the bittorrent space since they get positioned according to what is learned from the network, the final spies are not as numerous as people might think, often envisioning the need of enormous blocklists to catch them.
The first level spies do not only return real spies but substitute the identity of existing peers that have nothing to do with monitoring activities, probably to make more difficult the detection of the final spies.
Unexpectedly, ...undisclosed...
All this is of course highly questionable from a legal standpoint.
Static blocklists are not enough and the method shows how to create and maintain dynamic blocklists.
Finally, we show that a method is already specified to defeat most of the spies but is not implemented, or partially but is not in force in general, and we suggest some changes in the bittorrent protocol and in the bittorrent clients to protect more the users as well as allowing a fair relationship between them and the copyright holders.
Legal mention

This study was performed as a research work, not taking position, while it focuses on protecting the privacy of the users, as well as detecting and blocking the monitors, it provides some thoughts how the copyright holders could take benefit of the P2P network, envisioning for example some means for the users to pay something, which does not exist at all today, preventing people from legally using the power of the bittorrent network (update: ...undisclosed...), and envisioning the development of a new bittorrent client following the recommendations of this study protecting much more the users and fair for all parties.
The rationale for performing this study is explained at the end in order to solve an unfair situation in the context of Peersm project [13].
IP adresses of bittorrent users can not be hidden and can be easily seen in any bittorrent client, we do not disclose any of the IP addresses encountered during this study, whether for peers or monitors, the data used do not present any individual privacy issues since it was never analyzed on a case per case basis but in mass for statistical computation, the data will be destroyed when they are no longer required.
This study partially covers the specific case of monitors behaving quasi normally in a torrent, which to be fully studied would require us to participate to the torrent, therefore we have not participated in any copyright-infringing activity neither downloaded any file during this study.
Background - Quick reminder about the bittorrent peer and content discovery system

The peers and content discovery system is the Distributed Hash Table (DHT). Each peer has a nodeID, each content has a reference called the infohash, a mathematical calculation (xor) gives a distance between them. Each peer maintains a routing table of the peers it knows, it first registers the closest peers to its nodeID by asking recursively others (find_node requests) starting with some well known bootstrap nodes and then registers the peers it encounters during its lifetime, the routing table is splited into 160 buckets corresponding to a distance range from its nodeID, each new peer is registered in the corresponding bucket according to its distance with the nodeID.
When a peer wants to download or announce a content it looks recursively for the closest nodes to the content's infohash by sending get_peers requests, and then a subsequent announce_peer request to the closest nodes, the second message must contain the token returned by the first get_peers request and the IP address of the querying party must be the same for both messages, this mechanism makes difficult for someone to announce something for somebody else.
The peers are answering to get_peers request with values (peers that did announce having the requested infohash) and/or nodes (peers that are known by the queried party to be close to the infohash).
The ensemble of all peers participating to a given torrent is called a "swarm".
Related work

The trackers which are servers registering the peers and referencing the contents are out of the scope of this study, there are a lot of research papers about monitoring the bittorrent network ([3] and subsequent references) using trackers mainly, trackers are now obsolete and should not be used.
The above references sometimes mention the DHT but to a certain extent, some work exists about monitoring the spies ([3],[2] and subsequent references), but generally the topics are more about monitoring the users rather than  monitoring the spies, we are not aware of studies related to detecting, tracking and following the spies using the DHT only.
We decided to explore all possible ways the spies have to monitor the bittorrent network using the DHT only, the first part is more empirical in order to understand the general behavior of the spies and to collect data, the second part is studying more precisely the DHT distribution and finalizes the method.
The spies: definition and methods

A spy is a peer whose function is to monitor what the bittorrent users are doing and possibly avoid them to do it, this experiment is targeting the spies associated to copyright enforcement policies.
A spy has several ways to monitor a torrent:

it can set a nodeID close to the monitored torrent and look for the get_peers and announce_peer requests to see who is downloading it
it can announce itself in the DHT for the related torrent and wait for peers to connect to it.
it can just monitor get_peers requests in the DHT to see who is looking for a given torrent.
it can reply to get_peers requests with other spies that have nothing to do with the requested torrent but whose sole purpose is to invite you to connect to them.
it can participate normally or quasi normally to a torrent.
it can connect to peers it has identified, this case is out of the scope of this study since it does not apply to torrent-live peers that are totally freeriding.
it can participate to a swarm and get other peers in that swarm via the Peer Exchange Protocol. The Peer Exchange Protocol is not precisely documented neither specified and the bittorrent clients implement it the way they like. The principles are that peers participating in a swarm send to others the peers they are connected to. We disconsider it in this study since it can not give any proof that a peer really participated to a torrent, because it's easy for some peers to send fake or false peers to others via this protocol.

The last categories are called direct monitoring and covered by [3], while we don't agree with the conclusion in that paper stating that the spies are only monitoring notorious torrents as shown in this study, the others are called indirect monitoring.
We started this study on the assumption that spies must implement some automatic methods to discover new torrents and monitor them, so the principles of this experimentation partially inspired from discussion [4] are to spread in the DHT the existence of a torrent that we are the only one to have (ie that does not exist in the DHT), referenced with a "fake" infohash, in order to attract the spies and detect them.
We used our own bittorrent client torrent-live [1] built from other open source modules to perform these experiments.
Initial method

The real infohash is the torrent the user wants to download.
Torrent-live user's manual states:
To start building its own blocklist before starting to use torrent-live (or another bittorrent client allowing blocklists), it is advised to run several instances of torrent-live with infohashes corresponding to well known and monitored torrents and the 'findspiesonly' option.
There is an obvious correlation between the notoriety of a torrent and the number of spies monitoring it, so notorious torrents will allow to discover spies more quickly.
Torrent-live performs the following steps/tasks:

set a fake infohash close to the real one, unless really bad luck the fake infohash does not exist in the bittorrent network
walk the DHT periodically looking for the fake infohash, respond to queries (freerider option set to false, please see below)
change the nodeID at each new walk with a random one, so the path changes each time in the DHT
do not announce the fake infohash, if not all torrent-live users would blacklist each other and this will disturb the DHT lookups
register the spies found in a blocklist, register them in a file, no difference is made for Tor exit nodes or VPNs, they will be blocked too
start the real torrent after 30s if a blocklist exists (average time to get the closest nodes) or 5mn, use the closest nodes (not in the blocklist) found during the fake infohash lookup to retrieve the peers for the real infohash, this prevents the user from walking the DHT again saying to everybody what it is really looking for.
enable the freerider option: do not advertise anything, do not answer to queries, do not share anything.
connect to the first 20 ones not in the blocklist
maintain a swarm of 20 peers, if one disconnects, replace it by another one in the peer list not contained in the blocklist
due to the freerider option some peers might disconnect but the main seeders usually don't, so the swarm will oscillate around 20 peers and stabilize after some time with supposedly good seeders (ie not spies)
the periodical check of the DHT still runs while the torrent is downloaded/streamed to remove real-time the new spies found and increment the blocklist

In this process the spies may be the peers that are pretending to have the fake infohash (called level 2 spies), or those that are sending them (called level 1 spies), other spies categories will be considered to in what follows.
Experimentation 1

Two campaigns were launched during two weeks using different notorious torrents. These campaigns were only collecting level 2 spies and were stopped when they reached 250 000 spies for the first one and 350 000 spies for the second since it appeared obvious that the number will indefinitly continue to increase.
The naive assumption was that while seeing an unknown infohash (the fake infohash), level 2 spies would announce themselves in the DHT with this infohash to see what it is and who is connecting to them to request it.
Surprisingly we noticed that well known non spies (like the usual DHT bootstrap nodes) did end up in the blocklist.
By checking the logs to see who returned the DHT bootstrap nodes we only found 40 nodes, but it just proves that some nodes are returning on purpose nodes that have nothing to do with what was requested.
The intersection between the spies of both campaigns gave only 15 500 spies, which could corroborate the above statement.
So maybe there are level 1 spies that are returning colluding level 2 spies to get_peers requests or dummy peers.
Experimentation 2

We did extract from the logs of the two campaigns (50 GB file) the level 1 spies and found 90 000.
If level 1 spies are returning colluding peers, this probably means that level 2 peers do not make a lot of efforts to announce themselves for the fake infohash.
To check this basically we did operate some nodes with a nodeID close to the real infohash (and therefore the fake infohash) that did announce the fake infohash in the DHT, we did not see anybody announcing the fake infohash to these nodes, but we did see quite a lot of get_peers and announce_peer requests coming from the peers downloading the real infohash, giving us a good position to watch them if this was our intent. This is probably what some spies are doing too, just setting a nodeId close to the infohash they want to monitor and just waiting for peers requests.
Those that are responding to a get_peers request with values (ie peers that are supposed to have the fake infohash content) are whether spies (level 1) or non spies where level 2 spies did announce themselves.
The peers returned (level 2) are likely to be colluding spies with level 1 spies, but the experiment shows that some nodes are returning some other peers that are not spies (for example the usual bittorrent DHT bootstrap nodes).
While sending get_peers requests to level 1 spies, it appears that they reply sometimes with some peers and sometimes don't reply. They maybe learn from you, if you have detected a level 1 spy with a given IP/port/nodeID and you try to query it with a different IP/port/nodeID, the level 1 spy is likely not to answer.
Now, if you keep trying with the initial IP/port/nodeID and send get_peers request for a given infohash, the level 1 spy usually answers with the same peers, but sometimes it replies with different peers, or does not reply.
If you change randomely the infohash it will keep answering the same way, unless the random infohash is too far from the initial one, in that case it will not answer.
So, definitely these peers are spies, because it's impossible that the level 2 spies did announce all possible infohashes to them.
Most probably they are addresses belonging to some systems that we will clarify later in this study operated by those controling the level 2 spies or by some party cooperating with them.
We can now assume that the level 1 spies just return randomely some peers they have encountered in the DHT with a true or random port, living level 2 spies or dead ones.
To sort these peers we connect to them, by "connect" we mean to establish a TCP or uTP connection and start the bittorrent handshake.
So far we have not encountered any level 2 spies answering correctly the bittorrent handshake, whether with the fake infohash or the real one, so we just assume that if for a given IP/port the TCP/uTP connection is successfull, then the peer is indeed a level 2 spy waiting for you to connect to check what infohash you are requesting, if not that's probably a random non working peer or a former level 2 spy.
This is surprising because this means that the spies do not even check that you will request pieces, they apparently just assume that connecting to them with a given infohash is enough to conclude that your are downloading the associated torrent and maybe to generate automatic take-down notices while the user could have connected to other peers just to know what the torrent was about.
When we checked the intersection between level 2 spies (350 000) and level 1 spies (90 000), we only found 500 in common.
Then we checked some well known level 2 spies and look from the logs of the two campaigns who returned them, we found 11 000 nodes from all countries, this looked to be a high number for colluding spies.
So we checked again the above, we selected some of these nodes and sent to them several get_peers requests with random infohashes, most of the time they did return some well known spies as peers having the random infohashes, but not always the same nodes, sometimes they did not answer, and sometimes they replied with peers from other countries.
Our peer sending the requests did not do anything inside the DHT previously, it just sent directly the get_peers request to the level 1 spies with the random infohashes, which then could not be known by anybody, therefore, again, some level 1 nodes are indeed spies, or the level 2 spies did announce all the possible infohashes to them, which is impossible.
For unknown reasons, the level 2 spies sometimes accept the TCP/uTP connections, sometimes don't, in any case they almost never answer to the bittorrent handshake message and close the connection.
Experimentation 3

Experimentation 2 showed that spies do not announce themselves to a node close to the fake infohash.
They don't do this because they would then be easy to detect, you would just have to walk the DHT to find the closest nodes to the fake infohash and ask them who is pretending to have it, the returned values would be obvious spies.
Then maybe, while seeing an unknown infohash (the fake infohash), they just send get_peers requests to the closest nodes with the fake infohash to monitor who is downloading it.
We did set a nodeID close to a famous torrent while another process was sending requests for a fake infohash close to the torrent infohash.
Again, we did see a lot of requests for the torrent infohash but absolutely none for the fake infohash, so the spies are not doing this, probably because they would be easy to detect as well, only obvious spies would be sending requests for the fake infohash.
Experimentation 4

Given the high number of level 1 spies detected for well known spies, maybe they are not all spies, which would mean that level 2 spies are annoucing sometimes themselves with the fake infohash, but as we have seen before they don't trivially do this to the closest nodes.
For this experiment the target was to catch a spy announcing the fake infohash to our torrent-live peer, which is not easy given the size of the bittorrent network and given the fact that we don't know on what criteria the spies might choose the peers to announce to.
We selected a large torrent infohash, ran torrent-live waiting for a notorious spy to be returned by a peer for the associated fake infohash, then we set a nodeID close to the nodeID of this peer to set a realistic distance between the fake infohash and our nodeID (not too far, not too close), launched torrent-live with this nodeID (peer 1) and a second process (peer 2) running torrent-live normally, so spreading widely in the DHT the existence of a fake infohash, hoping that some well known spies will choose peer 1 to announce the fake infohash.
A peer 3 running with random nodeIDs and making lookups for random infohashes was used to get the latest information about our targeted well known spies, ie to retrieve their ports and nodeIDs from the nodes information returned by the get_peers requests.
Then to motivate our targeted spies, peer 1 did send periodically to them some find_nodes requests with dummy values for nodeIDs to make sure that they know about its nodeId.
After some time we could observe that well known (and targeted) level 2 spies started to announce infohashes to peer 1, they are announcing infohashes corresponding to real torrents and corresponding to nothing, or dead torrents for a non negligible part.
Note:
While running our peer 2 above we did notice that it was detected by a usual bittorrent client that we asked to open the fake infohash torrent.
This is not supposed to be possible since peer 2 does not announce anything.
This is normal in fact, the bittorrent client sent a get_peers request to peer 1 followed by an announce_peer request for the fake infohash, then peer 2 detected it as a spy and started a bittorrent handshake with it, so peer 2 appeared in the client list for the fake infohash torrent of the bittorrent client.
It has to be noted that the bittorrent clients do announce themselves right away for a given torrent event if they don't even know what it is and have no pieces.
Note 2: interesting to know too, if we ask directly an infohash to a usual bittorrent client that has it, this one does not return itself but other nodes to query.
(end notes)
We were not successful in attracting the well known spies to announce the fake infohash to peer 1, the conclusion is that they announce some torrents that exist or did exist but they probably do not announce the fake infohash until they get some indications that it should be monitored.
In case we suspect that the spies could spoof the DHT to avoid being easily detected by making fake announcements and growing the supposed base of level 2 spies, it must be noted that it's difficult to announce someone different (peer x) than itself, it would first require to spoof the IP address of peer x to send a get_peer request to another peer and to send the subsequent announce_peer message with the right token received in the get_peers answer, that the initiating peer would have a hard time to receive since it spoofed the IP address.
The conclusion is that level 2 spies are returned only by level 1 spies for fake infohashes.
Spies in swarms

There can be another category of spies, those that participate in a swarm correctly (ie behaving like normal peers and sending pieces) or incorrectly (ie sending always the same piece, incorrect ones or doing something equivalent), their behavior is described in [3] and shows that most probably none of them do behave normally since it's unlikely that they cooperate to the copyright infrigement, now we can suspect that the monitoring companies are using the same level 2 spies for this job.
Running torrent-live on different higly monitored torrents with our real-time blocklist we found approximately 1% of spies inside the swarms, but this is just an indication we did not test this extensively.
Summary

Level 1 spies return level 2 spies and other peers, level 1 spies are definitely spies, while in some cases the correlation between level 1 and level 2 spies is obvious (same subnet, same companies operating the IP addresses), in most cases it is not, as we have seen the intersection between the two categories is very small, we will explain this later.
Level 2 spies seem not to announce any fake infohashes, they announce real infohashes including outdated ones or maybe not existing ones, so to detect for sure a level 1 spies we just need to send to it some get_peers requests with different non existing infohashes, if it answers to several requests then it is a spy.
Level 2 spies do not send get_peers requests for infohashes they don't know.
Level 2 spies accept TCP/uTP connections but never answer to the bittorrent handshake, they can only know what was requested and can not make sure the querying peer will download the torrent.
Level 3 spies might exist, they are passive nodes setting a nodeID close to the monitored torrent and looking for get_peers and announce_peers requests from the peers downloading the torrent, but this is not enough to prove that these peers did try to download the torrent.
Level 4 spies might exist they are peers behaving normally or quasi normally in a swarm, or peers not participating in swarms but just connecting to peers detected by level 2 spies, they can not be detected unless we can correlate their activity among different swarms and detect it as suspicious, as described in [3]. But we can suspect that they are also level 2 spies since level 2 spies do announce infohashes, so detectable, even if this study might suggest the contrary since level 2 spies never answer correctly to the handshakes, even with the real infohashes, but we have not tested the later widely.
Changes in the initial method

Add following steps in the initial method:

walk the DHT with the fake infohash
ignore the peers returned (by level 1 spies) until you reach the closest peers, register both in a temporary list
send get_peers requests to the level 1 spies in the temporary list with different infohashes not too far from the fake infohash, if the level 1 peers keep returning peers, register them in the blocklist, add the returned peers in the temporary list as level 2 spies
try to connect several times to each level 2 spies in the temporary list, if it succeeds register the spy in the blocklist
eliminate from the blocklist the spies among the closest nodes
send get_peers requests with the real infohash to the remaining closest nodes

or

among the closest nodes eliminate those that seem too far (avoid spies that might have announced the fake infohash) and too close (avoid level 3 spies) and send get_peers requests with the real infohash to them.

This last precaution is probably not necessary, we can expect that the peers acting as level 3 or level 4 spies are acting too as level 2 spies, so have been detected previously.
Then we launched a third campaign based on this model.
Results

After one day we got 35 000 adresses for spies level 1 and 2 but the total still was not converging and kept increasing.
We did concentrate on level 1 spies, they seem to be ephemeral: they reply at the time they are detected but do not reply any longer some time after, new ones are coming all the time and are unknown from all of our campaigns, and a very few are in common with the level 2 spies.
So we believe that they are belonging to some systems allowing to use a wide range of IP addresses and change them quickly that we will calrify later in this study, each address being responsible for monitoring some parts of the DHT and for returning the level 2 spies, all this under the supervision of the copyright enforcers.
We decided to skip the level 1 spies and to register only level 2 spies in the above method since level 1 spies are not intrinsically dangerous, they can just know that you might have looked for a given content but can not know what happened next, "might" because in the context of torrent-live which does not send announce messages someone else could have spoofed your IP address and sent a get_peers message on your behalf, but anyway they are not dangerous too for normal users for the reason explained above:

walk the DHT with the fake infohash
ignore the level 1 spies
try to connect several times to each level 2 spies returned by the level 1 spies, if it succeeds or if the level 2 spy closes normally the connection, register the spy in the blocklist
eliminate from the blocklist the spies among the closest nodes
send get_peers requests with the real infohash to the remaining closest nodes

After three days, the list counted 45 000 addresses and was still increasing, only 900 addresses were in common with the intersection of level 2 spies of the first two campaigns, 1300 addresses were in common with the first campaign and 2300 with the second campaign, and 2000 for the third campaign.
This means that the level 2 spies might be ephemeral too and might change their IP addresses.
We started a fifth campaign identical to the previous one at the exception that a check was added to test periodically  the level 2 spies (every hour, retrying 3 times for each if necessary). The test was again to initiate a bittorrent handshake with the latest IP/port known for them and see if we can at least connect to them, if not the spies are removed from the list. The difficulty is that they might change their ports numbers only, and not their complete IP address, so we might remove some spies by mistake. But in this process, while encountering again and again the same spies, we kept updating the port numbers where the spies are listening to.
We ran two processes with notorious torrents on different servers during 3 days and observed that the lists of alive spies did stabilize and oscillate between 1000 and 2000 spies with a peak over the week-end.
The intersection of both lists only gave 30 addresses in common and we noticed that some well known spies were missing and had been removed by the periodical check. This is due to what we have observed in experimentation 2 where we noticed that level 2 spies sometimes accept a connection and sometimes don't, in addition it seems they become suspicious if you try to connect to them on some ports they are not listening at and then refuse all connections from you.
But we detected a mistake in our method, while torrent-live launched several swarms with different infohashes the level 2 spies were only tested with the infohash of the first swarm, and by testing them again it appeared that they monitor a specific part of the DHT and do not answer if the requested infohash is outside of it.
We modified torrent-live so the level 2 spies are tested with the infohash that was used to retrieve them with a periodical check every 4 hours and 3 attempts for each if necessary.
A sixth campaign was started with two different servers using 5 torrents each among the top 100 torrents of The Pirate Bay [6] website.
The number of spies stabilized around 10000 for both servers with a constant average of 600 spies in common and a quasi linear progression of the level 1 and level 2 spies discovered over two days but, again, some well known spies got removed by the periodical check.
The following graph shows the evolution of alive spies (blocklist) and the total discovered so far for both servers.
Spies distribution in the DHT

We deduct from the previous section that the level 2 spies are sensible to infohashes too, they watch a dedicated part of the DHT, that's probably why the intersection between the spies detected by both servers is far from the total for each.
Then we looked again at the well-known spies. They are returned as "nodes" by other nodes (answering to get_peers requests) with different ports and different nodeIDs (close to the requested infohash) for the same IP. A normal peer could run different bittorrent clients and have the same behavior but this does not look like a normal use. This is dynamic and their ports/nodeIDs keep changing, that's maybe why they do not answer any longer after some time and got removed by the periodical check, unless they detected that the same IP (us) was polling them periodically.
In addition, when they are returned as "values" for a given infohash, the same (IP,port) advertised in the DHT as a node close to the given infohash is used for the bittorrent protocol, but we can observe that a well known spy does not trivially return another well known spy as "values".
Maybe it does not reflect the behavior of all the spies but performing this requires some organization which would maybe be difficult to work if their IP addresses kept changing, so we decided to add the following rule:

tag as permanent spies the level 2 spies that are implementing several ports/nodeIDs for the same IP and don't check them periodically

Now, what do we mean by "close" and how can the spies be distributed to watch all the DHT?
The more a nodeID and an infohash have bits in common in their prefix, the closest they are, knowing that up to a certain number of bits in common it becomes unlikely that the peers are real ones.
For bittorrent network the article [7] estimates the size of the DHT to ~8.4 M nodes (2^20 * 8), based on the assumption that a typical routing table has about 20 of its top routing table buckets full which means that the key space is dense enough to contain 8 nodes for every combination of the 20 top bits of nodeIDs, concluding that a spy with 8.4 M IPs could watch any infohash, knowing that a /8 IP block [8] gives access to 16.7 million IPs.
We did extract from the logs of all the campaigns (150 GB) the "nodes" answers to get_peers requests that we recorded walking the DHT with different nodeIDs, giving us a file of 22 GB, then we compiled the number of bits in common between the nodeID and the infohash for each request and got the following distribution: prefix 160
The number of nodeIDs is not really relevant since it appears that 71 different infohashes were used and a lot of duplicates are there but that's comparable to the level 1 spies distribution below, ignoring the first 4 bits since we assume that these peers are returned by peers that don't have their routing tables correctly populated yet or are misbehaving (like the spies), we see an average of 13.17 bits in common.
Eliminating the duplicates, the curve gives: prefix 160
Since the closest and closest nodes are the same at a certain point of time of the lookups, this curve represents mainly the distance between the infohashes and the first nodes of the lookups.
We compiled too the 20 closest nodes retrieved for 220 000 random infohashes, and got the following distribution: prefix_closest
On 2 M closest nodeIDs found by get_peers requests and 1 M closest nodeIDs found by find_node requests, 95% are in the range of 20 to 25 bits in common and 99% in the range of 20 to 27 bits, the average being 22.2 bits.
Both distributions are identical, an extrapollation is that the size of the DHT (which is not the subject of this study) is at least 2^20*20 (~20 M) since the space is dense enough to get at least 20 bits in common for any possibility of the 20th closest nodes, empirically we can see that usually 30 to 40 nodes can be found to have 20 bits in common, so the size of the DHT should be at least 20 to 40 Million nodes.
Now, looking at the spies level 1 (ie those returning values to get_peers requests for fake infohashes) we did look at the distance between their nodeID and the requested infohashes (ie their nodeID returned as a node value following a get_peers request for a given fake infohash) and got the following distribution: prefix_level 1
On a set of 60 000 level 1 spies (retrieved with the same method used for campaign 7 below running 64 fix infohashes and walking the DHT with random nodeIDs), ignoring the first 4 bits we found an average of 15.41 bit in common with the infohashes and 60% in the range of 15 to 21 bits in common.
We can see that the curve is different from the aveage distribution, a very few have 5 bits in common as it appears to be the majority for the usual first nodes of the lookups, this indicates that the level 1 spies are rarely in the first nodes of the lookups if we don't take into account those that have less than 4 bits in common.
They are "better" distributed in the DHT than the average distribution and then can not be detected based on their nodeID only, except those that are in the range of the first 4 bits in common, which are usually returned by the DHT bootstrap servers and that announce peers right away while their nodeID is very far from the requested infohash (usually only the first bit in common), they can be eliminated based on this but this is not really usefull since the majority have nodeIDs that match the infohashes. While starting the DHT lookup we can observe that the nodeIDs are very far from the infohashes at the begining (first 4 bits of the curve) then come closer and closer with time (the 20 to 24 bits of the curve), probably the level 1 spies are learning from us and position themselves in the DHT with a plausible nodeID.
We extracted from the logs the nodeIDs of the well known level 2 spies returned as "nodes" to get_peers requests and the associated infohashes over one week (11 MB - 21901 nodeIDs), ignoring the first 4 bits again (14420 remaining nodeIDs), the distribution for the common bits in the prefix showed an average of 6.45 bits with 96% between 5 to 10 bits.
See the graph: nodeIDs distribution
So those spies must be implementing less than 1024 different nodeIDs.
This gives them quasi no chances to be part of the 20 first closest nodes, unless they set specific nodeIDs for highly monitored torrents.
Probably they don't care appearing in the closest nodes since they will announce the highly monitored torrents or be returned by a level 1 spy just waiting for peers to connect to them.
We don't know what frequency they use to rotate their nodeIDs (which they are doing, if not our periodical check would not have removed them from the blocklist and we would not have collected so many different nodeIDs) but taking a snapshot of the nodeIDs detected (by get_peers and find_node queries, remember while we only run 5 notorious infohashes per server torrent-live is crawling the DHT changing its nodeID each time) over 12 hours for some given well known spies gave an average of 25, knowing that the number of IP addresses of well known spies is about 40 it gives something like 1000 nodeIDs over 12 hours, and a rotation every 3 to 6 hours.
So, running torrent-live with 1024 infohashes filling a prefix of 10 bits should give us good chances to discover most of the level 2 spies, but since the level 1 spies are very well distributed maybe we don't need as many infohashes.
We tried first to launch a seventh campaign with two servers, one running 64 infohashes (filling the first 6 bits of the prefix) and the other running 256 infohashes (filling the first 8 bits of the prefix).
The result are shown on this graph campaign 7 level 2 and campaign 7 all
Note that the level 1 spies and the total of level 2 spies discovered are following again a surprising quasi linear curve, all of the well known French spies were in the server 2 list.
Again we see that the number of alive spies stabilized after some time around 10 000 for server 1 and 30 000 for server 2, during this campaign we lost some data between hours 40 and 50 for server 2, at the end of the campaign we found 3000 spies in common.
But our target was supposed to get more level 2 spies for server 1 in common with server 2 level 2 spies so we know that the method allows to catch all the spies.
It was not the case so we launched a eighth campaign, identical except that server 2 was now running 1024 to 2048 infohashes covering the first 11th bits of the prefix.
The first 140 hours of the campaign were a bit erratic since server 2 processes were crashing too often to allow to test correctly the level 2 spies periodically (each hour), nevertheless the global rendering of the campaign still shows a linear increase of the total spies discovered (a little disturbed on the graph due to the repeated crashes): campaign 8
Server 2 stabilized after hour 140 and we can see that the number of alive level 2 spies on server 1 was oscillating around 12000, 18000 for server 2, and between 2500 and 3000 for the spies in common between both servers: campaign 8
Server 2 crashed on hour 184 and we lost the logs, server 2 was restarted with a fix so it does not crash again running 2048 infohashes.
In the meantime we get the results we did compare on server 1 the level 2 spies over a period of 5 and 3 days (between hours 118 and 238), we got respectively 2475 and 3910 in common, comparing both we found 2157 in common.
Comparing the later with the level 2 spies in common between server 1 and server 2 on hour 238, we got 704 spies in common.
And then comparing the 2475 and 3910 spies with level 2 spies in common on hour 238, we got respectively 848 and 1198 spies in common.
This seems to suggest that there is a solid kernel of around 1000 level 2 spies and that they are not more numerous than 3000, which is the number that shows up repeatedly in terms of common spies between both servers in all of our stats.
This is confirmed again by the campaign: campaign 8 - continuation, campaign 8 - level 2 and campaign 8 - common spies
We see that the number of total level 2 spies oscillates around 10 000 for both servers, so adding infohashes with server 2 did not really improve what we knew from server 1, the number of spies in common between both servers oscillates around 2500.
The last graph where we compared different things regarding spies in common on both servers and between them shows that their IP addresses change, unless the total number of level 2 spies is less than ~1000.
At last, did we get them all?

As we have seen before, a means to detect a level 2 spy is to check that it accepts a TCP or uTP connection but does not answer to the handshake.
The normal bittorrent clients have the very same behavior if they receive a handshake for an infohash that they don't have.
The conclusion is that we don't need to perform any more campaigns to get all the spies, as we have seen the number of spies in common between both servers is always oscillating around 3000, whatever amount of infohashes we use to detect them.
The spies are in there, the others that are not in common between both servers are probably usual bittorrent clients returned by the level 1 spies.
This means that the level 1 spies are substituting the identity of normal users, which we thought up to now was a marginal phenomenon for legal reasons, but is apparently not, they could claim that these peers did announce our infohashes to them but this is quite unlikely since the infohashes we are using do not exist in the bittorrent network, so we can question the legal aspects of what they are doing.
To prove this, we reminded the Torrentfreak article [11] and used BTindex [12] which allows to estimate how many torrents can be linked to a given IP address, we will not comment about the privacy aspects of BTindex but testing supposed level 2 spies IP addresses in or out of the common 3000 addresses between server 1 and server 2 that we suspected not to be real spies usually gives something like Not spy
And testing IP addresses that we really think are spies (here one of the rare peer that pretends to have whatever torrent we sent) gives something like Spy
As we can see the first peer is participating normally to a few torrents while the other one is participating to too many torrents to be a real peer. Even if we could suspect that BTindex did not catch everything, we can see on the above examples that the differences between both peers are so big that the conclusion is obvious. Since the normal peers are usually behind a NAT, they can change their IP address, so the BTindex history might not correspond to the same peer and the BTindex snapshot might not correspond to the time we identified the peer, but we tested this with many peers and the BTindex history gives evidences that they were never spies.
As explained previously it's a priori not easy to say among the 3000 supposed spies who is a spy and who is not since they behave identically, but should we trust that BTindex is really efficiently crawling the DHT, we could sort them again using it or an equivalent service, while said service can be used to do the contrary too, ie identifying you as a potential downloader but without any proof of it or, worse, unvealing to everybody what you downloaded.
Checking BTindex for spies in common between both servers on hour 268 of campaign 8 (2387 spies), 1513 were unknown from BTindex, 218 were related to less than 100 torrents, 545 less than 1000 torrents, which leaves ~300 to 650 spies related to an abnormal number of torrents, among those well known spies are indeed related to thousands of torrents.
Anyway, here is our dynamic blocklist, even if some peers inside are not monitoring spies at all and even if we might be blocking large range of users for ISPs behind a NAT or VPNs for example.
We can not prove this firmly but we think that we got them all at this stage, all along the campaigns we did see addresses from all countries and the same numbers in the range of 1000/3000 kept coming back repeatedly.
But we can not exclude that the spies might act depending on your IP location, which we should test running torrent-live from different countries.
Now, running the same method twice on the same server at the same time gives the very same results, spies in common between both processes are less than 3000, this suggests that there are no correlations between the level 2 spies returned and the IP address used.
We don't think that the spies are filtering based on the location but, even if we saw previously that adding infohashes did not seem to improve our learning about the spies, we can not exclude that some of them position themselves to monitor dedicated torrents.
Therefore our crawlers should continuously explore the 2^20 space (20 bits in common in prefix) infohashes/nodeIDs space of the bittorrent network and possibly faster than the monitors are switching IPs, it does not prevent to miss a newcomer the time it is detected but combining the dynamic blocklist with the torrent method makes quasi null the probability to encounter a spy.
Who is Number One?

We know who are the level 2, 3 or 4 spies, they are monitoring companies using IP addresses that they own, if not the result of their monitoring would have no value, but we don't know who are the level 1 spies.
We have shown above that they are well distributed, too well distributed in fact, this seems to suggest that they are millions.
...undisclosed...
... in order to attract the requesting peers and redirect them toward a level 2 spy, this looks very similar to what is described in [9].
But since the begining of this study we are wondering about them, they seem to behave like normal peers and new ones are showing up all the time, while trying to correlate their addresses, ASes, traceroute, countries, etc we could not find anything showing that they could be linked.
...undisclosed...
Undisclosed

...undisclosed...
Deanonymizing the VPN, Tor and proxies peers

...undisclosed...
Level 2 spies polling

We did observe that the level 2 spies are announcing periodically, which again can not be the behavior of a normal peer, they do this to make sure that they will appear first in the answers to get_peers requests, so add to the method this last rule:

disregard the first peers returned by the selected closest nodes (5/10% of the swarm)

DHT Security extension

...undisclosed...

ignore the peers that are not following the "DHT security extension (bep42)", ie the peers that have not their nodeIDs tied to their IP addreses

Tor:

If normal bittorrent users are trying to hide using the Tor network [6] maybe the spies are doing the same.
Tor does not support UDP, so it can not talk to the DHT and a Tor user can only connect to others, not wait for someone to connect to it, unless UDP or TCP is tunnelled through Tor to a SOCKS proxy (see [17] for an example), which seems some useless additional efforts for the spies since they already use direct proxies, so we don't really see what could be the use of the Tor network for the spies considered in this study, and for all of them in fact, the Tor network is far too small compared to the number of proxies.
By curiousity we checked how many Tor nodes we did encounter during this study, among the most stable Tor [6] exit nodes (+/- 1000), only 4 were found as level 2 spies during all of our campaigns and absolutely none as level 1 spies.
The Tor exit nodes found were returned by level 1 spies choosing a peer that was trying to hide with the Tor network, this indicates again that level 1 spies are substituting the identity of normal users.
Maintain the blocklist

The IP addresses and ports of the spies change, it's obvious from this study that usual blocklists like [10] can not work and will end up blacklisting all the IP addresses, in order to maintain the blocklist torrent-live must continuously run to renew and test the spies.
Conclusion

What are doing the spies might not be enough to detect precisely what the bittorrent users are doing, they accept connections to receive the first message of the bittorrent handshake telling them what the users are requesting but don't answer to it, during all the campaigns we only saw a dozen of spies answering to the handshake and only a very few answering to the handshake claiming to have the fake infohash we were requesting.
They watch some dedicated parts of the bittorrent space and change IP addresses constantly, they pollute the DHT by returning the identity of peers that have nothing to do with what was requested and monitoring activities.
The level 1 spies are well distributed ...undisclosed... in order to attract the users and redirect them toward a final spy.
The number of final spies is difficult to estimate precisely, but less than 3000, with probably a kernel of about 1000 spies, they watch some dedicated parts of the bittorrent space and might change their IP addresses.
So they need to be constantly renewed/tested with the method resulting from our experiments, which shows that crawling the ~ 1M (2^20) infohashes/nodeIds space of the bittorrent space does allow to follow all of them.
This demonstrates again that usual static blocklists can not work and are not enough.
We don't see how a sophisticated spy could have escaped this experiment, except a level 4 spy behaving like a normal peer in the swarm, which includes sending correct pieces, this is difficult to detect unless we can correlate some abnormal behavior as explained in [3]. Or a level 4 spy not participating in the swarm but just connecting to a peer detected by a level 2 spy in order to make sure that it is downloading the torrent.
Regarding torrent-live's users, the method does protect them, to a certain extent but they are much less exposed than usually:

they would not be detected by level 1 spies since they never request the real infohash until they have reached the closest nodes after they eliminated the detected spies, unless an undetected level 1 spy is part of the closest nodes, which can be detected by sending to it non existing infohashes abnormally close to the target one.
they might be detected by a level 2 spies if unluckily an undetected one is part of the 20 peers they connect to, but the probability is low (since the user connects to a reduced set of random peers among those the first peers returned which are likely to be spies were ignored and since the method is supposed to have detected the level 2 spies) and as we have repeatedly stated it's not enough to prove that the user did download the torrent neither that the querying IP address is really the one of the user.
they might be detected by level 3 spies if one of them is part of the closest nodes, level 3 spies are not dangerous too.
they might be detected by level 4 spies if unluckily one of them is part of the 20 peers they connect to, torrent-live will block those that are behaving abnormaly but it might be too late, this is probably a continuation of this and [3] studies, but we have seen that level 2 spies do announce torrents, so they might be the level 4 spies as well, we believe that they are supposed to be stable peers not changing all the time their IP addresses, so likely to be blacklisted at a certain point of time since it's difficult to imagine that they really participate over a long time to the copyright infrigement, so their behavior will become suspicious at a certain point of time and they will be blocked.
they can not be caught by spies that would attempt to connect to them since they refuse all connections.

Now, if everybody were using torrent-live's freerider option as the default the bittorrent network would stop working, so we propose in the next section some changes in the bittorrent protocol to better protect the users.
Toward a new or modified bittorrent protocol?

We have seen that it's not difficult to intrude the bittorrent network, the bittorrent protocol is not designed for privacy and does immediately leak what you are doing, even if you are finally not downloading anything, it does not force all peers to participate and allow easily freeriding.
Some basic changes could be made to secure more the bittorrent protocol:

while running these experiments we thought about the well known means to complicate the task of sybils in a P2P network, ie the spies would have really a hard time performing their job if they could not choose their nodeID and if their nodeID was linked to their IP address. This is exactly what is explained in [7], where the prefix of the nodeID is computed from the the IP address and a random number using CRC32C algorithm, as we have seen it is partially implemented but not used for what it was intended for.
...undisclosed...
subsequently, the clients should ignore any "values" returned until it reaches the closest nodes
and/or
the bittorrent clients should implement the "do not say to the whole world what you are looking for" feature, which consists in requesting a fake infohash close to the real one until the user reaches the closest nodes and query them with the real infohash, this would prevent everybody in the path to know what the user is really requesting, this does not hurt anything in the DHT. As we have seen the level 1 spies position themselves in the DHT according to what they want to monitor or what they see, for unknown torrents far from notorious ones it take them some time to do so, but we can suspect that they have positionned themselves for notorious torrents correctly to be eligible for the closest nodes, so the next proposal must be added.
once the user reaches the closest nodes it should ask for different infohashes abnormally close to the real one (like the fake infohash), so for infohashes that do not exist, if the closest node replies with values, then it's a level 1 spy and should be ignored.
the bittorrent client should check that the closest nodes have a nodeID that is plausible with the requested infohash, according to the distribution above we could suggest to disconsider all nodes that are outside of the range of 20 to 24 bits in common.
disregard the first peers returned by the selected closest nodes (5/10% of the swarm)
the bootstrap servers seem not to follow really the protocol, they answer with random nodes to find_node requests, ie they don't return the closest nodes they know from the requested nodeID (yours), changing this combined with the first proposal will make more difficult for the spies to pollute the bootstrap nodes.
the peers should not announce themselves right away after they sent a get_peer request, they should at least wait that they have received the metadata for the requested content, with a prompt or equivalent ('You are going to download ... do you want to proceed?', or, why not, 'Do you want to pay something?') because this is only from this point that we can deem that they know what they are downloading.
the peers should respond to get_peer requests with themselves if they have the requested content, this would avoid the other peers to perform useless lookups in the DHT possibly telling to everybody what they are looking for if they know the peers that have the content already.
of course, trackers and the peer exchange protocol must be deactivated in the bittorrent clients

Optional:

check for the closest nodes if they are listed as spam sources or infected/exploited using DNSBL, then ignore them if it is the case.
check for the peers returned by the closest nodes if they have been seen in an abnormal number of torrents using a service such as BTindex, ignore them if it is the case.
refuse connections from any peer that is not part of the initial selected swarm to avoid that a level 4 spy not participating in the swarms connect to the peer detected by a level 2 spy to check that is is downloading the torrent

Some of these concepts are used in Peersm project [13], an anonymous "JS Tor" bittorrent inside browsers, which in addition is designed for streaming and forbid freeriding.
Toward a new bittorrent client?

...undisclosed...
The idea would be to develop a new anti-spy bittorrent client following the method defined in this study (but without the freerider option as the default since it's a deviant behavior), which would be open source, would allow to protect much more the privacy of the users.
Maybe a naive thinking would be for this client to allow the users to pay for the content if they are willing to, ...undisclosed...
Does our method work and does it disturb the DHT?

For the second question we don't see how it could, the sybils we are using are all ephemerals and constantly renewed, so will not be kept in the routing tables of the other peers, they are passive and don't do anything else than crawling the DHT.
Regarding the first question, we were contacted by some people used to get DMCA notices in US at least once a month, using our dynamic blocklist alone they did not get any since months, so it confirms that the blocklist alone seems to be enough to protect, adding to it the methods described in this study implemented in torrent-live would therefore render null the probability to encounter a monitor.
References

[1] A. Vitte - torrent-live- making torrents more private and live streaming inside browsers github repository
[2] G. Sigano, JM Pujol, P. Rodriguez Monitoring the Bittorrent Monitors: A bird’s eye view
[3] Tom Chotia, Marco Cova, Kris Novakovic, and Camillo Gonzales Toro The Unbearable Lightness of Monitoring: Direct
Monitoring in BitTorrent School of Computer Science, University of Birmingham, UK September 2012
[4] p2p-hackers How do BitTorrent block lists get created?
[5] The Pirate Bay - http://www.thepiratebay.se
[6] The Tor project - http://www.torproject.org
[7] A. Norberg http://libtorrent.org/dht_sec.html BitTorrent DHT security extension and http://www.bittorrent.org/beps/bep_0042.html
[8] http://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xhtml IANA IPv4 Address Space Registry
[9] S. Rouibia B. Casalta http://www.google.com/patents/US20120166541 Systems and methods for collecting information over a peer to peer network
[10] iblocklist - https://www.iblocklist.com/
[11] Torrentfreak https://torrentfreak.com/btindex-exposes-ip-addresses-of-bittorrent-users-140807/
[12] BTindex - http://www.btindex.org
[13] Peersm - https://github.com/Ayms/node-Tor/blob/master/README.md#anonymous-serverless-p2p-inside-browsers---peersm-specs
[14] Peersm clients - https://github.com/Ayms/node-Tor/tree/master/install
...undisclosed...