Skip to content

Instantly share code, notes, and snippets.

@nothingmuch
Last active November 10, 2022 15:06
Show Gist options
  • Save nothingmuch/0ba650fcca7e8ce5181e56526dfdd0eb to your computer and use it in GitHub Desktop.
Save nothingmuch/0ba650fcca7e8ce5181e56526dfdd0eb to your computer and use it in GitHub Desktop.
Notes for a presentation on Bitcoin privacy for the Des Femmes mentorship progarm

Bitcoin Privacy Introduction

Why privacy matters in Bitcoin

Celsius recent court filings provided a terrible and arguably avoidable loss of privacy for many of its customers, and will likely have chilling ripple effects for a long time to come. The goal of this presentation-turned-writeup is to give some context so that the reader can understand the consequences of such incidents and where to find more detailed information in order to mitigate such risks, and the tradeoffs that entails.

Individual perspective

Privacy is a requirement for individual agency. Financial privacy in particular is important in that in its absence one is much more vulnerable to being taken advantage of, manipulated or controlled.

In Bitcoin, specifically as an emerging internet based technology, there are novel ways to attack privacy and to weaponize the breaking of privacy, and unlike ephemeral interactions in the real world, a record of all exchanges will be preserved indefinitely. The more people are burdened with mitigating, managing or recovering from exposure to these risks the less useful Bitcoin will be for them.

Social perspective

Even though society is made of up of individuals and privacy is something that directly affects individuals, there are good reasons to protect privacy that arise purely from thinking about the system as a whole.

Fungiblity

Abstractly, fungibility means that the individual units of something are interchangeable. Applied to currency, it implies that every unit of a currency is as good as any other.

Applied descriptively, Bitcoin trivially fails at this simplistic notion of fungibility because individual coins are uniquely identified on the blockchain. Cash does too, because of serial numbers. But in the real world fungibility is arguably more about peoples’ attitudes, so a more nuanced approach might be examining whether or not some units in circulation are interchangeable in practice, or is something (e.g. censorship) creating a discount on some coins.

From a prescriptive point of view, fungibility implies that unlike possessions, stolen money which has exchanged hands should not be returned to the victim even if the thief is apprehended and the money can be traced. The justification for this is that if that were the case, merchants in an economy were burdened with verifying the provenance of every unit of currency transaction costs would skyrocket because of this added friction. Alternatively, they would need to accept a lot of risk unless they restrict their customer base. In order to improve the efficiency of an economy a currency must be (at least reasonably) fungible.

Game theory suggest that this is reasonable, the introduction of a transferable utility greatly increases the ease with which desirable outcomes can be obtained by all participants by allowing payoffs to be redistributed, but a non-fungible currency would not satisfy that definition.

Censorship Resistance

Suppose Bitcoin was perfectly anonymous, and therefore afforded the maximal degree of privacy possible. Regardless of how you define “perfect”, the ability to enforce any restrictions on fungibility would be severely limited in this scenario, for precisely the same reason that it makes transaction censorship impractical.

Conversely, if privacy is completely absent we can assume that censoring transactions for any reason will be made easier, and consequentially we should expect more deviations in the value of circulating units.

https://blockfi.com/prohibited-uses/

How can privacy be measured?

Privacy implies not standing out in a crowd, which requires a crowd.

The academic literature often considers systems where messages are exchanged between participants, and an adversary can observe or sometimes interfere with (some of) these messages.

The goal of the adversary is to correctly guess what users did in the system, for example decide who were the sender and recipient of a message. The adversary’s ability to guess correctly can be analyzed in terms of the information it has about the of all participants of the system.

Anonymity Sets

The anonymity sets associated with a message provide an upper bound on the set of possible senders or receivers. If we make assumptions about the adversary’s capabilities, we can we can infer bounds on what this set might be (namely a lower bound on the upper bound).

If the adversary’s best strategy is to simply pick an element of the set at random, in other words there is information other than the anonymity set, this is described by the $k$-anonymity model, where $k$ is the size of the set, but there are ways of dealing with non-uniform probability distributions over the set using more sophisticated models.

Quasi-identifiers

Quasi identifiers are a concept related to anonymity and explained in the $k$-anonymity introduction linked above. In contrast with unique identifiers, quasi identifiers are not as specific, the value of a quasi-identifier attribute might be shared between multiple users, but each user might still be uniquely identifier by the combination of such attributes.

IP addresses, temporal patterns, transaction fingerprints etc are all examples of quasi identifiers that are described in a little more detail below.

Usability

An interesting perspective on privacy in practice was presented in a paper entitled Anonymity Loves Company: Usability, in particular reducing unnecessary choices for users can make a big difference for anonymity for two main reasons.

The first is that better usability lowers the barriers to entry. The more users are able to use a system the more impactful any privacy enhancing technology integrated into the system can be. This is crucial because if only vulnerable people turn to privacy technologies they will be stigmatized and far less effective.

The second is that too many knobs can result in inadvertent introduction of privacy leaks if that results in a detectable fingerprint. In this regard, the most vulnerable users might be practically better off with theoretically weaker guarantees, but blending into a larger crowd.

How can privacy be protected?

The universe believes in encryption

This quote from Julian Assange famously made the observation that computationally it seems that it’s easier to defend than to attack in cryptography. He later continues:

Strong cryptography can resist an unlimited application of violence. No amount of coercive force will ever solve a math problem.

Indeed, most privacy enhancing systems rely on encryption and zero knowledge proofs so that a diverse group of mutually suspicious strangers can cooperate within that system and be reasonably assured that unless everyone else is a spy privacy was indeed realized by blending into a crowd.

Phil Rogaway has expounded on the nature of power imbalances, and not only how cryptography works to redistribute power in the manner that Assange describes. Like Assange, Rogaway makes very explicit the moral imperative to repair power imbalance with cryptographic work.

Encryption allows the contents messages to be hidden, and has naturally been used to build things like onion routing (Tor is described below), mixnets, etc with varying degrees of sophistication through different encryption schemes. Very broadly, we have symmetric, public key cryptography, partially or even fully homomorphic encryption. Partially homomorphic encryption has the property that C(a) + C(b) = C(a + b), where C denotes encrypting or comitting to a value, and this property has been widely exploited for blind signatures and homomorphic values in blockchains supporting confidential transactions.

Zero knowledge proofs allow us to prove statements about information that might be encrypted or only committed to, without directly revealing it. Again in the context of confidential transactions, part of the authorization in a transaction is proving that the hidden amounts are still constrained by the protocol rules.

Is Bitcoin private?

Aspects of Bitcoin privacy can be broadly split into the network protocol, which subsumes the blockchain data, and external information. The external information breaks down into metadata pertaining to the protocol (temporal leaks, network level identifiers, etc) and things like PII (KYC information on exchanges, information given to counterparties, physical proximity, device identifiers, …) linked to payments or services.

As a general rule, attacks on privacy compose non-linearly. For example a history intersection only needs a logarithmic number of intersections to pin down a single element in a set, and even a few intersections can dramatically reduce the size of an anonymity set, amplifying other attacks. In a domain where the blockchain and its associated leaks only grow in time, adversaries can only get stronger, often compounding.

Diversity of Nodes and Wallets on Bitcoin Network

The main takeaway for this section is that Core, specialized clients like lightning nodes, and light clients more generally and especially mobile light clients all vary greatly in their inherent privacy leaks, and the degree to which they let users control or mitigate those leaks.

Bitcoin Core

When a Bitcoin Core node starts for the first time it synchronizes with the blockchain by downloading and verifying all blocks from genesis. Already configured wallets’ pre-existing transactions will be detected, and additional wallets can be found by rescanning (if the node is not pruned). In either case is no (known) pattern of network activity which would allow peers on the network to detect which historical transactions a were saved in the wallet(s) of a full node.

When new wallet transactions are received or sent, Bitcoin Core’s rebroadcasting behavior may be an issue in some threat models. The wallet will by default rebroadcast transactions it cares about. To avoid this behavior, -nobroadcast can be enabled, with alternative tools used to broadcast transactions (see below).

Light Clients

In contrast to full nodes, light clients rely on external services in order to avoid processing the entire blockchain.

Electrum based

Electrum protocol based wallets will connect to a server and querying with hashes of output scripts (roughly like individual addresses), and the server responds the relevant transactions.

This reveals to the server sets of linked addresses, confirming precise wallet clusters (sets of transactions related to each other), and the leak continues so long as additional addresses are checked, which depends on the gap limit. These addresses are monitored by the server and the client is notified of new unconfirmed transactions when the server learns of them.

A common usage pattern for improved privacy is to use Bitcoin Core with electrum personal server or bwt, and use electrum or some other electrum protocol supporting wallet. Operating this way reduces the network level privacy of a light client to the privacy of a full node without needing to operate a full electrum server that maintains an index of all transactions. For users with more storage capacity, many node-in-a-box solutions also support a fully fledged electrum server, which can also can be used to power a private instance of a blockchain explorer in order to avoid leaking information by searching or browsing public ones.

A common method of doing this securely is using a Tor hidden service or some kind of overlay VPN (e.g. wireguard/tailscale/headscale, onioncat, GNUnet, Yggdrasil, ZeroTier, etc..[fn::note that headscale is an unofficial self hosted version of tailscale and not all ZeroTier clients are fully open source]) to connect a phone or other devices to an electrum server over an end to end encrypted connection even if the service is behind a NAT.

BIP 157-8 based

BIP 158 block filters based wallets fare somewhat better from a privacy standpoint. Instead of downloading all blocks, block filters are significantly smaller. Once downloaded wallets can check each filter to find out whether there’s a good chance an output of theirs was created or spent in the corresponding block. Assuming the filters make no omissions, only blocks of interest need to be downloaded

Depending on the number of nodes controlled by a local covert adversary or if the threat model is a global passive adversary, and the pattern of block downloading that the wallet performs may or may not be private, but because no specific information is shared with the network it is categorically more private than using a server.

Filters are only available for blocks so detecting unconfirmed funds still relies on transaction gossip.

Transaction Fingerprinting

At the protocol level, transaction data may contain patterns observable with varying degree of certainty.

The types of inputs and outputs is clearly observable since the scripts are different, and is one of the most overt fingerprints.

Within the signature script or witness data, different clients may also produce signatures differently, for example some clients grind different nonces to produce shorter signatures which reduces their size, raising their effective feerate for a modest computational cost. However, even clients that don’t do so will occasionally produce a shorter signature by chance occurrence, so only an observation of a long signature is strong evidence of a wallet not performing this optimization.

The transaction nVersion and nLocktime fields, individual inputs’ nSequence fields, all contain different patterns based on whether or not the wallet supports or uses a different features. For example, Core will always put in an nLocktime value whereas many clients leave it at 0. Some wallets opt-in to BIP-125 replace by fee by default, for others it must be enabled, and others still do not support it, producing different nSequence values. Use of relative lock times implicitly opts in to BIP-125 RBF.

If transactions include or exclude SegWit inputs, this may indicate a requirement for txid stability. Lightning nodes and other layer 2 tech generally need to base their off chain transactions on SegWit only transactions so that the pre-signed offchain transactions spending a known txid can be prepared and signed before signing the funding transaction.

Finally, ordering of inputs/outputs by amount, type, lexicographically, or otherwise may also provide various clues, like whether the producing wallet uses BIP 69, sorting or shuffling some other way, or simply leaving the inputs and outputs in the order they were created, in which case payment outputs can be heuristically detected.

https://github.com/achow101/wallet-fingerprinting

https://b10c.me/observations/

Network Layer & Transaction Broadcast

In terms of how the transaction graph interacts with the network, the main considerations from a user privacy point of view are initial broadcast, and rebroadcast behaviors if any.

Broadcast

If a node is connected over TCP to its peers and broadcasts to all of them then it is relatively easy for an adversary to narrow down the set of nodes from which the transaction originated to the point of concern, since IP information is potentially linkable to real world identities. Indeed, block explorer services have presented geolocation data for transactions’ presumed origins for many years.

Since the Bitcoin protocol has no authentication or encryption (but see also BIP 324), broadcast reveals potentially the transaction details to the user’s ISP (or VPN, or hosting provider) or to nation level adversaries (“global passive adversary”).

The main body of work concerned with improving the privacy of regular broadcast is Dandelion and Dandelion++, which modify the pattern of broadcast so as to obscure the origins by having an initial phase with low fan-out. This has not been deployed on the network.

Rebroadcast (Core specific)

Amiti Uttarwar’s work on improving transaction rebroadcast greatly reduced core’s attack surface, and she has also spoken about and documented the problem extensively. A particularly interesting nuance is dust attacks, and how they relate not only to input selection and wallet clustering, but could also be used to attack privacy via the rebroadcasting logic.

As a more drastic measure, the nobroadcast configuration flag can also be used to prevent it entirely.

Tor based broadcast

When connecting to a service over Tor, the tor daemon will build so called circuits. It does this by connecting to a guard relay, and through it to an intermediate relay and finally through that in turn a third relay. The client encrypts the relayed packets in layers, so that the guard node can’t see the encrypted payload intended for the third node, for example.

Circuits are used to create private connections called streams, similar to TCP connections. When connecting to the regular internet (“clearnet”) the last node on the circuit serves as an exit node, and makes regular TCP connections on behalf of the client.

Considering again the lack of authentication or encryption, exit nodes may present more of a risk than hidden services, but assuming all peers are adverserial this doesn’t make much of a difference.

When connecting to a Tor hidden service, multiple circuits are required. First, the hidden service directory must be queried for information about the hidden service’s introducer nodes. The hidden service maintains circuits to those nodes, where clients can submit rendezvous requests. To actually connect, both the client and the server build circuits that are joined at a rendezvous relay node, providing mutual privacy. If the service doesn’t need to actually be hidden it can be configured to talk directly to the rendezvous node.

As long as at least one relay node on the circuit is not compromised, correlating the traffic between the client and the final destination requires the capabilities of a global passive adversary and traffic analysis (Tor is not designed to protect against such an adversary). When connecting to peers on the Bitcoin network this can provide strong guarantees about the linkability of specific messages, e.g. block download requests or transaction broadcasts.

An important detail to consider when a wallet uses Tor is whether or not its connections are isolated to different circuits. Without isolation, different activities may be correlated more easily.

A simple way of broadcasting a transaction browser is using a web based form such as the one on Blockstream’s onion service using TorBrowser.

Privately broadcast transactions directly to the Bitcoin network usually involves connecting to a random peer (onion service or through an exit node) with an isolated circuit, relaying the transaction and disconnecting. bitcoin-submittx provides this as standalone functionality, and some wallets such as JoinMarket and Wasabi do this for their broadcasting.

Transaction Structure

Considering the transaction graph proper, two main heuristics form the basis of a lot of the studies of the Bitcoin blockchain (e.g. RS11, MPJ+13). Both heuristics tacitly assume that transactions correspond to payments or transfers between different entities.

The first is the common input ownership heuristic, which says that when there are multiple inputs to a transaction, they are controlled by the same user.

The second has many variations, but the general idea is to heuristically distinguish payment outputs from change outputs, whether by script type, analyzing the amount, ordering, or some other means.

These heuristics trace back to the Bitcoin paper, and multiple papers have developed and applied them. Two works in particular are more empirical in nature and worth noting. In the full of bitcoins paper the authors sent money through a mixer, and reported on what they observed. Second, Jonas Nick’s thesis is notable not only for the BIP 37 privacy leaks he discovered, but also the use of this leak to validate the effectiveness of the two heuristics. His work shows that at least for 2015 era light clients’ transaction footprint the heuristics were very powerful, with vast majority of clusters correctly identified (80% recall rate for public key clustering)

With the prior knowledge of the naive payment structure and the heuristics that are based on it, we can think of privacy tech in bitcoin as breaking down into overt and covert ways of introducing ambiguity into individual transactions as well as the graph structure between transactions.

For example, payjoin transactions are deliberately constructed to appear to be naive payments, but suggest counterfactual conclusions about wallet clustering as the most natural interpretation. CoinJoin transactions introduce ambiguity and do so overtly, which makes them more censorable and subject to taint analysis. CoinSwaps on the other hand are disjoint swaps of histories on the transaction graph and as such have much larger anonymity sets, and technically better privacy but in practice may have more variable risks associated with them (receiving coins which are tainted possibly in some unknown way so the risk is hard to account for).

Finally with lightning, the incentives for scaling and privacy are more or less aligned, so its on chain footprint is more naturally resistant to analysis of individual payments or flows between parties (but some of that information can be recovered through other layers).

Lightning

This subject is too broad to be in scope for an introduction, but a good and detailed overview is provided by in lnbook’s chapter on privacy.

Privacy Wallets

Similarly, specific privacy techniques are too detailed a subject. Several have been proposed and implemented. The history of CoinJoins is richest, with multiple deployed implementations and a number of studies (especially those by Möser et al), etc.

Specifically with regards to CoinJoin transactions, Maurer et al, proposes the most comprehensive framework for analyzing privacy with arbitrary amounts (as opposed to CoinJoins with $k$ outputs of an identical value and script type), but even it needs some modification to be applied to real world protocols. LaurentMT’s link probability matrix is essentially the same and has precedence, but is slightly less precise.

Although teleport-tx is the only coinswap implementation that appears to be nearing mainnet, I think it is well worth the study, especially because CoinSwap and lightning payments rely on many of the same concepts, and comparing CoinSwap and in particular the teleport-tx approach with routing etc is an interesting stepping stone to contrast with LN privacy discussions.

Additional material

In the next few days I will try to summarize and expand on some of the CoinJoin related stuff I brought up in the Q&A:

  • SCRIPT
  • CoinJoin
  • overt/covert/disjoint on chain footprints
  • interactive-tx proposal
  • moon math based blockchain privacy (unlikely in Bitcoin itself)
  • … things i’m working on?

as well as more details on lightning privacy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment