nothingmuch/Bitcoin Privacy Intro.org

## Bitcoin Privacy Intro.org

      
    Raw
  

              Bitcoin Privacy Intro.org
            
          
    Bitcoin Privacy Introduction

Why privacy matters in Bitcoin

Celsius recent court filings provided a terrible and arguably avoidable loss of
  privacy for many of its customers, and will likely have chilling ripple effects
  for a long time to come. The goal of this presentation-turned-writeup is to give
  some context so that the reader can understand the consequences of such
  incidents and where to find more detailed information in order to mitigate such
  risks, and the tradeoffs that entails.
Individual perspective

Privacy is a requirement for individual agency. Financial privacy in particular
  is important in that in its absence one is much more vulnerable to being taken
  advantage of, manipulated or controlled.
In Bitcoin, specifically as an emerging internet based technology, there are
  novel ways to attack privacy and to weaponize the breaking of privacy, and
  unlike ephemeral interactions in the real world, a record of all exchanges will
  be preserved indefinitely. The more people are burdened with mitigating,
  managing or recovering from exposure to these risks the less useful Bitcoin will
  be for them.
Social perspective

Even though society is made of up of individuals and privacy is something that
  directly affects individuals, there are good reasons to protect privacy that
  arise purely from thinking about the system as a whole.
Fungiblity

Abstractly, fungibility means that the individual units of something are
  interchangeable. Applied to currency, it implies that every unit of a currency
  is as good as any other.
Applied descriptively, Bitcoin trivially fails at this simplistic notion of
  fungibility because individual coins are uniquely identified on the blockchain.
  Cash does too, because of serial numbers. But in the real world fungibility is
  arguably more about peoples’ attitudes, so a more nuanced approach might be
  examining whether or not some units in circulation are interchangeable in
  practice, or is something (e.g. censorship) creating a discount on some coins.
From a prescriptive point of view, fungibility implies that unlike possessions,
  stolen money which has exchanged hands should not be returned to the victim even
  if the thief is apprehended and the money can be traced. The justification for
  this is that if that were the case, merchants in an economy were burdened with
  verifying the provenance of every unit of currency transaction costs would
  skyrocket because of this added friction. Alternatively, they would need to
  accept a lot of risk unless they restrict their customer base. In order to
  improve the efficiency of an economy a currency must be (at least reasonably)
  fungible.
Game theory suggest that this is reasonable, the introduction of a transferable
  utility greatly increases the ease with which desirable outcomes can be obtained
  by all participants by allowing payoffs to be redistributed, but a non-fungible
  currency would not satisfy that definition.
Censorship Resistance

Suppose Bitcoin was perfectly anonymous, and therefore afforded the maximal
  degree of privacy possible. Regardless of how you define “perfect”, the ability
  to enforce any restrictions on fungibility would be severely limited in this
  scenario, for precisely the same reason that it makes transaction censorship
  impractical.
Conversely, if privacy is completely absent we can assume that censoring
  transactions for any reason will be made easier, and consequentially we should
  expect more deviations in the value of circulating units.
https://blockfi.com/prohibited-uses/
How can privacy be measured?

Privacy implies not standing out in a crowd, which requires a crowd.
The academic literature often considers systems where messages are exchanged
  between participants, and an adversary can observe or sometimes interfere with
  (some of) these messages.
The goal of the adversary is to correctly guess what users did in the system,
  for example decide who were the sender and recipient of a message. The
  adversary’s ability to guess correctly can be analyzed in terms of the
  information it has about the of all participants of the system.
Anonymity Sets

The anonymity sets associated with a message provide an upper bound on the set
  of possible senders or receivers. If we make assumptions about the adversary’s
  capabilities, we can we can infer bounds on what this set might be (namely a
  lower bound on the upper bound).
If the adversary’s best strategy is to simply pick an element of the set at
  random, in other words there is information other than the anonymity set, this
  is described by the $k$-anonymity model, where $k$ is the size of the set, but
  there are ways of dealing with non-uniform probability distributions over the
  set using more sophisticated models.
Quasi-identifiers

Quasi identifiers are a concept related to anonymity and explained in the
  $k$-anonymity introduction linked above. In contrast with unique identifiers,
  quasi identifiers are not as specific, the value of a quasi-identifier attribute
  might be shared between multiple users, but each user might still be uniquely identifier by the combination of such attributes.
IP addresses, temporal patterns, transaction fingerprints etc are all examples
  of quasi identifiers that are described in a little more detail below.
Usability

An interesting perspective on privacy in practice was presented in a paper
  entitled Anonymity Loves Company: Usability, in particular reducing unnecessary
  choices for users can make a big difference for anonymity for two main reasons.
The first is that better usability lowers the barriers to entry. The more
  users are able to use a system the more impactful any privacy enhancing
  technology integrated into the system can be. This is crucial because if only
  vulnerable people turn to privacy technologies they will be stigmatized and far
  less effective.
The second is that too many knobs can result in inadvertent introduction of
  privacy leaks if that results in a detectable fingerprint. In this regard, the most vulnerable users might be practically better off with theoretically weaker guarantees, but blending into a larger crowd.
How can privacy be protected?


  The universe believes in encryption

This quote from Julian Assange famously made the observation that
  computationally it seems that it’s easier to defend than to attack in
  cryptography. He later continues:

  Strong cryptography can resist an unlimited application of violence. No amount
    of coercive force will ever solve a math problem.

Indeed, most privacy enhancing systems rely on encryption and zero knowledge
  proofs so that a diverse group of mutually suspicious strangers can cooperate
  within that system and be reasonably assured that unless everyone else is a spy
  privacy was indeed realized by blending into a crowd.
Phil Rogaway has expounded on the nature of power imbalances, and not only how
  cryptography works to redistribute power in the manner that Assange describes.
  Like Assange, Rogaway makes very explicit the moral imperative to repair power imbalance with cryptographic work.
Encryption allows the contents messages to be hidden, and has naturally been
  used to build things like onion routing (Tor is described below), mixnets, etc
  with varying degrees of sophistication through different encryption schemes.
  Very broadly, we have symmetric, public key cryptography, partially or even
  fully homomorphic encryption. Partially homomorphic encryption has the property
  that C(a) + C(b) = C(a + b), where C denotes encrypting or comitting to a
  value, and this property has been widely exploited for blind signatures and
  homomorphic values in blockchains supporting confidential transactions.
Zero knowledge proofs allow us to prove statements about information that might
  be encrypted or only committed to, without directly revealing it. Again in the
  context of confidential transactions, part of the authorization in a transaction
  is proving that the hidden amounts are still constrained by the protocol rules.
Is Bitcoin private?

Aspects of Bitcoin privacy can be broadly split into the network protocol, which
  subsumes the blockchain data, and external information. The external information
  breaks down into metadata pertaining to the protocol (temporal leaks, network
  level identifiers, etc) and things like PII (KYC information on exchanges,
  information given to counterparties, physical proximity, device identifiers,
  …) linked to payments or services.
As a general rule, attacks on privacy compose non-linearly. For example a
  history intersection only needs a logarithmic number of intersections to pin
  down a single element in a set, and even a few intersections can dramatically
  reduce the size of an anonymity set, amplifying other attacks. In a domain where
  the blockchain and its associated leaks only grow in time, adversaries can only
  get stronger, often compounding.
Diversity of Nodes and Wallets on Bitcoin Network

The main takeaway for this section is that Core, specialized clients like
  lightning nodes, and light clients more generally and especially mobile light
  clients all vary greatly in their inherent privacy leaks, and the degree to
  which they let users control or mitigate those leaks.
Bitcoin Core

When a Bitcoin Core node starts for the first time it synchronizes with the
  blockchain by downloading and verifying all blocks from genesis. Already
  configured wallets’ pre-existing transactions will be detected, and additional
  wallets can be found by rescanning (if the node is not pruned). In either case
  is no (known) pattern of network activity which would allow peers on the network
  to detect which historical transactions a were saved in the wallet(s) of a full
  node.
When new wallet transactions are received or sent, Bitcoin Core’s rebroadcasting
  behavior may be an issue in some threat models. The wallet will by default
  rebroadcast transactions it cares about. To avoid this behavior, -nobroadcast
  can be enabled, with alternative tools used to broadcast transactions (see below).
Light Clients

In contrast to full nodes, light clients rely on external services in order to
  avoid processing the entire blockchain.
Electrum based

Electrum protocol based wallets will connect to a server and querying with
  hashes of output scripts (roughly like individual addresses), and the server
  responds the relevant transactions.
This reveals to the server sets of linked addresses, confirming precise wallet
  clusters (sets of transactions related to each other), and the leak continues so
  long as additional addresses are checked, which depends on the gap limit. These
  addresses are monitored by the server and the client is notified of new
  unconfirmed transactions when the server learns of them.
A common usage pattern for improved privacy is to use Bitcoin Core with electrum
  personal server or bwt, and use electrum or some other electrum protocol
  supporting wallet. Operating this way reduces the network level privacy of a
  light client to the privacy of a full node without needing to operate a full
  electrum server that maintains an index of all transactions. For users with more
  storage capacity, many node-in-a-box solutions also support a fully fledged
  electrum server, which can also can be used to power a private instance of a
  blockchain explorer in order to avoid leaking information by searching or
  browsing public ones.
A common method of doing this securely is using a Tor hidden service or some
  kind of overlay VPN (e.g. wireguard/tailscale/headscale, onioncat, GNUnet,
  Yggdrasil, ZeroTier, etc..[fn::note that headscale is an unofficial self hosted
  version of tailscale and not all ZeroTier clients are fully open source]) to
  connect a phone or other devices to an electrum server over an end to end
  encrypted connection even if the service is behind a NAT.
BIP 157-8 based

BIP 158 block filters based wallets fare somewhat better from a privacy
  standpoint. Instead of downloading all blocks, block filters are significantly
  smaller. Once downloaded wallets can check each filter to find out whether
  there’s a good chance an output of theirs was created or spent in the
  corresponding block. Assuming the filters make no omissions, only blocks of
  interest need to be downloaded
Depending on the number of nodes controlled by a local covert adversary or if
  the threat model is a global passive adversary, and the pattern of block
  downloading that the wallet performs may or may not be private, but because no
  specific information is shared with the network it is categorically more private
  than using a server.
Filters are only available for blocks so detecting unconfirmed funds still
  relies on transaction gossip.
Transaction Fingerprinting

At the protocol level, transaction data may contain patterns observable with
  varying degree of certainty.
The types of inputs and outputs is clearly observable since the scripts are
  different, and is one of the most overt fingerprints.
Within the signature script or witness data, different clients may also produce
  signatures differently, for example some clients grind different nonces to
  produce shorter signatures which reduces their size, raising their effective
  feerate for a modest computational cost. However, even clients that don’t do so
  will occasionally produce a shorter signature by chance occurrence, so only
  an observation of a long signature is strong evidence of a wallet not performing
  this optimization.
The transaction nVersion and nLocktime fields, individual inputs’
  nSequence fields, all contain different patterns based on whether or not the
  wallet supports or uses a different features. For example, Core will always put
  in an nLocktime value whereas many clients leave it at 0. Some wallets opt-in
  to BIP-125 replace by fee by default, for others it must be enabled, and others
  still do not support it, producing different nSequence values. Use of relative
  lock times implicitly opts in to BIP-125 RBF.
If transactions include or exclude SegWit inputs, this may indicate a
  requirement for txid stability. Lightning nodes and other layer 2 tech generally
  need to base their off chain transactions on SegWit only transactions so that
  the pre-signed offchain transactions spending a known txid can be prepared and
  signed before signing the funding transaction.
Finally, ordering of inputs/outputs by amount, type, lexicographically, or
  otherwise may also provide various clues, like whether the producing wallet uses
  BIP 69, sorting or shuffling some other way, or simply leaving the inputs and
  outputs in the order they were created, in which case payment outputs can be
  heuristically detected.
https://github.com/achow101/wallet-fingerprinting
https://b10c.me/observations/
Network Layer & Transaction Broadcast

In terms of how the transaction graph interacts with the network, the main
  considerations from a user privacy point of view are initial broadcast, and
  rebroadcast behaviors if any.
Broadcast

If a node is connected over TCP to its peers and broadcasts to all of them then
  it is relatively easy for an adversary to narrow down the set of nodes from
  which the transaction originated to the point of concern, since IP information
  is potentially linkable to real world identities. Indeed, block explorer
  services have presented geolocation data for transactions’ presumed origins for
  many years.
Since the Bitcoin protocol has no authentication or encryption (but see also BIP
  324), broadcast reveals potentially the transaction details to the user’s ISP
  (or VPN, or hosting provider) or to nation level adversaries (“global passive
  adversary”).
The main body of work concerned with improving the privacy of regular broadcast
  is Dandelion and Dandelion++, which modify the pattern of broadcast so as to
  obscure the origins by having an initial phase with low fan-out. This has not
  been deployed on the network.
Rebroadcast (Core specific)

Amiti Uttarwar’s work on improving transaction rebroadcast greatly reduced
  core’s attack surface, and she has also spoken about and documented the problem
  extensively. A particularly interesting nuance is dust attacks, and how they
  relate not only to input selection and wallet clustering, but could also be used
  to attack privacy via the rebroadcasting logic.
As a more drastic measure, the nobroadcast configuration flag can also be
  used to prevent it entirely.
Tor based broadcast

When connecting to a service over Tor, the tor daemon will build so called
  circuits. It does this by connecting to a guard relay, and through it to an
  intermediate relay and finally through that in turn a third relay. The client
  encrypts the relayed packets in layers, so that the guard node can’t see the
  encrypted payload intended for the third node, for example.
Circuits are used to create private connections called streams, similar to TCP
  connections. When connecting to the regular internet (“clearnet”) the last node
  on the circuit serves as an exit node, and makes regular TCP connections on
  behalf of the client.
Considering again the lack of authentication or encryption, exit nodes may
  present more of a risk than hidden services, but assuming all peers are
  adverserial this doesn’t make much of a difference.
When connecting to a Tor hidden service, multiple circuits are required. First,
  the hidden service directory must be queried for information about the hidden
  service’s introducer nodes. The hidden service maintains circuits to those
  nodes, where clients can submit rendezvous requests. To actually connect, both
  the client and the server build circuits that are joined at a rendezvous relay
  node, providing mutual privacy. If the service doesn’t need to actually be
  hidden it can be configured to talk directly to the rendezvous node.
As long as at least one relay node on the circuit is not compromised,
  correlating the traffic between the client and the final destination requires
  the capabilities of a global passive adversary and traffic analysis (Tor is not
  designed to protect against such an adversary). When connecting to peers on the
  Bitcoin network this can provide strong guarantees about the linkability of
  specific messages, e.g. block download requests or transaction broadcasts.
An important detail to consider when a wallet uses Tor is whether or not its
  connections are isolated to different circuits. Without isolation, different
  activities may be correlated more easily.
A simple way of broadcasting a transaction browser is using a web based form
  such as the one on Blockstream’s onion service using TorBrowser.
Privately broadcast transactions directly to the Bitcoin network usually
  involves connecting to a random peer (onion service or through an exit node)
  with an isolated circuit, relaying the transaction and disconnecting.
  bitcoin-submittx provides this as standalone functionality, and some wallets
  such as JoinMarket and Wasabi do this for their broadcasting.
Transaction Structure

Considering the transaction graph proper, two main heuristics form the basis of
  a lot of the studies of the Bitcoin blockchain (e.g. RS11, MPJ+13). Both
  heuristics tacitly assume that transactions correspond to payments or transfers
  between different entities.
The first is the common input ownership heuristic, which says that when there
  are multiple inputs to a transaction, they are controlled by the same user.
The second has many variations, but the general idea is to heuristically
  distinguish payment outputs from change outputs, whether by script type,
  analyzing the amount, ordering, or some other means.
These heuristics trace back to the Bitcoin paper, and multiple papers have
  developed and applied them. Two works in particular are more empirical in nature
  and worth noting. In the full of bitcoins paper the authors sent money through
  a mixer, and reported on what they observed. Second, Jonas Nick’s thesis is
  notable not only for the BIP 37 privacy leaks he discovered, but also the use of
  this leak to validate the effectiveness of the two heuristics. His work shows
  that at least for 2015 era light clients’ transaction footprint the heuristics
  were very powerful, with vast majority of clusters correctly identified (80%
  recall rate for public key clustering)
With the prior knowledge of the naive payment structure and the heuristics that
  are based on it, we can think of privacy tech in bitcoin as breaking down into
  overt and covert ways of introducing ambiguity into individual transactions as
  well as the graph structure between transactions.
For example, payjoin transactions are deliberately constructed to appear to be
  naive payments, but suggest counterfactual conclusions about wallet clustering
  as the most natural interpretation.  CoinJoin transactions introduce ambiguity
  and do so overtly, which makes them more censorable and subject to taint
  analysis. CoinSwaps on the other hand are disjoint swaps of histories on the
  transaction graph and as such have much larger anonymity sets, and technically
  better privacy but in practice may have more variable risks associated with them
  (receiving coins which are tainted possibly in some unknown way so the risk is
  hard to account for).
Finally with lightning, the incentives for scaling and privacy are more or less
  aligned, so its on chain footprint is more naturally resistant to analysis of
  individual payments or flows between parties (but some of that information can
  be recovered through other layers).
Lightning

This subject is too broad to be in scope for an introduction, but a good and
  detailed overview is provided by in lnbook’s chapter on privacy.
Privacy Wallets

Similarly, specific privacy techniques are too detailed a subject.
  Several have been proposed and implemented. The history of CoinJoins is richest,
  with multiple deployed implementations and a number of studies (especially those
  by Möser et al), etc.
Specifically with regards to CoinJoin transactions, Maurer et al, proposes the
  most comprehensive framework for analyzing privacy with arbitrary amounts (as
  opposed to CoinJoins with $k$ outputs of an identical value and script type),
  but even it needs some modification to be applied to real world protocols.
  LaurentMT’s link probability matrix is essentially the same and has precedence,
  but is slightly less precise.
Although teleport-tx is the only coinswap implementation that appears
  to be nearing mainnet, I think it is well worth the study, especially
  because CoinSwap and lightning payments rely on many of the same
  concepts, and comparing CoinSwap and in particular the teleport-tx
  approach with routing etc is an interesting stepping stone to contrast
  with LN privacy discussions.
Additional material

In the next few days I will try to summarize and expand on some of the CoinJoin
  related stuff I brought up in the Q&A:

  SCRIPT
  CoinJoin
  overt/covert/disjoint on chain footprints
  interactive-tx proposal
  moon math based blockchain privacy (unlikely in Bitcoin itself)
  … things i’m working on?

as well as more details on lightning privacy.