Instantly share code, notes, and snippets.

# LaurentMT/gist:e758767ca4038ac40aaf

Last active April 29, 2024 07:45
Show Gist options
• Save LaurentMT/e758767ca4038ac40aaf to your computer and use it in GitHub Desktop.
Bitcoin Transactions & Privacy (part 1)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### AdamISZ commented Mar 4, 2019

If we envision the bitcoin txs graph as a graph of interlinked utxos, your model has a focus on the ambiguity introduced by the nodes of the graph ("who controls which utxos?") while Boltzmann has a focus on the ambiguity introduced by the edges of the graph ("probability of a link between 2 utxos")

This is a nice simple way to state it. "Link entropy" vs "Node entropy".

Actually, the approach used for Boltzmann is really similar to the concept of Entropy used in thermodynamics with a Boltzmann configuration being the equivalent of a microstate as defined in thermodynamics. From this, we derive the Entropy score by using Gibbs Entropy formula, with the assumption that the probability of each configuration is the same (i.e. we don't have additionnal information allowing to ponder a given microstate with a higher probability).

Sure, but I don't think this changes between the two models. What probably does change: with "node entropy" the probabilities of individual "microstates" are in practice wildly different, whereas in a coinjoin scenario (choosing between multiple subset sum linkages) that isn't the case, so the assumption of "all states equiprobable", while the simplest, would be a bit shitty for "node entropy".

Just one question: according to your metrics, what would be the number of interpretations for this transaction?

What I described above would imply 15 possibilities for any 2-in 2-out, no matter the btc amounts (but of course I haven't proposed anything sophisticated like reasons to discount specific possibilities, etc.).

Re: the various metrics, sure, that looks reasonable, but let me state my principal concern/issue with the "Link Entropy" approach:

Consider the case of a simple tree of transactions, each of which is a payment from one of the two outputs of its parent. So, every transaction is a bog-standard bitcoin payment. Since all of them would fit Example 1 here, all would have zero entropy. This means that a normal bitcoin transaction graph has zero entropy.

I understand your reasoning - that there is no ambiguity in linkages, and it's assumed that each transaction has a deterministic flow, but I fear that this is massively misleading to consumers of this metric.

Since there is, in fact, massive ambiguity in the actual flow of funds. Change heuristics are usually dubious, and at best they're still probabilistic. So even in the absolutely least private case (well - I ignore address reuse), it concerns me that an entropy measure of 0 will be misused and will help parties who want to sell snake-oil "tracing" to people.

Thanks again, this is an interesting topic and a lot clearer to me now.

Sure, but I don't think this changes between the two models.

Agreed. It's just a matter of variables selected for the definition of a "microstate".

What probably does change: with "node entropy" the probabilities of individual "microstates" are in practice wildly different, whereas in a coinjoin scenario (choosing between multiple subset sum linkages) that isn't the case, so the assumption of "all states equiprobable", while the simplest, would be a bit shitty for "node entropy".

I tend to agree and this is also why I believe that the distinction between intrinsic and actual entropies is important. The intrinsic version is a kind of ideal case assuming the equiprobability. The computation of actual entropy is equivalent to removing the equiprobability hypothesis. A corollary is that in Bitcoin, the intrinsic entropy is the max possible value for the entropy score. In best case scenario, this value will stay the same as time passes but it will often decrease if new information are collected, allowing to associate a null probability to some microstates.

I understand your reasoning - that there is no ambiguity in linkages, and it's assumed that each transaction has a deterministic flow, but I fear that this is massively misleading to consumers of this metric.

IMO, it depends on the purpose of the tool. As you noted, if you're an analytics company it may be misleading but if the goal is to alert users about potential privacy issues, displaying the "worst case scenario" seems to me like a good conservative approach.

# On Entropy

Basing it on Shannon entropy is a creative intuition, but there is no need to resort to any intuition at all, as in 2017, Maurer et al. in the paper titled Anonymous CoinJoin Transactions with Arbitrary Values worked out the mathematical basis of it.

Here the entropy depends on solely the number of valid non-complete sub-mappings (`N`), but notice what you want to examine is the strength of input - input, input-output and output-output relationships. With these, weighted by the values you can come up with a measure for entropy that does not rely on intuition. Also the work on top of this entropy can be mostly ported to the new entropy metric, so it's not like everything is lost by changing the intuition axiom with a mathematical axiom.

You may be wondering how the strength of the links of input-input, input-output and output-output are determined: take all the valid sub-mappings and count them. This would be with your notation `N + 1` (as @AdamISZ noted the possibility of the tx not being a CJ shall also not be ruled out, at least not by default.) Let's call it `|M|`, which is `= N + 1`. Next select a link and count how many valid sub-mappings the link is in the same partition, and let's call this number `|L|`. Finally you can get the probability to the link by dividing them: `|L|/|M|`. Then you get all the links and from there on getting your axiomatic entropy metric is obvious.

## Feasibility Concern

Unless you optimized something since to estimate `N`, it is not computationally more expensive as you are relying on identifying the sub-mappings themselves, just like Maurer et al.

## User Friendly Exmplanation

You can find a more user friendly introduction and some suggestions on how to build other metrics out of it here.

## Example

Take tx1 `1, 1, 100 -> 1, 1, 100` and tx2 `100, 100 -> 100, 100`. Shannon entropy gives the same result for both transactions, while an entropy built on top of the Knapsack paper has more accurate score, because in Knapsack you must factors in that the `100 -> 100` partition of tx1 can be found in every valid sub-mapping, while in Shannon you're only concerned with the number of sub-mappings.

### LaurentMT commented Mar 12, 2020

as in 2017, Maurer et al. in the paper titled Anonymous CoinJoin Transactions with Arbitrary Values worked out the mathematical basis of it.

It's been a while since I've read this paper but my souvenir it that it was indeed an interesting paper. For at least 2 reasons:

• The approach is very similar to the one used by Boltzmann and the paper provides a description far better than these gists (academic quality),

Basing it on Shannon entropy is a creative intuition

Actually, the use of the concept of Entropy wasn't based on an intuition but on the certainty that it was the right tool for the task that motivated the design of the algo.

TL/DR:

Sometimes, in statistical physics or in thermodynamics, entropy is interpretated as a measure of uncertainty (as in "a measure of our ignorance about the system"). For instance, we can observe a macrostate and we know that this macrostate may correspond to N different microstates. We aren't able to know the exact microstate but at least, we can evaluate our lack of knowledge, measured as the Entropy.

The specific form of the formula used to compute the entropy will depend on the fact that microstates have:

• an equal probability => formula proposed by Boltzmann (the scientist... not the algo),
• unequal probabilities => (generalized form proposed by Gibbs).

In the case of bitcoin transactions, the algo was initially designed under the angle of blockchain analysis. Thus, having (among others things) a metrics that measures our lack of knowledge about the right interpretation was a win '"Knowledge of our lack of knowledge is still useful knowledge).

In a nutshell we have these correspondences:

• macrostate => the transaction that we can observe on the blockhain,
• microstates => all potential interpretations of the transaction,
• Boltzmann's (the scientist) entropy => Boltzmann's (the algo) entropy

In details, the algo enumerates all valid interpretations. From this, it then derives:

• the entropy,
• the link probabilities matrix (probabilities of links between inputs and outputs)

At this point, tyou may wonder why we use the Boltzmann's form and not the Gibb's form to compute the entropy.

Well, The reason is pretty simple. When we analyze a transaction in isolation (without additional info that we may collect in the blockchain or by others means), all interpretations are equiprobable. Thus the choice of this form. This is what I called the "Intrinsic Entropy" in this gist.

But here's the most interesting part, IMO. If we can collect additionnal info (e.g: 2 inputs are controlled by the same entiy), then we can rule out some of the interpretations , which is roughly equivalent to setting a null probability for these interpretations in the Gibb's forms while considering that others interpretations have the same probability. This is what I called the "Actual Entropy" in this gist.

Under the hood, this is how the Boltzmann algo is used by OXT. At first, OXT computes the "Intrinsic" entropy of the tx but if new information are gathered later from new transactions, then OXT recomputes the entropy with these new information. This is what is displayed by OXT.

Take tx1 1, 1, 100 -> 1, 1, 100 and tx2 100, 100 -> 100, 100. Shannon entropy gives the same result for both transactions, while an entropy built on top of the Knapsack paper has more accurate score, because in Knapsack you must factors in that the 100 -> 100 partition of tx1 can be found in every valid sub-mapping, while in Shannon you're only concerned with the number of sub-mappings.

I'm not sure to understand what you mean by Shannon's entropy in this context. All I can say for sure is that Boltzmann (with the hypothesis of equiprobability of the interpretations) returns 2 different results:

• tx1: Entropy = 3 (i.e. 8 interpretations)
• tx2: Entropy = 1.584963 (i.e. 3 interpretations)

### nopara73 commented Mar 14, 2020

Wrong example and probably wrong about Shannon naming of the entropy, it's just AFAIK Shannon's intuition was using the logarithm and ended up working in practice, but as I'm trying to look it up, I can see there's much more to this, anyway I won't spend much time on this, as it's less relevant to the conversation. Rather I'd get back to my tailor made example that was wrong and can't help but wondering if it's not that simple to construct such example, then maybe maximizing the subsets could give roughly the same result in practice as the Knapsack way where probabilities between i/o, i/i, o/o and come up with entropy from that.

Anyway, as a side note, the Knapsack paper's goal was to create unequal output mixing and only as a side effect they had to work out a mathematical model on how to score mixes. IMO this spinoff research is more applicable than the original goal itself, as the original goal suffers from lack of further research:

• How to do it trustlessly?
• Is it really more efficient than equal amount mixes across transaction chains?

Regarding the trustlessness aspect I'm not sure if it's even possible as mixing networks rely on standard package sizes in one way or another.
Regarding always creating unequal amounts to maximize Knapsack entropy in a single transaction, it makes sense but does it makes sense when we extend it to transaction chains and compare it with standard amount mixes? Remixes are frequent and creating standard output amounts facilitates mixes on the inputs of the remixes. Thus it may very well be the case where the Knapsack entropy of a tx chain would be ultimately greater with standard amounts than the Knapsack entropy maximizing method on single transaction.

Side note aside, what would be the Knapsack entropy anyway?

With Knapsack we can calculate the probabilities of links, but from there we have to get to entropy somehow.

1. Maybe the most obvious improvement would be that those links would be weighed by amounts as a 30% probable link between 1BTC and 100BTC isn't equivalent to a 30% link between 1BTC and 2BTC.
2. Then maybe we could say that the total volume of a mix would be 1, instead of 47.5BTC so we can compare different mixes regardless the volume.

Example. So for tx `1, 2 -> 1, 1, 1` we would get the following:

``````Sub mappings:
1,2 -> 1,1,1
1 -> 1 | 2 -> 1,1
1 -> 1 | 2 -> 1,1
1 -> 1 | 2 -> 1,1

Input match probabilities:
1 - inputs: 2(0.25) | outputs: 1(0.5) 1(0.5) 1(0.5)
2 - inputs: 1(0.25) | outputs: 1(0.75) 1(0.75) 1(0.75)

Output match probabilities:
1 - outputs: 1(0.5) 1(0.5) | inputs: 1(0.5) 2(0.75)
1 - outputs: 1(0.5) 1(0.5) | inputs: 1(0.5) 2(0.75)
1 - outputs: 1(0.5) 1(0.5) | inputs: 1(0.5) 2(0.75)
``````

This is how far Knapsack goes. Next thing could be to normalize the amounts to 1 based on the total volume for comparisions between transactions, which is `3BTC`, so then the amounts would be `1/3, 2/3 -> 1/3, 1/3, 1/3`.

And then maybe add the amounts together and multiply them with the link probabilities?

• 1BTC input links to other inputs and outputs: `(1/3 + 2/3) * 0.25 + (1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5`
• 2BTC input links to other inputs and outputs: `(2/3 + 1/3) * 0.25 + (2/3 + 1/3) * 0.75 + (2/3 + 1/3) * 0.75 + (2/3 + 1/3) * 0.75`
• 3 times 1BTC output links to other outputs: `3 * ((1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5)`

The result of this would be `5.75` = `(1/3 + 2/3) * 0.25 + (1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5 + (2/3 + 1/3) * 0.25 + (2/3 + 1/3) * 0.75 + (2/3 + 1/3) * 0.75 + (2/3 + 1/3) * 0.75 + 3 * ((1/3 + 1/3) * 0.5 + (1/3 + 1/3) * 0.5)`.

And the Boltzmann (not Shannon, I hope I name it properly now) entropy for 3 interpretations would be `1.58`.

I'm far from sure my calculations are correct, but it communicates the idea. Is this how entropy based on link probabilities should be calculated? Or there's a better way to do it?