Skip to content

Instantly share code, notes, and snippets.

@volhovm
Last active November 14, 2017 14:09
Show Gist options
  • Save volhovm/5d356d68f77f56080a4ec28a57d728e7 to your computer and use it in GitHub Desktop.
Save volhovm/5d356d68f77f56080a4ec28a57d728e7 to your computer and use it in GitHub Desktop.

Block retrieval optimization

This document aims to provide a detailed overview of a certain development problem and suggest/compare several possible solutions. We will mostly talk about speeding up data deserialization/verification and diffusion interface.

Terms and definitions

Arbitrary datatype T construction/deserialization routine usually consists of the following steps:

  • (D1) Read bytes (from network or any type of storage)
  • (D2) Check CBOR structure (bytestring is a correct CBOR encoded data)
  • (D3) Check datatype canonicity
  • (D4) Construct/deserialize datatype T
    • (D4.1) An option is to deserialize only some part of data (header of block)
  • (D5) Validate T.
    • (D5.1) Validate parts of T.

Current codebase only provides interface to do all of it at once. User runs D1 and feeds bytestring to decoder. Decoder performs D2-D3, parses subfields and then constructs datatype (D4) with a special constructor (D5).

It is currently impossible to perform (D4.1) and (D5.1), though it can be implemented.

Now let's describe features that we'd like to have. We'll formulate them in terms of use cases:

  • (U1) Respond block requests quickly.
    • Only D1 is needed (just get those bytes and send them).
  • (U2) Diffusion (as designed for Shelley) of blocks: receive block and distribute it on fly.
    • (U2.1) D1, D2, D3 and D4.1 are needed. Regarding blocks, we don't want to deserialize payload at all, checking header is enough.
    • (U2.2) One can argue that may want to have D4 + D5.1 instead of D4.1, if it's not significantly slower.
  • (U3) Apply received block(s) to GState.
    • D1-D5 (complete procedure) are needed.
  • (U4) Reading data (sometimes lots of it) from DB for different node local actions (LRC, GState checks, API for Middleware -- all of them require db traversing or LCA computations).
    • D1,D4 needed (we trust local database, so we want to turn off as much checks as it's possible).
    • AFAIU @avieth assumes D2-D3 is also needed and we shouldn't optimize here because it's too dangerous anyway.

Solutions and comparison.

First of all it is needed to say that solving U1 is completely orthogonal to solving U2-U4. We will just modify database typeclass so it is possible to read raw data instead of already deserialized value. So let's move to U2-U4.

Current state of CSL-1399 PR aims to cover U4 only, as it was described in the YT task. It does it in a wery compact and concise way, though preventing us from implementing U2. Decoupling (de)serialization and datatype verification would give us more flexibility to solve U2. At least it seems to be. Before I move to suggesting other solutions, I find it important to say that CSL-1399 PR idea can be applied to get rid of D2-D3 as it's required by U4. Still, if we go with decoupling approach, it might be a good idea to leave ReaderT (or patch cborg) to skip heavy D3. Let's call this suggestion Q1 to refer to it later.

Let's have a closer look at the validation process itself. Currently we have a bunch of types T that conform the following structure:

  1. Defined as data T = UnsafeT { .. }
  2. There is mkT function that works in the same constructor does. It defines several checks that define internal validity of T.
  3. We support invariant "every instance of type T is valid" by not using UnsafeT where it is possible to construct incorrect T.

Pros of this approach:

  • It's easy to implement
  • It integrates into decoder nicely so we shouldn't ever think that T's subfield S can be invalid. If it's decoded, it's instantiated. Then it's valid. We can simply feed S to makeT.

Cons of this approach is that it, well, combines verification and deserialization, which we want to avoid.

The first dichotomy i would like to highlight is:

  • (Q2) is it alright to break invariant (3)? Rephrasing: may we assume that type T can be valid to different degree? Or we should create Ti for each validation step? (like T0 -- no checks enforced, T1 -- some checks enforced, ... Tn -- valid). If yes, which n should we take?

We need to figure out whether it's alright to accept Q2, it's a tough tradeoff: if we reject Q2 it can lead to massive code duplication and unnecessary overabstraction.

There's another issue with T verification. Imagine it's now completely separate and we already have T deserialized, but not verified (T or T0 according to Q1). How would verifyT look like? Well, you should consider the fact that now you're obliged to perform all the subfields checks by yourself. Hopefully, this will be easy to maintain. Just a notice.

I claim that if we're aiming to solve U2.2 then the only thing we need to agree on is Q2: U1 is irrelevant, U4 is solved by deserializing to unsafe T or T0 (and maybe skipping D2 D3 as suggested by Q1).

If we aim to do U2.1, i think we can implement it using newtypes. E.g. if we have S as subfield of T and we only want to deserialize S, we can create newtype SAsT = SAsT S which will be encoded as S, but decoded as T, though avoiding other T's fields.

Another suggestion is to use data SAsT = ... effectively duplicating T, but containing only one related subfield S.

And yes, maybe default haskell laziness can help us here too (though I'm not sure). What if S subfield of T is not strict?

Solution 1: Approach with "RawX and X".

Described in input-output-hk/cardano-sl#1823 (comment)

It solves the problem by rejecting Q2. Though, there is a serious drawback with it: You must propagate "Raw" to all record fields. Imagine you have the following datatype set

data C = C { ... }
data B = B { bC :: C, ... }
data A = A { aB :: B, ... }

Now imagine you want to turn off validation of C and A. It is impossible to do it without propagating "Raw" to B:

data C = C { ... }
data RawC = RawC { ... }

data B = B { bC :: C, ... }
data RawB = RawB { bC :: RawC, ... }

data A = A { aB :: B, ... }
data RawA = A { aB :: RawB, ... }

You now should see where it is going. Imagine you want to have RawB1 (which doesn't check everything) and RawB2 (which checks only bC, but not, say, bD): how would you implement it without duplicating datatypes?

Solution 2: using GADT/DataKinds

Another no-Q2 solution suggested in input-output-hk/cardano-sl#1823 (comment)

Has the very same drawback as the first solution. It is actually a cosmetic improvement that gives another tradeoff:

  1. Now you can't mistakenly define RawA and A in different modules (which feels like chaos).
  2. But now you should use this parameter v everywhere you use block.

Solution 3: phantom types

Same as 3, but instead of having two constructors you now have:

data Block (v :: Verification) = Block {..}

This doesn't have the (bad) possibility to have different implementations of block for verified and unverified parameter.

Solution 4: watching your code.

Q2-accepting solution that works essentially as:

  1. Use T, but use implicit function invariants (e.g. "this function works on valid T only").
  2. Use verifyT or partial verifications.

The main idea behind the confidence it will work is that most of the CSL code/core uses valid block already and unsafe blocks will be used in the diffusion layer only, so these module sets have minimum number of intersections thus minimizing ability to make a mistake too.

@gromakovsky
Copy link

It does it in a wery compact and

very

Pros of this approach:

Another advantage is that it combines nicely with Q1 suggestion.

(Q2) is it alright to break invariant (3)?

What is invariant (3)?

@gromakovsky
Copy link

gromakovsky commented Nov 14, 2017

There's another issue with T verification. Imagine it's now completely separate and we already have T deserialized, but not verified (T or T0 according to Q1). How would verifyT look like? Well, you should consider the fact that now you're obliged to perform all the subfields checks by yourself. Hopefully, this will be easy to maintain. Just a notice.

Yes, that's another advantage of the current approach. For instance, we have:

mkTx
    :: MonadFail m
    => NonEmpty TxIn -> NonEmpty TxOut -> TxAttributes -> m Tx

Inside mkTx we can assume that TxIns and TxOuts are valid, so we only need to check invariants of Tx itself.
If we have verifyTx :: MonadFail m => Tx -> m (), then we should also call verifyTxIn for all TxIns and verifyTxOut for all TxOuts.
One can argue that TxIn and TxOut are simple types without any invariants. However:

  1. TxOut contains Coin which has an internal invariant, so with verify approach we have to also have verifyTxOut.
  2. TxIn doesn't have nested invariants, so we indeed don't need to use verifyTxIn, but can we be sure that we will never add any checks to TxIn? And if we do it, we may forget to use verifyTxIn in verifyTx and the code will compile (unless we do something sophisticated, but I don't know what).

Even if we call verifyX for all Xs from some type Y, we may later add another field to Y and forget to add verifyY usage to verifyX.

P. S. Perhaps it can be solved using TH.

@gromakovsky
Copy link

gromakovsky commented Nov 14, 2017

Upd: after I wrote the comment above, I read the article fully and remembered about verified/unverified types. I see that it's not that bad.
Also I remembered that the key problem with what we have now is that we can't have unverified value, but we need it for diffusion layer.

data Block (v :: Verification) = Block {..}

Does v apply to all fields also? Will we have

data Tx (v :: Verification) = Tx
    { _txInputs     :: !(NonEmpty TxIn)  -- ^ Inputs of transaction.
    , _txOutputs    :: !(NonEmpty (TxOut v)) -- ^ Outputs of transaction.
    , _txAttributes :: !TxAttributes     -- ^ Attributes of transaction
    } deriving (Eq, Ord, Generic, Show, Typeable)

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment