volhovm/blockRetrieval.md

## blockRetrieval.md

      
    Raw
  

              blockRetrieval.md
            
          
    Block retrieval optimization

This document aims to provide a detailed overview of a certain
development problem and suggest/compare several possible solutions. We
will mostly talk about speeding up data
deserialization/verification and diffusion interface.
Terms and definitions

Arbitrary datatype T construction/deserialization routine usually
consists of the following steps:

(D1) Read bytes (from network or any type of storage)
(D2) Check CBOR structure (bytestring is a correct CBOR encoded data)
(D3) Check datatype canonicity
(D4) Construct/deserialize datatype T

(D4.1) An option is to deserialize only some part of data (header
of block)


(D5) Validate T.

(D5.1) Validate parts of T.


Current codebase only provides interface to do all of it at
once. User runs D1 and feeds bytestring to decoder. Decoder performs
D2-D3, parses subfields and then constructs datatype (D4) with a
special constructor (D5).
It is currently impossible to perform (D4.1) and (D5.1), though it can
be implemented.
Now let's describe features that we'd like to have. We'll formulate
them in terms of use cases:

(U1) Respond block requests quickly.

Only D1 is needed (just get those bytes and send them).


(U2) Diffusion (as designed for Shelley) of blocks: receive block
and distribute it on fly.

(U2.1) D1, D2, D3 and D4.1 are needed. Regarding blocks, we don't want
to deserialize payload at all, checking header is enough.
(U2.2) One can argue that may want to have D4 + D5.1 instead of
D4.1, if it's not significantly slower.


(U3) Apply received block(s) to GState.

D1-D5 (complete procedure) are needed.


(U4) Reading data (sometimes lots of it) from DB for different
node local actions (LRC, GState checks, API for Middleware -- all
of them require db traversing or LCA computations).

D1,D4 needed (we trust local database, so we want to turn off
as much checks as it's possible).
AFAIU @avieth assumes D2-D3 is also needed and we shouldn't
optimize here because it's too dangerous anyway.


Solutions and comparison.

First of all it is needed to say that solving U1 is completely
orthogonal to solving U2-U4. We will just modify database typeclass so
it is possible to read raw data instead of already deserialized
value. So let's move to U2-U4.
Current state of CSL-1399 PR aims to cover U4 only, as it was
described in the YT task. It does it in a wery compact and concise
way, though preventing us from implementing U2. Decoupling
(de)serialization and datatype verification would give us more
flexibility to solve U2. At least it seems to be. Before I move to
suggesting other solutions, I find it important to say that CSL-1399
PR idea can be applied to get rid of D2-D3 as it's required by
U4. Still, if we go with decoupling approach, it might be a good idea
to leave ReaderT (or patch cborg) to skip heavy D3. Let's call this
suggestion Q1 to refer to it later.
Let's have a closer look at the validation process itself. Currently
we have a bunch of types T that conform the following structure:

Defined as data T = UnsafeT { .. }
There is mkT function that works in the same constructor
does. It defines several checks that define internal validity of
T.
We support invariant "every instance of type T is valid" by not
using UnsafeT where it is possible to construct incorrect T.

Pros of this approach:

It's easy to implement
It integrates into decoder nicely so we shouldn't ever think that
T's subfield S can be invalid. If it's decoded, it's
instantiated. Then it's valid. We can simply feed S to makeT.

Cons of this approach is that it, well, combines verification and
deserialization, which we want to avoid.
The first dichotomy i would like to highlight is:

(Q2) is it alright to break invariant (3)? Rephrasing: may we
assume that type T can be valid to different degree? Or we should
create Ti for each validation step? (like T0 -- no checks
enforced, T1 -- some checks enforced, ... Tn -- valid). If yes,
which n should we take?

We need to figure out whether it's alright to accept Q2, it's a tough
tradeoff: if we reject Q2 it can lead to massive code duplication and
unnecessary overabstraction.
There's another issue with T verification. Imagine it's now
completely separate and we already have T deserialized, but not
verified (T or T0 according to Q1). How would verifyT look
like? Well, you should consider the fact that now you're obliged to
perform all the subfields checks by yourself. Hopefully, this will be
easy to maintain. Just a notice.
I claim that if we're aiming to solve U2.2 then the only thing we
need to agree on is Q2: U1 is irrelevant, U4 is solved by
deserializing to unsafe T or T0 (and maybe skipping D2 D3 as
suggested by Q1).
If we aim to do U2.1, i think we can implement it using
newtypes. E.g. if we have S as subfield of T and we only want to
deserialize S, we can create newtype SAsT = SAsT S which will be
encoded as S, but decoded as T, though avoiding other T's
fields.
Another suggestion is to use data SAsT = ... effectively duplicating
T, but containing only one related subfield S.
And yes, maybe default haskell laziness can help us here too (though
I'm not sure). What if S subfield of T is not strict?
Solution 1: Approach with "RawX and X".

Described in input-output-hk/cardano-sl#1823 (comment)
It solves the problem by rejecting Q2. Though, there is a serious
drawback with it: You must propagate "Raw" to all record
fields. Imagine you have the following datatype set
data C = C { ... }
data B = B { bC :: C, ... }
data A = A { aB :: B, ... }

Now imagine you want to turn off validation of C and A. It is
impossible to do it without propagating "Raw" to B:
data C = C { ... }
data RawC = RawC { ... }

data B = B { bC :: C, ... }
data RawB = RawB { bC :: RawC, ... }

data A = A { aB :: B, ... }
data RawA = A { aB :: RawB, ... }

You now should see where it is going. Imagine you want to have
RawB1 (which doesn't check everything) and RawB2 (which checks
only bC, but not, say, bD): how would you implement it without
duplicating datatypes?
Solution 2: using GADT/DataKinds

Another no-Q2 solution suggested in
input-output-hk/cardano-sl#1823 (comment)
Has the very same drawback as the first solution. It is actually a
cosmetic improvement that gives another tradeoff:

Now you can't mistakenly define RawA and A in different
modules (which feels like chaos).
But now you should use this parameter v everywhere you use block.

Solution 3: phantom types

Same as 3, but instead of having two constructors you now have:
data Block (v :: Verification) = Block {..}

This doesn't have the (bad) possibility to have different
implementations of block for verified and unverified parameter.
Solution 4: watching your code.

Q2-accepting solution that works essentially as:

Use T, but use implicit function invariants (e.g. "this function
works on valid T only").
Use verifyT or partial verifications.

The main idea behind the confidence it will work is that most of the
CSL code/core uses valid block already and unsafe blocks will be used
in the diffusion layer only, so these module sets have minimum number
of intersections thus minimizing ability to make a mistake too.