Skip to content

Instantly share code, notes, and snippets.

@jzhou77
Last active May 13, 2022 16:26
Show Gist options
  • Save jzhou77/5247c1dbe820d9fcebb1fd1df825b7d5 to your computer and use it in GitHub Desktop.
Save jzhou77/5247c1dbe820d9fcebb1fd1df825b7d5 to your computer and use it in GitHub Desktop.
FDB Random Notes
@jzhou77
Copy link
Author

jzhou77 commented Nov 11, 2020

DD 2 phase transfer of ownership from an old team to a new team, due to a SS failure

  1. Writing mutations to both old and new teams. New SS fetching data from old SS (range reads). After having all the data, signal DD complete.
  2. DD reassign owner to the new team. DD tells tlog to remove tag of the failed SS.

Start the above process, SS comes back after hours and is way behind. It's hard to for SS to catch up, because its data is on disk, comparing to other in-memory data. So it's better to keep this SS as failed.

@jzhou77
Copy link
Author

jzhou77 commented Nov 11, 2020

HA

Proxy: add primary SS tags, remote SS tags, and a LR tag
Primary tlog: only index primary SS tags and a LR tag
satellite tlog only index LR tag,

LR adds ephemeral tag. LR pops only when all remote tlogs pop the ephemeral tags.

remote tlogs: pull data from all LRs and merge them. This means all LRs are synced, otherwise remote SS has to wait for the slowest LRs.

If FDB can't recruit enough processes on the primary, it automatically fails over to the remote region. CC needs to talk to the majority of Coordinators.

Client reads remote region can have future_version error, if remote region lags behind. Load balancing algorithm can direct reads to remote SS. Reducing replication degree can lower storage cost, but can increase hot spot problem, because of smaller number of SS replicas.

  • Failures
    • Remote DC failure. TLog spilling for 24 hours. ~16 hours into spilling, should drop remote DC.
      • remote SS remains the same, but primary sharding can be changed due to hot spot.
    • A remote SS becomes slow. TLog queues more data -> slows down LR -> primary/satellite tlog queues more data
    • Exercise failover: demo -> production

WAN Latency

  • Commit slow

  • Remote falling behind: DD problem of moving data in the primary, impact primary capability of move & load balancing, because remote lags

  • Failure monitoring

  • Asymmetric networking

  • Scalability

    • CC tracks all processes in two regions
    • Master tracks remote tlogs: remote tlog failure potentially not causing primary recovery, just replace remote tlogs

Region configuration: add regions; add processes; change replication -> DD issue moves -> use one SS pull from primary, the other two copy from the first SS; configure auto failover

Q: how to do distributed txn?

@jzhou77
Copy link
Author

jzhou77 commented Nov 12, 2020

Data Distribution

  • Background actors
    • BgDDMountainChopper: moving data away from most utilized server
    • BgDDValleyFiller: moving data to least utilized server
    • Directory moving data from most to least can cause skewed data distribution

Tracker: track failures?, split shards

Queue: on source servers

  • dataDistributionRelocator: process the queue head
    • Before move finishes, SS adds inflight move stats.
    • teamCollection: one per region
    • Across WAN, pick a random SS to copy data, which then seeds other SSes in the team.
  • SS finishes fetch, signal transferComplete so that DD can start next move, even though SS still takes time to persist data.

SS has a queue for fetching keys. keeps logic bytes stats from samples. DD tries to balance logic bytes among SSes.

  • DataDistribution.actor.cpp

  • MoveKeys: mechanism for reassigning shards, implemented via transactions.

    • Move data from src to dst teams

    • checkFetchingState: poll dst SS about fetching waitForShardReady.

    • finishMoveKeys: poll dst servers the move finished, so it can remove src. If a SS is in both src and dst, ... Change key range map in the end.

    • moveKeyLock: makes sure only one DD active

    • krmGetRanges: krm means keyRangeMap, a data structure from key range to its owners

    • krmSetRangeCoalescing : applyMetadataMutations see changes in keyServersPrefix and serverKeysPrefix. When SS sees the privatized mutation in applyPrivateData, SS knows its ownership of a key range. AddingShard buffers mutations during the move. fetchComplete. After fetch complete, SS needs to wait 5s of MVCC window for data to become durable.

SS: fetch uses its own version for fetching, which could be too old and can't catch up. If fetching new version, need to wait SS catch up to the new version. FetchInjectionInfo. The update loop fk.send() runs the rest of fetch loop

@jzhou77
Copy link
Author

jzhou77 commented Dec 11, 2020

NIC and kernel TCP tuning

  • Increase Intel NIC ring buffer size to absorb the burst traffic patterns - default is 512.
    * "ethtool -G eth0 rx 4096 tx 4096"
  • Disable the flow control for Intel NIC
    * "ethtool -A eth0 rx off tx off"
  • Increase socket buffer default size : default is 212992 -i.e. 208K to 1M
    * "sysctl -w net.core.rmem_default=1048576"
    * "sysctl -w net.core.rmem_max=1048576"
    * "sysctl -w net.core.wmem_default=1048576"
    * "sysctl -w net.core.wmem_max=1048576"

@jzhou77
Copy link
Author

jzhou77 commented May 26, 2021

Anti-quorum is not used, because if you let a tlog fall behind other tlogs, all of the storage servers that read from that tlog will also fall behind, and then you will get future version errors from clients trying to read from those storage servers

@jzhou77
Copy link
Author

jzhou77 commented Sep 1, 2021

Docker on Mac without docker-desktop

brew install docker docker-machine docker-credential-helper docker-compose virtualbox

(make sure ~/.docker/config.json has "credsStore" : "osxkeychain" in it)

@jzhou77
Copy link
Author

jzhou77 commented Sep 21, 2021

documentation build

  1. ninja docpreview starts a web server at a local port: e.g., Serving HTTP on 0.0.0.0 port 14244 (http://0.0.0.0:14244/)
  2. Since this web server is in okteto, need to do port fowarding:
$ ssh -L 14244:localhost:14244 jzhou-dev.okteto

Or

  $ kubectl get all
  $ kubectl port-forward replicaset.apps/jzhou-dev-6df7457774 14244 14244
  1. Navigate to http://localhost:14244/performance.html.

@jzhou77
Copy link
Author

jzhou77 commented Nov 1, 2021

(1) Install gdbgui using pip install gdbgui
(2) Forward the port used by gdbserver or gdbgui in Okteto environment to your local machine: kubectl port-forward pod/vishesh-dev-6d4f39f78-ngs8f 5000:5000

@jzhou77
Copy link
Author

jzhou77 commented May 7, 2022

	TraceEvent("TLogInitCommit", logData->logId).log();
	wait(ioTimeoutError(self->persistentData->commit(), SERVER_KNOBS->TLOG_MAX_CREATE_DURATION));

217.908010 Role ID=f70414b30d93c824 As=TLog Transition=Begin Origination=Recovered OnWorker=3135ceefe717ee33 SharedTLog=853fab151d4a9085 GP,LR,MS,SS,TL
217.908010 TLogRejoining ID=f70414b30d93c824 ClusterController=50cd94d391dd1f0b DbInfoMasterLifeTime=50cd94d391dd1f0b#4 LastMasterLifeTime=0000000000000000#0 GP,LR,MS,SS,TL
217.908010 TLogStart ID=f70414b30d93c824 RecoveryCount=28 GP,LR,MS,SS,TL
217.908010 TLogInitCommit ID=f70414b30d93c824 GP,LR,MS,SS,TL

227.908010 IoTimeoutError ID=0000000000000000 Error=io_timeout ErrorDescription=A disk IO operation failed to complete in a timely manner ErrorCode=1521 Duration=10 BT=addr2line -e fdbserver.debug -p -C -f -i 0x1d9102b 0x249cfbb 0x249c324 0x2472a61 0x2479a69 0x24c2cac 0x2631811 0x261bb4d 0x13b1f2c 0x360e31b 0x360df73 0x11c6198 0x36dfda6 0x36dfc18 0x1b9d8c5 0x7f719aff6555 Backtrace=addr2line -e fdbserver.debug -p -C -f -i 0x3823549 0x3823801 0x381eb04 0x1d710d3 0x1d70d57 0x387bfb8 0x387bdde 0x11c6198 0x36dfda6 0x36dfc18 0x1b9d8c5 0x7f719aff6555 GP,LR,MS,SS,TL

227.908010 TLogError ID=853fab151d4a9085 Error=io_timeout ErrorDescription=A disk IO operation failed to complete in a timely manner ErrorCode=1521 GP,LR,MS,SS,TL
227.908010 Role ID=f70414b30d93c824 Transition=End As=TLog Reason=Error GP,LR,MS,SS,TL
227.908010 RoleRemove ID=0000000000000000 Address=2.0.1.2:1 Role=TLog NumRoles=7 Value=1 Result=Decremented Role GP,LR,MS,SS,TL
227.908010 Role ID=7fb6adb922290f7e Transition=End As=TLog Reason=Error GP,LR,MS,SS,TL
227.908010 RoleRemove ID=0000000000000000 Address=2.0.1.2:1 Role=TLog NumRoles=6 Value=0 Result=Removed Role GP,LR,MS,SS,TL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment