Skip to content

Instantly share code, notes, and snippets.

@jzhou77
Last active May 13, 2022 16:26
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jzhou77/5247c1dbe820d9fcebb1fd1df825b7d5 to your computer and use it in GitHub Desktop.
Save jzhou77/5247c1dbe820d9fcebb1fd1df825b7d5 to your computer and use it in GitHub Desktop.
FDB Random Notes
@jzhou77
Copy link
Author

jzhou77 commented May 9, 2020

r -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ParallelRestoreNewBackupCorrectnessMultiCycles.txt -s 517382541 -b on

Install docker daemon

yum-config-manager     --add-repo     https://download.docker.com/linux/centos/docker-ce.repo
yum install docker-ce docker-ce-cli containerd.io

Start docker daemon

sudo systemctl enable docker
sudo systemctl start docker
sudo chmod 777 /var/run/docker.sock

ssh joshua4 "sudo systemctl enable docker; sudo systemctl start docker; sudo chmod 777 /var/run/docker.sock"

Copy ssh key over and build docker images

for (( i = 1; i < 5; i++ )) ; do ssh joshua$i sudo yum install -y git; done
for (( i = 1; i < 5; i++ )) ; do scp ~/.ssh/joshua joshua$i:.ssh/; done
for (( i = 1; i < 5; i++ )) ; do ssh joshua$i "echo StrictHostKeyChecking no >> .ssh/config && chmod 600 .ssh/config" ; done

for (( i = 1; i < 5; i++ )) ; do scp contrib/build_docker.sh joshua$i:.; done
for (( i = 1; i < 5; i++ )) ; do ssh joshua$i ./build_docker.sh joshua logs; done

Copy cluster file and start agents

for (( i = 1; i < 5; i++ )) ; do scp fdb_aws_mixshareddev.cluster joshua$i:fdb.cluster; done
for (( i = 1; i < 5; i++ )) ; do ssh joshua$i "nohup docker run --rm -v /home/centos:/opt/joshua foundationdb/joshua-agent:latest > start_agents.log 2>&1 &"; done

@jzhou77
Copy link
Author

jzhou77 commented May 13, 2020

fdb> status

Using cluster file `fdb.cluster'.

Could not communicate with all of the coordination servers.
  The database will remain operational as long as we
  can connect to a quorum of servers, however the fault
  tolerance of the system is reduced as long as the
  servers remain disconnected.

  10.0.1.73:4500  (reachable)
  10.0.1.137:4500  (reachable)
  10.0.1.141:4500  (unreachable)

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3

python3 -m venv venv
source venv/bin/activate
pip install joshua/ # remove childsubreaper from setup.py
pip install python-dateutil
export FDB_VERSION=curl -L https://www.foundationdb.org/downloads/version.txt
curl -L https://www.foundationdb.org/downloads/${FDB_VERSION}/linux/libfdb_c_${FDB_VERSION}.so -o venv/lib64/libfdb_c.so
export LD_LIBRARY_PATH=/home/jingyu_zhou/venv/lib

@jzhou77
Copy link
Author

jzhou77 commented May 16, 2020

Bug: file size assertion failure in BackupContainer caused by

  • not filtering out mutations before true-up version
  • true-up progress by looking backup multiple epochs

-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ParallelRestoreNewBackupCorrectnessMultiCycles.txt -s 194031900 -b on

fileName = "plogs/0000/0000/log,156590915,264226555,53feec9b33d3ea5a72b12ffc69d05909,0-of-3,1048576",
fileSize = 1233830,

fileName = "plogs/0000/0000/log,257838691,264168498,04ba01906f379a38892bbed802fdf1d4,0-of-3,1048576",
fileSize = 1343735,

77.116473" Type="BackupWorkerStart" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" Tag="-2:0" TotalTags="3" StartVersion="156590915" EndVersion="-1" LogEpoch="6" BackupEpoch="6" Roles="BK,RK,SS"
92.371963" Type="BackupWorkerDisplaced" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" RecoveryCount="6" SavedVersion="156590914" BackupWorkers="anti: 0 replication: 2 loc
95.633401" Type="CloseMutationFile" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" FileSize="1343735" TagId="0" File="plogs/0000/0000/log,257838691,264168498,04ba01906f379a38892bbed802fdf1d4,0-of-3,1048576" Roles="BK,RK,SS"
97.116473" Type="BackupWorkerMetrics" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" Elapsed="5" SavedVersion="156590914" MinKnownCommittedVersion="264179213" MsgQ="25250" BufferedBytes="8905656" Roles="BK,RK,SS" TrackLatestType="Original"
111.687895" Type="BackupWorkerSavedProgress" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" Tag="-2:0" Version="264168497" MsgQ="684" Roles="BK,LR,SS,TL"
111.687895" Type="BackupWorkerTerminated" Machine="2.0.1.0:3" ID="cd3c6a17d3f598e9" Error="worker_removed" ErrorDescription="Normal worker shut down" ErrorCode="1202"

109.341644" Type="BackupWorkerStart" Machine="2.0.2.0:2" ID="29c6f2f74f13216f" Tag="-2:0" TotalTags="3" StartVersion="156590915" EndVersion="264226554" LogEpoch="10" BackupEpoch="6" Roles="BK,DD,RK,SS,TL"

111.254429" Type="OpenMutationFile" Machine="2.0.2.0:2" ID="29c6f2f74f13216f" BackupID="4d6aff6a48d23c44" TagId="0" File="plogs/0000/0000/log,156590915,264226555,53feec9b33d3ea5a72b12ffc69d05909,0-of-3,1048576" Roles="BK,DD,RK,SS,TL"

@jzhou77
Copy link
Author

jzhou77 commented May 21, 2020

Fixed bug: mutations for an old epoch can be cleared when stopped pulling

Seed: -r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ParallelRestoreNewBackupCorrectnessAtomicOp.txt -s 626279072 -b on
Commit: 3bf38c1ac

cmr | grep -E 'Backup(Version|ContainerDescribe)|BackupWorker(Save|FinishPull|Start|Done|Dis|Ter|Wait|Noop|Pop|Log|Set|Metric|True|Memory)|BackupRecruitment|NewEpochStartVersion|MutationFile|TargetVersion|backup_lock_bytes|ConsistencyCheck|BARW|BAFRW|Debug' | s | less

  cmr | grep -E 'ProxyCommit|"BackupWorkerDebug"' | s | awk '{ if ($1 < 177) { print $0; } } ' | grep ProxyCommit | grep log0000001f | sed -e 's/.*Mutation="\(.*\)" Version="\([0-9]*\)".*/\2 \1/' > proxy.txt
  cmr | grep -E 'ProxyCommit|"BackupWorkerDebug"' | s | awk '{ if ($1 < 177) { print $0; } } ' | grep Debug | grep log0000001f | sed -e 's/.*Version="\([0-9\.]*\)" Mutation="\(.*\)" KCV.*/\1 \2/' > bw.txt

176.599304" Type="BAFRW_Restore" Machine="3.4.3.5:1" ID="4bc75b1fda06e16d" LastBackupContainer="file://simfdb/backups/backup-1969-12-31-16-01-31.785083" MinRestorable Version="275882439" MaxRestorableVersion="407817252" ContiguousLogEnd="407817253" TargetVersion="-1" Roles="TS"
176.643945" Type="FastRestoreSubmitRestoreRequest" Machine="3.4.3.5:1" ID="0000000000000000" BackupDesc="URL: file://simfdb/backups/backup-1969-12-31-16-01-31.785083\x0aRestorable: true\x0aPartitioned logs: true\x0aSnapshot: startVersion=275799911 (1969/12/31.16:01:38-0800) endVersion=275882439 (1969/12/31.16:01:38-0800) totalBytes=6570746 restorable=true expiredPct=0.00\x0aSnapshotBytes: 6570746\x0aMinLogBeginVersion: 167875689 (1969/12/31.16:01:22-0800)\x0aContiguousLogEndVersion: 407817253 (1969/12/31.16:02:18-0800)\x0aMaxLogEndVersion: 407874120 (1969/12/31.16:02:18-0800)..." TargetVersion="407817252" Roles="TS"

104.396907 ProxyCommitTo ID=0ffa5ff8d33b3577 To="0:0,0:1,1:1" Mutation="code: SetValue param1: log0000001f0000000300002766 param2: 6\xdes\x00\x00\x00\x00\x00" Version="281812362"
104.396907 ProxyCommitTo ID=0ffa5ff8d33b3577 To="0:1,1:1" Mutation="code: ByteMax param1: ops0000001f000000ea param2: 6\xdes\x00\x00\x00\x00\x00" Version="281812362"
107.027467 BackupWorkerDebug ID=ea8733526f22904f Version="281812362.13" Mutation="code: SetValue param1: debug0000001f0000000300002766 param2: ops0000001f000000ea" KCV="283600028" SavedVersion="167868147"
107.060972 BackupWorkerDebug ID=99ac0e2b463f6215 Version="281812362.15" Mutation="code: ByteMax param1: ops0000001f000000ea param2: 6\xdes\x00\x00\x00\x00\x00" KCV="283600028" SavedVersion="167868147"

Missing mutation: 281812362.14 Tag -2:5
104.396907 ProxyCommitTo To="0:0,0:1,1:1" Mutation="code: SetValue param1: log0000001f0000000300002766 param2: 6\xdes\x00\x00\x00\x00\x00" Version="281812362"

5-of-6 281517599 283628854 87fbc2c6477899c1df6c1c77dd76c2b9 0

121.211703 BackupWorkerStart ID=02e39df54ff302f9 Tag="-2:5" TotalTags="6" StartVersion="167868148" EndVersion="283628853" LogEpoch="9" BackupEpoch="7"

@jzhou77
Copy link
Author

jzhou77 commented May 28, 2020

Joshua binding tests on Ruby

docker run -it -v /home/centos/bindingtester/:/opt/binding foundationdb/joshua-agent:0.0.6 scl enable rh-python36 rh-ruby24 -- bash
for ((i = 0; i < 10; i++ )) ; do ./joshua_test ; done

Ruby 2.6 has error related in tuple encoding, failing many binding tests.

@jzhou77
Copy link
Author

jzhou77 commented Jun 25, 2020

Bootstrapping in FDB

FoundationDB has no external dependency on other services. As a result, bootstrapping the system is performed in several steps.

  • Coordinators start and elect the cluster controller. If the database already exists, coordinators will store the configuration of the transaction system.

  • Cluster controller recruits the master, which reads the configuration of the old transaction, spawns the new transaction system, and waits until the new transaction system finishes recovery and becomes ready. The master then writes the configuration of the new transaction system to coordinators.

  • System started. Transaction state store (metadata on where each transaction data is written to) is stored in Proxy memory and persisted on TLogs, and is read during recovery.

Coordinator states -> transaction state -> whole database

@jzhou77
Copy link
Author

jzhou77 commented Jul 4, 2020

Proxy1 asks commit version from Master
Master returns V1

Proxy 2 asks commit version from Master
Master returns V2 (V2 > V1)

Proxy1 commits V1
Proxy1 tells Master V1 committed

Proxy2 commits V2

Client ask any Proxy for read version, which is forwarded to Master
Master replies V1 (the highest commit version master knows)

Proxy2 tells Master V2 committed
Client reads V1 data, which is stale, since V2 is already committed.

@jzhou77
Copy link
Author

jzhou77 commented Jul 13, 2020

Unseed: 2620

e78cc9ee4 good
d61206e good

-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ParallelRestoreOldBackupCorrectnessMultiCycles.txt -s 18633417 -b on (DD crash)
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/slow/ParallelRestoreNewBackupCorrectnessAtomicOp.txt -s 5709316 -b on
226.664578 BackupContainerDescribe2 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="309544630" LogEndVersion="655437004" LogType="0"
(gdb) p (int64_t) desc.minRestorableVersion.value
$4 = 421331220
(gdb) p (int64_t) desc.contiguousLogEnd.value
$5 = 655437004
ScanBegin="655437004" ScanEnd="9223372036854775807" Plogs="0" Logs="0" MetaLogTypePresent="1"
216.345288 BackupContainerDescribe1 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1
"
216.345288 BackupContainerDescribe2 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1
" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1" LogType="-1"
216.345288 BackupContainerMetadataInvalid ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" ExpiredEndVersion="-1
" UnreliableEndVersion="-1" LogBeginVersion="-1" LogEndVersion="-1"
216.345288 BackupContainer ID=0000000000000000 ScanBegin="0" ScanEnd="9223372036854775807" Plogs="15" Logs="0" MetaLogTypePresent="0"
220.225274 BackupContainerDescribe1 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1
"
220.231089 BackupContainerDescribe2 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1
" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="309544630" LogEndVersion="655437004" LogType="-1"
220.231089 BackupContainer ID=0000000000000000 ScanBegin="655437004" ScanEnd="9223372036854775807" Plogs="0" Logs="0" MetaLogTypePresent="0"
220.617254 BARW_LastBackupContainer ID=30ba527ca5055c63 BackupTag="default" LastBackupContainer="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LastBackupUID="46f29ad1a20fb0b4" WaitStatus="3" Restorable="1"
220.617254 BARW_DoBackupAbortBackup2 ID=30ba527ca5055c63 Tag="default" WaitStatus="3" LastBackupContainer="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" Restorable="1"
222.392848 BARW_LastBackupContainer ID=2aa6eb7a8ed62c67 BackupTag="default" LastBackupContainer="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LastBackupUID="46f29ad1a20fb0b4" WaitStatus="3" Restorable="1"
226.657826 BAFRW_Restore ID=59e0a5b618240b7a LastBackupContainer="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" RestoreAfter="60" BackupTag="default"
226.657826 BackupContainerDescribe1 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1"
226.664578 BackupContainerDescribe2 ID=0000000000000000 URL="file://simfdb/backups/backup-1969-12-31-16-02-23.400482" LogStartVersionOverride="-1" ExpiredEndVersion="-1" UnreliableEndVersion="-1" LogBeginVersion="309544630" LogEndVersion="655437004" LogType="0"
226.664578 BackupContainer ID=0000000000000000 ScanBegin="655437004" ScanEnd="9223372036854775807" Plogs="0" Logs="0" MetaLogTypePresent="1"

describeBackup race: assertion failure self->usePartitionedLogs == desc.partitioned

Two describe backup in BackupAndParallelRestoreCorrectness workloads, each for one doBackup. The two describeBackup happens in such a sequence:

  1. First describe backup updated "log_begin_version" and "log_end_version" meta files.
  2. Second describe backup starts and reads those two meta files and find no files after the end version. Since no log type version, it sets log type to non-partitioned log
  3. First describe backup updates the "mutation_log_type" meta file, which somehow wasn't successful.
  4. Restore process tries describe backup and got non-partitioned log type, which fails the assertion.

c94f46440 (bad) 20200712-172903-jingyu_zhou-0a009367c3b5c866 2/1636 Mismatch error
3b09308 bad

@jzhou77
Copy link
Author

jzhou77 commented Jul 21, 2020

Delete Joshua offending key:

fdb> getrange \x15\x34\x01\x32\x30\x32\x30\x30\x37\x31\x36\x2d\x31\x35\x33 \x15\x34\x01\x32\x30\x32\x30\x30\x37\x31\x36\x2d\x31\x35\x34 5

Range limited to 5 keys
`\x154\x0120200716-153805-xiaoge_su-450c1bc147ddf978\x00\x01count\x00\x01ended\x00' is `\x03\x00\x00\x00\x00\x00\x00\x00'
fdb> clear \x154\x0120200716-153805-xiaoge_su-450c1bc147ddf978\x00\x01count\x00\x01ended\x00

@jzhou77
Copy link
Author

jzhou77 commented Sep 30, 2020

New backup: perf test

grep CodeCoverage trace.0.0.0.0.0.1600114388.Cbuqzt.0.1.xml | sed -e 's/Event.CodeCoverage/CodeCoverage/' -e 's|" Condition=.|"/>|' -e 's/" Machine.File=/ File=/' | sed -e 's/(CodeCoverage )(.) (Covered="0").*/\1\3 \2/>/' > rerun.coverage

@jzhou77
Copy link
Author

jzhou77 commented Sep 30, 2020

BinaryReader::arenaRead() will allocate memory in the arena, which has the same lifetime as the BinaryReader. So deserialization from BinaryReader::fromStringRef is potentially dangerous?

@jzhou77
Copy link
Author

jzhou77 commented Sep 30, 2020

Create a new tar ball by deleting a file from the old one:
tar cvfz new.tar.gz --exclude='.*IncrementalBackup.toml' @packages/correctness.tar.gz

@jzhou77
Copy link
Author

jzhou77 commented Oct 6, 2020

To kill all the stopped jobs

kill -9 `jobs -ps`

@jzhou77
Copy link
Author

jzhou77 commented Oct 9, 2020

Joshua Debugging Tips

$ LD_LIBRARY_PATH=/share/devcache/joshua/libs PYTHONPATH=/share/devcache/joshua/modules/fdb-joshua:/share/devcache/joshua/modules/site-packages python3 -i -m joshua.joshua --cluster-file /share/devcache/joshua/config/fdb.cluster
...
>>> list_active_ensembles(stopped=False)
>>> list_active_ensembles(stopped=True)

>>> joshua_model.db
<fdb.impl.Database object at 0x7f3eb2b9de48>
>>> joshua_model.dir_top
DirectorySubspace(path=('joshua',), prefix=b'\x154')
>>> joshua_model.dir_ensembles
DirectorySubspace(path=('joshua', 'ensembles'), prefix=b'\x15%')

>>> joshua_model.dir_all_ensembles['20201008-135157-XinDong-9848dd79e3aac22d']['properties']
Subspace(rawPrefix=b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00')
>>> joshua_model.dir_all_ensembles['20201008-135157-XinDong-9848dd79e3aac22d']['properties']['submitted']
Subspace(rawPrefix=b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02submitted\x00')
>>> tr = joshua_model.db.create_transaction()
>>> tr.get(b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02submitted\x00')
b'\x0220201008-135157\x00'

tr.reset()
kvs = tr.get_range(b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00', b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x01')
for k, v in kvs:
    print(k, v)

>>> tr.reset()
>>> kvs = tr.get_range(b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00', b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x01')
>>> for k, v in kvs:
...     print(k, v)
... 
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02compressed\x00' b"'"
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02data_size\x00' b'\x18\x01Bx\\'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02fail_fast\x00' b'\x15\n'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02max_runs\x00' b'\x17\x01\x86\xa0'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02priority\x00' b'\x15d'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02runtime\x00' b'\x022:57:10\x00'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02sanity\x00' b'&'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02stopped\x00' b'\x0220201008-164907\x00'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02submitted\x00' b'\x0220201008-135157\x00'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02timeout\x00' b'\x16\x15\x18'
b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\x02properties\x00\x02username\x00' b'\x02XinDong\x00'

>>> joshua_model.dir_ensemble_results_fail
DirectorySubspace(path=('joshua', 'ensembles', 'results', 'fail'), prefix=b'\x15\x08')
>>> joshua_model.dir_ensemble_results_fail[b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00']
Subspace(rawPrefix=b'\x15\x08\x01\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\xff\x00')

tr.reset()
kvs = tr.get_range(b'\x15\x08\x0220201007-155332-mengxurelease63-ff905403cf4fe1f3\x00', b'\x15\x08\x0220201007-155332-mengxurelease63-ff905403cf4fe1f3\x01', 5)
for k, v in kvs:
    print(k, v)

>>> joshua_model.dir_ensemble_results_pass[b'\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00']
Subspace(rawPrefix=b'\x15\x18\x01\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\xff\x00')

tr.reset()
kvs = tr.get_range(b'\x15\x18\x01\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\xff\x00', b'\x15\x18\x01\x157\x0220201008-135157-XinDong-9848dd79e3aac22d\x00\xff\x01', 5)
for k, v in kvs:
    print(k, v)

>>> time_start = int(time.time() - 60*60*24*7)
>>> time_end = timestamp_of(None)
>>> failures = joshua_model.get_agent_failures(time_start, time_end)
>>> failures[0]
(('2020-Oct-03 (Sat) 11:25:01 AM', 'fdb-awsshared9'), b'Traceback (most recent call last):\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/joshua/joshua_agent.py", line 515, in agent\n    retcode = run_ensemble(chosen_ensemble, save_on, work_dir=work_dir)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/joshua/joshua_agent.py", line 316, in run_ensemble\n    output, _ = process.communicate(timeout=1)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py", line 863, in communicate\n    stdout, stderr = self._communicate(input, endtime, timeout)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/subprocess.py", line 1525, in _communicate\n    selector.register(self.stdout, selectors.EVENT_READ)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/selectors.py", line 351, in register\n    key = super().register(fileobj, events, data)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/selectors.py", line 237, in register\n    key = SelectorKey(fileobj, self._fileobj_lookup(fileobj), events, data)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/selectors.py", line 224, in _fileobj_lookup\n    return _fileobj_to_fd(fileobj)\n  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/selectors.py", line 39, in _fileobj_to_fd\n    "{!r}".format(fileobj)) from None\nValueError: Invalid file object: <_io.BufferedReader name=22>\n')

@jzhou77
Copy link
Author

jzhou77 commented Oct 13, 2020

debug split txn session (fixed)

175.947105 MutationTracking ID=0000000000000000 At="ApiCorrectnessSet" Version="460019398" MutationType="DebugKey" Key="0000000003xf" Value="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..."
178.502753 MutationTracking ID=0000000000000000 At="ApiCorrectnessClear" Version="462578090" MutationType="DebugKeyRange" KeyBegin="0000000003lckgkcynyfpqcjnhvrsnhebzqiwihfuopshegxuomzdmbyvdpzjfmhlgtpuqlnlteksvytosnygevkqiqufgyitoaskbafvnmzhtvxzftvhyxhvtstlnhmhi" KeyEnd="0000000003zefixtdqmemndhcegipkinuofnz"
179.197182 MutationTracking ID=0000000000000000 At="ApiCorrectnessSet" Version="463268853" MutationType="DebugKey" Key="0000000003xf" Value="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..."
179.644120 ApiCorrectness_CompareValueMismatch ID=0000000000000000 ReadVer="463644615" ResultSize="100" DifferAt="71" DBResult="0000000003evpgxxgqrahuzizodkbruoarunjolpxbefrkghdklemubejbgcmvmxzombvzpbcehpsow:1548" StoreResult="0000000003xf:773" Backtrace="addr2line -e fdbserver.debug -p -C -f -i 0x2ab880d 0x2ab89f5 0x2ab8f03 0x1dfaf69 0x1dffbb4 0x1dffac8 0x1dfefb2 0x2660ff8 0x265de35 0x265afc8 0x265acfe 0x2659008 0x265105c 0x2658778 0x26583aa 0x2655858 0x265516a 0x1f532c8 0xf69aa8 0xf69886 0x2917b11 0x29178d1 0xfff878 0x298a097 0x2989ef6 0x2989b91 0x2989f42 0x298a5cc 0xfff878 0x2a6a546 0x2a5f2ed 0x297bb94 0x143f298 0x7ffff7105555"

179.563105 ProxyPush ID=b0aadb712ec2b453 PrevVersion="463621208" Version="463640486" TransactionsSubmitted="1" TransactionsCommitted="1" TxsPopTo="458596165" MutationsInFirstTxn="530"
179.563105 MutationTracking ID=0000000000000000 At="ProxyCommit" Version="463640486" MutationType="SetValue" Key="0000000002aa"

179.563864 ProxyPush ID=476437716c7fec27 PrevVersion="463621208" Version="463640486" TransactionsSubmitted="1" TransactionsCommitted="1" TxsPopTo="1" MutationsInFirstTxn="522"
179.564233 ProxyPush ID=743c411820e0db95 PrevVersion="463621208" Version="463640486" TransactionsSubmitted="1" TransactionsCommitted="1" TxsPopTo="1" MutationsInFirstTxn="527"

179.564916 ProxyPush ID=b0aadb712ec2b453 PrevVersion="463640486" Version="463644615" TransactionsSubmitted="1" TransactionsCommitted="1" TxsPopTo="458596165" MutationsInFirstTxn="1"

179.572839 MutationTracking ID=0000000000000000 At="ApiCorrectnessSet" Version="463640486" MutationType="DebugKey" Key="0000000002aa" ...

185.178093 MutationTracking ID=0000000000000000 At="makeVersionDurable" Version="463644615" MutationType="SetValue" Key="0000000002aa" This is wrong version!

==========
179.566418 TLogCommit ID=7930199bf939f469 Version="463640486"
179.566418 MutationTracking ID=0000000000000000 At="TLogCommitMessages" Version="463640486" MutationType="SetValue" Key="0000000002aa" Value="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." MessageTags="-2:1 0:0" UID="f4669a654561e330" LogId="7930199bf939f469"
179.566418 TLogCommit ID=7930199bf939f469 Version="463644615"
179.566418 MutationTracking ID=0000000000000000 At="TLogCommitMessages" Version="463644615" MutationType="SetValue" Key="0000000002a" Value="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." MessageTags="-2:3 0:0" UID="f4669a654561e330" LogId="7930199bf939f469"
179.566418 MutationTracking ID=0000000000000000 At="TLogCommitMessages" Version="463644615" MutationType="SetValue" Key="0000000002aa" Value="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..." MessageTags="-2:1 0:0" UID="f4669a654561e330" LogId="7930199bf939f469"

@jzhou77
Copy link
Author

jzhou77 commented Oct 13, 2020

split txn failures

xt ../correctness/20201012-102741-jingyu_zhou-e82f65ce2a1c6f07.xml
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/LargeApiCorrectness.toml -s 106241225 -b off
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 361779533 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/LargeApiCorrectness.toml -s 286910263 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/fast/WriteDuringReadClean.toml -s 815765234 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 96958445 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/LargeApiCorrectnessStatus.toml -s 676131845 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 442127362 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 174058913 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/rare/SwizzledLargeApiCorrectness.toml -s 463491179 -b off
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_5.1.7/DrUpgradeRestart-1.txt -s 239187238 -b on
-r simulation --crash --logsize 1024MB -f ./foundationdb/tests/restarting/from_5.1.7/DrUpgradeRestart-2.txt -s 239187239 -b on

@jzhou77
Copy link
Author

jzhou77 commented Oct 16, 2020

Parallel hashmap test

$ ./build/fdb/foundationdb/linux/bin/fdbserver -r skiplisttest
Skip list test
miniConflictSetTest complete
Test data generated: 500 batches, 5000/batch
Running
New conflict set: 3.143 sec
0.398 Mtransactions/sec
1.591 Mkeys/sec
Detect only: 2.848 sec
0.439 Mtransactions/sec
1.756 Mkeys/sec
Skiplist only: 2.097 sec
0.596 Mtransactions/sec
2.385 Mkeys/sec
Performance counters:
Build: 0.102
Add: 0.169
Detect: 2.85
D.Sort: 0.572
D.Combine: 0.0201
D.CheckRead: 1.14
D.CheckIntraBatch: 0.0188
D.MergeWrite: 0.955
D.RemoveBefore: 0.135
429334 entries in version history

====
Baseline

$ ./build/fdb/foundationdb/linux/bin/fdbserver -r skiplisttest
Skip list test
miniConflictSetTest complete
Test data generated: 500 batches, 5000/batch
Running
New conflict set: 3.136 sec
0.399 Mtransactions/sec
1.594 Mkeys/sec
Detect only: 2.842 sec
0.440 Mtransactions/sec
1.759 Mkeys/sec
Skiplist only: 2.083 sec
0.600 Mtransactions/sec
2.400 Mkeys/sec
Performance counters:
Build: 0.102
Add: 0.167
Detect: 2.84
D.Sort: 0.581
D.Combine: 0.0213
D.CheckRead: 1.13
D.CheckIntraBatch: 0.0179
D.MergeWrite: 0.955
D.RemoveBefore: 0.135
429967 entries in version history

@jzhou77
Copy link
Author

jzhou77 commented Oct 18, 2020

Race condition between startBackup and abortBackup in PR #3922

Source DB: StartFullBackupTaskFunc
1st Txn

srcTr->set(versionKey, beginVersionKey);
task->params[BackupAgentBase::destUid] = destUidValue;

2nd Txn writes dest DB, conflicts with abort's 1st tr and didn't succeed. So destUid is not set.

tr->set(config.pack(BackupAgentBase::destUid), task->params[BackupAgentBase::destUid]);

Abort Src DB

1st Tr read dest DB for destUid. clear the range and prevents future tasks from executing, including 2nd transaction of StartFullBackupTaskFunc.

state Future<UID> destUidFuture = backupAgent->getDestUid(tr, logUid);
...
tr->clear(backupAgent->config.get(logUidValue).range());

So the destUid was never set by the 2nd transaction of StartFullBackupTaskFunc and this txn got null for it. As a result, the 3rd Txn read source DB for latestVersionKey, which depends on destUid and couldn't find it.

@jzhou77
Copy link
Author

jzhou77 commented Oct 24, 2020

DR/Backup Data Structures

tag maps to UID, then UID maps to configuration kv pairs

“abort tag X” ==> “find whatever UID that tax X points to, and then abort it (tag X with uid Z).”
get status for a tag, you get its UID and you get whether or not it’s running, and its dest UID, and stuff like that
When we abort, we don’t clear the UIDs. so even after being aborted, tag X will still map to the same uid until a new DR is started with tag X

@jzhou77
Copy link
Author

jzhou77 commented Nov 4, 2020

TagPartitionedLogSystem notes

Special tags are always indexed everywhere, i.e., primary, satellite, remote tlogs.

  • Spill by value: write tag's data into SQLite B-tree, optimize for reads, but will write the same mutation multiple times. Good for StorageServers.
  • Spill by reference: write a pointer for each tag, where the version data is stored in DiskQueue. When reading back, reader needs to find the tag in that version's data. Good for tlogs.

For tlog, it's better to make write cheap.

exclude failed allows a storage server to be marked as failed, and tlog can pop that tag, thus unblocking popping of old mutations in the disk queue.

SQLite file store index to disk queue.

If DC lag is very long, have to drop remote region to avoid tlog disk filled up.

Log Router

Recruited for each generation.
Keeps 5s mutation in memory, to keep remote tlogs within 5s of each other.
Construct tags for remote storage servers with locality -3, i.e., tagLocalityRemoteLog.

Peek aggressively from primary region, could include mutations that can be rolled back.

Primary tlog sees every version, not the case for remote tlog/logrouter. What to do if LR didn't see any version in 5s? Recovery has 90s gap. How to tell the difference of these two?

waitForVersion: see some version in previous epoch, then a version bump of 100M, and current epoch. WaitForVersionMS is the time LG waits for remote tlog to pop data.

Even if no messages found, peekCursor help us advance version.

@jzhou77
Copy link
Author

jzhou77 commented Nov 8, 2020

Load Balance

  • SS load balance: sending multiple read requests to different SS is allowed. Latency is accurate.

    • Keep track of in-flight/outstanding requests to different SSes. The idea is to keep equal number of outstanding requests. Allow up to 5% backup requests. The implementation is in the queueModel. If we know a SS is lagging/failed (failedUntil), then skip the SS for a while.

    • penalty is SS's signal of stress, indicating client to avoid it. secondDelay is the waiting time before sending out the backup request.

    • alwaysFresh means SS.

    • After cluster bounce, client can find all SSes failed. The client needs to ask proxies from SS interfaces. This is triggered by throwing out all_alternatives_failed() error. There is a delay for backoff if all SSes really failed.

    • secondDelay seems can be calculated when triggered -- improvement!

  • QueueModel

    • clean means real answer
  • Proxy load balance: can only send request to one. Proxy latency is not accurate, because of batching effect.

    • 6.1 client connects to all proxies, causing too many client connections for proxies, especially bad for TLS connections.
    • 6.2 client connects to 5 proxies, but load is not balanced. Fixed in 6.3
    • ModelInterface: updateRecent() for GRV. Consider both CPU and GRV budgets, marshaled into an integer's higher and lower bits.

5s MVCC Window

  • Memory limit: resolver and storage server (SS)
    • SS must keep 5s window. Durability to disk is lower priority than serving reads. So SS can have more than 5s data in memory.
    • SS may need to roll back versions, and can't roll back more than it's durable version. That's why LogRouter keeps 5s data in memory to keep remote SS are not 5s apart. In case of region fail over, remote SS won't roll back more than 5s.
    • Proxy's view and SS's view should be consistent. Proxy knows SS can roll back 5M versions.
    • Recovery transaction tells SS: master picks recovery version, set lastEpochEndKey, which let's all SSes know the roll back version. This can be place to change MVCC window.

@jzhou77
Copy link
Author

jzhou77 commented Nov 11, 2020

DD 2 phase transfer of ownership from an old team to a new team, due to a SS failure

  1. Writing mutations to both old and new teams. New SS fetching data from old SS (range reads). After having all the data, signal DD complete.
  2. DD reassign owner to the new team. DD tells tlog to remove tag of the failed SS.

Start the above process, SS comes back after hours and is way behind. It's hard to for SS to catch up, because its data is on disk, comparing to other in-memory data. So it's better to keep this SS as failed.

@jzhou77
Copy link
Author

jzhou77 commented Nov 11, 2020

HA

Proxy: add primary SS tags, remote SS tags, and a LR tag
Primary tlog: only index primary SS tags and a LR tag
satellite tlog only index LR tag,

LR adds ephemeral tag. LR pops only when all remote tlogs pop the ephemeral tags.

remote tlogs: pull data from all LRs and merge them. This means all LRs are synced, otherwise remote SS has to wait for the slowest LRs.

If FDB can't recruit enough processes on the primary, it automatically fails over to the remote region. CC needs to talk to the majority of Coordinators.

Client reads remote region can have future_version error, if remote region lags behind. Load balancing algorithm can direct reads to remote SS. Reducing replication degree can lower storage cost, but can increase hot spot problem, because of smaller number of SS replicas.

  • Failures
    • Remote DC failure. TLog spilling for 24 hours. ~16 hours into spilling, should drop remote DC.
      • remote SS remains the same, but primary sharding can be changed due to hot spot.
    • A remote SS becomes slow. TLog queues more data -> slows down LR -> primary/satellite tlog queues more data
    • Exercise failover: demo -> production

WAN Latency

  • Commit slow

  • Remote falling behind: DD problem of moving data in the primary, impact primary capability of move & load balancing, because remote lags

  • Failure monitoring

  • Asymmetric networking

  • Scalability

    • CC tracks all processes in two regions
    • Master tracks remote tlogs: remote tlog failure potentially not causing primary recovery, just replace remote tlogs

Region configuration: add regions; add processes; change replication -> DD issue moves -> use one SS pull from primary, the other two copy from the first SS; configure auto failover

Q: how to do distributed txn?

@jzhou77
Copy link
Author

jzhou77 commented Nov 12, 2020

Data Distribution

  • Background actors
    • BgDDMountainChopper: moving data away from most utilized server
    • BgDDValleyFiller: moving data to least utilized server
    • Directory moving data from most to least can cause skewed data distribution

Tracker: track failures?, split shards

Queue: on source servers

  • dataDistributionRelocator: process the queue head
    • Before move finishes, SS adds inflight move stats.
    • teamCollection: one per region
    • Across WAN, pick a random SS to copy data, which then seeds other SSes in the team.
  • SS finishes fetch, signal transferComplete so that DD can start next move, even though SS still takes time to persist data.

SS has a queue for fetching keys. keeps logic bytes stats from samples. DD tries to balance logic bytes among SSes.

  • DataDistribution.actor.cpp

  • MoveKeys: mechanism for reassigning shards, implemented via transactions.

    • Move data from src to dst teams

    • checkFetchingState: poll dst SS about fetching waitForShardReady.

    • finishMoveKeys: poll dst servers the move finished, so it can remove src. If a SS is in both src and dst, ... Change key range map in the end.

    • moveKeyLock: makes sure only one DD active

    • krmGetRanges: krm means keyRangeMap, a data structure from key range to its owners

    • krmSetRangeCoalescing : applyMetadataMutations see changes in keyServersPrefix and serverKeysPrefix. When SS sees the privatized mutation in applyPrivateData, SS knows its ownership of a key range. AddingShard buffers mutations during the move. fetchComplete. After fetch complete, SS needs to wait 5s of MVCC window for data to become durable.

SS: fetch uses its own version for fetching, which could be too old and can't catch up. If fetching new version, need to wait SS catch up to the new version. FetchInjectionInfo. The update loop fk.send() runs the rest of fetch loop

@jzhou77
Copy link
Author

jzhou77 commented Dec 11, 2020

NIC and kernel TCP tuning

  • Increase Intel NIC ring buffer size to absorb the burst traffic patterns - default is 512.
    * "ethtool -G eth0 rx 4096 tx 4096"
  • Disable the flow control for Intel NIC
    * "ethtool -A eth0 rx off tx off"
  • Increase socket buffer default size : default is 212992 -i.e. 208K to 1M
    * "sysctl -w net.core.rmem_default=1048576"
    * "sysctl -w net.core.rmem_max=1048576"
    * "sysctl -w net.core.wmem_default=1048576"
    * "sysctl -w net.core.wmem_max=1048576"

@jzhou77
Copy link
Author

jzhou77 commented May 26, 2021

Anti-quorum is not used, because if you let a tlog fall behind other tlogs, all of the storage servers that read from that tlog will also fall behind, and then you will get future version errors from clients trying to read from those storage servers

@jzhou77
Copy link
Author

jzhou77 commented Sep 1, 2021

Docker on Mac without docker-desktop

brew install docker docker-machine docker-credential-helper docker-compose virtualbox

(make sure ~/.docker/config.json has "credsStore" : "osxkeychain" in it)

@jzhou77
Copy link
Author

jzhou77 commented Sep 21, 2021

documentation build

  1. ninja docpreview starts a web server at a local port: e.g., Serving HTTP on 0.0.0.0 port 14244 (http://0.0.0.0:14244/)
  2. Since this web server is in okteto, need to do port fowarding:
$ ssh -L 14244:localhost:14244 jzhou-dev.okteto

Or

  $ kubectl get all
  $ kubectl port-forward replicaset.apps/jzhou-dev-6df7457774 14244 14244
  1. Navigate to http://localhost:14244/performance.html.

@jzhou77
Copy link
Author

jzhou77 commented Nov 1, 2021

(1) Install gdbgui using pip install gdbgui
(2) Forward the port used by gdbserver or gdbgui in Okteto environment to your local machine: kubectl port-forward pod/vishesh-dev-6d4f39f78-ngs8f 5000:5000

@jzhou77
Copy link
Author

jzhou77 commented May 7, 2022

	TraceEvent("TLogInitCommit", logData->logId).log();
	wait(ioTimeoutError(self->persistentData->commit(), SERVER_KNOBS->TLOG_MAX_CREATE_DURATION));

217.908010 Role ID=f70414b30d93c824 As=TLog Transition=Begin Origination=Recovered OnWorker=3135ceefe717ee33 SharedTLog=853fab151d4a9085 GP,LR,MS,SS,TL
217.908010 TLogRejoining ID=f70414b30d93c824 ClusterController=50cd94d391dd1f0b DbInfoMasterLifeTime=50cd94d391dd1f0b#4 LastMasterLifeTime=0000000000000000#0 GP,LR,MS,SS,TL
217.908010 TLogStart ID=f70414b30d93c824 RecoveryCount=28 GP,LR,MS,SS,TL
217.908010 TLogInitCommit ID=f70414b30d93c824 GP,LR,MS,SS,TL

227.908010 IoTimeoutError ID=0000000000000000 Error=io_timeout ErrorDescription=A disk IO operation failed to complete in a timely manner ErrorCode=1521 Duration=10 BT=addr2line -e fdbserver.debug -p -C -f -i 0x1d9102b 0x249cfbb 0x249c324 0x2472a61 0x2479a69 0x24c2cac 0x2631811 0x261bb4d 0x13b1f2c 0x360e31b 0x360df73 0x11c6198 0x36dfda6 0x36dfc18 0x1b9d8c5 0x7f719aff6555 Backtrace=addr2line -e fdbserver.debug -p -C -f -i 0x3823549 0x3823801 0x381eb04 0x1d710d3 0x1d70d57 0x387bfb8 0x387bdde 0x11c6198 0x36dfda6 0x36dfc18 0x1b9d8c5 0x7f719aff6555 GP,LR,MS,SS,TL

227.908010 TLogError ID=853fab151d4a9085 Error=io_timeout ErrorDescription=A disk IO operation failed to complete in a timely manner ErrorCode=1521 GP,LR,MS,SS,TL
227.908010 Role ID=f70414b30d93c824 Transition=End As=TLog Reason=Error GP,LR,MS,SS,TL
227.908010 RoleRemove ID=0000000000000000 Address=2.0.1.2:1 Role=TLog NumRoles=7 Value=1 Result=Decremented Role GP,LR,MS,SS,TL
227.908010 Role ID=7fb6adb922290f7e Transition=End As=TLog Reason=Error GP,LR,MS,SS,TL
227.908010 RoleRemove ID=0000000000000000 Address=2.0.1.2:1 Role=TLog NumRoles=6 Value=0 Result=Removed Role GP,LR,MS,SS,TL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment