Skip to content

Instantly share code, notes, and snippets.

@anj-s
anj-s / gist:cc3d65e168e51f2affec813a909de5c0
Created September 23, 2021 00:42
SsdParameter - SsdTensorHandle is a property
class SsdParameter(torch.nn.Parameter):
@staticmethod
def __new__(
cls: SsdParameter, data: torch.Tensor, shape: Tuple[int, ...], dtype: torch.dtype, requires_grad: bool = False
) -> SsdParameter:
if data is None:
data = torch.tensor([])
return torch.Tensor._make_subclass(cls, data, requires_grad)
if type(data).__name__ == 'Tensor':
return torch.Tensor._make_subclass(cls, data, requires_grad)
@anj-s
anj-s / gist:3e615541c34dc45a714e4a3aa8ada098
Created August 3, 2021 14:04
SIGSEV error when running on multiple nodes
SIGSEGV(11), PID: 1831084, Thread 1831084:
frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x12a (0x7f11eaf85a6a in /private/home/anj/.conda/envs/test_clone/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x153c0 (0x7f12694e63c0 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #2: <unknown function> + 0x133f4 (0x7f12694e43f4 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #3: <unknown function> + 0x134e8 (0x7f12694e44e8 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: PyThread_acquire_lock_timed + 0xd9 (0x558dde9eaa69 in /private/home/anj/.conda/envs/test_clone/bin/python)
frame #5: <unknown function> + 0x1af68a (0x558ddea5768a in /private/home/anj/.conda/envs/test_clone/bin/python)
frame #6: <unknown function> + 0x1a51c7 (0x558ddea4d1c7 in /private/home/anj/.conda/envs/test_clone/bin/python)
frame #7: <unknown function> + 0x10075e (0x558dde9a875e in /private/home/anj/.conda/envs/test_clone/bin/python)
frame #8: _PyEval_EvalCodeWithName + 0x2d2 (0x558ddea32
Package Version Location
----------------------------- ------------------------ --------------------------------------
alabaster 0.7.12
antlr4-python3-runtime 4.8
appdirs 1.4.4
attrs 20.3.0
Babel 2.9.0
black 19.10b0
bleach 3.3.0
certifi 2020.12.5
@anj-s
anj-s / gist:6c808731287e9a504cb63c6f8013fad0
Created April 30, 2021 04:21
Stack trace: node 0: worker 0 , node 1: worker 1, server, scheduler
BytePS launching worker
BytePS launching worker
BytePS launching server
BytePS launching scheduler
[2021-04-29 20:03:00.669667: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-04-29 20:03:00.669697: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-04-29 20:03:00.669699: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-04-29 20:03:00.669754: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-04-29 20:03:00.670890: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
@anj-s
anj-s / repro_rpc_profiler_callback.py
Created April 28, 2021 18:03
Repro failing to print when using a profiler within a callback.
# Example repro for failing to profile a callback.
import torch
import torch.distributed.rpc as rpc
import torch.multiprocessing as mp
import time
import argparse
RPC_PORT = 25001
@anj-s
anj-s / repro_bucket_rtts.txt
Created April 28, 2021 16:57
Output of `python repro_bucket_rtts.py --bucket_size=10 --use_cuda_tensors --num_buckets=20`
run_worker 1 with world size 2
---Warm Up-----
Callback triggered in 7664.990643 ms
Callback triggered in 7664.933709 ms
Callback triggered in 7664.819333 ms
Callback triggered in 7665.318029 ms
Callback triggered in 7668.967457 ms
Callback triggered in 7673.087738 ms
Callback triggered in 7677.450334 ms
Callback triggered in 7684.611007 ms
@anj-s
anj-s / repro_bucket_rtts.py
Created April 28, 2021 16:53
Monotonically increasing bucket RTTs in parameter servers.
# Repro increasing bucket RTTs.
import argparse
import os
import socket
import threading
import subprocess
import time
import torch
@anj-s
anj-s / repro_rpc_torch_script.py
Created April 28, 2021 15:25
Example demonstrating torch.jit.script + rpc_async/rpc_sync + Rrefs
# Example repro for failing to profile a callback.
import torch
import torch.distributed.rpc as rpc
import torch.multiprocessing as mp
import time
import argparse
RPC_PORT = 25001
@anj-s
anj-s / repro_seg_fault_rpc_sync.py
Last active April 28, 2021 15:02
Repro rpc_sync segmentation fault
# Example repro for failing to profile a callback.
import torch
import torch.distributed.rpc as rpc
import torch.multiprocessing as mp
import os
import argparse
import subprocess
Wed Mar 31 20:15:11 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro GP100 On | 00000000:AF:00.0 Off | 0 |
| 26% 36C P0 30W / 235W | 4638MiB / 16278MiB | 0% Default |