Skip to content

Instantly share code, notes, and snippets.

Debugging memory leak in LTC

pytorch/pytorch#80942

Was able to repro a leak in c++ without python by modifying a unit test to run in a loop.

Then tried using valgrind (compiled from src following valgrind instructions, didn't require any deps, worked first try

Modify test_lazy_ops.py to apply the patch below, making TestLinear into a loop basically

(/scratch/whc/work/py38) whc@a100-st-p4d24xlarge-17:/scratch/whc/work/torchdynamo$ python benchmarks/torchbench.p
y --dynamic_shapes --training --nvfuser --accuracy-aot-ts-mincut --devices cuda --repeat 1 -k hf_bert
cuda train hf_Bert ERROR FROM offset=178 filename /scratch/whc/work/py38/lib/python3.8
/site-packages/transformers/models/bert/modeling_bert.py 1377 AssertionError
========== TorchDynamo Stack Trace ==========
Traceback (most recent call last):
File "/scratch/whc/work/torchdynamo/torchdynamo/convert_frame.py", line 304, in _convert_frame_assert
code = transform_code_object(frame.f_code, transform)
File "/scratch/whc/work/torchdynamo/torchdynamo/bytecode_transformation.py", line 338, in transform_code_object
transformations(instructions, code_options)
@wconstab
wconstab / output_rgcn
Last active July 14, 2022 22:58
Infra for profiling dynamic shapes and modeling as a distribution
ShapeModel(x, [DynamicDim[('cauchy', {'loc': 5396.350680541128, 'scale': 23.678595214301602})], StaticDim[64]]
ShapeModel(edge_index, [StaticDim[2], DynamicDim[('cauchy', {'loc': 6886.825555614419, 'scale': 26.45052768271832})]]
@wconstab
wconstab / coldstart.py
Last active June 8, 2022 16:48
Benchmarking torchdynamo cold start time
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--dynamo", action="store_true")
parser.add_argument("--size", type=int, default=1)
parser.add_argument("--child", action="store_true", help="inside child process")
parser.add_argument("--repeat", type=int, default=2, help="how many repeats (without warmup) to time. 2 covers profiling executor behavior.")
parser.add_argument("--device", default="cuda")
args = parser.parse_args()
torchdynamo.config.cache_size_limit = 1
def main():
class SuperLinear(nn.Linear):
def __init__(self, size_i, size_o):
super().__init__(size_i, size_o)
def forward(self, x):
with torchdynamo.optimize("eager"):
x = super().forward(x)
[ RUN ] LazyOpsTest.TestNllLoss
/var/lib/jenkins/workspace/aten/src/ATen/native/LossNLL.cpp:266:16: runtime error: division by zero
#0 0x7f9abe2167ce in void at::native::(anonymous namespace)::nll_loss_out_frame<float, long>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbbaa7ce)
#1 0x7f9abe204c3c in at::native::(anonymous namespace)::nll_loss_forward_out_cpu_template(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb98c3c)
#2 0x7f9abe20443c in at::native::structured_nll_loss_forward_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb9843c)
#3 0x7f9abfd205c0 in at::(anonymous namespace)::wrapper_n
@wconstab
wconstab / bert_ir_baseline_with_lr_hack
Created December 8, 2021 18:24
bert_ir for wraped scalars
[ScheduleSyncTensorsGraph]
TensorsGraphInfo:
to_device (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:294)
check_results (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:361)
<module> (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:440)
Hashes: (51812b4e6a763b887a0ee6c07ddc9e86)
## BEGIN_GRAPH
IR {
@wconstab
wconstab / README.md
Created October 21, 2021 16:31
follow-up repro using Kevin's script:

Notes:

same runner.py as https://gist.github.com/wconstab/9802986a1353ee8eb14e12d1e6a23b79

run with rm mytest; LTC_SAVE_TENSORS_FILE=mytest PYTORCH_JIT_LOG_LEVEL=">>>graph_fuser" LTC_TS_CUDA=1 python bias_dropout_add_layernorm.py > console.log 2>&1

Observed this warning for some reason: [W manager.cpp:305] Warning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variableexport PYTORCH_NVFUSER_DISABLE_FALLBACK=1 (function runCudaFusionGroup)

But I don't see the fragmentation of backward that you mentioned. I'm also not sure if backward is complete in this case, it only includes native_layer_norm and sum.

@wconstab
wconstab / README.md
Created October 21, 2021 16:26
Repro for bias_dropout_add_layernorm

Steps to repro

  1. sync to wconstab/dropout branch of pytorch, which I just rebased on lazy_tensor_staging 10/21 and see no change in behavior
  2. run rm mytest; LTC_SAVE_TENSORS_FILE=mytest PYTORCH_JIT_LOG_LEVEL="&gt;&gt;&gt;graph_fuser" LTC_TS_CUDA=1 python bias_dropout_add_layernorm.py
@wconstab
wconstab / LazyNativeFunctions.cpp
Created October 14, 2021 16:02
A view of the generated LazyNativeFunctions.cpp
// @generated by tools/codegen/gen.py from DispatchKeyNativeFunctions.cpp
#include "ATen/MetaFunctions.h"
#include "lazy_tensor_core/csrc/aten_ltc_bridge.h"
#include "lazy_tensor_core/csrc/helpers.h"
#include "lazy_tensor_core/csrc/tensor.h"
#include "lazy_tensor_core/csrc/tensor_util.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyNativeFunctions.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyLazyIr.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyShapeDtype.h"