Will Constable wconstab

## README.md

      
              3 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                wconstab
                / README.md
            
            
              Created
              July 21, 2022 23:10
            
          
    Debugging memory leak in LTC

pytorch/pytorch#80942
Was able to repro a leak in c++ without python by modifying a unit test to run in a loop.
Then tried using valgrind
(compiled from src following valgrind instructions, didn't require any deps, worked first try
Modify test_lazy_ops.py to apply the patch below, making TestLinear into a loop basically

  
## gist:5364080996629669de5a416a76bca509
(/scratch/whc/work/py38) whc@a100-st-p4d24xlarge-17:/scratch/whc/work/torchdynamo$ python benchmarks/torchbench.p
y --dynamic_shapes --training --nvfuser --accuracy-aot-ts-mincut --devices cuda --repeat 1 -k hf_bert
cuda train hf_Bert                            ERROR FROM offset=178 filename /scratch/whc/work/py38/lib/python3.8
/site-packages/transformers/models/bert/modeling_bert.py 1377 AssertionError
========== TorchDynamo Stack Trace ==========
Traceback (most recent call last):
  File "/scratch/whc/work/torchdynamo/torchdynamo/convert_frame.py", line 304, in _convert_frame_assert
    code = transform_code_object(frame.f_code, transform)
  File "/scratch/whc/work/torchdynamo/torchdynamo/bytecode_transformation.py", line 338, in transform_code_object
    transformations(instructions, code_options)

## output_rgcn

ShapeModel(x, [DynamicDim[('cauchy', {'loc': 5396.350680541128, 'scale': 23.678595214301602})], StaticDim[64]]

ShapeModel(edge_index, [StaticDim[2], DynamicDim[('cauchy', {'loc': 6886.825555614419, 'scale': 26.45052768271832})]]

## coldstart.py
import argparse
import torch
parser = argparse.ArgumentParser()
parser.add_argument("--dynamo", action="store_true")
parser.add_argument("--size", type=int,  default=1)
parser.add_argument("--child", action="store_true", help="inside child process")
parser.add_argument("--repeat", type=int, default=2, help="how many repeats (without warmup) to time. 2 covers profiling executor behavior.")
parser.add_argument("--device", default="cuda")
args = parser.parse_args()

## not_working.py
torchdynamo.config.cache_size_limit = 1

def main():
    class SuperLinear(nn.Linear):
        def __init__(self, size_i, size_o):
            super().__init__(size_i, size_o)

        def forward(self, x):
            with torchdynamo.optimize("eager"):
                x = super().forward(x)

## gist:f4b6fcc3b6c69facd5c2ec5353ea5463
[ RUN      ] LazyOpsTest.TestNllLoss
/var/lib/jenkins/workspace/aten/src/ATen/native/LossNLL.cpp:266:16: runtime error: division by zero
    #0 0x7f9abe2167ce in void at::native::(anonymous namespace)::nll_loss_out_frame<float, long>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbbaa7ce)
    #1 0x7f9abe204c3c in at::native::(anonymous namespace)::nll_loss_forward_out_cpu_template(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb98c3c)
    #2 0x7f9abe20443c in at::native::structured_nll_loss_forward_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb9843c)
    #3 0x7f9abfd205c0 in at::(anonymous namespace)::wrapper_n

## bert_ir_baseline_with_lr_hack
[ScheduleSyncTensorsGraph]
TensorsGraphInfo:
  to_device (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:294)
  check_results (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:361)
  <module> (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:440)

Hashes: (51812b4e6a763b887a0ee6c07ddc9e86)

## BEGIN_GRAPH
IR {

## README.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                wconstab
                / README.md
            
            
              Created
              October 21, 2021 16:31
            
              
                follow-up repro using Kevin's script:
              
          
    Notes:

same runner.py as https://gist.github.com/wconstab/9802986a1353ee8eb14e12d1e6a23b79
run with rm mytest; LTC_SAVE_TENSORS_FILE=mytest PYTORCH_JIT_LOG_LEVEL=">>>graph_fuser" LTC_TS_CUDA=1 python bias_dropout_add_layernorm.py > console.log 2>&1
Observed this warning for some reason:
[W manager.cpp:305] Warning: FALLBACK path has been taken. This is an indication that codegenFailed for some reason. To debug try disable codegen fallback pathvia setting the env variableexport PYTORCH_NVFUSER_DISABLE_FALLBACK=1 (function runCudaFusionGroup)
But I don't see the fragmentation of backward that you mentioned. I'm also not sure if backward is complete in this case, it only includes native_layer_norm and sum.

  
## README.md

      
              4 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                wconstab
                / README.md
            
            
              Created
              October 21, 2021 16:26
            
              
                Repro for bias_dropout_add_layernorm
              
          
    Steps to repro


sync to wconstab/dropout branch of pytorch, which I just rebased on lazy_tensor_staging 10/21 and see no change in behavior
run rm mytest; LTC_SAVE_TENSORS_FILE=mytest PYTORCH_JIT_LOG_LEVEL="&gt;&gt;&gt;graph_fuser" LTC_TS_CUDA=1 python bias_dropout_add_layernorm.py


## LazyNativeFunctions.cpp
// @generated by tools/codegen/gen.py from DispatchKeyNativeFunctions.cpp
#include "ATen/MetaFunctions.h"
#include "lazy_tensor_core/csrc/aten_ltc_bridge.h"
#include "lazy_tensor_core/csrc/helpers.h"
#include "lazy_tensor_core/csrc/tensor.h"
#include "lazy_tensor_core/csrc/tensor_util.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyNativeFunctions.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyLazyIr.h"
#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyShapeDtype.h"
	(/scratch/whc/work/py38) whc@a100-st-p4d24xlarge-17:/scratch/whc/work/torchdynamo$ python benchmarks/torchbench.p
	y --dynamic_shapes --training --nvfuser --accuracy-aot-ts-mincut --devices cuda --repeat 1 -k hf_bert
	cuda train hf_Bert ERROR FROM offset=178 filename /scratch/whc/work/py38/lib/python3.8
	/site-packages/transformers/models/bert/modeling_bert.py 1377 AssertionError
	========== TorchDynamo Stack Trace ==========
	Traceback (most recent call last):
	File "/scratch/whc/work/torchdynamo/torchdynamo/convert_frame.py", line 304, in _convert_frame_assert
	code = transform_code_object(frame.f_code, transform)
	File "/scratch/whc/work/torchdynamo/torchdynamo/bytecode_transformation.py", line 338, in transform_code_object
	transformations(instructions, code_options)

	ShapeModel(x, [DynamicDim[('cauchy', {'loc': 5396.350680541128, 'scale': 23.678595214301602})], StaticDim[64]]

	ShapeModel(edge_index, [StaticDim[2], DynamicDim[('cauchy', {'loc': 6886.825555614419, 'scale': 26.45052768271832})]]
	import argparse
	import torch
	parser = argparse.ArgumentParser()
	parser.add_argument("--dynamo", action="store_true")
	parser.add_argument("--size", type=int, default=1)
	parser.add_argument("--child", action="store_true", help="inside child process")
	parser.add_argument("--repeat", type=int, default=2, help="how many repeats (without warmup) to time. 2 covers profiling executor behavior.")
	parser.add_argument("--device", default="cuda")
	args = parser.parse_args()
	torchdynamo.config.cache_size_limit = 1

	def main():
	class SuperLinear(nn.Linear):
	def __init__(self, size_i, size_o):
	super().__init__(size_i, size_o)

	def forward(self, x):
	with torchdynamo.optimize("eager"):
	x = super().forward(x)
	[ RUN ] LazyOpsTest.TestNllLoss
	/var/lib/jenkins/workspace/aten/src/ATen/native/LossNLL.cpp:266:16: runtime error: division by zero
	#0 0x7f9abe2167ce in void at::native::(anonymous namespace)::nll_loss_out_frame<float, long>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbbaa7ce)
	#1 0x7f9abe204c3c in at::native::(anonymous namespace)::nll_loss_forward_out_cpu_template(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb98c3c)
	#2 0x7f9abe20443c in at::native::structured_nll_loss_forward_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::OptionalTensorRef, long, long, at::Tensor const&, at::Tensor const&) (/opt/conda/lib/python3.7/site-packages/torch/bin/libtorch_cpu.so+0xbb9843c)
	#3 0x7f9abfd205c0 in at::(anonymous namespace)::wrapper_n
	[ScheduleSyncTensorsGraph]
	TensorsGraphInfo:
	to_device (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:294)
	check_results (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:361)
	<module> (/home/whc/pytorch/lazy_tensor_core/lazy_bench.py:440)

	Hashes: (51812b4e6a763b887a0ee6c07ddc9e86)

	## BEGIN_GRAPH
	IR {
	// @generated by tools/codegen/gen.py from DispatchKeyNativeFunctions.cpp
	#include "ATen/MetaFunctions.h"
	#include "lazy_tensor_core/csrc/aten_ltc_bridge.h"
	#include "lazy_tensor_core/csrc/helpers.h"
	#include "lazy_tensor_core/csrc/tensor.h"
	#include "lazy_tensor_core/csrc/tensor_util.h"
	#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyNativeFunctions.h"
	#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyLazyIr.h"
	#include "/home/whc/pytorch/lazy_tensor_core/scripts/../lazy_tensor_core/csrc/ts_backend/LazyShapeDtype.h"