Skip to content

Instantly share code, notes, and snippets.

Avatar
:octocat:
.\ |

vfdev vfdev-5

:octocat:
.\ |
View GitHub Profile
@vfdev-5
vfdev-5 / notes.md
Created Jul 10, 2020
PyTorch tests notes
View notes.md

How to run tests of PyTorch

Distributed module

  • Run all distributed tests
python test/run_test.py -pt -vi distributed/test_distributed
  • Run a single test
@vfdev-5
vfdev-5 / tech-share.md
Created Jul 6, 2020
PyTorch Tech Share (July 06 2020) - Simple PyTorch distributed computation functionality testing with `pytest-xdist`.
View tech-share.md

PyTorch Tech Share (July 06 2020) - Simple PyTorch distributed computation functionality testing with pytest-xdist.

It is about one of many other approaches on how we can test a custom distributed computation functionality by emulating multiple processes.

What is "distributed setting" in PyTorch ?

  • Communications between N application's processes
    • send/receive tensors
@vfdev-5
vfdev-5 / readme.md
Last active May 19, 2020
Remote debugging python code with VSCode
View readme.md

My two cents on debugging, if you have ssh access to some of your nodes, especially the one where an xp failed. There is also a possibility to remotely debug the code. This may seem a bit tricky however could help to check the pipeline. I tested it with VSCode. So, to debug you need:

  1. ssh to the node
  2. find the docker image of the failed xp
  3. run a docker container of the image with options such that the input data is mounted inside the docker -v somepath:/data + additional options for debugging --security-opt seccomp:unconfined and --network=host.
  4. Setup pip install --upgrade ptvsd on local and remote machines
  5. Follow the guide on how to setup debugging interface from VSCode: https://code.visualstudio.com/docs/python/debugging#_remote-debugging
{
  "name": "Python: Attach",
@vfdev-5
vfdev-5 / Dockerfile
Created Apr 28, 2020
Torch XLA Image, CPU
View Dockerfile
FROM python:3.6-buster
# - Install gsutil to copy wheels from gcr
# - Install openblas
# - Download torch & xla
# - Setup Python 3.6 env for Torch XLA wheels
# - Install torch & xla
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
apt-get install -y apt-transport-https ca-certificates gnupg curl && \
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
@vfdev-5
vfdev-5 / check_torch_cuda_amp_CycleGAN_on_random.py
Created Apr 12, 2020
Reproduce loss=NaN in CycleGAN with torch.cuda.amp
View check_torch_cuda_amp_CycleGAN_on_random.py
#!/usr/bin/env python
# coding: utf-8
import torch
print(torch.__version__)
import ignite
print(ignite.__file__)
print(ignite.__version__)
@vfdev-5
vfdev-5 / benchmark_engine.sh
Last active Jan 13, 2020
Helper scripts to benchmark and check ignite's engine on MNIST, CIFAR10 tasks
View benchmark_engine.sh
#!/bin/bash
# Tests configuration:
if [ -z $version ]; then
export version="v0.2.1"
# export version="master"
# export version="engine_refactor"
echo "Setup version: $version"
fi
View check_amp_CycleGAN_on_random.py
#!/usr/bin/env python
# coding: utf-8
opt_level = "O1"
import torch
print(torch.__version__)
import ignite
ignite.__file__
@vfdev-5
vfdev-5 / README.md
Last active Aug 11, 2020
ROS development on MacOSX using docker
View README.md

ROS development on MacOSX using docker

We need to use docker-machine to handle USB ports inside the docker.

Docker Machine (0.16.1)

@vfdev-5
vfdev-5 / reproduce_error.py
Last active Mar 6, 2019
TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
View reproduce_error.py
from __future__ import print_function
import argparse
import random
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
@vfdev-5
vfdev-5 / notes_pytorch_distributed.md
Created Feb 26, 2019
Notes on PyTorch distributed
View notes_pytorch_distributed.md

Some notes on launching distributed computations with PyTorch

  • Inside a docker container
  • Using NCCL and TCP or Shared file-system
  • PyTorch version: 1.0.1.post2
  • 2 Nodes / 3 GPUs

Docker container

We need to run the container with --network=host option

You can’t perform that action at this time.