Skip to content

Instantly share code, notes, and snippets.

View fighterhit's full-sized avatar
💭
I may be slow to respond.

fighterhit

💭
I may be slow to respond.
View GitHub Profile
@superbrothers
superbrothers / go.mod
Last active May 4, 2023 07:52
A workaround for "Failed to initialize NVML: Unknown Error after calling systemctl daemon-reload https://github.com/NVIDIA/nvidia-docker/issues/1650
module github.com/pfnet-research/nvidia-create-symlinks
go 1.19
require (
github.com/NVIDIA/nvidia-container-toolkit v1.12.0-rc.2.0.20230127101129-9fc2c5912242 // indirect
github.com/cpuguy83/go-md2man/v2 v2.0.1 // indirect
github.com/fsnotify/fsnotify v1.5.4 // indirect
github.com/russross/blackfriday/v2 v2.1.0 // indirect
github.com/sirupsen/logrus v1.9.0 // indirect
@gengwg
gengwg / nvml_cgroupv2_fix.md
Last active July 10, 2024 07:10
Fix jobs originally seeing the GPUs fine, suddenly nvml goes away after a few hours

NOTE: This seems fixed our cluster. BUT I do see some still reporting cgroup2 having same issue, for example here. So YMMV.

DISCLAIMER: This seems works in our env. may not work in others. I'm still not sure what is the real root cause(s) yet. Not even 100% sure it full fixes in our env - it's been good for 2 weeks. But if it reappears, (for example, under certain use cases. high load or something), I'll be doomed.

TLDR

Switching to cgroup v2 seems fixed the nvml suddenly go away in pod issue.

Problem

@Dounm
Dounm / monitor_ib_traffic.py
Last active May 22, 2024 02:31
Monitor Infiniband traffic and caculate bandwidth
# Inspired by https://github.com/vpenso/ganglia-sensors/blob/master/lib/python_modules/infiniband.py#/
import logging
import re
import sys
import json
import time
import subprocess
@hellojukay
hellojukay / proxy_server.go
Last active May 12, 2022 08:39
A http(s) proxy server
package main
import (
"bytes"
"fmt"
"io"
"log"
"net"
"regexp"
"strings"
@cirocosta
cirocosta / containerd-prune
Created January 30, 2020 13:09
prune containerd stuff
#!/bin/bash
set -o errexit
set -o xtrace
main() {
local namespaces=$(list_namespaces)
for namespace in $namespaces; do
local tasks=$(list_tasks $namespace)
@Einstrasse
Einstrasse / bits-stdc++.h
Created December 3, 2019 14:52
bits/stdc++.h header file
// C++ includes used for precompiling -*- C++ -*-
// Copyright (C) 2003-2015 Free Software Foundation, Inc.
//
// This file is part of the GNU ISO C++ Library. This library is free
// software; you can redistribute it and/or modify it under the
// terms of the GNU General Public License as published by the
// Free Software Foundation; either version 3, or (at your option)
// any later version.
@theojulienne
theojulienne / traceicmpsoftirq.py
Last active January 11, 2024 12:38
ICMP packet tracer using BCC
#!/usr/bin/python
bpf_text = """
#include <linux/ptrace.h>
#include <linux/sched.h> /* For TASK_COMM_LEN */
#include <linux/icmp.h>
#include <linux/netdevice.h>
struct probe_icmp_data_t
{
@asaphe
asaphe / k8s_kubectl.md
Last active June 20, 2023 13:05
Kubernetes Commands - Kubectl

Kubectl

Imperative == refers to cli commands Declarative == using YAML files

--export
--save-config
--record
@OhBonsai
OhBonsai / producer.go
Last active November 15, 2023 12:10
rabbitmq 支持重连和重传的生产者
package main
import (
"log"
"github.com/streadway/amqp"
"time"
"os"
"errors"
)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC