Skip to content

Instantly share code, notes, and snippets.

View shjwudp's full-sized avatar

Jianbin Chang shjwudp

View GitHub Profile
#!/bin/bash
DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
TP=${TP:-"1"}
PP=${PP:-"1"}
GBS=${GBS:-"128"}
MBS=${MBS:-"2"}
FP8=${FP8:-"0"}
NUM_LAYERS=${NUM_LAYERS:-"32"}
USE_MEGATRON_FSDP=${USE_MEGATRON_FSDP:-"0"}
@shjwudp
shjwudp / test_fsdp_llama2-7b.sh
Created September 11, 2024 04:47
test_fsdp_llama2-7b.sh
#!/bin/bash
# Docker Image: nvcr.io/nvidia/nemo:24.03.01.framework
DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
TP=${TP:-"1"}
PP=${PP:-"1"}
GBS=${GBS:-"128"}
MBS=${MBS:-"2"}
FP8=${FP8:-"0"}
#!/bin/bash
# Parameters
#SBATCH --account=coreai_devtech_all
#SBATCH --dependency=singleton
#SBATCH --exclusive
#SBATCH --job-name=megatron-fsdp_gpt-20b_h100_bf16_1node
#SBATCH --mem=0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8

To extract the position offset of each frame in an MP4 file using Python, you can use the OpenCV library, which provides a convenient way to handle video files. Below is a Python program that uses OpenCV to read an MP4 file and print the position offset of each frame:

import cv2

# Function to capture frames
def frame_capture(video_path):
    # Create a VideoCapture object
    cap = cv2.VideoCapture(video_path)
@shjwudp
shjwudp / test.py
Created November 14, 2023 06:17
NCCL kernels and GEMM kernel parallel test
import os
import torch
import torch.multiprocessing as mp
import torch.distributed as dist
m = 48000
n = 960
k = 7680
@shjwudp
shjwudp / moe-etp-perf_w_o_etp.sh
Created November 1, 2023 09:56
moe-etp-perf baseline
#! /bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
NCCL_DEBUG=INFO
DIR=`pwd`
GPUS_PER_NODE=1
# Change for multinode config
@shjwudp
shjwudp / moe-etp-perf.sh
Created November 1, 2023 09:56
moe-etp-perf
#! /bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
NCCL_DEBUG=INFO
DIR=`pwd`
GPUS_PER_NODE=8
# Change for multinode config
@shjwudp
shjwudp / GPT_model_FLOPs.md
Created August 3, 2023 08:10
GPT model FLOPs

GPU one step FLOPs calculation

For FLOPs calculations, we follow the derivation from Narayanan, et.al.[13] and only consider the matrix multiplications (GEMMs) which are the main contributors to the number of floating-point operations. For the attention block, the main contributors to floating-point operations are: key, query, and value transformation ($6Bs2h^2$ operations), attention matrix computation ($2Bs^2h$ operations), attention over values ($2Bs^2h$ operations), and post-attention linear projection ($2Bsh^2$ operations) where $B$ is the microbatch size.

For the feed-forward network that increases the hidden size to $4h$ and then reduces it back to $h$, we have $16Bsh^2$ floating-point operations. Summing these together, each transformer layer results in $24Bsh^2 + 4Bs^2h$ FLOPs for the forward pass. The other main contributor to the number of floatingpoint operations is the logits layer in the language model head, which transforms features of dimension $h$ to the vocabulary dimension $v$. The req

@shjwudp
shjwudp / ipython_magic_command_%env.md
Created July 29, 2023 16:15
Adjusting Jupyter Notebook/IPython's System Shell with %env Magic Command

IPython's system shell has a unique subshell mechanism that sometimes behaves differently from what we expect. However, we can use the %env magic built-in command to change the subshell configuration and adjust the execution environment to meet our needs.

For example, if we're working in an unfamiliar environment where the default terminal is zcsh, but we want to work under our favorite bash, we can type %env SHELL bash at the begining of Jupyter Notebook/IPython script to modify the execution system shell. This execution is valid globally.

@shjwudp
shjwudp / find_nearest_neighbor.cpp
Created June 1, 2023 08:10
find_nearest_neighbor
struct Node {
int id;
};
struct Edge {
int a;
int b;
};
struct Graph {