Jianbin Chang shjwudp

## optimizer_offloading.sh
#!/bin/bash

DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
TP=${TP:-"1"}
PP=${PP:-"1"}
GBS=${GBS:-"128"}
MBS=${MBS:-"2"}
FP8=${FP8:-"0"}
NUM_LAYERS=${NUM_LAYERS:-"32"}
USE_MEGATRON_FSDP=${USE_MEGATRON_FSDP:-"0"}

## test_fsdp_llama2-7b.sh
#!/bin/bash

# Docker Image: nvcr.io/nvidia/nemo:24.03.01.framework

DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
TP=${TP:-"1"}
PP=${PP:-"1"}
GBS=${GBS:-"128"}
MBS=${MBS:-"2"}
FP8=${FP8:-"0"}

## gist:996824eb2965f4fadbcfcdfd3c5a518c
#!/bin/bash

# Parameters
#SBATCH --account=coreai_devtech_all
#SBATCH --dependency=singleton
#SBATCH --exclusive
#SBATCH --job-name=megatron-fsdp_gpt-20b_h100_bf16_1node
#SBATCH --mem=0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8

## for_yuyue.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                shjwudp
                / for_yuyue.md
            
            
              Created
              March 14, 2024 12:09
            
              
                QQQ
              
          
    To extract the position offset of each frame in an MP4 file using Python, you can use the OpenCV library, which provides a convenient way to handle video files. Below is a Python program that uses OpenCV to read an MP4 file and print the position offset of each frame:
import cv2

# Function to capture frames
def frame_capture(video_path):
    # Create a VideoCapture object
    cap = cv2.VideoCapture(video_path)

  
## test.py
import os

import torch
import torch.multiprocessing as mp
import torch.distributed as dist

m = 48000
n = 960
k = 7680

## moe-etp-perf_w_o_etp.sh
#! /bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1

NCCL_DEBUG=INFO

DIR=`pwd`

GPUS_PER_NODE=1
# Change for multinode config

## moe-etp-perf.sh
#! /bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1

NCCL_DEBUG=INFO

DIR=`pwd`

GPUS_PER_NODE=8
# Change for multinode config

## GPT_model_FLOPs.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                shjwudp
                / GPT_model_FLOPs.md
            
            
              Created
              August 3, 2023 08:10
            
              
                GPT model FLOPs
              
          
    GPU one step FLOPs calculation

For FLOPs calculations, we follow the derivation from Narayanan, et.al.[13] and only consider the
matrix multiplications (GEMMs) which are the main contributors to the number of floating-point operations. For the attention block, the main contributors to floating-point operations are: key, query,
and value transformation ($6Bs2h^2$ operations), attention matrix computation ($2Bs^2h$ operations), attention over values ($2Bs^2h$ operations), and post-attention linear projection ($2Bsh^2$ operations) where
$B$ is the microbatch size.
For the feed-forward network that increases the hidden size to $4h$ and then reduces it back to $h$, we have $16Bsh^2$ floating-point operations. Summing these together, each transformer layer results in $24Bsh^2 + 4Bs^2h$ FLOPs for the forward pass. The other main contributor to the number of floatingpoint operations is the logits layer in the language model head, which transforms features of dimension $h$ to the vocabulary dimension $v$. The req

  
## ipython_magic_command_%env.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                shjwudp
                / ipython_magic_command_%env.md
            
            
              Created
              July 29, 2023 16:15
            
              
                Adjusting Jupyter Notebook/IPython's System Shell with %env Magic Command
              
          
    IPython's system shell has a unique subshell mechanism that sometimes behaves differently from what we expect. However, we can use the %env magic built-in command to change the subshell configuration and adjust the execution environment to meet our needs.
For example, if we're working in an unfamiliar environment where the default terminal is zcsh, but we want to work under our favorite bash, we can type %env SHELL bash at the begining of Jupyter Notebook/IPython script to modify the execution system shell. This execution is valid globally.

  
## find_nearest_neighbor.cpp
struct Node {
    int id;
};

struct Edge {
    int a;
    int b;
};

struct Graph {
	#!/bin/bash

	DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
	TP=${TP:-"1"}
	PP=${PP:-"1"}
	GBS=${GBS:-"128"}
	MBS=${MBS:-"2"}
	FP8=${FP8:-"0"}
	NUM_LAYERS=${NUM_LAYERS:-"32"}
	USE_MEGATRON_FSDP=${USE_MEGATRON_FSDP:-"0"}
	#!/bin/bash

	# Docker Image: nvcr.io/nvidia/nemo:24.03.01.framework

	DP_STRATEGY=${DP_STRATEGY:-"fsdp"}
	TP=${TP:-"1"}
	PP=${PP:-"1"}
	GBS=${GBS:-"128"}
	MBS=${MBS:-"2"}
	FP8=${FP8:-"0"}
	#!/bin/bash

	# Parameters
	#SBATCH --account=coreai_devtech_all
	#SBATCH --dependency=singleton
	#SBATCH --exclusive
	#SBATCH --job-name=megatron-fsdp_gpt-20b_h100_bf16_1node
	#SBATCH --mem=0
	#SBATCH --nodes=1
	#SBATCH --ntasks-per-node=8
	import os

	import torch
	import torch.multiprocessing as mp
	import torch.distributed as dist

	m = 48000
	n = 960
	k = 7680
	#! /bin/bash

	export CUDA_DEVICE_MAX_CONNECTIONS=1

	NCCL_DEBUG=INFO

	DIR=`pwd`

	GPUS_PER_NODE=1
	# Change for multinode config
	struct Node {
	int id;
	};

	struct Edge {
	int a;
	int b;
	};

	struct Graph {