Skip to content

Instantly share code, notes, and snippets.

View yifuwang's full-sized avatar

Yifu Wang yifuwang

View GitHub Profile

Latency numbers every programmer should know

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs

Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs

tensor([0.0002]) tensor([0], dtype=torch.int32)
tensor([0.0002]) tensor([-1], dtype=torch.int32)
WARNING:root:Could not determine LOCAL_WORLD_SIZE from environment, falling back to WORLD_SIZE.
WARNING:root:Could not determine LOCAL_WORLD_SIZE from environment, falling back to WORLD_SIZE.
torch.float32
torch.float32
old weight tensor([[-0.0285, 0.0301, 0.0173, ..., -0.0305, -0.0288, -0.0027],
[-0.0224, -0.0263, 0.0212, ..., -0.0249, 0.0071, -0.0202],
[ 0.0125, 0.0225, 0.0154, ..., -0.0155, -0.0169, 0.0253],
...,
num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066)
num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417)
num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419)
num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410)
num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098)
num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134)
num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099)
num_params=9 world_size=8 mix
num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460)
num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572)
num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587)
num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534)
num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084)
num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154)
num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087)
num_params=9 world_size=8 mix
from typing import Callable
import functools
import torch
SIZES = [
torch.Size([256, 280]),
torch.Size([256]),
torch.Size([280, 256]),
[torch.Size([32000, 8192]), torch.Size([8192, 8192]), torch.Size([1024, 8192]), torch.Size([1024, 8192]), torch.Size([8192, 8192]), torch.Size([28672, 8192]), torch.Size([8192, 28672]), torch.Size([28672, 8192]), torch.Size([8192]), torch.Size([8192]), torch.Size([8192, 8192]), torch.Size([1024, 8192]), torch.Size([1024, 8192]), torch.Size([8192, 8192]), torch.Size([28672, 8192]), torch.Size([8192, 28672]), torch.Size([28672, 8192]), torch.Size([8192]), torch.Size([8192]), torch.Size([8192, 8192]), torch.Size([1024, 8192]), torch.Size([1024, 8192]), torch.Size([8192, 8192]), torch.Size([28672, 8192]), torch.Size([8192, 28672]), torch.Size([28672, 8192]), torch.Size([8192]), torch.Size([8192]), torch.Size([8192, 8192]), torch.Size([1024, 8192]), torch.Size([1024, 8192]), torch.Size([8192, 8192]), torch.Size([28672, 8192]), torch.Size([8192, 28672]), torch.Size([28672, 8192]), torch.Size([8192]), torch.Size([8192]), torch.Size([8192, 8192]), torch.Size([1024, 8192]), torch.Size([1024, 8192]), torch.Size([8192,