Skip to content

Instantly share code, notes, and snippets.

@LiutongZhou
Last active February 10, 2023 15:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save LiutongZhou/552cf38421bbcaa28435eb1e03f792bb to your computer and use it in GitHub Desktop.
Save LiutongZhou/552cf38421bbcaa28435eb1e03f792bb to your computer and use it in GitHub Desktop.
Large-Scale Distributed Data and Model Parallel Training

Large-Scale Distributed Data and Model Parallel Training

Data Streaming

image

FastFile Mode

sagemaker.inputs.TrainingInput(S3_INPUT_FOLDER, input_mode='FastFile') 

Recommended Setting For FSx for Lustre

  1. FSx for Lustre with Scratch 2 storage providing a baseline of 200 MB/s and a burst of up to 1300 MB/s per TB of provisioned storage. (https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf)
  2. Give SageMaker Training Jobs Access to Resources in S3
  3. Give SageMaker Training Jobs Access to Internet - Create NAT to connect to Public Destination: 0.0.0.0/0 Target: nat-gateway-id
  4. Use FileSystemInput
FileSystemInput(
          file_system_id="fs-0a8cb526dc288fada",
          file_system_type="FSxLustre",
          directory_path="/mount_name/path relative to s3 bucket name"
          )

To preload files from S3 to the file system, do

nohup find local/directory -type f -print0 | xargs -0 -n 1 sudo lfs hsm_restore &

Price creating a 1.2 TB file system of SSD-backed Scratch 2 type with data compression disabled costs $168 per month ($140/TB/month).

Training Instance

Recommended Setting

Instance Count Instance Type vCPU GPU Type No. of GPU Memory Hourly Price
2~3 ml.p4d.24xlarge 96 A100 8/instance 1152 GB $37.688

SageMaker Environment Variables and Paths

# SageMaker Training fit input 
SM_CHANNEL_TRAINING -> "/opt/ml/input/data/training"
SM_CHANNEL_MODEL -> "/opt/ml/input/data/model"

# source_dir  
"/opt/ml/input/code"

# SageMaker Processing
"/opt/ml/processing/input"          
"/opt/ml/processing/output"       

# ouptut
SM_MODEL_DIR -> "/opt/ml/model" 

# Checkpoints
"/opt/ml/checkpoints/"    # output will be streamed to S3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment