FastFile Mode
sagemaker.inputs.TrainingInput(S3_INPUT_FOLDER, input_mode='FastFile')
Recommended Setting For FSx for Lustre
- FSx for Lustre with Scratch 2 storage providing a baseline of 200 MB/s and a burst of up to 1300 MB/s per TB of provisioned storage. (https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf)
- Give SageMaker Training Jobs Access to Resources in S3
- Give SageMaker Training Jobs Access to Internet - Create NAT to connect to Public Destination:
0.0.0.0/0
Target:nat-gateway-id
- Use
FileSystemInput
FileSystemInput(
file_system_id="fs-0a8cb526dc288fada",
file_system_type="FSxLustre",
directory_path="/mount_name/path relative to s3 bucket name"
)
To preload files from S3 to the file system, do
nohup find local/directory -type f -print0 | xargs -0 -n 1 sudo lfs hsm_restore &
Price creating a 1.2 TB file system of SSD-backed Scratch 2 type with data compression disabled costs $168 per month ($140/TB/month).
Recommended Setting
Instance Count | Instance Type | vCPU | GPU Type | No. of GPU | Memory | Hourly Price |
---|---|---|---|---|---|---|
2~3 | ml.p4d.24xlarge | 96 | A100 | 8/instance | 1152 GB | $37.688 |
# SageMaker Training fit input
SM_CHANNEL_TRAINING -> "/opt/ml/input/data/training"
SM_CHANNEL_MODEL -> "/opt/ml/input/data/model"
# source_dir
"/opt/ml/input/code"
# SageMaker Processing
"/opt/ml/processing/input"
"/opt/ml/processing/output"
# ouptut
SM_MODEL_DIR -> "/opt/ml/model"
# Checkpoints
"/opt/ml/checkpoints/" # output will be streamed to S3