This Gist provides a comprehensive solution for setting up automated GPU and CPU monitoring on AWS EC2 instances using a combination of Terraform configurations, Python scripts, and PowerShell scripts. The setup is designed to handle the creation of IAM roles, Lambda functions, and CloudWatch alarms, with a specific focus on Windows systems.
- Automated creation of IAM roles, Lambda functions, and CloudWatch alarms using Terraform
- Dedicated Lambda function to create CloudWatch Alarms for automatic instance shutdown based on CPU or GPU usage metrics
- Monitoring setup tailored for Windows systems
- Utilization of AWS Systems Manager (SSM) for script execution on EC2 instances
- Flexibility to host Python (as zipped files) and PowerShell scripts on an S3 bucket accessible to EC2 instances and SSM
- AWS account with necessary permissions to create and manage EC2 instances, IAM roles, Lambda functions, and CloudWatch alarms
- Terraform installed on your local machine
- Python scripts (zipped) and PowerShell scripts hosted on an S3 bucket accessible to EC2 instances and SSM
-
Update the Terraform configuration files with your desired settings, such as instance types, monitoring thresholds, and S3 bucket details.
-
Provide the S3 bucket URL containing your Python (zipped) and PowerShell scripts as a variable to the Terraform script.
-
Run
terraform init
to initialize the Terraform working directory. -
Run
terraform apply
to create the necessary AWS resources, including IAM roles, Lambda functions, and CloudWatch alarms. -
Launch your EC2 instances with the appropriate IAM role and tags for monitoring.
-
(Optional) If you prefer not to run
enable_gpu_monitoring.py
externally, as the instance can take some time to start and ultimately fail, you can optionally incorporateGPUStats.ps1
content directly into the instance's user data and removegpu_monitoring_lambda
sections from Terraform.
-
When an EC2 instance is launched with the appropriate IAM role and tags, the Lambda function is triggered.
-
The Lambda function checks for the presence of the
GPU_Monitoring
tag on the instance. -
If the tag is set to
true
, the Lambda function sends commands via SSM to download and execute the GPU monitoring script on the instance. -
The GPU monitoring script collects GPU usage metrics and sends them to CloudWatch.
-
The dedicated Lambda function creates CloudWatch Alarms based on the specified CPU or GPU usage thresholds.
-
If the CPU or GPU usage exceeds the defined thresholds, the CloudWatch Alarm triggers an automatic shutdown of the instance.
-
Modify the Terraform configuration files to adjust instance types, monitoring thresholds, and other settings according to your requirements.
-
Update the Python and PowerShell scripts to collect additional metrics or perform specific actions based on your monitoring needs.
-
Customize the CloudWatch Alarm thresholds and actions to align with your resource management policies.