Skip to content

Instantly share code, notes, and snippets.

@lxbraga
Last active April 26, 2024 17:45
Show Gist options
  • Save lxbraga/3aef27bd42a3378de3f722abc09eba4d to your computer and use it in GitHub Desktop.
Save lxbraga/3aef27bd42a3378de3f722abc09eba4d to your computer and use it in GitHub Desktop.
AWS EC2 GPU Monitoring

Automated GPU and CPU Monitoring on AWS EC2 Instances

This Gist provides a comprehensive solution for setting up automated GPU and CPU monitoring on AWS EC2 instances using a combination of Terraform configurations, Python scripts, and PowerShell scripts. The setup is designed to handle the creation of IAM roles, Lambda functions, and CloudWatch alarms, with a specific focus on Windows systems.

Key Features

  • Automated creation of IAM roles, Lambda functions, and CloudWatch alarms using Terraform
  • Dedicated Lambda function to create CloudWatch Alarms for automatic instance shutdown based on CPU or GPU usage metrics
  • Monitoring setup tailored for Windows systems
  • Utilization of AWS Systems Manager (SSM) for script execution on EC2 instances
  • Flexibility to host Python (as zipped files) and PowerShell scripts on an S3 bucket accessible to EC2 instances and SSM

Prerequisites

  • AWS account with necessary permissions to create and manage EC2 instances, IAM roles, Lambda functions, and CloudWatch alarms
  • Terraform installed on your local machine
  • Python scripts (zipped) and PowerShell scripts hosted on an S3 bucket accessible to EC2 instances and SSM

Setup Instructions

  1. Update the Terraform configuration files with your desired settings, such as instance types, monitoring thresholds, and S3 bucket details.

  2. Provide the S3 bucket URL containing your Python (zipped) and PowerShell scripts as a variable to the Terraform script.

  3. Run terraform init to initialize the Terraform working directory.

  4. Run terraform apply to create the necessary AWS resources, including IAM roles, Lambda functions, and CloudWatch alarms.

  5. Launch your EC2 instances with the appropriate IAM role and tags for monitoring.

  6. (Optional) If you prefer not to run enable_gpu_monitoring.py externally, as the instance can take some time to start and ultimately fail, you can optionally incorporate GPUStats.ps1 content directly into the instance's user data and remove gpu_monitoring_lambda sections from Terraform.

Monitoring Workflow

  1. When an EC2 instance is launched with the appropriate IAM role and tags, the Lambda function is triggered.

  2. The Lambda function checks for the presence of the GPU_Monitoring tag on the instance.

  3. If the tag is set to true, the Lambda function sends commands via SSM to download and execute the GPU monitoring script on the instance.

  4. The GPU monitoring script collects GPU usage metrics and sends them to CloudWatch.

  5. The dedicated Lambda function creates CloudWatch Alarms based on the specified CPU or GPU usage thresholds.

  6. If the CPU or GPU usage exceeds the defined thresholds, the CloudWatch Alarm triggers an automatic shutdown of the instance.

Customization

  • Modify the Terraform configuration files to adjust instance types, monitoring thresholds, and other settings according to your requirements.

  • Update the Python and PowerShell scripts to collect additional metrics or perform specific actions based on your monitoring needs.

  • Customize the CloudWatch Alarm thresholds and actions to align with your resource management policies.

import boto3
import logging
def put_cpu_alarm(instance_id):
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instance_id)
tag_workstation = False
logging.info("Entering put_cpu_alarm function")
for tag in instance.tags:
logging.info(f"Checking tag: {tag}")
if tag['Key'] == 'CPU_Monitoring' and tag['Value'] == 'True':
tag_workstation = True
break
if tag_workstation:
logging.info("Matched CPU_Monitoring tag, attempting to create alarm")
cloudWatch = boto3.client('cloudwatch')
cloudWatch.put_metric_alarm(
AlarmName = f'CPU_ALARM_{instance_id}',
AlarmDescription = 'Alarm when server CPU does not exceed 5%',
AlarmActions = ['arn:aws:automate:us-west-2:ec2:stop'],
MetricName = 'CPUUtilization',
Namespace = 'AWS/EC2',
Statistic = 'Average',
Dimensions = [{'Name': 'InstanceId', 'Value': instance_id}],
Period = 600,
EvaluationPeriods = 3,
Threshold = 5,
ComparisonOperator = 'LessThanOrEqualToThreshold',
TreatMissingData = 'notBreaching'
)
logging.info(f"CPU Alarm created for instance: {instance_id}")
else:
logging.info(f"No matching tag found for CPU_Monitoring in instance: {instance_id}")
def put_gpu_alarm(instance_id):
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instance_id)
tag_workstation = False
logging.info("Entering put_gpu_alarm function")
for tag in instance.tags:
logging.info(f"Checking tag: {tag}")
if tag['Key'] == 'GPU_Monitoring' and tag['Value'] == 'True':
tag_workstation = True
break
if tag_workstation:
logging.info("Matched GPU_Monitoring tag, attempting to create alarm")
cloudWatch = boto3.client('cloudwatch')
cloudWatch.put_metric_alarm(
AlarmName = f'GPU_ALARM_{instance_id}',
AlarmDescription = 'Alarm when server GPU does not exceed 10%',
AlarmActions = ['arn:aws:automate:us-west-2:ec2:stop'],
MetricName = 'GPUUtilization',
Namespace = 'GPUStats',
Statistic = 'Average',
Dimensions = [{'Name': 'InstanceId', 'Value': instance_id}],
Period = 1800,
EvaluationPeriods = 1,
Threshold = 10,
ComparisonOperator = 'LessThanThreshold',
TreatMissingData = 'notBreaching'
)
logging.info(f"GPU Alarm created for instance: {instance_id}")
else:
logging.info(f"No matching tag found for GPU_Monitoring in instance: {instance_id}")
def lambda_handler(event, context):
instance_id = event['detail']['instance-id']
ec2 = boto3.resource('ec2')
instance = ec2.Instance(instance_id)
logging.info(f"Lambda handler invoked for instance: {instance_id}")
if instance.instance_type.startswith('g'):
logging.info("Instance type starts with 'g', invoking put_gpu_alarm")
put_gpu_alarm(instance_id)
else:
logging.info("Instance type does not start with 'g', invoking put_cpu_alarm")
put_cpu_alarm(instance_id)
import boto3
import time
from datetime import datetime
def lambda_handler(event, context):
ec2 = boto3.resource('ec2')
ssm = boto3.client('ssm')
# Extract instance ID from the CloudWatch event
instance_id = event['detail']['instance-id']
# Get instance details
instance = ec2.Instance(instance_id)
instance.load()
# Check for the GPU_Monitoring tag
gpu_monitoring = next((tag['Value'] for tag in instance.tags if tag['Key'].lower() == 'gpu_monitoring'), None)
if gpu_monitoring and gpu_monitoring.lower() == 'true':
# Command to download and execute your script
commands = [
"if (-not (Test-Path 'C:\\Scripts\\GPUStats.ps1')) { Invoke-WebRequest -Uri 'https://s3.amazonaws.com/trackit-cpu-gpu-monitoring/GPUStats.ps1' -OutFile 'C:\\Scripts\\GPUStats.ps1' }",
"if (Test-Path 'C:\\Scripts\\GPUStats.ps1') { C:\\Scripts\\GPUStats.ps1 }"
]
# Retry sending the command with a delay
max_retries = 5
retry_delay = 30 # seconds
for attempt in range(max_retries):
try:
# Send command via SSM
response = ssm.send_command(
InstanceIds=[instance_id],
DocumentName='AWS-RunPowerShellScript',
Parameters={'commands': commands},
Comment='Executing GPU stats collection setup'
)
# Check the status of the command
command_id = response['Command']['CommandId']
waiter = ssm.get_waiter('command_executed')
waiter.wait(
CommandId=command_id,
InstanceId=instance_id,
WaiterConfig={
'Delay': 5,
'MaxAttempts': 6
}
)
# Convert datetime objects to string
response['Command']['RequestedDateTime'] = response['Command']['RequestedDateTime'].isoformat()
response['Command']['ExpiresAfter'] = response['Command']['ExpiresAfter'].isoformat()
return response
except ssm.exceptions.InvalidInstanceId:
# Instance not ready yet, wait and retry
if attempt < max_retries - 1:
time.sleep(retry_delay)
else:
raise
return 'No action needed - tag not set or set to false'
#Create Scheduled Task to run every 5 minutes
$taskName = 'Collect GPU Stats'
$description = 'Collect GPU Stats and Pass to Cloudwatch Custom Metrics'
$taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUStats.ps1'
$principal = New-ScheduledTaskPrincipal -UserID "System" -LogonType ServiceAccount -RunLevel Highest
$taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 5)
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)
mkdir C:\Scripts
Set-Content -Path 'C:\Scripts\GPUStats.ps1' -Value @'
try {
Import-Module -Name AWSPowerShell
# Get Stats from NVIDIA-SMI
$STATS = nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits
$object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','
# Get EC2 Instance ID
$instanceID = Invoke-RestMethod -Uri http://169.254.169.254/latest/meta-data/instance-id
# Create Dimension Object
$dimension = New-Object -TypeName Amazon.CloudWatch.Model.Dimension
$dimension.Name = "InstanceId"
$dimension.Value = $instanceID
$totalGPUUtilization = 0
$gpuCount = 0
# Accumulate GPU utilization values
foreach ($item in $object) {
$gpuUtilization = [decimal]$item.'utilization.gpu [%]'
$totalGPUUtilization += $gpuUtilization
$gpuCount++
}
# Calculate average GPU utilization
$averageGPUUtilization = $totalGPUUtilization / $gpuCount
# Create MetricDatum for average GPU utilization
$averageGpuUtilMetric = New-Object -TypeName Amazon.CloudWatch.Model.MetricDatum
$averageGpuUtilMetric.MetricName = "AggregatedGPUUtilization"
$averageGpuUtilMetric.Unit = "Percent"
$averageGpuUtilMetric.Value = $averageGPUUtilization
$averageGpuUtilMetric.TimestampUtc = (Get-Date).ToUniversalTime()
$averageGpuUtilMetric.Dimensions.Add($dimension)
# Publish the aggregate GPU utilization
Write-CWMetricData -Namespace 'GPUStats' -MetricData $averageGpuUtilMetric
} catch {
$ErrorMessage = $_.Exception.Message
$FailedItem = $_.Exception.ItemName
Add-Content -Path 'C:\Scripts\ErrorLog.txt' -Value "Error: $ErrorMessage; Item: $FailedItem"
}
echo $object
exit
'@
Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal
provider "aws" {
region = "us-west-2"
}
data "aws_caller_identity" "current" {}
variable "bucket_name" {}
# Roles
resource "aws_iam_role" "gpu_monitoring_lambda_execution_role" {
name = "gpu_monitoring_lambda_execution_role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
},
]
})
}
resource "aws_iam_role" "usage_autoshutdown_lambda_execution_role" {
name = "usage_autoshutdown_lambda_execution_role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
},
]
})
}
# Role Policies
resource "aws_iam_role_policy" "usage_autoshutdown_lambda_policy" {
name = "usage_autoshutdown_lambda_policy"
role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = ["ec2:DescribeInstances", "cloudwatch:PutMetricAlarm"],
Resource = "*"
},
]
})
}
resource "aws_iam_role_policy" "gpu_monitoring_lambda_policy" {
name = "gpu_monitoring_lambda_policy"
role = aws_iam_role.gpu_monitoring_lambda_execution_role.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Sid = "VisualEditor0"
Effect = "Allow"
Action = "ec2:DescribeInstances"
Resource = "*"
},
{
Sid = "VisualEditor1"
Effect = "Allow"
Action = "s3:GetObject"
Resource = "arn:aws:s3:::${var.bucket_name}/*"
},
{
Sid = "VisualEditor2"
Effect = "Allow"
Action = "ssm:SendCommand"
Resource = [
"arn:aws:s3:::${var.bucket_name}",
"arn:aws:ec2:*:*:instance/*",
"arn:aws:ssm:*:*:managed-instance/*",
"arn:aws:ssm:*:*:document/AWS-RunPowerShellScript"
]
},
{
Effect : "Allow",
Action : [
"ssm:GetCommandInvocation"
],
Resource : "*"
}
]
})
}
resource "aws_iam_role_policy" "usage_autoshutdown_service_linked_role_policy" {
name = "usage_autoshutdown_service_linked_role_policy"
role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Effect = "Allow",
Action = "iam:CreateServiceLinkedRole",
Resource = "*",
Condition = {
StringEquals = {
"iam:AWSServiceName" = "events.amazonaws.com"
}
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "gpu_monitoring_basic_execution" {
role = aws_iam_role.gpu_monitoring_lambda_execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
resource "aws_iam_role_policy_attachment" "usage_autoshutdown_basic_execution" {
role = aws_iam_role.usage_autoshutdown_lambda_execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
# Lambda Functions
resource "aws_lambda_function" "gpu_monitoring" {
function_name = "ec2_enable_gpu_monitoring"
handler = "enable_gpu_monitoring.lambda_handler"
runtime = "python3.11"
s3_bucket = "trackit-cpu-gpu-monitoring"
s3_key = "enable_gpu_monitoring.zip"
role = aws_iam_role.gpu_monitoring_lambda_execution_role.arn
timeout = 120
}
resource "aws_lambda_function" "usage_autoshutdown" {
function_name = "ec2_usage_autoshutdown"
handler = "autoshutdown.lambda_handler"
runtime = "python3.11"
s3_bucket = "trackit-cpu-gpu-monitoring"
s3_key = "autoshutdown.zip"
role = aws_iam_role.usage_autoshutdown_lambda_execution_role.arn
timeout = 60
}
# CloudWatch Event Rule
resource "aws_cloudwatch_event_rule" "ec2_state_change" {
name = "ec2-running-state-change"
description = "Triggers when EC2 instances move to running state."
event_pattern = jsonencode({
source = ["aws.ec2"],
"detail-type" = ["EC2 Instance State-change Notification"],
detail = {
state = ["running"]
}
})
}
# CloudWatch Event Targets
resource "aws_cloudwatch_event_target" "invoke_gpu_monitoring" {
rule = aws_cloudwatch_event_rule.ec2_state_change.name
target_id = "InvokeGPUMonitoringLambdaFunction"
arn = aws_lambda_function.gpu_monitoring.arn
}
resource "aws_cloudwatch_event_target" "invoke_usage_autoshutdown" {
rule = aws_cloudwatch_event_rule.ec2_state_change.name
target_id = "InvokeUsageAutoShutdownLambdaFunction"
arn = aws_lambda_function.usage_autoshutdown.arn
}
# Lambda Permissions
resource "aws_lambda_permission" "gpu_monitoring_allow_eventbridge" {
statement_id = "AllowExecutionFromEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.gpu_monitoring.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ec2_state_change.arn
}
resource "aws_lambda_permission" "usage_autoshutdown_allow_eventbridge" {
statement_id = "AllowExecutionFromEventBridge"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.usage_autoshutdown.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.ec2_state_change.arn
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment