lxbraga/GPUStats.ps1 Secret

## README.md

      
    Raw
  

              README.md
            
          
    Automated GPU and CPU Monitoring on AWS EC2 Instances

This Gist provides a comprehensive solution for setting up automated GPU and CPU monitoring on AWS EC2 instances using a combination of Terraform configurations, Python scripts, and PowerShell scripts. The setup is designed to handle the creation of IAM roles, Lambda functions, and CloudWatch alarms, with a specific focus on Windows systems.
Key Features


Automated creation of IAM roles, Lambda functions, and CloudWatch alarms using Terraform
Dedicated Lambda function to create CloudWatch Alarms for automatic instance shutdown based on CPU or GPU usage metrics
Monitoring setup tailored for Windows systems
Utilization of AWS Systems Manager (SSM) for script execution on EC2 instances
Flexibility to host Python (as zipped files) and PowerShell scripts on an S3 bucket accessible to EC2 instances and SSM

Prerequisites


AWS account with necessary permissions to create and manage EC2 instances, IAM roles, Lambda functions, and CloudWatch alarms
Terraform installed on your local machine
Python scripts (zipped) and PowerShell scripts hosted on an S3 bucket accessible to EC2 instances and SSM

Setup Instructions


Update the Terraform configuration files with your desired settings, such as instance types, monitoring thresholds, and S3 bucket details.


Provide the S3 bucket URL containing your Python (zipped) and PowerShell scripts as a variable to the Terraform script.


Run terraform init to initialize the Terraform working directory.


Run terraform apply to create the necessary AWS resources, including IAM roles, Lambda functions, and CloudWatch alarms.


Launch your EC2 instances with the appropriate IAM role and tags for monitoring.


(Optional) If you prefer not to run enable_gpu_monitoring.py externally, as the instance can take some time to start and ultimately fail, you can optionally incorporate GPUStats.ps1 content directly into the instance's user data and remove gpu_monitoring_lambda sections from Terraform.


Monitoring Workflow


When an EC2 instance is launched with the appropriate IAM role and tags, the Lambda function is triggered.


The Lambda function checks for the presence of the GPU_Monitoring tag on the instance.


If the tag is set to true, the Lambda function sends commands via SSM to download and execute the GPU monitoring script on the instance.


The GPU monitoring script collects GPU usage metrics and sends them to CloudWatch.


The dedicated Lambda function creates CloudWatch Alarms based on the specified CPU or GPU usage thresholds.


If the CPU or GPU usage exceeds the defined thresholds, the CloudWatch Alarm triggers an automatic shutdown of the instance.


Customization


Modify the Terraform configuration files to adjust instance types, monitoring thresholds, and other settings according to your requirements.


Update the Python and PowerShell scripts to collect additional metrics or perform specific actions based on your monitoring needs.


Customize the CloudWatch Alarm thresholds and actions to align with your resource management policies.


## autoshutdown.py
import boto3
import logging

def put_cpu_alarm(instance_id):
    ec2 = boto3.resource('ec2')
    instance = ec2.Instance(instance_id)
    tag_workstation = False
    logging.info("Entering put_cpu_alarm function")

    for tag in instance.tags:
        logging.info(f"Checking tag: {tag}")
        if tag['Key'] == 'CPU_Monitoring' and tag['Value'] == 'True':
            tag_workstation = True
            break

    if tag_workstation:
        logging.info("Matched CPU_Monitoring tag, attempting to create alarm")
        cloudWatch = boto3.client('cloudwatch')
        cloudWatch.put_metric_alarm(
            AlarmName          = f'CPU_ALARM_{instance_id}',
            AlarmDescription   = 'Alarm when server CPU does not exceed 5%',
            AlarmActions       = ['arn:aws:automate:us-west-2:ec2:stop'],
            MetricName         = 'CPUUtilization',
            Namespace          = 'AWS/EC2',
            Statistic          = 'Average',
            Dimensions         = [{'Name': 'InstanceId', 'Value': instance_id}],
            Period             = 600,
            EvaluationPeriods  = 3,
            Threshold          = 5,
            ComparisonOperator = 'LessThanOrEqualToThreshold',
            TreatMissingData   = 'notBreaching'
        )
        logging.info(f"CPU Alarm created for instance: {instance_id}")
    else:
        logging.info(f"No matching tag found for CPU_Monitoring in instance: {instance_id}")

def put_gpu_alarm(instance_id):
    ec2 = boto3.resource('ec2')
    instance = ec2.Instance(instance_id)
    tag_workstation = False
    logging.info("Entering put_gpu_alarm function")

    for tag in instance.tags:
        logging.info(f"Checking tag: {tag}")
        if tag['Key'] == 'GPU_Monitoring' and tag['Value'] == 'True':
            tag_workstation = True
            break

    if tag_workstation:
        logging.info("Matched GPU_Monitoring tag, attempting to create alarm")
        cloudWatch = boto3.client('cloudwatch')
        cloudWatch.put_metric_alarm(
            AlarmName          = f'GPU_ALARM_{instance_id}',
            AlarmDescription   = 'Alarm when server GPU does not exceed 10%',
            AlarmActions       = ['arn:aws:automate:us-west-2:ec2:stop'],
            MetricName         = 'GPUUtilization',
            Namespace          = 'GPUStats',
            Statistic          = 'Average',
            Dimensions         = [{'Name': 'InstanceId', 'Value': instance_id}],
            Period             = 1800,
            EvaluationPeriods  = 1,
            Threshold          = 10,
            ComparisonOperator = 'LessThanThreshold',
            TreatMissingData   = 'notBreaching'
       )
        logging.info(f"GPU Alarm created for instance: {instance_id}")
    else:
        logging.info(f"No matching tag found for GPU_Monitoring in instance: {instance_id}")

def lambda_handler(event, context):
    instance_id = event['detail']['instance-id']
    ec2 = boto3.resource('ec2')
    instance = ec2.Instance(instance_id)

    logging.info(f"Lambda handler invoked for instance: {instance_id}")

    if instance.instance_type.startswith('g'):
        logging.info("Instance type starts with 'g', invoking put_gpu_alarm")
        put_gpu_alarm(instance_id)
    else:
        logging.info("Instance type does not start with 'g', invoking put_cpu_alarm")
        put_cpu_alarm(instance_id)

## enable_gpu_monitoring.py
import boto3
import time
from datetime import datetime

def lambda_handler(event, context):
    ec2 = boto3.resource('ec2')
    ssm = boto3.client('ssm')

    # Extract instance ID from the CloudWatch event
    instance_id = event['detail']['instance-id']

    # Get instance details
    instance = ec2.Instance(instance_id)
    instance.load()

    # Check for the GPU_Monitoring tag
    gpu_monitoring = next((tag['Value'] for tag in instance.tags if tag['Key'].lower() == 'gpu_monitoring'), None)

    if gpu_monitoring and gpu_monitoring.lower() == 'true':
        # Command to download and execute your script
     commands = [
        "if (-not (Test-Path 'C:\\Scripts\\GPUStats.ps1')) { Invoke-WebRequest -Uri 'https://s3.amazonaws.com/trackit-cpu-gpu-monitoring/GPUStats.ps1' -OutFile 'C:\\Scripts\\GPUStats.ps1' }",
        "if (Test-Path 'C:\\Scripts\\GPUStats.ps1') { C:\\Scripts\\GPUStats.ps1 }"
    ]

        # Retry sending the command with a delay
        max_retries = 5
        retry_delay = 30  # seconds

        for attempt in range(max_retries):
            try:
                # Send command via SSM
                response = ssm.send_command(
                    InstanceIds=[instance_id],
                    DocumentName='AWS-RunPowerShellScript',
                    Parameters={'commands': commands},
                    Comment='Executing GPU stats collection setup'
                )

                # Check the status of the command
                command_id = response['Command']['CommandId']
                waiter = ssm.get_waiter('command_executed')
                waiter.wait(
                    CommandId=command_id,
                    InstanceId=instance_id,
                    WaiterConfig={
                        'Delay': 5,
                        'MaxAttempts': 6
                    }
                )

                # Convert datetime objects to string
                response['Command']['RequestedDateTime'] = response['Command']['RequestedDateTime'].isoformat()
                response['Command']['ExpiresAfter'] = response['Command']['ExpiresAfter'].isoformat()

                return response

            except ssm.exceptions.InvalidInstanceId:
                # Instance not ready yet, wait and retry
                if attempt < max_retries - 1:
                    time.sleep(retry_delay)
                else:
                    raise

    return 'No action needed - tag not set or set to false'

## GPUStats.ps1
#Create Scheduled Task to run every 5 minutes
$taskName = 'Collect GPU Stats'
$description = 'Collect GPU Stats and Pass to Cloudwatch Custom Metrics'
$taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUStats.ps1'
$principal = New-ScheduledTaskPrincipal -UserID "System" -LogonType ServiceAccount -RunLevel Highest
$taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 5)
$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)


mkdir C:\Scripts
Set-Content -Path 'C:\Scripts\GPUStats.ps1' -Value @'
try {
    Import-Module -Name AWSPowerShell

    # Get Stats from NVIDIA-SMI
    $STATS = nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits
    $object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

    # Get EC2 Instance ID
    $instanceID = Invoke-RestMethod -Uri http://169.254.169.254/latest/meta-data/instance-id

    # Create Dimension Object
    $dimension = New-Object -TypeName Amazon.CloudWatch.Model.Dimension
    $dimension.Name = "InstanceId"
    $dimension.Value = $instanceID

    $totalGPUUtilization = 0
    $gpuCount = 0

    # Accumulate GPU utilization values
    foreach ($item in $object) {
        $gpuUtilization = [decimal]$item.'utilization.gpu [%]'
        $totalGPUUtilization += $gpuUtilization
        $gpuCount++
    }

    # Calculate average GPU utilization
    $averageGPUUtilization = $totalGPUUtilization / $gpuCount

    # Create MetricDatum for average GPU utilization
    $averageGpuUtilMetric = New-Object -TypeName Amazon.CloudWatch.Model.MetricDatum
    $averageGpuUtilMetric.MetricName = "AggregatedGPUUtilization"
    $averageGpuUtilMetric.Unit = "Percent"
    $averageGpuUtilMetric.Value = $averageGPUUtilization
    $averageGpuUtilMetric.TimestampUtc = (Get-Date).ToUniversalTime()
    $averageGpuUtilMetric.Dimensions.Add($dimension)

    # Publish the aggregate GPU utilization
    Write-CWMetricData -Namespace 'GPUStats' -MetricData $averageGpuUtilMetric

} catch {
    $ErrorMessage = $_.Exception.Message
    $FailedItem = $_.Exception.ItemName
    Add-Content -Path 'C:\Scripts\ErrorLog.txt' -Value "Error: $ErrorMessage; Item: $FailedItem"
}

echo $object
exit
'@

Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal

## main.tf
provider "aws" {
  region = "us-west-2"
}

data "aws_caller_identity" "current" {}

variable "bucket_name" {}

# Roles
resource "aws_iam_role" "gpu_monitoring_lambda_execution_role" {
  name = "gpu_monitoring_lambda_execution_role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action    = "sts:AssumeRole"
        Effect    = "Allow"
        Principal = { Service = "lambda.amazonaws.com" }
      },
    ]
  })
}

resource "aws_iam_role" "usage_autoshutdown_lambda_execution_role" {
  name = "usage_autoshutdown_lambda_execution_role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action    = "sts:AssumeRole"
        Effect    = "Allow"
        Principal = { Service = "lambda.amazonaws.com" }
      },
    ]
  })
}

# Role Policies
resource "aws_iam_role_policy" "usage_autoshutdown_lambda_policy" {
  name = "usage_autoshutdown_lambda_policy"
  role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect   = "Allow",
        Action   = ["ec2:DescribeInstances", "cloudwatch:PutMetricAlarm"],
        Resource = "*"
      },
    ]
  })
}

resource "aws_iam_role_policy" "gpu_monitoring_lambda_policy" {
  name = "gpu_monitoring_lambda_policy"
  role = aws_iam_role.gpu_monitoring_lambda_execution_role.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Sid      = "VisualEditor0"
        Effect   = "Allow"
        Action   = "ec2:DescribeInstances"
        Resource = "*"
      },
      {
        Sid      = "VisualEditor1"
        Effect   = "Allow"
        Action   = "s3:GetObject"
        Resource = "arn:aws:s3:::${var.bucket_name}/*"
      },
      {
        Sid    = "VisualEditor2"
        Effect = "Allow"
        Action = "ssm:SendCommand"
        Resource = [
          "arn:aws:s3:::${var.bucket_name}",
          "arn:aws:ec2:*:*:instance/*",
          "arn:aws:ssm:*:*:managed-instance/*",
          "arn:aws:ssm:*:*:document/AWS-RunPowerShellScript"
        ]
      },
      {
        Effect : "Allow",
        Action : [
          "ssm:GetCommandInvocation"
        ],
        Resource : "*"
      }
    ]
  })
}

resource "aws_iam_role_policy" "usage_autoshutdown_service_linked_role_policy" {
  name = "usage_autoshutdown_service_linked_role_policy"
  role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect   = "Allow",
        Action   = "iam:CreateServiceLinkedRole",
        Resource = "*",
        Condition = {
          StringEquals = {
            "iam:AWSServiceName" = "events.amazonaws.com"
          }
        }
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "gpu_monitoring_basic_execution" {
  role       = aws_iam_role.gpu_monitoring_lambda_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy_attachment" "usage_autoshutdown_basic_execution" {
  role       = aws_iam_role.usage_autoshutdown_lambda_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# Lambda Functions
resource "aws_lambda_function" "gpu_monitoring" {
  function_name = "ec2_enable_gpu_monitoring"
  handler       = "enable_gpu_monitoring.lambda_handler"
  runtime       = "python3.11"
  s3_bucket     = "trackit-cpu-gpu-monitoring"
  s3_key        = "enable_gpu_monitoring.zip"
  role          = aws_iam_role.gpu_monitoring_lambda_execution_role.arn
  timeout       = 120
}

resource "aws_lambda_function" "usage_autoshutdown" {
  function_name = "ec2_usage_autoshutdown"
  handler       = "autoshutdown.lambda_handler"
  runtime       = "python3.11"
  s3_bucket     = "trackit-cpu-gpu-monitoring"
  s3_key        = "autoshutdown.zip"
  role          = aws_iam_role.usage_autoshutdown_lambda_execution_role.arn
  timeout       = 60
}

# CloudWatch Event Rule
resource "aws_cloudwatch_event_rule" "ec2_state_change" {
  name        = "ec2-running-state-change"
  description = "Triggers when EC2 instances move to running state."

  event_pattern = jsonencode({
    source        = ["aws.ec2"],
    "detail-type" = ["EC2 Instance State-change Notification"],
    detail = {
      state = ["running"]
    }
  })
}

# CloudWatch Event Targets
resource "aws_cloudwatch_event_target" "invoke_gpu_monitoring" {
  rule      = aws_cloudwatch_event_rule.ec2_state_change.name
  target_id = "InvokeGPUMonitoringLambdaFunction"
  arn       = aws_lambda_function.gpu_monitoring.arn
}

resource "aws_cloudwatch_event_target" "invoke_usage_autoshutdown" {
  rule      = aws_cloudwatch_event_rule.ec2_state_change.name
  target_id = "InvokeUsageAutoShutdownLambdaFunction"
  arn       = aws_lambda_function.usage_autoshutdown.arn
}

# Lambda Permissions
resource "aws_lambda_permission" "gpu_monitoring_allow_eventbridge" {
  statement_id  = "AllowExecutionFromEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.gpu_monitoring.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.ec2_state_change.arn
}

resource "aws_lambda_permission" "usage_autoshutdown_allow_eventbridge" {
  statement_id  = "AllowExecutionFromEventBridge"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.usage_autoshutdown.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.ec2_state_change.arn
}
	import boto3
	import logging

	def put_cpu_alarm(instance_id):
	ec2 = boto3.resource('ec2')
	instance = ec2.Instance(instance_id)
	tag_workstation = False
	logging.info("Entering put_cpu_alarm function")

	for tag in instance.tags:
	logging.info(f"Checking tag: {tag}")
	if tag['Key'] == 'CPU_Monitoring' and tag['Value'] == 'True':
	tag_workstation = True
	break

	if tag_workstation:
	logging.info("Matched CPU_Monitoring tag, attempting to create alarm")
	cloudWatch = boto3.client('cloudwatch')
	cloudWatch.put_metric_alarm(
	AlarmName = f'CPU_ALARM_{instance_id}',
	AlarmDescription = 'Alarm when server CPU does not exceed 5%',
	AlarmActions = ['arn:aws:automate:us-west-2:ec2:stop'],
	MetricName = 'CPUUtilization',
	Namespace = 'AWS/EC2',
	Statistic = 'Average',
	Dimensions = [{'Name': 'InstanceId', 'Value': instance_id}],
	Period = 600,
	EvaluationPeriods = 3,
	Threshold = 5,
	ComparisonOperator = 'LessThanOrEqualToThreshold',
	TreatMissingData = 'notBreaching'
	)
	logging.info(f"CPU Alarm created for instance: {instance_id}")
	else:
	logging.info(f"No matching tag found for CPU_Monitoring in instance: {instance_id}")

	def put_gpu_alarm(instance_id):
	ec2 = boto3.resource('ec2')
	instance = ec2.Instance(instance_id)
	tag_workstation = False
	logging.info("Entering put_gpu_alarm function")

	for tag in instance.tags:
	logging.info(f"Checking tag: {tag}")
	if tag['Key'] == 'GPU_Monitoring' and tag['Value'] == 'True':
	tag_workstation = True
	break

	if tag_workstation:
	logging.info("Matched GPU_Monitoring tag, attempting to create alarm")
	cloudWatch = boto3.client('cloudwatch')
	cloudWatch.put_metric_alarm(
	AlarmName = f'GPU_ALARM_{instance_id}',
	AlarmDescription = 'Alarm when server GPU does not exceed 10%',
	AlarmActions = ['arn:aws:automate:us-west-2:ec2:stop'],
	MetricName = 'GPUUtilization',
	Namespace = 'GPUStats',
	Statistic = 'Average',
	Dimensions = [{'Name': 'InstanceId', 'Value': instance_id}],
	Period = 1800,
	EvaluationPeriods = 1,
	Threshold = 10,
	ComparisonOperator = 'LessThanThreshold',
	TreatMissingData = 'notBreaching'
	)
	logging.info(f"GPU Alarm created for instance: {instance_id}")
	else:
	logging.info(f"No matching tag found for GPU_Monitoring in instance: {instance_id}")

	def lambda_handler(event, context):
	instance_id = event['detail']['instance-id']
	ec2 = boto3.resource('ec2')
	instance = ec2.Instance(instance_id)

	logging.info(f"Lambda handler invoked for instance: {instance_id}")

	if instance.instance_type.startswith('g'):
	logging.info("Instance type starts with 'g', invoking put_gpu_alarm")
	put_gpu_alarm(instance_id)
	else:
	logging.info("Instance type does not start with 'g', invoking put_cpu_alarm")
	put_cpu_alarm(instance_id)
	import boto3
	import time
	from datetime import datetime

	def lambda_handler(event, context):
	ec2 = boto3.resource('ec2')
	ssm = boto3.client('ssm')

	# Extract instance ID from the CloudWatch event
	instance_id = event['detail']['instance-id']

	# Get instance details
	instance = ec2.Instance(instance_id)
	instance.load()

	# Check for the GPU_Monitoring tag
	gpu_monitoring = next((tag['Value'] for tag in instance.tags if tag['Key'].lower() == 'gpu_monitoring'), None)

	if gpu_monitoring and gpu_monitoring.lower() == 'true':
	# Command to download and execute your script
	commands = [
	"if (-not (Test-Path 'C:\\Scripts\\GPUStats.ps1')) { Invoke-WebRequest -Uri 'https://s3.amazonaws.com/trackit-cpu-gpu-monitoring/GPUStats.ps1' -OutFile 'C:\\Scripts\\GPUStats.ps1' }",
	"if (Test-Path 'C:\\Scripts\\GPUStats.ps1') { C:\\Scripts\\GPUStats.ps1 }"
	]

	# Retry sending the command with a delay
	max_retries = 5
	retry_delay = 30 # seconds

	for attempt in range(max_retries):
	try:
	# Send command via SSM
	response = ssm.send_command(
	InstanceIds=[instance_id],
	DocumentName='AWS-RunPowerShellScript',
	Parameters={'commands': commands},
	Comment='Executing GPU stats collection setup'
	)

	# Check the status of the command
	command_id = response['Command']['CommandId']
	waiter = ssm.get_waiter('command_executed')
	waiter.wait(
	CommandId=command_id,
	InstanceId=instance_id,
	WaiterConfig={
	'Delay': 5,
	'MaxAttempts': 6
	}
	)

	# Convert datetime objects to string
	response['Command']['RequestedDateTime'] = response['Command']['RequestedDateTime'].isoformat()
	response['Command']['ExpiresAfter'] = response['Command']['ExpiresAfter'].isoformat()

	return response

	except ssm.exceptions.InvalidInstanceId:
	# Instance not ready yet, wait and retry
	if attempt < max_retries - 1:
	time.sleep(retry_delay)
	else:
	raise

	return 'No action needed - tag not set or set to false'
	#Create Scheduled Task to run every 5 minutes
	$taskName = 'Collect GPU Stats'
	$description = 'Collect GPU Stats and Pass to Cloudwatch Custom Metrics'
	$taskAction = New-ScheduledTaskAction -Execute 'powershell.exe' -Argument '-File C:\Scripts\GPUStats.ps1'
	$principal = New-ScheduledTaskPrincipal -UserID "System" -LogonType ServiceAccount -RunLevel Highest
	$taskTrigger = New-ScheduledTaskTrigger -Once -At (Get-Date) -RepetitionInterval (New-TimeSpan -Minutes 5)
	$settings = New-ScheduledTaskSettingsSet -ExecutionTimeLimit (New-TimeSpan -Minutes 2)


	mkdir C:\Scripts
	Set-Content -Path 'C:\Scripts\GPUStats.ps1' -Value @'
	try {
	Import-Module -Name AWSPowerShell

	# Get Stats from NVIDIA-SMI
	$STATS = nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits
	$object = ConvertFrom-Csv -InputObject $STATS -Delimiter ','

	# Get EC2 Instance ID
	$instanceID = Invoke-RestMethod -Uri http://169.254.169.254/latest/meta-data/instance-id

	# Create Dimension Object
	$dimension = New-Object -TypeName Amazon.CloudWatch.Model.Dimension
	$dimension.Name = "InstanceId"
	$dimension.Value = $instanceID

	$totalGPUUtilization = 0
	$gpuCount = 0

	# Accumulate GPU utilization values
	foreach ($item in $object) {
	$gpuUtilization = [decimal]$item.'utilization.gpu [%]'
	$totalGPUUtilization += $gpuUtilization
	$gpuCount++
	}

	# Calculate average GPU utilization
	$averageGPUUtilization = $totalGPUUtilization / $gpuCount

	# Create MetricDatum for average GPU utilization
	$averageGpuUtilMetric = New-Object -TypeName Amazon.CloudWatch.Model.MetricDatum
	$averageGpuUtilMetric.MetricName = "AggregatedGPUUtilization"
	$averageGpuUtilMetric.Unit = "Percent"
	$averageGpuUtilMetric.Value = $averageGPUUtilization
	$averageGpuUtilMetric.TimestampUtc = (Get-Date).ToUniversalTime()
	$averageGpuUtilMetric.Dimensions.Add($dimension)

	# Publish the aggregate GPU utilization
	Write-CWMetricData -Namespace 'GPUStats' -MetricData $averageGpuUtilMetric

	} catch {
	$ErrorMessage = $_.Exception.Message
	$FailedItem = $_.Exception.ItemName
	Add-Content -Path 'C:\Scripts\ErrorLog.txt' -Value "Error: $ErrorMessage; Item: $FailedItem"
	}

	echo $object
	exit
	'@

	Register-ScheduledTask -TaskName $taskName -Action $taskAction -Trigger $taskTrigger -Description $description -Settings $settings -Principal $principal
	provider "aws" {
	region = "us-west-2"
	}

	data "aws_caller_identity" "current" {}

	variable "bucket_name" {}

	# Roles
	resource "aws_iam_role" "gpu_monitoring_lambda_execution_role" {
	name = "gpu_monitoring_lambda_execution_role"
	assume_role_policy = jsonencode({
	Version = "2012-10-17"
	Statement = [
	{
	Action = "sts:AssumeRole"
	Effect = "Allow"
	Principal = { Service = "lambda.amazonaws.com" }
	},
	]
	})
	}

	resource "aws_iam_role" "usage_autoshutdown_lambda_execution_role" {
	name = "usage_autoshutdown_lambda_execution_role"
	assume_role_policy = jsonencode({
	Version = "2012-10-17"
	Statement = [
	{
	Action = "sts:AssumeRole"
	Effect = "Allow"
	Principal = { Service = "lambda.amazonaws.com" }
	},
	]
	})
	}

	# Role Policies
	resource "aws_iam_role_policy" "usage_autoshutdown_lambda_policy" {
	name = "usage_autoshutdown_lambda_policy"
	role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id
	policy = jsonencode({
	Version = "2012-10-17",
	Statement = [
	{
	Effect = "Allow",
	Action = ["ec2:DescribeInstances", "cloudwatch:PutMetricAlarm"],
	Resource = "*"
	},
	]
	})
	}

	resource "aws_iam_role_policy" "gpu_monitoring_lambda_policy" {
	name = "gpu_monitoring_lambda_policy"
	role = aws_iam_role.gpu_monitoring_lambda_execution_role.id
	policy = jsonencode({
	Version = "2012-10-17",
	Statement = [
	{
	Sid = "VisualEditor0"
	Effect = "Allow"
	Action = "ec2:DescribeInstances"
	Resource = "*"
	},
	{
	Sid = "VisualEditor1"
	Effect = "Allow"
	Action = "s3:GetObject"
	Resource = "arn:aws:s3:::${var.bucket_name}/*"
	},
	{
	Sid = "VisualEditor2"
	Effect = "Allow"
	Action = "ssm:SendCommand"
	Resource = [
	"arn:aws:s3:::${var.bucket_name}",
	"arn:aws:ec2:::instance/*",
	"arn:aws:ssm:::managed-instance/*",
	"arn:aws:ssm:::document/AWS-RunPowerShellScript"
	]
	},
	{
	Effect : "Allow",
	Action : [
	"ssm:GetCommandInvocation"
	],
	Resource : "*"
	}
	]
	})
	}

	resource "aws_iam_role_policy" "usage_autoshutdown_service_linked_role_policy" {
	name = "usage_autoshutdown_service_linked_role_policy"
	role = aws_iam_role.usage_autoshutdown_lambda_execution_role.id

	policy = jsonencode({
	Version = "2012-10-17",
	Statement = [
	{
	Effect = "Allow",
	Action = "iam:CreateServiceLinkedRole",
	Resource = "*",
	Condition = {
	StringEquals = {
	"iam:AWSServiceName" = "events.amazonaws.com"
	}
	}
	}
	]
	})
	}

	resource "aws_iam_role_policy_attachment" "gpu_monitoring_basic_execution" {
	role = aws_iam_role.gpu_monitoring_lambda_execution_role.name
	policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
	}

	resource "aws_iam_role_policy_attachment" "usage_autoshutdown_basic_execution" {
	role = aws_iam_role.usage_autoshutdown_lambda_execution_role.name
	policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
	}

	# Lambda Functions
	resource "aws_lambda_function" "gpu_monitoring" {
	function_name = "ec2_enable_gpu_monitoring"
	handler = "enable_gpu_monitoring.lambda_handler"
	runtime = "python3.11"
	s3_bucket = "trackit-cpu-gpu-monitoring"
	s3_key = "enable_gpu_monitoring.zip"
	role = aws_iam_role.gpu_monitoring_lambda_execution_role.arn
	timeout = 120
	}

	resource "aws_lambda_function" "usage_autoshutdown" {
	function_name = "ec2_usage_autoshutdown"
	handler = "autoshutdown.lambda_handler"
	runtime = "python3.11"
	s3_bucket = "trackit-cpu-gpu-monitoring"
	s3_key = "autoshutdown.zip"
	role = aws_iam_role.usage_autoshutdown_lambda_execution_role.arn
	timeout = 60
	}

	# CloudWatch Event Rule
	resource "aws_cloudwatch_event_rule" "ec2_state_change" {
	name = "ec2-running-state-change"
	description = "Triggers when EC2 instances move to running state."

	event_pattern = jsonencode({
	source = ["aws.ec2"],
	"detail-type" = ["EC2 Instance State-change Notification"],
	detail = {
	state = ["running"]
	}
	})
	}

	# CloudWatch Event Targets
	resource "aws_cloudwatch_event_target" "invoke_gpu_monitoring" {
	rule = aws_cloudwatch_event_rule.ec2_state_change.name
	target_id = "InvokeGPUMonitoringLambdaFunction"
	arn = aws_lambda_function.gpu_monitoring.arn
	}

	resource "aws_cloudwatch_event_target" "invoke_usage_autoshutdown" {
	rule = aws_cloudwatch_event_rule.ec2_state_change.name
	target_id = "InvokeUsageAutoShutdownLambdaFunction"
	arn = aws_lambda_function.usage_autoshutdown.arn
	}

	# Lambda Permissions
	resource "aws_lambda_permission" "gpu_monitoring_allow_eventbridge" {
	statement_id = "AllowExecutionFromEventBridge"
	action = "lambda:InvokeFunction"
	function_name = aws_lambda_function.gpu_monitoring.function_name
	principal = "events.amazonaws.com"
	source_arn = aws_cloudwatch_event_rule.ec2_state_change.arn
	}

	resource "aws_lambda_permission" "usage_autoshutdown_allow_eventbridge" {
	statement_id = "AllowExecutionFromEventBridge"
	action = "lambda:InvokeFunction"
	function_name = aws_lambda_function.usage_autoshutdown.function_name
	principal = "events.amazonaws.com"
	source_arn = aws_cloudwatch_event_rule.ec2_state_change.arn
	}