Skip to content

Instantly share code, notes, and snippets.

@mt-kelvintaywl
Last active December 10, 2020 16:27
Show Gist options
  • Save mt-kelvintaywl/d2f000bfbdb338de18b80c0774b2e9b5 to your computer and use it in GitHub Desktop.
Save mt-kelvintaywl/d2f000bfbdb338de18b80c0774b2e9b5 to your computer and use it in GitHub Desktop.
Messing around with Glue

Motivation

I wanted to find out how much "power" Python shell scripts are afforded in AWS Glue.

For context, with AWS Glue, in general, you can choose either a Spark, Python Shell type for Jobs.

Pricing will be different because of minimal billed duration and minimal DPUs allocatable.

In my case, we are trying to use simple Python scripts (without need for PySpark), hence the investigation on Python Shell.

Set up

I declared a Python Shell (Glue 1.0, Python 3) Glue Job where the script attempts to output the disk space and CPU cores afforded.

See the test_environment.py.

Findings

This is tested with standard worker.

When running with Maximum capacity: 0.0625 (default DPU capacity),

Total: 19 GiB
Used: 5 GiB
Free: 14 GiB
no. of CPUs: 1

When running with Maximum capacity: 1,

Total: 19 GiB
Used: 5 GiB
Free: 14 GiB
no. of CPUs: 4

As noted on their documentation, the disk space would be the same.

Number of CPUs available is expectedly different too. With multiple CPUs, we can achieve multi processing, so this is certainly powerful 💪.

Some good resources I found around the multiprocessing module in Python:

Notes

this was tested as of Dec 11 2020.*

import os
import shutil
total, used, free = shutil.disk_usage("/")
print("Total: %d GiB" % (total // (2**30)))
print("Used: %d GiB" % (used // (2**30)))
print("Free: %d GiB" % (free // (2**30)))
print(f"no. of CPUs: {os.cpu_count()}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment