Skip to content

Instantly share code, notes, and snippets.

@sirbr
Created August 12, 2020 04:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sirbr/df1850f7ecc62797d521011aba0e6735 to your computer and use it in GitHub Desktop.
Save sirbr/df1850f7ecc62797d521011aba0e6735 to your computer and use it in GitHub Desktop.
AWSTemplateFormatVersion: 2010-09-09
Parameters:
BucketName:
Type: String
Description: log bucket name
DataBaseName:
Type: String
Description: DataBase Name
Default: shoppingfeedeventsdb
ScriptLocation:
Type: String
Description: Glue Job location
ParquetDataBucketName:
Type: String
Description: parquet data bucket name
Resources:
GlueWorkflow:
Type: AWS::Glue::Workflow
Properties:
Description: Workflow to orchestrat job and crawlers
Name: GlueWorkflow
GlueDatabase:
Type: AWS::Glue::Database
Properties:
DatabaseInput:
Name: !Ref DataBaseName
Description: AWS Glue container to hold metadata tables
CatalogId: !Ref AWS::AccountId
DataClassifier:
Type: AWS::Glue::Classifier
Properties:
JsonClassifier:
JsonPath: $[*]
Name: GlueClassifier
JsonCrawler:
Type: AWS::Glue::Crawler
Properties:
Classifiers:
- !Ref DataClassifier
Role: !GetAtt AWSGlueRole.Arn
Description: AWS Glue crawler to crawl json data
DatabaseName: !Ref DataBaseName
Targets:
S3Targets:
- Path: !Sub "s3://${BucketName}/"
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
AWSGlueRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Principal:
Service:
- "glue.amazonaws.com"
Action:
- "sts:AssumeRole"
Path: "/"
Policies:
-
PolicyName: "root"
PolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Action:
- s3:GetObject
- s3:PutObject
- s3:ReplicateObject
- s3:PutObject
- s3:GetObject
- s3:ListAllMyBuckets
- s3:RestoreObject
- s3:CreateBucket
- s3:ListBucket
- s3:DeleteObject
- s3:DeleteBucket
- glue:UpdateDatabase
- glue:CreateTable
- glue:CreateDatabase
- glue:UpdateTable
- glue:CreatePartition
- glue:UpdatePartition
- glue:GetTables
- glue:GetTable
- glue:GetDatabase
- glue:GetDatabases
Resource: "*"
ManagedPolicyArns:
- "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
GlueJob:
Type: AWS::Glue::Job
Properties:
Role: !Ref AWSGlueRole
Command:
Name: glueetl
ScriptLocation: !Sub "s3://${ScriptLocation}/jobscripts/JobScript.py"
AllocatedCapacity: 5
ExecutionProperty:
MaxConcurrentRuns: 1
DefaultArguments:
"--db_name": !Ref DataBaseName
"--s3_destination" : !Sub s3://${ParquetDataBucketName}/
"--job-language": "python"
GlueVersion: 1.0
MaxRetries: 0
ParquetCrawler:
Type: AWS::Glue::Crawler
Properties:
Role: !GetAtt AWSGlueRole.Arn
Description: AWS Glue crawler to crawl parquet data
DatabaseName: !Ref DataBaseName
Targets:
S3Targets:
- Path: !Sub "s3://${ParquetDataBucketName}/"
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
JsonDataCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: JsonCrawlerTrigger
Type: SCHEDULED
Description: Trigger for starting the workflow
Schedule: cron(00 6 * * ? *)
StartOnCreation: true
Actions:
- CrawlerName: !Ref JsonCrawler
WorkflowName: !Ref GlueWorkflow
JobTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: JobTrigger
Type: CONDITIONAL
StartOnCreation: True
Description: Trigger to start the glue job
Actions:
- JobName: !Ref GlueJob
Predicate:
Conditions:
- LogicalOperator: EQUALS
CrawlerName: !Ref JsonCrawler
CrawlState: SUCCEEDED
WorkflowName: !Ref GlueWorkflow
ParquetDataCrawlerTrigger:
Type: AWS::Glue::Trigger
Properties:
Name: ParquetCrawlerTrigger
Type: CONDITIONAL
StartOnCreation: True
Description: Trigger to start the ParquetDataCrawler
Actions:
- CrawlerName: !Ref ParquetDataCrawler
Predicate:
Conditions:
- LogicalOperator: EQUALS
JobName: !Ref GlueJob
State: SUCCEEDED
WorkflowName: !Ref GlueWorkflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment