Skip to content

Instantly share code, notes, and snippets.

View ottomata's full-sized avatar

Andrew Otto ottomata

View GitHub Profile
cd ~/
mkdir flink-sql-libs
cd flink-sql-libs/
wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka/1.17.2/flink-connector-kafka-1.17.2.jar
wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.4.0/kafka-clients-3.4.0.jar
# Only need this if querying WMF event streams.
# https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Processing/Flink_Catalog#Creating_Tables
# wget https://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities-flink/1.3.3/eventutilities-flink-1.3.3-jar-with-dependencies.jar
@ottomata
ottomata / output.txt
Created January 2, 2024 21:55
EvolveHiveTable with comments output
spark3-submit --class org.wikimedia.analytics.refinery.job.refine.tool.EvolveHiveTable ./refinery-job/target/refinery-job-0.2.28-SNAPSHOT-shaded.jar --table=event.mediawiki_page_change_v1 --schema_uri=/mediawiki/page/change/latest --dry_run=true
24/01/02 21:49:53 INFO DataFrameToHive: Found difference in schemas for Hive table otto.mw_page_change0
Table schema:
root
-- _schema: string (nullable = true)
-- changelog_kind: string (nullable = true)
-- comment: string (nullable = true)
-- created_redirect_page: struct (nullable = true)
|-- is_redirect: boolean (nullable = true)
@ottomata
ottomata / pyflink_output_tag_reuse_fail_word_count_example.py
Last active February 14, 2023 19:50
Example showing failure when re-using an OutputTag in multiple ProcessFunctions
@ottomata
ottomata / pyflink_sideout_fail_word_count_example.py
Last active February 10, 2023 14:05
Example showing failure when not calling get_side_output on a pyflink DataStream immediately after the ProcessFunction that emits to a side output
@ottomata
ottomata / Dockerfile
Last active December 8, 2022 18:48
flink-kubernetes-operator python example
FROM docker-registry.wikimedia.org/flink:1.16.0-37
# add python script
USER root
RUN mkdir -p /srv/flink_app && ls
ADD python_demo.py /srv/flink_app/python_demo.py
USER flink
@ottomata
ottomata / schema_classes.py
Created November 17, 2022 18:23
datahub schema classes
# flake8: noqa
# This file is autogenerated by /metadata-ingestion/scripts/avro_codegen.py
# Do not modify manually!
# pylint: skip-file
# fmt: off
# The SchemaFromJSONData method only exists in avro-python3, but is called make_avsc_object in avro.
# We can use this fact to detect conflicts between the two packages. Pip won't detect those conflicts
@ottomata
ottomata / 0_flink_sql_enrich_demo.sh
Created November 7, 2022 20:38
Mediawiki Content Flink Streaming SQL Content Enrich
@ottomata
ottomata / enrich.sql
Created November 3, 2022 21:50
flink SQL enrichment UDF experiements 2
CREATE TEMPORARY TABLE mediawiki_page_change (
`wiki_id` STRING,
`meta` ROW<domain STRING>,
`page_change_kind` STRING,
`page` ROW<page_id BIGINT, page_title STRING>,
`revision` ROW<rev_id BIGINT, content_slots MAP<string, ROW<slot_role STRING, content_format STRING, content_body STRING>>>
) WITH (
'connector' = 'kafka',
// Quick and hacky script that will use dyff to show the diff between
// Any modified materialized schema version and its previous version.
//
// Defaults to using https://github.com/homeport/dyff, so install that first.
// This could be cleaned up and incorporated into jsonschema-tools itself, and
// then shown in CI.
//
jsonschema_tools = require('@wikimedia/jsonschema-tools');
const _ = require('lodash');
@ottomata
ottomata / 0_mediawiki_page_create_stream_table.py
Last active September 29, 2022 20:38
pyflink + event platform experiments
# Pyflink Streaming Table Env + Event Utilities Event Platform integration.
# Download wikimedia-event-utilities Flink:
# Download Wikimedia Event Utilities Flink jar, e.g.
# https://archiva.wikimedia.org/#artifact-details-download-content/org.wikimedia/eventutilities-flink/1.2.0
# wget http://archiva.wikimedia.org/repository/releases/org/wikimedia/eventutilities-flink/1.2.0/eventutilities-flink-1.2.0-jar-with-dependencies.jar
# Also download other dependencies we need:
# wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-kafka/1.15.2/flink-connector-kafka-1.15.2.jar
# wget https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients/3.2.3/kafka-clients-3.2.3.jar
# If you like, use ipython for your REPL, its nicer!