Skip to content

Instantly share code, notes, and snippets.

View nerukulla0719's full-sized avatar

Naresh Erukulla nerukulla0719

View GitHub Profile
@nerukulla0719
nerukulla0719 / dedup_sql_queries.sql
Created January 24, 2025 00:57
Suggested methods to remove duplicates from google cloud bigquery using bqSQL
-- Use below SQL queries to periodically deduplicate data in BigQuery tables.
CREATE OR REPLACE TABLE Transactions AS
SELECT DISTINCT *
FROM raw_transactions;
--OR use below incremental steps to drop the necessary partitions and re-insert the deduped data into the original table
-- Step 1: Insert distinct records from the original table based on the max timestamp available
@nerukulla0719
nerukulla0719 / dataflow_customBQ_python_example.py
Last active August 8, 2024 13:54
dataflow_customBQ_python_example
# This is the dataflow pipeline created using Python.
# This data pipeline reads PubSub message in JSON format and identifies which target table the data should write to based on "Category" key Identifier.
# You can modify this based on use case to write to multiple destination tables in Bigquery.
# Also, this datapipeline uses custom BigQuery function to process each element
# and write Error messages to a Dead letter table with the Raw Message received from PubSub. Refer to "WriteToBQ" Class for more information
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, GoogleCloudOptions
from apache_beam.io.gcp.pubsub import ReadFromPubSub