This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| -- Use below SQL queries to periodically deduplicate data in BigQuery tables. | |
| CREATE OR REPLACE TABLE Transactions AS | |
| SELECT DISTINCT * | |
| FROM raw_transactions; | |
| --OR use below incremental steps to drop the necessary partitions and re-insert the deduped data into the original table | |
| -- Step 1: Insert distinct records from the original table based on the max timestamp available |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # This is the dataflow pipeline created using Python. | |
| # This data pipeline reads PubSub message in JSON format and identifies which target table the data should write to based on "Category" key Identifier. | |
| # You can modify this based on use case to write to multiple destination tables in Bigquery. | |
| # Also, this datapipeline uses custom BigQuery function to process each element | |
| # and write Error messages to a Dead letter table with the Raw Message received from PubSub. Refer to "WriteToBQ" Class for more information | |
| import apache_beam as beam | |
| from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, GoogleCloudOptions | |
| from apache_beam.io.gcp.pubsub import ReadFromPubSub |