Skip to content

Instantly share code, notes, and snippets.

@hui-zheng
hui-zheng / cat_with_filenames.sh
Last active May 27, 2020 17:01
bash script recipes for all
# cat multiple files and show filenames
grep ^ /dev/null $@
@hui-zheng
hui-zheng / find_and_display_duplicates
Created March 14, 2020 19:49
[BigQuery Advanced SQL] the most flexible script to detect and display duplicate records and remove duplicates (dedup)
-- base_time has t
WITH rows_by_key AS(
SELECT
surrogate_key,
array_agg(base_table) as _rows,
count(*) as _count
FROM `gcp_project.data_set.original_table` as base_table
WHERE stamp BETWEEN "2020-03-12T00:00:00" AND "2020-03-14T00:00:00"
GROUP BY surrogate_key
)
@hui-zheng
hui-zheng / clean_kubernetes_jobs.sh
Last active March 5, 2020 16:39
kubernetes DevOps Operation Script
BY_LIST_FILE="NONE"
COMPLETED="1/."
AGE="3"
NAME_PATTERN=".*"
while [[ $# -gt 0 ]]; do
key="$1"
case $key in
@hui-zheng
hui-zheng / SQL non-null greatest for multiple columns
Last active February 7, 2024 04:01
[BigQuery Advanced SQL] find greatest/largest/max non-null values among multiple columns
-- Below is a fancy version of non-null-greatest() for multi-columns.
-- it is more extensible for more two columns.
WITH base AS (
SELECT
(SELECT ARRAY_AGG (x IGNORE NULLS) AS Y FROM UNNEST ([col_1, col_2, col_3, col_4]) AS x)
AS array,
FROM source_table AS nl
)
SELECT
(SELECT MAX(y) FROM UNNEST(array) AS Y
@hui-zheng
hui-zheng / BQ_partition_dedup.sql
Last active May 26, 2023 14:14
This list provides BigQuery SQL templates that remove duplicates for large size timestamp partitioned table (using MERGE statement) and for small size table or a non-partition table (Using REPLACE TABLE statement)
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST