Skip to content

Instantly share code, notes, and snippets.

@yifeihuang
Created September 13, 2020 20:37
Show Gist options
  • Save yifeihuang/5ff7726b23b134405cb51b8a1acc6437 to your computer and use it in GitHub Desktop.
Save yifeihuang/5ff7726b23b134405cb51b8a1acc6437 to your computer and use it in GitHub Desktop.
[ER] sample and review potential matches
distance_df = spark.read.parquet("YOUR_STORAGE_PATH/amazon_google_distance.parquet")
display_cols = ['name', 'description', 'manufacturer', 'price']
sample_df = distance_df.filter((f.col('overall_sim') > 0) & (f.col('overall_sim') < 1))
.select('edge.src', 'edge.dst', *[f.concat_ws('\nVS\n', 'src.' + c, 'dst.' + c).alias(c) for c in display_cols], 'overall_sim')
.sample(withReplacement=False, fraction=0.02, seed=42)
sample_df.write.mode('overwrite').csv("YOUR_STORAGE_PATH/candidate_pair_sample.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment