Skip to content

Instantly share code, notes, and snippets.

@vinothchandar
Last active August 31, 2021 17:11
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vinothchandar/593b19c47bea2406b9a8a9aaed30775a to your computer and use it in GitHub Desktop.
Save vinothchandar/593b19c47bea2406b9a8a9aaed30775a to your computer and use it in GitHub Desktop.
hyperspace - demo

TL;DR :

  • Was exploring if hyperspace can be used an alternative for our record/bloom indexes
  • For the needle-in-a-haystack search i.e a single id out of all the records, hyperspace also seems to be not very effective atm (might not be suprising given the covered indexes recommendations so far).
  • Our old workhorse BLOOM_INDEX still significantly outperforms. But we should really step on the gas for RFC-15 like efforts/RFC-08 to make this much faster

https://microsoft.github.io/hyperspace/docs/ug-quick-start-guide/

~/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-shell   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --driver-memory 8g --packages com.microsoft.hyperspace:hyperspace-core_2.12:0.1.0

val part100Path = "file:///Volumes/HUDIDATA/input-data/amazon-reviews-100-parts"
val df100 = spark.read.parquet(part100Path)
df100.registerTempTable("amazon_reviews_100_parts")

import com.microsoft.hyperspace._
val hs = new Hyperspace(spark)
import com.microsoft.hyperspace.index._


+--------------+
|     review_id|
+--------------+
|R38YR2K3RQVUT6|
|R1UE9PRDNPVWJN|
|R2T5TIOI92JDOA|
| RY7UKOQOZ1NA9|
| R1LJ65G8LY6L6|
| ROQTM343YUPY5|
|R160R9P9BRK8J6|
| R30ZKF6EPTV76|
|R2Q93ZF9K7BERL|
|R2UG8JB73C003W|
|R1NX7L8FAZFL6T|
| R3RJQHNPYINS1|
| R5Z19IT94F27U|
|R1C1X93D1TPIVY|
|R2AZ4P431BHSXD|
|R1G30L7BW96HH9|
|R2Q05M51VX6P14|
| RL9AZUSVJC16M|
|R119E7G9JQDDO5|
|R36I5SKSR7V0WK|
+--------------+

@vinothchandar
Copy link
Author

vinothchandar commented Jul 20, 2020

Query without index

df100.filter("review_id = 'ROQTM343YUPY5'").show
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|marketplace|customer_id|    review_id|product_id|product_parent|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|year|product_category|
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|         US|   29124476|ROQTM343YUPY5|0374528373|     569503661|The Brothers Kara...|          5|            0|          1|   N|                N|Despite all his f...|This is indeed on...| 2015-06-25|2015|           Books|
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+

image

@vinothchandar
Copy link
Author

vinothchandar commented Jul 20, 2020

Query with hyperspace

Run on spark 2.4.6

hs.createIndex(df100, IndexConfig("index", indexedColumns = Seq("review_id"), includedColumns = Seq("customer_id", "marketplace", "product_id")))
val indexes = hs.indexes
indexes.show


// Exiting paste mode, now interpreting.

+-----+--------------+--------------------+----------+--------------------+--------------------+--------------------+------+
| name|indexedColumns|     includedColumns|numBuckets|              schema|       indexLocation|           queryPlan| state|
+-----+--------------+--------------------+----------+--------------------+--------------------+--------------------+------+
|index|   [review_id]|[customer_id, mar...|       200|{"type":"struct",...|file:/Volumes/HUD...|Relation[marketpl...|ACTIVE|
+-----+--------------+--------------------+----------+--------------------+--------------------+--------------------+------+

indexes: org.apache.spark.sql.DataFrame = [name: string, indexedColumns: array<string> ... 6 more fields]

scala>

Index build step

image

spark.enableHyperspace
val query = df100.filter("review_id = 'ROQTM343YUPY5'")
hs.explain(query, verbose = true)


=============================================================
Plan with indexes:
=============================================================
Project [marketplace#0, customer_id#1, review_id#2, product_id#3, product_parent#4, product_title#5, star_rating#6, helpful_votes#7, total_votes#8, vine#9, verified_purchase#10, review_headline#11, review_body#12, review_date#13, year#14, product_category#15]
+- Filter (isnotnull(review_id#2) && (review_id#2 = ROQTM343YUPY5))
   +- FileScan parquet [marketplace#0,customer_id#1,review_id#2,product_id#3,product_parent#4,product_title#5,star_rating#6,helpful_votes#7,total_votes#8,vine#9,verified_purchase#10,review_headline#11,review_body#12,review_date#13,year#14,product_category#15] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/HUDIDATA/input-data/amazon-reviews-100-parts], PartitionCount: 43, PartitionFilters: [], PushedFilters: [IsNotNull(review_id), EqualTo(review_id,ROQTM343YUPY5)], ReadSchema: struct<marketplace:string,customer_id:string,review_id:string,product_id:string,product_parent:st...

=============================================================
Plan without indexes:
=============================================================
Project [marketplace#0, customer_id#1, review_id#2, product_id#3, product_parent#4, product_title#5, star_rating#6, helpful_votes#7, total_votes#8, vine#9, verified_purchase#10, review_headline#11, review_body#12, review_date#13, year#14, product_category#15]
+- Filter (isnotnull(review_id#2) && (review_id#2 = ROQTM343YUPY5))
   +- FileScan parquet [marketplace#0,customer_id#1,review_id#2,product_id#3,product_parent#4,product_title#5,star_rating#6,helpful_votes#7,total_votes#8,vine#9,verified_purchase#10,review_headline#11,review_body#12,review_date#13,year#14,product_category#15] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Volumes/HUDIDATA/input-data/amazon-reviews-100-parts], PartitionCount: 43, PartitionFilters: [], PushedFilters: [IsNotNull(review_id), EqualTo(review_id,ROQTM343YUPY5)], ReadSchema: struct<marketplace:string,customer_id:string,review_id:string,product_id:string,product_parent:st...

=============================================================
Indexes used:
=============================================================

=============================================================
Physical operator stats:
=============================================================
+-----------------+-------------------+------------------+----------+
|Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+-----------------+-------------------+------------------+----------+
|           Filter|                  1|                 1|         0|
|          Project|                  1|                 1|         0|
|     Scan parquet|                  1|                 1|         0|
|WholeStageCodegen|                  1|                 1|         0|
+-----------------+-------------------+------------------+----------+


scala>

Index on disk

total 512
drwxrwxrwx  1 vs  staff  262144 Jul 19 19:04 index

./spark-warehouse/indexes//index:
total 1024
drwxrwxrwx  1 vs  staff  262144 Jul 19 19:04 _hyperspace_log
drwxrwxrwx  1 vs  staff  262144 Jul 19 19:04 v__=0

./spark-warehouse/indexes//index/_hyperspace_log:
total 4608
-rwxrwxrwx  1 vs  staff  687474 Jul 19 19:04 0
-rwxrwxrwx  1 vs  staff  687472 Jul 19 19:14 1
-rwxrwxrwx  1 vs  staff  687472 Jul 19 19:14 latestStable

./spark-warehouse/indexes//index/v__=0:
total 9113600
-rwxrwxrwx  1 vs  staff         0 Jul 19 19:14 _SUCCESS
-rwxrwxrwx  1 vs  staff  23213334 Jul 19 19:10 part-00000-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00000.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235439 Jul 19 19:10 part-00001-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00001.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232819 Jul 19 19:10 part-00002-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00002.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226684 Jul 19 19:10 part-00003-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00003.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23257929 Jul 19 19:10 part-00004-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00004.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23292434 Jul 19 19:10 part-00005-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00005.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23241055 Jul 19 19:10 part-00006-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00006.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23203631 Jul 19 19:10 part-00007-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00007.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23249414 Jul 19 19:10 part-00008-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00008.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23222184 Jul 19 19:10 part-00009-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00009.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23245083 Jul 19 19:10 part-00010-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00010.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232753 Jul 19 19:10 part-00011-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00011.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23253105 Jul 19 19:10 part-00012-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00012.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23198594 Jul 19 19:10 part-00013-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00013.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23236601 Jul 19 19:10 part-00014-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00014.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23207379 Jul 19 19:10 part-00015-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00015.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23258874 Jul 19 19:10 part-00016-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00016.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23240327 Jul 19 19:10 part-00017-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00017.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226363 Jul 19 19:10 part-00018-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00018.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23217928 Jul 19 19:10 part-00019-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00019.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23259160 Jul 19 19:10 part-00020-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00020.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235324 Jul 19 19:10 part-00021-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00021.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23241632 Jul 19 19:10 part-00022-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00022.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23228028 Jul 19 19:10 part-00023-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00023.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23217968 Jul 19 19:10 part-00024-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00024.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235276 Jul 19 19:10 part-00025-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00025.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23262897 Jul 19 19:10 part-00026-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00026.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235759 Jul 19 19:10 part-00027-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00027.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23214629 Jul 19 19:10 part-00028-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00028.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23234460 Jul 19 19:10 part-00029-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00029.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23234568 Jul 19 19:10 part-00030-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00030.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226401 Jul 19 19:10 part-00031-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00031.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23216329 Jul 19 19:11 part-00032-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00032.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23185441 Jul 19 19:11 part-00033-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00033.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232155 Jul 19 19:11 part-00034-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00034.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23256716 Jul 19 19:11 part-00035-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00035.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23281925 Jul 19 19:11 part-00036-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00036.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23283882 Jul 19 19:11 part-00037-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00037.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23257309 Jul 19 19:11 part-00038-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00038.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23245085 Jul 19 19:11 part-00039-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00039.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23245296 Jul 19 19:11 part-00040-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00040.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23220991 Jul 19 19:11 part-00041-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00041.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23206863 Jul 19 19:11 part-00042-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00042.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23236477 Jul 19 19:11 part-00043-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00043.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23183709 Jul 19 19:11 part-00044-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00044.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23200692 Jul 19 19:11 part-00045-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00045.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23283870 Jul 19 19:11 part-00046-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00046.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23222838 Jul 19 19:11 part-00047-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00047.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23254176 Jul 19 19:11 part-00048-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00048.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23274202 Jul 19 19:11 part-00049-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00049.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23236212 Jul 19 19:11 part-00050-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00050.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23201207 Jul 19 19:11 part-00051-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00051.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23198549 Jul 19 19:11 part-00052-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00052.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23261941 Jul 19 19:11 part-00053-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00053.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23256594 Jul 19 19:11 part-00054-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00054.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23202945 Jul 19 19:11 part-00055-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00055.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23227772 Jul 19 19:11 part-00056-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00056.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23209128 Jul 19 19:11 part-00057-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00057.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23200224 Jul 19 19:11 part-00058-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00058.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23270253 Jul 19 19:11 part-00059-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00059.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23218671 Jul 19 19:11 part-00060-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00060.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23212553 Jul 19 19:11 part-00061-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00061.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23221617 Jul 19 19:11 part-00062-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00062.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23237442 Jul 19 19:11 part-00063-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00063.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23231506 Jul 19 19:11 part-00064-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00064.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23246464 Jul 19 19:11 part-00065-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00065.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23194407 Jul 19 19:11 part-00066-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00066.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23201390 Jul 19 19:11 part-00067-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00067.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23243252 Jul 19 19:11 part-00068-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00068.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23253795 Jul 19 19:11 part-00069-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00069.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226824 Jul 19 19:11 part-00070-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00070.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23249275 Jul 19 19:11 part-00071-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00071.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226333 Jul 19 19:11 part-00072-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00072.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23231319 Jul 19 19:11 part-00073-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00073.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23250028 Jul 19 19:11 part-00074-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00074.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23182901 Jul 19 19:11 part-00075-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00075.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208606 Jul 19 19:11 part-00076-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00076.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23271094 Jul 19 19:11 part-00077-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00077.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23205015 Jul 19 19:11 part-00078-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00078.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23263700 Jul 19 19:11 part-00079-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00079.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23259281 Jul 19 19:11 part-00080-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00080.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23192456 Jul 19 19:11 part-00081-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00081.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23295860 Jul 19 19:11 part-00082-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00082.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23194965 Jul 19 19:11 part-00083-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00083.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23243223 Jul 19 19:11 part-00084-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00084.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23210853 Jul 19 19:12 part-00085-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00085.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23224148 Jul 19 19:12 part-00086-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00086.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23258535 Jul 19 19:12 part-00087-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00087.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235617 Jul 19 19:12 part-00088-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00088.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23282369 Jul 19 19:12 part-00089-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00089.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23254581 Jul 19 19:12 part-00090-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00090.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23261329 Jul 19 19:12 part-00091-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00091.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23199170 Jul 19 19:12 part-00092-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00092.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23203346 Jul 19 19:12 part-00093-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00093.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211599 Jul 19 19:12 part-00094-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00094.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23218826 Jul 19 19:12 part-00095-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00095.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211228 Jul 19 19:12 part-00096-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00096.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23234336 Jul 19 19:12 part-00097-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00097.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23194735 Jul 19 19:12 part-00098-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00098.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23238800 Jul 19 19:12 part-00099-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00099.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23206681 Jul 19 19:12 part-00100-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00100.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23186676 Jul 19 19:12 part-00101-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00101.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23260763 Jul 19 19:12 part-00102-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00102.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23269858 Jul 19 19:12 part-00103-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00103.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23250596 Jul 19 19:12 part-00104-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00104.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211815 Jul 19 19:12 part-00105-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00105.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232375 Jul 19 19:12 part-00106-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00106.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23288308 Jul 19 19:12 part-00107-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00107.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23245471 Jul 19 19:12 part-00108-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00108.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23166602 Jul 19 19:12 part-00109-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00109.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23229732 Jul 19 19:12 part-00110-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00110.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23219100 Jul 19 19:12 part-00111-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00111.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23269078 Jul 19 19:12 part-00112-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00112.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23205149 Jul 19 19:12 part-00113-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00113.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23260239 Jul 19 19:12 part-00114-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00114.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23243198 Jul 19 19:12 part-00115-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00115.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23229884 Jul 19 19:12 part-00116-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00116.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23238198 Jul 19 19:12 part-00117-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00117.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23269664 Jul 19 19:12 part-00118-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00118.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211535 Jul 19 19:12 part-00119-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00119.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23262656 Jul 19 19:12 part-00120-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00120.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235823 Jul 19 19:12 part-00121-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00121.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23239877 Jul 19 19:12 part-00122-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00122.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23207157 Jul 19 19:12 part-00123-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00123.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23279832 Jul 19 19:12 part-00124-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00124.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23252268 Jul 19 19:12 part-00125-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00125.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23262972 Jul 19 19:12 part-00126-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00126.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23283286 Jul 19 19:12 part-00127-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00127.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23245192 Jul 19 19:12 part-00128-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00128.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208152 Jul 19 19:12 part-00129-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00129.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23215002 Jul 19 19:12 part-00130-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00130.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208108 Jul 19 19:12 part-00131-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00131.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211288 Jul 19 19:12 part-00132-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00132.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23239698 Jul 19 19:12 part-00133-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00133.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23239502 Jul 19 19:12 part-00134-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00134.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23251565 Jul 19 19:12 part-00135-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00135.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208020 Jul 19 19:13 part-00136-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00136.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23199716 Jul 19 19:13 part-00137-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00137.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23209264 Jul 19 19:13 part-00138-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00138.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208870 Jul 19 19:13 part-00139-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00139.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23214484 Jul 19 19:13 part-00140-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00140.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23197107 Jul 19 19:13 part-00141-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00141.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23281205 Jul 19 19:13 part-00142-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00142.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23213987 Jul 19 19:13 part-00143-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00143.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23242044 Jul 19 19:13 part-00144-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00144.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23208287 Jul 19 19:13 part-00145-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00145.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232389 Jul 19 19:13 part-00146-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00146.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23200231 Jul 19 19:13 part-00147-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00147.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23226198 Jul 19 19:13 part-00148-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00148.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232251 Jul 19 19:13 part-00149-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00149.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23260741 Jul 19 19:13 part-00150-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00150.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23222743 Jul 19 19:13 part-00151-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00151.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23214971 Jul 19 19:13 part-00152-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00152.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23227365 Jul 19 19:13 part-00153-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00153.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23274317 Jul 19 19:13 part-00154-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00154.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23253414 Jul 19 19:13 part-00155-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00155.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23250165 Jul 19 19:13 part-00156-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00156.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23192104 Jul 19 19:13 part-00157-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00157.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23280045 Jul 19 19:13 part-00158-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00158.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23239272 Jul 19 19:13 part-00159-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00159.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23278769 Jul 19 19:13 part-00160-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00160.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23272290 Jul 19 19:13 part-00161-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00161.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23189123 Jul 19 19:13 part-00162-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00162.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23254500 Jul 19 19:13 part-00163-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00163.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23189382 Jul 19 19:13 part-00164-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00164.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23206638 Jul 19 19:13 part-00165-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00165.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23220450 Jul 19 19:13 part-00166-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00166.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23263662 Jul 19 19:13 part-00167-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00167.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23242594 Jul 19 19:13 part-00168-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00168.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23251899 Jul 19 19:13 part-00169-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00169.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23248752 Jul 19 19:13 part-00170-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00170.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23203881 Jul 19 19:13 part-00171-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00171.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23225324 Jul 19 19:13 part-00172-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00172.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23214338 Jul 19 19:13 part-00173-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00173.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23218410 Jul 19 19:13 part-00174-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00174.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23261025 Jul 19 19:13 part-00175-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00175.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23233996 Jul 19 19:13 part-00176-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00176.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23197958 Jul 19 19:13 part-00177-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00177.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23221439 Jul 19 19:13 part-00178-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00178.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23222238 Jul 19 19:13 part-00179-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00179.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23217078 Jul 19 19:13 part-00180-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00180.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23194385 Jul 19 19:13 part-00181-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00181.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23216861 Jul 19 19:13 part-00182-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00182.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23240298 Jul 19 19:13 part-00183-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00183.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23204610 Jul 19 19:14 part-00184-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00184.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23256855 Jul 19 19:14 part-00185-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00185.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23241107 Jul 19 19:14 part-00186-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00186.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23279011 Jul 19 19:14 part-00187-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00187.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23232202 Jul 19 19:14 part-00188-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00188.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23264828 Jul 19 19:14 part-00189-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00189.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23223551 Jul 19 19:14 part-00190-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00190.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23229862 Jul 19 19:14 part-00191-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00191.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23264385 Jul 19 19:14 part-00192-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00192.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23211548 Jul 19 19:14 part-00193-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00193.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23259055 Jul 19 19:14 part-00194-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00194.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23268412 Jul 19 19:14 part-00195-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00195.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23257628 Jul 19 19:14 part-00196-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00196.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23230806 Jul 19 19:14 part-00197-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00197.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23235034 Jul 19 19:14 part-00198-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00198.c000.snappy.parquet
-rwxrwxrwx  1 vs  staff  23265105 Jul 19 19:14 part-00199-0c3b7c7d-23a3-4a0e-b275-839a2c456f9f_00199.c000.snappy.parquet

Actual run

query.show
scala> query.show
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|marketplace|customer_id|    review_id|product_id|product_parent|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|year|product_category|
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|         US|   29124476|ROQTM343YUPY5|0374528373|     569503661|The Brothers Kara...|          5|            0|          1|   N|                N|Despite all his f...|This is indeed on...| 2015-06-25|2015|           Books|
+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+


scala>

image

@vinothchandar
Copy link
Author

vinothchandar commented Jul 20, 2020

Hudi Bloom Index

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "file:///Volumes/HUDIDATA/input-data/amazon-reviews-hudi"
val dataGen = new DataGenerator

df100.write.format("hudi").
  option(PRECOMBINE_FIELD_OPT_KEY, "review_id").
  option(RECORDKEY_FIELD_OPT_KEY, "review_id").
  option(PARTITIONPATH_FIELD_OPT_KEY, "product_category").
  option(TABLE_NAME, "amazon_reviews_hudi").
  option(OPERATION_OPT_KEY,"bulk_insert").
  option("hoodie.bloom.index.filter.type", "DYNAMIC_V0").
  option("hoodie.bulkinsert.shuffle.parallelism", 100).
  option("hoodie.parquet.compression.codec", "snappy").
  mode(Overwrite).
  save(basePath)
val jsc = new org.apache.spark.api.java.JavaSparkContext(spark.sparkContext)
import org.apache.hudi.config._;
import org.apache.hudi.common.model._

val cfg =  HoodieWriteConfig.newBuilder().withPath("file:///Volumes/HUDIDATA/input-data/amazon-reviews-hudi").withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(org.apache.hudi.index.HoodieIndex.IndexType.GLOBAL_BLOOM).build()).build()
val readClient = new org.apache.hudi.client.HoodieReadClient(jsc, cfg)
readClient.checkExists(jsc.parallelize(java.util.Arrays.asList(new HoodieKey("ROQTM343YUPY5", null)), 1)).collect()

Just sample full search using parquet format, takes about 63 seconds

scala> val sparkDF = spark.read.format("parquet").load("file:///Volumes/HUDIDATA/input-data/amazon-reviews-hudi/*/*")
sparkDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 19 more fields]

scala> sparkDF.filter("review_id = 'ROQTM343YUPY5'").show
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|marketplace|customer_id|    review_id|product_id|product_parent|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|year|product_category|
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|     20200719210452|20200719210452_22...|     ROQTM343YUPY5|                 Books|f05cf503-6f55-4d5...|         US|   29124476|ROQTM343YUPY5|0374528373|     569503661|The Brothers Kara...|          5|            0|          1|   N|                N|Despite all his f...|This is indeed on...| 2015-06-25|2015|           Books|
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+


scala> new java.util.Date()
res33: java.util.Date = Mon Jul 20 00:31:23 PDT 2020

scala> hudiDF.filter("review_id = 'ROQTM343YUPY5'").show
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|marketplace|customer_id|    review_id|product_id|product_parent|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|year|product_category|
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|     20200719210452|20200719210452_22...|     ROQTM343YUPY5|                 Books|f05cf503-6f55-4d5...|         US|   29124476|ROQTM343YUPY5|0374528373|     569503661|The Brothers Kara...|          5|            0|          1|   N|                N|Despite all his f...|This is indeed on...| 2015-06-25|2015|           Books|
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+-----------+-------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+


scala> new java.util.Date()
res35: java.util.Date = Mon Jul 20 00:32:42 PDT 2020

image

Lookup using BLOOM_INDEX, supplying the Books partitionPath

scala> readClient.checkExists(jsc.parallelize(java.util.Arrays.asList(new HoodieKey("ROQTM343YUPY5", "Books")), 1)).collect()
res9: java.util.List[(org.apache.hudi.common.model.HoodieKey, org.apache.hudi.common.util.Option[String])] = [(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Books},Option{val=(Books,f05cf503-6f55-4d54-ad7d-e16cfe41aa2c-0)})]

scala>

image

Searching across all the partitions

val allProductCategories = hudiDF.select("product_category").distinct().map(r => r.getString(0)).collect
val keys = allProductCategories.map(c => new HoodieKey("ROQTM343YUPY5", c)).toList
readClient.checkExists(jsc.parallelize(keys, 1)).collect()
val locations = readClient.checkExists(jsc.parallelize(keys, 1)).collect()
scala> locations.foreach(println)
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Lawn_and_Garden},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Digital_Music_Purchase},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Grocery},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Mobile_Apps},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Baby},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Musical_Instruments},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Watches},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Video_Games},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Books},Option{val=(Books,f05cf503-6f55-4d54-ad7d-e16cfe41aa2c-0)})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Digital_Ebook_Purchase},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Outdoors},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Shoes},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Health_&_Personal_Care},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Gift_Card},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Jewelry},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=PC},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Furniture},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Beauty},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Wireless},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Luggage},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Toys},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Home_Improvement},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Major_Appliances},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Kitchen},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Digital_Software},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Apparel},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Sports},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Tools},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Home},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Mobile_Electronics},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Digital_Video_Games},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Home_Entertainment},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Pet_Products},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Automotive},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Digital_Video_Download},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Electronics},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Personal_Care_Appliances},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Office_Products},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Music},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Software},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Camera},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Video_DVD},Option{val=null})
(HoodieKey { recordKey=ROQTM343YUPY5 partitionPath=Video},Option{val=null})

It found the one key in approx 26 seconds!

image

20 Seconds of that in reading all the bloom filter/ranges. (here is where RFC-15 is going to rock)

image

and the remaining to check the 1 matched file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment