Skip to content

Instantly share code, notes, and snippets.

@ibzib
Created July 29, 2021 18:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ibzib/efaeb9e737a7b96cf034cb5aa2a2b25c to your computer and use it in GitHub Desktop.
Save ibzib/efaeb9e737a7b96cf034cb5aa2a2b25c to your computer and use it in GitHub Desktop.
BEAM-12675
55 results - 12 files
website/CONTRIBUTE.md:
192
193: ```
194: {{< highlight java >}}
195 // This is java
website/www/site/content/en/documentation/io/built-in/hadoop.md:
35
36: For example:
37: {{< highlight java >}}
38 Configuration myHadoopConfiguration = new Configuration(false);
51
52: For example:
53: {{< highlight java >}}
54 SimpleFunction<InputFormatKeyClass, MyKeyClass> myOutputKeyType =
292
293: A table snapshot can be taken using the HBase shell or programmatically:
294: {{< highlight java >}}
295 try (
365
366: For example:
367: {{< highlight java >}}
368 Configuration myHadoopConfiguration = new Configuration(false);
website/www/site/content/en/documentation/io/built-in/hcatalog.md:
48
49: For example:
50: {{< highlight java >}}
51 Map<String, String> configProperties = new HashMap<String, String>();
website/www/site/content/en/documentation/io/built-in/snowflake.md:
31
32: Passing credentials is done via Pipeline options used to instantiate `SnowflakeIO.DataSourceConfiguration`:
33: {{< highlight java >}}
34 SnowflakePipelineOptions options = PipelineOptionsFactory
68 ### General usage
69: Create the DataSource configuration:
70: {{< highlight java >}}
71 SnowflakeIO.DataSourceConfiguration
330 ### Batch write (from a bounded source)
331: The basic .`write()` operation usage is as follows:
332: {{< highlight java >}}
333 data.apply(
384 ### Streaming write (from unbounded source)
385: It is required to create a [SnowPipe](https://docs.snowflake.com/en/user-guide/data-load-snowpipe.html) in the Snowflake console. SnowPipe should use the same integration and the same bucket as specified by .withStagingBucketName and .withStorageIntegrationName methods. The write operation might look as follows:
386: {{< highlight java >}}
387 data.apply(
497 ### UserDataMapper function
498: The UserDataMapper function is required to map data from a PCollection to an array of String values before the `write()` operation saves the data to temporary .csv files. For example:
499: {{< highlight java >}}
500 public static SnowflakeIO.UserDataMapper<Long> getCsvMapper() {
507
508: Usage:
509: {{< highlight java >}}
510 String query = "SELECT t.$1 from YOUR_TABLE;";
530
531: Example of usage:
532: {{< highlight java >}}
533 data.apply(
550
551: Usage:
552: {{< highlight java >}}
553 data.apply(
567
568: Usage:
569: {{< highlight java >}}
570 SFTableSchema tableSchema =
590
591: The basic `.read()` operation usage:
592: {{< highlight java >}}
593 PCollection<USER_DATA_TYPE> items = pipeline.apply(
648
649: Example implementation of CsvMapper for GenericRecord:
650: {{< highlight java >}}
651 static SnowflakeIO.CsvMapper<GenericRecord> getCsvMapper() {
website/www/site/content/en/documentation/runners/dataflow.md:
50
51: <span class="language-java">When using Java, you must specify your dependency on the Cloud Dataflow Runner in your `pom.xml`.</span>
52: {{< highlight java >}}
53 <dependency>
website/www/site/content/en/documentation/runners/direct.md:
45
46: <span class="language-java">When using Java, you must specify your dependency on the Direct Runner in your `pom.xml`.</span>
47: {{< highlight java >}}
48 <dependency>
website/www/site/content/en/documentation/runners/samza.md:
39
40: <span class="language-java">You can specify your dependency on the Samza Runner by adding the following to your `pom.xml`:</span>
41: {{< highlight java >}}
42 <dependency>
website/www/site/content/en/documentation/sdks/java/euphoria.md:
42 ## WordCount Example
43: Lets start with the small example.
44: {{< highlight java >}}
45 PipelineOptions options = PipelineOptionsFactory.create();
146
147: When prototyping you may decide not to care much about coders, then create `KryoCoderProvider` without any class registrations to [Kryo](https://github.com/EsotericSoftware/kryo).
148: {{< highlight java >}}
149 //Register `KryoCoderProvider` which attempt to use `KryoCoder` to every non-primitive type
151 {{< /highlight >}}
152: Such a `KryoCoderProvider` will return `KryoCoder` for every non-primitive element type. That of course degrades performance, since Kryo is not able to serialize instance of unknown types effectively. But it boost speed of pipeline development. This behavior is enabled by default and can be disabled when creating `Pipeline` through `KryoOptions`.
153: {{< highlight java >}}
154 PipelineOptions options = PipelineOptionsFactory.create();
157
158: Second more performance friendly way is to register all the types which will Kryo serialize. Sometimes it is also a good idea to register Kryo serializers of its own too. Euphoria allows you to do that by implementing your own `KryoRegistrar` and using it when creating `KryoCoderProvider`.
159: {{< highlight java >}}
160 //Do not allow `KryoCoderProvider` to return `KryoCoder` for unregistered types
168 {{< /highlight >}}
169: Beam resolves coders using types of elements. Type information is not available at runtime when element type is described by lambda implementation. It is due to type erasure and dynamic nature of lambda expressions. So there is an optional way of supplying `TypeDescriptor` every time new type is introduced during Operator construction.
170: {{< highlight java >}}
171 PCollection<Integer> input = ...
181 ### Metrics and Accumulators
182: Statistics about job's internals are very helpful during development of distributed jobs. Euphoria calls them accumulators. They are accessible through environment `Context`, which can be obtained from `Collector`, whenever working with it. It is usually present when zero-to-many output elements are expected from operator. For example in case of `FlatMap`.
183: {{< highlight java >}}
184 Pipeline pipeline = ...
197 {{< /highlight >}}
198: `MapElements` also allows for `Context` to be accessed by supplying implementations of `UnaryFunctionEnv` (add second context argument) instead of `UnaryFunctor`.
199: {{< highlight java >}}
200 Pipeline pipeline = ...
217 ### Windowing
218: Euphoria follows the same [windowing principles](/documentation/programming-guide/#windowing) as Beam Java SDK. Every shuffle operator (operator which needs to shuffle data over the network) allows you to set it. The same parameters as in Beam are required. `WindowFn`, `Trigger`, `WindowingStrategy` and other. Users are guided to either set all mandatory and several optional parameters or none when building an operator. Windowing is propagated down through the `Pipeline`.
219: {{< highlight java >}}
220 PCollection<KV<Integer, Long>> countedElements =
242 ### `CountByKey`
243: Counting elements with the same key. Requires input dataset to be mapped by given key extractor (`UnaryFunction`) to keys which are then counted. Output is emitted as `KV<K, Long>` (`K` is key type) where each `KV` contains key and number of element in input dataset for the key.
244: {{< highlight java >}}
245 // suppose input: [1, 2, 4, 1, 1, 3]
261 {{< /highlight >}}
262: `Distinct` with mapper.
263: {{< highlight java >}}
264 // suppose keyValueInput: [KV(1, 100L), KV(3, 100_000L), KV(42, 10L), KV(1, 0L), KV(3, 0L)]
272 ### `Join`
273: Represents inner join of two (left and right) datasets on given key producing a new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes elements from both dataset sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type.
274: {{< highlight java >}}
275 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1]
287 ### `LeftJoin`
288: Represents left join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where right is present optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type.
289: {{< highlight java >}}
290 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1]
306 ### `RightJoin`
307: Represents right join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where left is present optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type.
308: {{< highlight java >}}
309 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1]
325 ### `FullJoin`
326: Represents full outer join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where both are present only optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type.
327: {{< highlight java >}}
328 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1]
343 ### `MapElements`
344: Transforms one input element of input type `InputT` to one output element of another (potentially the same) `OutputT` type. Transformation is done through user specified `UnaryFunction`.
345: {{< highlight java >}}
346 // suppose inputs contains: [ 0, 1, 2, 3, 4, 5]
355 ### `FlatMap`
356: Transforms one input element of input type `InputT` to zero or more output elements of another (potentially the same) `OutputT` type. Transformation is done through user specified `UnaryFunctor`, where `Collector<OutputT>` is utilized to emit output elements. Notice similarity with `MapElements` which can always emit only one element.
357: {{< highlight java >}}
358 // suppose words contain: ["Brown", "fox", ".", ""]
371 {{< /highlight >}}
372: `FlatMap` may be used to determine time-stamp of elements. It is done by supplying implementation of `ExtractEventTime` time extractor when building it. There is specialized `AssignEventTime` operator to assign time-stamp to elements. Consider using it, you code may be more readable.
373: {{< highlight java >}}
374 // suppose events contain events of SomeEventObject, its 'getEventTimeInMillis()' methods returns time-stamp
384 ### `Filter`
385: `Filter` throws away all the elements which do not pass given condition. The condition is supplied by the user as implementation of `UnaryPredicate`. Input and output elements are of the same type.
386: {{< highlight java >}}
387 // suppose nums contains: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
397
398: Following example shows basic usage of `ReduceByKey` operator including value extraction.
399: {{< highlight java >}}
400 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"]
411
412: Now suppose that we want to track our `ReduceByKey` internals using counter.
413: {{< highlight java >}}
414 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"]
429
430: Again the same example with optimized combinable output.
431: {{< highlight java >}}
432 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"]
444
445: Euphoria aims to make code easy to write and read. Therefore some support to write combinable reduce functions in form of `Fold` or folding function is already there. It allows user to supply only the reduction logic (`BinaryFunction`) and creates `CombinableReduceFunction` out of it. Supplied `BinaryFunction` still have to be associative.
446: {{< highlight java >}}
447 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"]
459 ### `ReduceWindow`
460: Reduces all elements in a [window](#windowing). The operator corresponds to `ReduceByKey` with the same key for all elements, so the actual key is defined only by window.
461: {{< highlight java >}}
462 //suppose input contains [ 1, 2, 3, 4, 5, 6, 7, 8 ]
476 ### `SumByKey`
477: Summing elements with same key. Requires input dataset to be mapped by given key extractor (`UnaryFunction`) to keys. By value extractor, also `UnaryFunction` which outputs to `Long`, to values. Those values are then grouped by key and summed. Output is emitted as `KV<K, Long>` (`K` is key type) where each `KV` contains key and number of element in input dataset for the key.
478: {{< highlight java >}}
479 //suppose input contains: [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
489 ### `Union`
490: Merge of at least two datasets of the same type without any guarantee about elements ordering.
491: {{< highlight java >}}
492 //suppose cats contains: [ "cheetah", "cat", "lynx", "jaguar" ]
501 ### `TopPerKey`
502: Emits one top-rated element per key. Key of type `K` is extracted by given `UnaryFunction`. Another `UnaryFunction` extractor allows for conversion input elements to values of type `V`. Selection of top element is based on _score_, which is obtained from each element by user supplied `UnaryFunction` called score calculator. Score type is denoted as `ScoreT` and it is required to extend `Comparable<ScoreT>` so scores of two elements can be compared directly. Output dataset elements are of type `Triple<K, V, ScoreT>`.
503: {{< highlight java >}}
504 // suppose 'animals contain: [ "mouse", "elk", "rat", "mule", "elephant", "dinosaur", "cat", "duck", "caterpillar" ]
516 ### `AssignEventTime`
517: Euphoria needs to know how to extract time-stamp from elements when [windowing](#windowing) is applied. `AssignEventTime` tells Euphoria how to do that through given implementation of `ExtractEventTime` function.
518: {{< highlight java >}}
519 // suppose events contain events of SomeEventObject, its 'getEventTimeInMillis()' methods returns time-stamp
website/www/site/content/en/documentation/transforms/java/aggregation/hllcount.md:
39 ## Examples
40: **Example 1**: creates a long-type sketch for a `PCollection<Long>` with a custom precision:
41: {{< highlight java >}}
42 PCollection<Long> input = ...;
46
47: **Example 2**: creates a bytes-type sketch for a `PCollection<KV<String, byte[]>>`:
48: {{< highlight java >}}
49 PCollection<KV<String, byte[]>> input = ...;
53 **Example 3**: merges existing sketches in a `PCollection<byte[]>` into a new sketch,
54: which summarizes the union of the inputs that were aggregated in the merged sketches:
55: {{< highlight java >}}
56 PCollection<byte[]> sketches = ...;
59
60: **Example 4**: estimates the count of distinct elements in a `PCollection<String>`:
61: {{< highlight java >}}
62 PCollection<String> input = ...;
66
67: **Example 5**: extracts the count distinct estimate from an existing sketch:
68: {{< highlight java >}}
69 PCollection<byte[]> sketch = ...;
website/www/site/content/en/documentation/transforms/java/aggregation/latest.md:
37 ## Examples
38: **Example**: compute the latest value for each session
39: {{< highlight java >}}
40 PCollection input = ...;
website/www/site/content/en/documentation/transforms/java/elementwise/withkeys.md:
37 ## Examples
38: **Example**
39: {{< highlight java >}}
40 PCollection<String> words = Create.of("Hello", "World", "Beam", "is", "fun");
website/www/site/content/en/documentation/transforms/java/other/passert.md:
32 ## Examples
33: For a given `PCollection`, you can use `PAssert` to verify the contents as follows:
34: {{< highlight java >}}
35 PCollection<String> output = ...;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment