Created
July 29, 2021 18:19
-
-
Save ibzib/efaeb9e737a7b96cf034cb5aa2a2b25c to your computer and use it in GitHub Desktop.
BEAM-12675
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
55 results - 12 files | |
website/CONTRIBUTE.md: | |
192 | |
193: ``` | |
194: {{< highlight java >}} | |
195 // This is java | |
website/www/site/content/en/documentation/io/built-in/hadoop.md: | |
35 | |
36: For example: | |
37: {{< highlight java >}} | |
38 Configuration myHadoopConfiguration = new Configuration(false); | |
51 | |
52: For example: | |
53: {{< highlight java >}} | |
54 SimpleFunction<InputFormatKeyClass, MyKeyClass> myOutputKeyType = | |
292 | |
293: A table snapshot can be taken using the HBase shell or programmatically: | |
294: {{< highlight java >}} | |
295 try ( | |
365 | |
366: For example: | |
367: {{< highlight java >}} | |
368 Configuration myHadoopConfiguration = new Configuration(false); | |
website/www/site/content/en/documentation/io/built-in/hcatalog.md: | |
48 | |
49: For example: | |
50: {{< highlight java >}} | |
51 Map<String, String> configProperties = new HashMap<String, String>(); | |
website/www/site/content/en/documentation/io/built-in/snowflake.md: | |
31 | |
32: Passing credentials is done via Pipeline options used to instantiate `SnowflakeIO.DataSourceConfiguration`: | |
33: {{< highlight java >}} | |
34 SnowflakePipelineOptions options = PipelineOptionsFactory | |
68 ### General usage | |
69: Create the DataSource configuration: | |
70: {{< highlight java >}} | |
71 SnowflakeIO.DataSourceConfiguration | |
330 ### Batch write (from a bounded source) | |
331: The basic .`write()` operation usage is as follows: | |
332: {{< highlight java >}} | |
333 data.apply( | |
384 ### Streaming write (from unbounded source) | |
385: It is required to create a [SnowPipe](https://docs.snowflake.com/en/user-guide/data-load-snowpipe.html) in the Snowflake console. SnowPipe should use the same integration and the same bucket as specified by .withStagingBucketName and .withStorageIntegrationName methods. The write operation might look as follows: | |
386: {{< highlight java >}} | |
387 data.apply( | |
497 ### UserDataMapper function | |
498: The UserDataMapper function is required to map data from a PCollection to an array of String values before the `write()` operation saves the data to temporary .csv files. For example: | |
499: {{< highlight java >}} | |
500 public static SnowflakeIO.UserDataMapper<Long> getCsvMapper() { | |
507 | |
508: Usage: | |
509: {{< highlight java >}} | |
510 String query = "SELECT t.$1 from YOUR_TABLE;"; | |
530 | |
531: Example of usage: | |
532: {{< highlight java >}} | |
533 data.apply( | |
550 | |
551: Usage: | |
552: {{< highlight java >}} | |
553 data.apply( | |
567 | |
568: Usage: | |
569: {{< highlight java >}} | |
570 SFTableSchema tableSchema = | |
590 | |
591: The basic `.read()` operation usage: | |
592: {{< highlight java >}} | |
593 PCollection<USER_DATA_TYPE> items = pipeline.apply( | |
648 | |
649: Example implementation of CsvMapper for GenericRecord: | |
650: {{< highlight java >}} | |
651 static SnowflakeIO.CsvMapper<GenericRecord> getCsvMapper() { | |
website/www/site/content/en/documentation/runners/dataflow.md: | |
50 | |
51: <span class="language-java">When using Java, you must specify your dependency on the Cloud Dataflow Runner in your `pom.xml`.</span> | |
52: {{< highlight java >}} | |
53 <dependency> | |
website/www/site/content/en/documentation/runners/direct.md: | |
45 | |
46: <span class="language-java">When using Java, you must specify your dependency on the Direct Runner in your `pom.xml`.</span> | |
47: {{< highlight java >}} | |
48 <dependency> | |
website/www/site/content/en/documentation/runners/samza.md: | |
39 | |
40: <span class="language-java">You can specify your dependency on the Samza Runner by adding the following to your `pom.xml`:</span> | |
41: {{< highlight java >}} | |
42 <dependency> | |
website/www/site/content/en/documentation/sdks/java/euphoria.md: | |
42 ## WordCount Example | |
43: Lets start with the small example. | |
44: {{< highlight java >}} | |
45 PipelineOptions options = PipelineOptionsFactory.create(); | |
146 | |
147: When prototyping you may decide not to care much about coders, then create `KryoCoderProvider` without any class registrations to [Kryo](https://github.com/EsotericSoftware/kryo). | |
148: {{< highlight java >}} | |
149 //Register `KryoCoderProvider` which attempt to use `KryoCoder` to every non-primitive type | |
151 {{< /highlight >}} | |
152: Such a `KryoCoderProvider` will return `KryoCoder` for every non-primitive element type. That of course degrades performance, since Kryo is not able to serialize instance of unknown types effectively. But it boost speed of pipeline development. This behavior is enabled by default and can be disabled when creating `Pipeline` through `KryoOptions`. | |
153: {{< highlight java >}} | |
154 PipelineOptions options = PipelineOptionsFactory.create(); | |
157 | |
158: Second more performance friendly way is to register all the types which will Kryo serialize. Sometimes it is also a good idea to register Kryo serializers of its own too. Euphoria allows you to do that by implementing your own `KryoRegistrar` and using it when creating `KryoCoderProvider`. | |
159: {{< highlight java >}} | |
160 //Do not allow `KryoCoderProvider` to return `KryoCoder` for unregistered types | |
168 {{< /highlight >}} | |
169: Beam resolves coders using types of elements. Type information is not available at runtime when element type is described by lambda implementation. It is due to type erasure and dynamic nature of lambda expressions. So there is an optional way of supplying `TypeDescriptor` every time new type is introduced during Operator construction. | |
170: {{< highlight java >}} | |
171 PCollection<Integer> input = ... | |
181 ### Metrics and Accumulators | |
182: Statistics about job's internals are very helpful during development of distributed jobs. Euphoria calls them accumulators. They are accessible through environment `Context`, which can be obtained from `Collector`, whenever working with it. It is usually present when zero-to-many output elements are expected from operator. For example in case of `FlatMap`. | |
183: {{< highlight java >}} | |
184 Pipeline pipeline = ... | |
197 {{< /highlight >}} | |
198: `MapElements` also allows for `Context` to be accessed by supplying implementations of `UnaryFunctionEnv` (add second context argument) instead of `UnaryFunctor`. | |
199: {{< highlight java >}} | |
200 Pipeline pipeline = ... | |
217 ### Windowing | |
218: Euphoria follows the same [windowing principles](/documentation/programming-guide/#windowing) as Beam Java SDK. Every shuffle operator (operator which needs to shuffle data over the network) allows you to set it. The same parameters as in Beam are required. `WindowFn`, `Trigger`, `WindowingStrategy` and other. Users are guided to either set all mandatory and several optional parameters or none when building an operator. Windowing is propagated down through the `Pipeline`. | |
219: {{< highlight java >}} | |
220 PCollection<KV<Integer, Long>> countedElements = | |
242 ### `CountByKey` | |
243: Counting elements with the same key. Requires input dataset to be mapped by given key extractor (`UnaryFunction`) to keys which are then counted. Output is emitted as `KV<K, Long>` (`K` is key type) where each `KV` contains key and number of element in input dataset for the key. | |
244: {{< highlight java >}} | |
245 // suppose input: [1, 2, 4, 1, 1, 3] | |
261 {{< /highlight >}} | |
262: `Distinct` with mapper. | |
263: {{< highlight java >}} | |
264 // suppose keyValueInput: [KV(1, 100L), KV(3, 100_000L), KV(42, 10L), KV(1, 0L), KV(3, 0L)] | |
272 ### `Join` | |
273: Represents inner join of two (left and right) datasets on given key producing a new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes elements from both dataset sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type. | |
274: {{< highlight java >}} | |
275 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1] | |
287 ### `LeftJoin` | |
288: Represents left join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where right is present optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type. | |
289: {{< highlight java >}} | |
290 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1] | |
306 ### `RightJoin` | |
307: Represents right join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where left is present optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type. | |
308: {{< highlight java >}} | |
309 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1] | |
325 ### `FullJoin` | |
326: Represents full outer join of two (left and right) datasets on given key producing single new dataset. Key is extracted from both datasets by separate extractors so elements in left and right can have different types denoted as `LeftT` and `RightT`. The join itself is performed by user-supplied `BinaryFunctor` which consumes one element from both dataset, where both are present only optionally, sharing the same key. And outputs result of the join (`OutputT`). The operator emits output dataset of `KV<K, OutputT>` type. | |
327: {{< highlight java >}} | |
328 // suppose that left contains: [1, 2, 3, 0, 4, 3, 1] | |
343 ### `MapElements` | |
344: Transforms one input element of input type `InputT` to one output element of another (potentially the same) `OutputT` type. Transformation is done through user specified `UnaryFunction`. | |
345: {{< highlight java >}} | |
346 // suppose inputs contains: [ 0, 1, 2, 3, 4, 5] | |
355 ### `FlatMap` | |
356: Transforms one input element of input type `InputT` to zero or more output elements of another (potentially the same) `OutputT` type. Transformation is done through user specified `UnaryFunctor`, where `Collector<OutputT>` is utilized to emit output elements. Notice similarity with `MapElements` which can always emit only one element. | |
357: {{< highlight java >}} | |
358 // suppose words contain: ["Brown", "fox", ".", ""] | |
371 {{< /highlight >}} | |
372: `FlatMap` may be used to determine time-stamp of elements. It is done by supplying implementation of `ExtractEventTime` time extractor when building it. There is specialized `AssignEventTime` operator to assign time-stamp to elements. Consider using it, you code may be more readable. | |
373: {{< highlight java >}} | |
374 // suppose events contain events of SomeEventObject, its 'getEventTimeInMillis()' methods returns time-stamp | |
384 ### `Filter` | |
385: `Filter` throws away all the elements which do not pass given condition. The condition is supplied by the user as implementation of `UnaryPredicate`. Input and output elements are of the same type. | |
386: {{< highlight java >}} | |
387 // suppose nums contains: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
397 | |
398: Following example shows basic usage of `ReduceByKey` operator including value extraction. | |
399: {{< highlight java >}} | |
400 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"] | |
411 | |
412: Now suppose that we want to track our `ReduceByKey` internals using counter. | |
413: {{< highlight java >}} | |
414 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"] | |
429 | |
430: Again the same example with optimized combinable output. | |
431: {{< highlight java >}} | |
432 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"] | |
444 | |
445: Euphoria aims to make code easy to write and read. Therefore some support to write combinable reduce functions in form of `Fold` or folding function is already there. It allows user to supply only the reduction logic (`BinaryFunction`) and creates `CombinableReduceFunction` out of it. Supplied `BinaryFunction` still have to be associative. | |
446: {{< highlight java >}} | |
447 //suppose animals contains : [ "mouse", "rat", "elephant", "cat", "X", "duck"] | |
459 ### `ReduceWindow` | |
460: Reduces all elements in a [window](#windowing). The operator corresponds to `ReduceByKey` with the same key for all elements, so the actual key is defined only by window. | |
461: {{< highlight java >}} | |
462 //suppose input contains [ 1, 2, 3, 4, 5, 6, 7, 8 ] | |
476 ### `SumByKey` | |
477: Summing elements with same key. Requires input dataset to be mapped by given key extractor (`UnaryFunction`) to keys. By value extractor, also `UnaryFunction` which outputs to `Long`, to values. Those values are then grouped by key and summed. Output is emitted as `KV<K, Long>` (`K` is key type) where each `KV` contains key and number of element in input dataset for the key. | |
478: {{< highlight java >}} | |
479 //suppose input contains: [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ] | |
489 ### `Union` | |
490: Merge of at least two datasets of the same type without any guarantee about elements ordering. | |
491: {{< highlight java >}} | |
492 //suppose cats contains: [ "cheetah", "cat", "lynx", "jaguar" ] | |
501 ### `TopPerKey` | |
502: Emits one top-rated element per key. Key of type `K` is extracted by given `UnaryFunction`. Another `UnaryFunction` extractor allows for conversion input elements to values of type `V`. Selection of top element is based on _score_, which is obtained from each element by user supplied `UnaryFunction` called score calculator. Score type is denoted as `ScoreT` and it is required to extend `Comparable<ScoreT>` so scores of two elements can be compared directly. Output dataset elements are of type `Triple<K, V, ScoreT>`. | |
503: {{< highlight java >}} | |
504 // suppose 'animals contain: [ "mouse", "elk", "rat", "mule", "elephant", "dinosaur", "cat", "duck", "caterpillar" ] | |
516 ### `AssignEventTime` | |
517: Euphoria needs to know how to extract time-stamp from elements when [windowing](#windowing) is applied. `AssignEventTime` tells Euphoria how to do that through given implementation of `ExtractEventTime` function. | |
518: {{< highlight java >}} | |
519 // suppose events contain events of SomeEventObject, its 'getEventTimeInMillis()' methods returns time-stamp | |
website/www/site/content/en/documentation/transforms/java/aggregation/hllcount.md: | |
39 ## Examples | |
40: **Example 1**: creates a long-type sketch for a `PCollection<Long>` with a custom precision: | |
41: {{< highlight java >}} | |
42 PCollection<Long> input = ...; | |
46 | |
47: **Example 2**: creates a bytes-type sketch for a `PCollection<KV<String, byte[]>>`: | |
48: {{< highlight java >}} | |
49 PCollection<KV<String, byte[]>> input = ...; | |
53 **Example 3**: merges existing sketches in a `PCollection<byte[]>` into a new sketch, | |
54: which summarizes the union of the inputs that were aggregated in the merged sketches: | |
55: {{< highlight java >}} | |
56 PCollection<byte[]> sketches = ...; | |
59 | |
60: **Example 4**: estimates the count of distinct elements in a `PCollection<String>`: | |
61: {{< highlight java >}} | |
62 PCollection<String> input = ...; | |
66 | |
67: **Example 5**: extracts the count distinct estimate from an existing sketch: | |
68: {{< highlight java >}} | |
69 PCollection<byte[]> sketch = ...; | |
website/www/site/content/en/documentation/transforms/java/aggregation/latest.md: | |
37 ## Examples | |
38: **Example**: compute the latest value for each session | |
39: {{< highlight java >}} | |
40 PCollection input = ...; | |
website/www/site/content/en/documentation/transforms/java/elementwise/withkeys.md: | |
37 ## Examples | |
38: **Example** | |
39: {{< highlight java >}} | |
40 PCollection<String> words = Create.of("Hello", "World", "Beam", "is", "fun"); | |
website/www/site/content/en/documentation/transforms/java/other/passert.md: | |
32 ## Examples | |
33: For a given `PCollection`, you can use `PAssert` to verify the contents as follows: | |
34: {{< highlight java >}} | |
35 PCollection<String> output = ...; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment