APIs and Libraries
DataSet API and DataStream API are the core APIs in Flink. The DataSet API is used for batch applications, whereas the DataStream API is used for streaming applications.
These APIs offer common data transformation functionalities like joins, aggregations, windows, and state management. The DataSet API also provides additional primitives on bounded data sets, such as loops/iterations.
Some common transformation operations supported by DataStream & DataSet APIs:
Function | Description | Sample (Scala Code) |
---|---|---|
map() | Takes one input element and produces one output element. | dataStream1.map{x => x*10} |
flatMap() | Takes one input element and produces zero, one, or more output elements. | dataStream1.flatMap { strr => strr.split(" ") } |
filter() | Evaluates a boolean function for each element and retains those for which the function returns true. | dataStream1.filter { _ < 10 } |
keyBy() | Returns a KeyedStream - Logically partitions a stream into disjoint partitions based on the key. | dataStream1.keyBy(_.keyColumn) |
Aggregations (min, max, sum, minBy, maxBy, sumBy) |
|
keyedStream.min(0) keyedStream.minBy("keyColumn")
|
union | Combines two or more data streams into a new stream containing all elements from all streams. | dataStream1.union(stream1, stream2) |
Check out this quiz on Apache Flink: Quiz on Introduction to Apache Flink
0 comments:
If you have any doubts,please let me know