Sunday, March 30, 2025

Apache Flink APIs and Libraries

APIs and Libraries

DataSet API and DataStream API are the core APIs in Flink. The DataSet API is used for batch applications, whereas the DataStream API is used for streaming applications.

These APIs offer common data transformation functionalities like joins, aggregations, windows, and state management. The DataSet API also provides additional primitives on bounded data sets, such as loops/iterations.

Some common transformation operations supported by DataStream & DataSet APIs:

Function Description Sample (Scala Code)
map() Takes one input element and produces one output element. dataStream1.map{x => x*10}
flatMap() Takes one input element and produces zero, one, or more output elements. dataStream1.flatMap { strr => strr.split(" ") }
filter() Evaluates a boolean function for each element and retains those for which the function returns true. dataStream1.filter { _ < 10 }
keyBy() Returns a KeyedStream - Logically partitions a stream into disjoint partitions based on the key. dataStream1.keyBy(_.keyColumn)
Aggregations (min, max, sum, minBy, maxBy, sumBy)
  • min() returns the minimum value.
  • minBy() returns the element that has the minimum value.
keyedStream.min(0)
keyedStream.minBy("keyColumn")
union Combines two or more data streams into a new stream containing all elements from all streams. dataStream1.union(stream1, stream2)

Check out this quiz on Apache Flink: Quiz on Introduction to Apache Flink

0 comments:

If you have any doubts,please let me know