Sunday, May 18, 2025

Apache Flink Word Count from File using Java

Apache Flink Word Count from File using Java

Apache Flink Word Count from File using Java

This post guides you through building a simple Apache Flink batch application that reads a text file, splits lines into words, and counts the number of occurrences of each word.

🧰 Prerequisites

  • Java 8 or 11 installed
  • Apache Flink installed on Ubuntu (View installation guide)
  • Maven or Gradle for Java project setup

📦 Step 1: Maven Dependency

<dependencies>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.17.1</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>1.17.1</version>
  </dependency>
</dependencies>

📝 Step 2: Java Program for Word Count from File

File:

FileWordCount.java


import org.apache.flink.api.common.functions.FlatMapFunction;

import org.apache.flink.api.java.DataSet;

import org.apache.flink.api.java.ExecutionEnvironment;

import org.apache.flink.api.java.tuple.Tuple2;

import org.apache.flink.util.Collector;


public class FileWordCount {

    public static void main(String[] args) throws Exception {

        // Set up the batch execution environment

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // Specify the input file path

        String inputPath = "src/main/resources/sample.txt";

        // Read the file

        DataSet&lt;String&gt; text = env.readTextFile(inputPath);

        // Perform word count

        DataSet&lt;Tuple2&lt;String, Integer&gt;&gt; counts = text

            .flatMap(new Tokenizer())

            .groupBy(0)

            .sum(1);

        // Print the result

        counts.print();

    }

    // Tokenizer class that splits lines into words

    public static final class Tokenizer implements FlatMapFunction&lt;String, Tuple2&lt;String, Integer&gt;&gt; {

        @Override

        public void flatMap(String line, Collector&lt;Tuple2&lt;String, Integer&gt;&gt; out) {

            for (String word : line.toLowerCase().split("\\W+")) {

                if (word.length() &gt; 0) {

                    out.collect(new Tuple2&lt;&gt;(word, 1));

                }

            }

        }

    }

📂 Step 3: Add Sample Input File

Save a file named

sample.txt
in src/main/resources with the following content:

Hello Flink Apache Flink is powerful Flink processes streaming and batch data

🚀 Step 4: Run the Program

You can run this Java application from your IDE or package it with Maven and run using:

./bin/flink run -c FileWordCount path-to-your-jar.jar

✅ Sample Output

(flink,3) (apache,1) (is,1) (powerful,1) (hello,1) (processes,1) (streaming,1) (and,1) (batch,1) (data,1)

🎯 Summary

This Flink batch job demonstrates how to process text files and perform basic transformations and aggregations using the DataSet API. You can further enhance it with filtering, sorting, and writing to files or databases.


📚 Also Read: Quiz on Apache Flink Basics

Sunday, April 20, 2025

How to Install Apache Flink on Ubuntu

How to Install Apache Flink on Ubuntu


Apache Flink is a powerful open-source framework for distributed stream and batch data processing. This guide walks you through the steps to install Flink on an Ubuntu system.

🛠️ Prerequisites

  • Ubuntu system (20.04 or later recommended)
  • Java 8 or Java 11 installed
  • Internet connection and basic terminal knowledge

🔧 Step 1: Install Java

Apache Flink requires Java to run. You can install OpenJDK 11 using the following command:

    sudo apt update
    sudo apt install openjdk-11-jdk -y

Verify Java installation:

    java -version

📦 Step 2: Download Apache Flink

Visit the official Flink download page to get the latest version. You can also use the command line:

    wget https://downloads.apache.org/flink/flink-1.17.1/flink-1.17.1-bin-scala_2.12.tgz

📁 Step 3: Extract Flink

    tar -xvzf flink-1.17.1-bin-scala_2.12.tgz
    cd flink-1.17.1

🚀 Step 4: Start Flink

Flink comes with a built-in local cluster. You can start it using the following script:

    ./bin/start-cluster.sh

This will start both the JobManager and TaskManager on your local machine.

🌐 Step 5: Access Flink Web UI

After starting the cluster, you can access the Flink dashboard at:

    http://localhost:8081

🛑 Stopping Flink

    ./bin/stop-cluster.sh

✅ Conclusion

You're now ready to run and develop Flink jobs on your Ubuntu machine! Stay tuned for more tutorials on writing and submitting Flink jobs.


📚 Also Read: How to Set Up Apache Flink on WSL (Windows Subsystem for Linux)

📚 Also Read: Quiz on Introduction to Apache Flink

Monday, April 7, 2025

What is Apache Flink? Stream Processing, Features & Cluster Deployment

What is Apache Flink? Stream Processing, Features & Cluster Deployment

Introduction to Apache Flink

Apache Flink is a powerful open-source framework designed for real-time stream processing applications. Initially released in 2011, Flink has continuously evolved, with a stable version launched in July 2020.

It is a distributed processing engine optimized for handling stateful computations over both bounded and unbounded data streams. In the digital landscape, data is generated as a continuous flow of events, which can be categorized into these two types:

  • Bounded Streams: These have a defined start and end point. The entire dataset is ingested into the system before computation begins.
  • Unbounded Streams: These have a start but no defined end, meaning data is continuously received and processed in real-time as it is generated.

Common examples of streaming data include credit card transactions, server logs, website user interactions, IoT sensor data, and weather station observations.

Flink is highly versatile, supporting both bounded and unbounded streams. It is designed to run on various cluster management systems such as Hadoop YARN, Apache Mesos, and Kubernetes. The framework is known for its in-memory computations and high scalability.

Key Features of Apache Flink for Real-Time Data Processing

  • Supports both real-time stream processing and batch data processing.
  • Enables stateful event-driven applications with advanced state management.
  • Deployable on cluster managers like Hadoop YARN, Mesos, Kubernetes, or standalone setups.
  • Capable of scaling to thousands of nodes and managing terabytes of application state.
  • Offers low-latency (fast event processing) and high-throughput (efficient data handling).

Apache Flink Deployment Modes: Session, Per-Job, and Application Mode Explained

The execution of Flink applications is determined by its deployment modes. These modes define resource allocation strategies and specify where the application's main() method is executed.

  • Session Mode: Uses an existing cluster to run applications. The main() method executes on the client side.
  • Per-Job Mode: Creates a new cluster for each job based on the available cluster manager, ensuring better resource allocation. The main() method runs on the client side.
  • Application Mode: Launches a dedicated cluster for each application. The main() method executes on the master node.

In the next post, we will explore the architecture of an Apache Flink cluster in detail.

Sunday, March 30, 2025

Apache Flink APIs and Libraries

Apache Flink APIs and Libraries

APIs and Libraries

DataSet API and DataStream API are the core APIs in Flink. The DataSet API is used for batch applications, whereas the DataStream API is used for streaming applications.

These APIs offer common data transformation functionalities like joins, aggregations, windows, and state management. The DataSet API also provides additional primitives on bounded data sets, such as loops/iterations.

Some common transformation operations supported by DataStream & DataSet APIs:

Function Description Sample (Scala Code)
map() Takes one input element and produces one output element. dataStream1.map{x => x*10}
flatMap() Takes one input element and produces zero, one, or more output elements. dataStream1.flatMap { strr => strr.split(" ") }
filter() Evaluates a boolean function for each element and retains those for which the function returns true. dataStream1.filter { _ < 10 }
keyBy() Returns a KeyedStream - Logically partitions a stream into disjoint partitions based on the key. dataStream1.keyBy(_.keyColumn)
Aggregations (min, max, sum, minBy, maxBy, sumBy)
  • min() returns the minimum value.
  • minBy() returns the element that has the minimum value.
keyedStream.min(0)
keyedStream.minBy("keyColumn")
union Combines two or more data streams into a new stream containing all elements from all streams. dataStream1.union(stream1, stream2)

Check out this quiz on Apache Flink: Quiz on Introduction to Apache Flink

Quiz on Introduction To Apache Flink

Quiz on Introduction To Apache Flink

Quiz on Introduction To Apache Flink

1. Which of the following does stream processing in micro-batches on its batch processing engine?
2. Which of the following layer is called Kernel in Flink architecture?
 
 
3. Which of the following is considered as a limitation for Spark?
 
 
 
4. Which of the following provides a unified model for Big Data processing? (Choose Two)
 

Apache Flink Ecosystem

Apache Flink Ecosystem

The ecosystem of Flink consists of various tools, services, APIs, and libraries that play a crucial role in analytical processes. It can be summarized as shown in the diagram below:



Storage / Streaming

Unlike Hadoop, which includes HDFS as its storage system, Flink does not have a built-in storage component. Instead, Flink programs implement transformations on distributed collections. These collections are created from external sources, and the results are written to sinks, which can store data in distributed files or other storage systems.

Deployment Layer

Flink programs can be deployed in multiple ways, depending on the execution environment. It supports:

  • Local Mode: Runs on a single machine within a single JVM.
  • Cluster Mode: Supports multi-node distributed execution across multiple JVM instances.
  • Cloud Mode: Can be deployed on cloud platforms like AWS (Amazon Web Services) and GCP (Google Cloud Platform).

Kernel / Core Layer

Often referred to as the Kernel, this core layer powers Flink’s runtime execution engine. It provides key functionalities, including:

  • Reliable and fault-tolerant distributed processing
  • Native iterative processing capabilities
  • High scalability and real-time data streaming

With this robust ecosystem, Apache Flink stands as a powerful framework for stream processing, ensuring efficiency and scalability. 

Apache Flink Architecture

Flink Architecture

Execution of streaming applications requires efficient allocation and management of resources. In this section, we explore Flink’s architecture and how its components interact to execute applications.

Flink operates with two core processes:

  • JobManager: The master process responsible for managing job execution.
  • TaskManager: The worker process executing tasks assigned by the JobManager.

The client submits a job but is not part of the runtime. It prepares and sends a dataflow to the JobManager. TaskManagers connect to JobManagers to announce availability and are assigned work accordingly.

Flink Execution Architecture


JobManager

The JobManager serves as the master process and has three key components:

  • ResourceManager:
    • Handles resource allocation and provisioning.
    • Manages task slots, which are the unit of resource scheduling in a Flink cluster.
  • Dispatcher:
    • Provides an interface for submitting Flink applications.
    • Initiates a new JobMaster for each submitted job.
    • Includes a Web UI for job monitoring.
  • JobMaster:
    • Supervises task execution within a single job.
    • Coordinates execution within the Flink application.
    • Ensures fault tolerance with High Availability (HA) support.

There is always at least one JobManager. To prevent a single point of failure, JobManager High Availability (HA) is supported.

TaskManager

TaskManagers are responsible for executing tasks and managing dataflow assignments.

  • Also referred to as worker nodes.
  • Each TaskManager is a JVM process and can run multiple subtasks.
  • At least one TaskManager is required.
  • Task Slots:
    • The smallest unit of resource scheduling in a TaskManager.
    • Determine the number of concurrent tasks a TaskManager can process.

Each task slot uses a fixed portion of a TaskManager’s resources, ensuring efficient parallel execution.

Next Steps

In the next section, we will explore the components of the Flink ecosystem in detail.