Last modified: July 11, 2017
Kafka is a messaging system used for big data streaming and processing. It's fast, scalable, and redundant, making it the perfect streaming platform for companies like PayPal and Uber. In this tutorial, we discuss the basics of getting started with Kafka. We'll discuss the architecture behind Kafka and demonstrate how to get started publishing and consuming basic messages.
Kafka is a messaging system. It safely moves data from system A to system B. Kafka runs on a shared cluster of servers making it a highly available and fault-tolerant platform for data streaming.
Producers push data to Kafka topics. Consumers read data from topics. Topics are partitioned across multiple brokers or servers in the cluster. Each partition within a topic is replicated across the available brokers to prevent data loss. Consumers read from individual partitions, allowing multiple consumers to run in parallel. Below is a more detailed description of key Kafka components:
The collective group of servers that Kafka runs on.
Kafka uses topics to categorize data. When you write to Kafka, you write to a particular topic. When you read from Kafka, you read from a particular topic.
Topics distribute their data over partitions. Partitions store records from oldest to newest and collectively make up a topic.
Brokers are simply the different servers or nodes within the cluster. The partitions for a given topic are balanced across the available brokers in a cluster.
Producers write data to Kafka topics
Consumers read data from brokers.
Partitions are replicated across available brokers to avoid data-loss and facilitate parallelism. These replicated partitions are referred to as replicas.
After partitions are replicated, one is designated as the leader. The leader partition is responsible for all reads/writes for it's replicas across the brokers.
Replicated partitions that aren't designated leaders are followers. Followers replace leaders if they fail but otherwise mirror the lead partition.
When a producer writes messages to a Kafka topic, the messages are stored across the topic's partitions in a balanced fashion. If there are 3 incoming messages for a particular topic and the topic has 3 partitions then one message will be stored on each partition.
Each of these partitions is replicated across the cluster. Certain partitions are designated as leaders making the other duplicates followers. Producers write to a single leader to facilitate load balancing among the different brokers. While brokers can house multiple leaders, Kafka works to distribute the leaders across the cluster to maximize efficiency.
If a broker goes offline, one of the duplicate partitions becomes the new leader for that partition, preventing data loss.
When partitions are distributed across different servers, topics can be parallelized. Consumers can also be parallelized as multiple consumers can read from multiple partitions. Consumers can also read from unique partitions within consumer groups so that an entire topic can be consumed.
This tutorial assumes you have Java/JRE already installed. To get the latest release of Kafka, run:
tar -xzf kafka_2.11-0.11.0.0.tgz
This will install all the dependencies you need to get started with Kafka, including Apache ZooKeeper.
Next you need to start ZooKeeper. ZooKeeper is a centralized service for maintaining configuration information that Kafka runs on. Run the following:
This will start a local ZooKeeperserver instance.
With the ZooKeeper server still running, run the following command in a separate terminal:
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic sample
This will create a new topic named sample. Notice how you can pass in arguments for number of partitions, replication-factor, etc.
Kafka has a producer script which takes standard input and sends messages. Every line is a new message. Run the following:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic sample
This starts a new producer from the command line. Notice how the sample topic is specified as an argument. After entering Hello Kafka!, hit enter to send the message.
Now that a producer is sending messages, it's time to consume the data. Run the following to start a Kafka consumer from the command line:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sample --from-beginning
This starts a Kafka consumer and prints the producer's message to the console. If you go back to a separate producer terminal and type more messages you will see them printed in the consumer terminal.
Kafka leverages a shared cluster of servers to distribute incoming messages across multiple server nodes. This allows Kafka to store records in a fault-tolerant manner while also maximizing throughput. By following this exercise, you've successfully produced and consumed messages using Kafka!