What Is Apache Kafka®?

What are message brokers and what are they used for?

LinkedIn developed Kafka, a message broker, back in 2011. Since then, it’s become a fault-tolerant, open-source distributed streaming platform that lets you store, process, and deliver huge amounts of data in real time.

Apache Kafka®

Distributed systems generally consist of many disparate services: some generate events (metrics, logs, monitoring events, service events), while others collect that data. Kafka® is a hybrid of a distributed database and a message broker with horizontal scalability. It collects application data, drops it into distributed storage, groups it by topic, and hands it out to application components on a subscription basis. At the same time, messages are stored on different broker nodes to ensure high availability and fault tolerance.

Topics are a way to group message streams in the repository by category. Services publish messages matching a certain category in the topic, and consumers subscribe to the topic to read them. For each topic, Apache Kafka® runs a message log that can be separated into sections. Sections are topic messages sequenced in the order they were received in.

Messages are stored in a log, or a long-term ordered data structure. With no changing or deleting, log entries can only be added, and the information is read from left to right to ensure that everything is in the correct order.

Apache Kafka® isn’t a pure DBMS even though it provides atomicity, consistency, isolation, and durability for stored data along with selective access using KSQL, an SQL engine based on the Kafka® Streams API. The platform is used as a commit log and integration center for a variety of external databases and repositories.

Kafka® vs RabbitMQ

Kafka® is often compared to another popular software message broker and queue management system: RabbitMQ. Both are used to exchange information between applications, follow a publisher/subscriber model, and provide message replication. But their message delivery models are fundamentally different: Kafka® is pull (recipients themselves get messages from the topic), while RabbitMQ is push (it sends messages to recipients).

RabbitMQ also deletes messages after delivery. Kafka® stores them until the next scheduled log cleanup. That means Apache Kafka® preserves the current system state and all previous ones, making it a reliable source of historical data. Since multiple users can read the same data independently, the pattern works particularly well for event-driven systems.

RabbitMQ features flexible message queue management (routing, delivery patterns, receipt monitoring), though that can mean lagging performance under heavier loads. Apache Kafka® is the best choice for collecting and aggregating events from a variety of sources, metrics, and logs, while RabbitMQ works for fast messaging between multiple services.

How Apache Kafka® Is Used

Its main functions are centralized collection, processing, secure storage, and transmission of a large number of messages from separate services. A distributed, horizontally scalable platform, it’s usually used for large amounts of unstructured data:

  • Large-scale IoT/IIoT systems with a multitude of sensors, controllers, and other end devices.

  • Analytics systems. For example, Kafka® is used in IBM and DataSift companies as a collector for monitoring events and a real time tracker of user data stream consumption.

  • Financial systems. Bank of America, Nordea, Robinhood, and ING Bank all make use of it.

  • Social media. Kafka® is part of the infrastructure processing streaming for Twitter, and LinkedIn leverages it to stream data on activity and operational statistics for apps.

  • Geo-positioning systems. Foursquare uses it to transmit messages between online and offline systems and integrate monitoring tools into its big data infrastructure built on Hadoop.

  • Telecom operators. Verizon, T-Mobile, Deutsche Telekom, and more.

  • Online games. For instance, Demonware, a division of Activision Blizzard, processes user logs with it.

The simplest example of the use of Apache Kafka® is to collect session logs from clients in streaming mode or logs from physical server files, and then put them somewhere like the Apache Hadoop HDFS file system or ClickHouse®. The service also lets you build a data pipeline to extract business-critical information from raw information using machine learning algorithms.

What’s Next For Apache Kafka®?

As systems distributed throughout the cloud become more common, managing those scattered assets will become harder and harder.

The DoubleCloud managed database service helps deploy and maintain Apache Kafka® databases in a variety of different cloud infrastructures and locales. You get all the benefits of a column-oriented DBMS without buying or configuring hardware, handling maintenance, or worrying about updates.

DoubleCloud’s Managed Service also makes your work far more secure even when you have cluster hosts in different areas of availability.

Thanks for reading, here’s $600 credit on us to trial the power of DoubleCloud

  1. Apache® and Apache Kafka® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.