Kafka: a Distributed Messaging System for Log Processing
Kafka: a Distributed Messaging System for Log Processing
ABSTRACT
Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.
One. Introduction
One. Introduction
There is a large amount of "log" data generated at any sizable internet company. This data typically includes (one) user activity events corresponding to logins, pageviews, clicks, "likes", sharing, comments, and search queries; (two) operational metrics such as service call stack, call latency, errors, and system metrics such as CPU, memory, network, or disk utilization on each machine. Log data has long been a component of analytics used to track user engagement, system utilization, and other metrics. However recent trends in internet applications have made activity data a part of the production data pipeline used directly in site features. These uses include (one) search relevance, (two) recommendations which may be driven by item popularity or co-occurrence in the activity stream, (three) ad targeting and reporting, and (four) security applications that protect against abusive behaviors such as spam or unauthorized data scraping, and (five) newsfeed features that aggregate user status updates or actions for their "friends" or "connections" to read.
This production, real-time usage of log data creates new challenges for data systems because its volume is orders of magnitude larger than the "real" data. For example, search, recommendations, and advertising often require computing granular click-through rates, which generate log records not only for every user click, but also for dozens of items on each page that are not clicked. Every day, China Mobile collects five to eight terabytes of phone call records and Facebook gathers almost six terabytes of various user activity events.
Many early systems for processing this kind of data relied on physically scraping log files off production servers for analysis. In recent years, several specialized distributed log aggregators have been built, including Facebook's Scribe, Yahoo's Data Highway, and Cloudera's Flume. Those systems are primarily designed for collecting and loading the log data into a data warehouse or Hadoop for offline consumption. At LinkedIn (a social network site), we found that in addition to traditional offline analytics, we needed to support most of the real-time applications mentioned above with delays of no more than a few seconds.
We have built a novel messaging system for log processing called Kafka that combines the benefits of traditional log aggregators and messaging systems. On the one hand, Kafka is distributed and scalable, and offers high throughput. On the other hand, Kafka provides an API similar to a messaging system and allows applications to consume log events in real time. Kafka has been open sourced and used successfully in production at LinkedIn for more than six months. It greatly simplifies our infrastructure, since we can exploit a single piece of software for both online and offline consumption of the log data of all types. The rest of the paper is organized as follows. We revisit traditional messaging systems and log aggregators in Section Two. In Section Three, we describe the architecture of Kafka and its key design principles. We describe our deployment of Kafka at LinkedIn in Section Four and the performance results of Kafka in Section Five. We discuss future work and conclude in Section Six.