Skip to content

Motivation

Andrew Choi edited this page Dec 3, 2019 · 2 revisions

Kafka has become a standard messaging system for large scale, streaming data. In companies like LinkedIn it is used as backbone for various data pipelines and is relied on by a variety of services. This makes Kafka a critical component of a company’s infrastructure that should be extremely robust, i.e. bug-free and fault-tolerant.

Kafka has relied on unit tests and system tests in virtual machines to detect bugs before it is deployed in a real cluster. Yet we still see occasional bugs that go undetected until Kafka has been deployed in a real cluster for days or even weeks. These bugs have caused a lot of operational overhead or even service disruption -- SREs need to rollback Kafka to an earlier version and developers need to reproduce and investigate the bug. In fact, many of these bugs could have been detected earlier if we had run Kafka’s system tests for a long time with production traffic. In LinkedIn we have relied on developers to manually run a variety of Kafka admin commands and trigger system failure in order to validate Kafka’s operation before its release, which is inconvenient. Kafka Monitor is designed to provide a framework under which system tests can be created and continuously run in real cluster. This allows us to use it as a vehicle for release validation by letting it run against a test cluster over a prolonged duration of time.

It is important for users to be able to monitor the availability and performance of its service. We can monitor Kafka server’s operation by reading its JMX metrics or tracking CPU, memory, and network usage on the hosts. But currently there is no easy way to monitor Kafka from user’s perspective, e.g. end-to-end latency or service availability. Doing so requires modification of client application to do extra work, which may have undesirable performance overhead, and is usually inconvenient for existing users of Kafka. Kafka Monitor addresses this need by monitoring Kafka using an end-to-end pipeline without requiring any change to existing deployment. Users should be able to simply run Kafka Monitor against their existing deployment to obtain some very useful metrics, e.g. end-to-end latency, service availability and message loss rate.

Clone this wiki locally