What is Apache Kafka?
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn to donate to Apache Software Foundation and written in Scala and Java. Kafka architecture is made up of topics, producers, consumers, consumer groups, clusters, brokers, partitions, replicas, leaders, and followers. Kafka cluster consists of one or more Kafka brokers running Kafka. Producers are processes that push records into Kafka topics within the broker. A consumer pulls records off a Kafka topic. Topics are divided into partitions, and these partitions are replicated across brokers. Each partition includes one leader replica and zero or greater follower replicas.
Zookeeper performs the management of the brokers in the cluster. We can use multiple Zookeepers in a cluster at a time, for example, three to five.
Apache Kafka Core APIs
Apache Kafka has five core APIs:
- The producer API allows applications to send streams of data to topics in the Kafka cluster.
- The consumer API allows applications to read streams of data from topics in the Kafka cluster.
- Moreover, the streams API allows transforming streams of data from input topics to output topics.
- The connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.
- In addition, the AdminClient API allows managing and inspecting topics, brokers, and other Kafka objects.
Key Benefits of Apache Kafka
- Performance: Works with a huge volume of real-time data streams. Handles high throughput for both publishing and subscribing.
- Scalability: Highly scales distributed systems with no downtime in all four dimensions- producers, processors, consumers, and connectors.
- Fault Tolerance: Handles failures with the masters and database with zero downtime and zero data loss.
- Data Transformation: Offers provisions for deriving new data streams using the data streams from producers.
- Durability: Use distributed commit logs to support messages persisting on disk.
- Replication: Replicates the messages across the clusters to support multiple subscribers.
Challenges Operating Apache Kafka
- Difficult to set up, configure and operate
- Tricky to scale
- Difficult with AWS Integrations
- Hard to achieve high availability
- No console, no visible Metrics
- Operation Experience
Amazon Managed Streaming for Apache Kafka
Amazon Managed Streaming for Apache Kafka (MSK) has the following components:
- Broker Nodes: Create several broker nodes per AZ in the VPC subnet.
- Zookeeper Nodes: Creates the Apache Zookeeper nodes for distributed coordination.
- Producers, Consumers, and Topic Creators: Use Apache Kafka data-plan operations to create topics and to produce and consume data.
- Cluster Operations: Use AWS Management Console, the AWS Command Line Interface (AWS CLI), or the APIs in the SDK.
Key Benefits of AWS MSK
Key benefits of AWS Managed Streaming for Apache Kafka (MSK):
- Fully Managed: Create a fully managed Apache Kafka cluster or users’ cluster using their custom configuration. MSK automatically provisions, configures, and manages the operations of users’ Apache Kafka cluster and Apache Zookeeper nodes.
- Highly Available: Automatic recovery and patching, Data replication.
- Highly Secure: Run in AWS VPC, data encrypted at rest using AWS KMS with Customer Master Key (CMK) by default or users own CMK, encrypts data-in-transit via TLS between brokers and between clients and brokers, SASL/SCRAM authentication secured by AWS Secrets Manager and ACLs.
- Scalable: Broker and storage scaling.
- Integration: AWS KMS, AWS, Certificate Manager, AWS VPC, AWS IAM, and AWS Glue Schema Registry.
- Elastic Stream Processing: Apache Flink is a powerful, open-source stream processing framework that is useful for stateful computations of streaming data. The user can run fully managed Apache Flink applications written in SQL, Java, or Scala that elastically scale to process data streams within Amazon MSK.
- Fully compatible: Amazon MSK runs and manages Apache Kafka for users. MSK makes it easy for users to migrate and run their existing Apache Kafka applications on AWS without changing the application code. Using Amazon MSK, the user can maintain open-source compatibility and continue to use familiar custom and community-built tools such as MirrorMaker, Apache Flink, and Prometheus.
- Fully Managed: AWS MSK lets its users focus on creating their streaming applications without worrying about the operational overhead of managing the Apache Kafka environment. Amazon MSK also manages the provisioning, configuration, and maintenance of Apache Kafka clusters and Apache ZooKeeper nodes for users. Amazon MSK shows key Apache Kafka performance metrics in the AWS console.
- Highly available: Amazon MSK creates an Apache Kafka cluster and offers multi-AZ replication within an AWS Region. Amazon MSK continuously monitors cluster health, and if a component fails, Amazon MSK will automatically replace it.
- Highly secure: Amazon MSK provides multiple levels of security for your Apache Kafka clusters, including VPC network isolation, AWS IAM for control-plane API authorization, encryption at rest, TLS encryption in-transit, TLS based certificate authentication, SASL/SCRAM authentication secured by AWS Secrets Manager, and supports Apache Kafka Access Control Lists (ACLs) for data-plane authorization.
AWS MSK Deployment with Kubernetes
It can deploy and scale via any Kubernetes environment such as AWS EKS or users’ existing Kafka Connect cluster.
Pros and Cons of AWS MSK
- AWS MSK provides easy development and deployment.
- It is suitable for quick event-based architecture deployment for low-to-medium traffic.
- It’s a battle proved by AWS Lambda.
- No resource to manage the user can focus totally on the logic.
- Pay as use cost, which depends on Lambda invocation cost.
- Logging by AWS Cloudwatch.
- It has serverless framework popularity and supporting plugins.
- It supports simple Python code.
- Hard (if not impossible) to test locally since AWS MSK deployed on secured VPC
- Only two consumer group messages available since the beginning of the topic (TRIM_HORIZON) and LATEST
- Not suitable for high traffic topic
- Sometimes deployment and removal take quite a long time.
Pricing Model Comparison
- To run Apache Kafka on any public cloud providers’ computing engine is not a recommended approach.
- It requires a lot of human support on installation, setup (in weeks), configuration, and cluster management.
- Apache Kafka is an open-source tool, and there are no fees; they are just charging fees associated with confluence, which is based on subscription.
- If a user needs to use a Kubernetes service like EKS, they will pay for nodes and the service itself (Kubernetes masters).
AWS MSK Pricing
- AWS MSK Pricing- Pay for when the broker instances run, the storage they use monthly, and standard data transfer fees for data in and out of users’ clusters. It offers On-demand, hourly pricing for broker and storage prorated to the second:
- Kafka.m5. large $0.21/hour
- $0.10 per GB-month
- No need to pay for the number of topics or replication traffic or Zookeeper.
Author: SVCIT Editorial
Copyright Silicon Valley Cloud IT, LLC.