Just like Facebook's Scribe tool, Kafka can be used for processing large amounts of streaming data.
It can basically handle all kind of activity stream data and processing on a consumer-scale website.
This activity includes page views, searches, and other user actions, all important ingredients for a social website.
Here are some key features of "Apache Kafka":
· Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
· High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
· Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
· Support for parallel data load into Hadoop.
What's New in This Release: [ read full changelog ]
· Avoid duplicated message during consumer rebalance
· Support configurable send / receive socket buffer size in server
· javaapi ZookeeperConsumerConnectorTest duplicates many tests in the scala version
· Make # of consumer rebalance retries configurable
· SyncProducer should log host and port if can't connect.
· Reduce duplicate messages served by the kafka consumer for uncompressed topics
· Avoid logging stacktrace directly
· Make backoff time during consumer rebalance configurable
· Improve log4j appender to use kafka.producer.Producer, and support zk.connect|broker.list options
· Separate out Kafka mirroring into a stand-alone app
· Hadoop producer should use software load balancer