Let’s start with the basics of communication:
We have the source of the communication (a producer), the communication itself (a message) and someone to receive and process the communication (a consumer):
This setup is functional and performs the task of communication. Excellent. What if we scale up and increase the number of messages to be communicated? Well, the producer would then create the messages and send each message one after the other to the consumer. The consumer on its end would receive and process each message.
One important note at this point: We cannot assume that the order in which the producer sends the message was the same order in which the receiver get the messages (lookup asynchronous processes). To counter this, we could do something like number each message before sending it over, and the consumer could read all the message in the order they arrive but process them in order of the numbering.
RULE: PRODUCER CREATE MESSAGES
RULE: CONSUMER CONSUMES MESSAGES
Let’s continue to scale up and add more producers. Here, we run into the issue that the consumer might not know which producer makes which message and how to process the message, so we append each message with a number (as mentioned below) as well as a source address.
RULE: EACH MESSAGE CONTAINS AN ORDER IN MESSAGE BASED ON PRODUCER
Scaling continues. In this case the consumer starts to slow down since no matter as how big it is, it has only a fixed amount of input. Further, if anything happens to this single consumer all the producers have nothing to talk to. To counter this, let’s introduce multiple consumers, but this introduces an important perspective change, which is this: do all consumers need all the information? Or, do particular consumers want specialized message sets (called topics) meant just for them ? Logically, the idea of using specialized messaged sets makes more sense. After all, if we wanted the same messages sent to all consumers, we could have all consumers getting the same specialized message set containing all the messages). Let’s put a pin on this idea of topics for a moment, as we’ll return to it later.
RULE: CONSUMERS SHOULD INTERACT WITH A SPECIFIED SET OF DATA (TOPIC) RATHER THAN ALL DATA
The question then becomes how to distribute the data. Well, we can dedicate a consumer to a producer (like in Fig 2), which would technically work. Practically, however, we face the following issues:
- If a high volume is produced some of the time and nothing the rest of the time, the consumer will waste a lot of time.
- A situation could arise where some consumers do a lot and other do very little. For example, there could be a situation in which a producer creates so many messages that the consumer cannot keep up with its processing, and we’re back to the original issue of slow consumers.
Let’s try one producer to many consumers.
In this case, the producer needs to keep track of consumers, and to maintain order means sending the message to the all consumers. Which defeats the whole purpose. Also, failures mean sending to some consumers and not sending to all, and the producer needs to be keep track of these issues, which is too much computation on the producer side. So, this idea doesn’t quite work.
In order to have some way to maintain load balancing, let’s use a middle man (a broker) that distributes the messages to a particular consumer. This broker is independent of the code on both sides and sits as an independent go-between. This broker decides where each message goes. For example, all messages from one consumer might go to a single consumer, but if it isn’t getting a lot of data, more messages could be sent. Also, if multiple consumers need to read from the same consumer, the broker ensures that both get the same information in the same order.
RULE: WE NEED AN MESSAGE DISTRIBUTOR (BROKER) BETWEEN THE CONSUMERS AND PRODUCERS THAT IS FUNCTIONALLY INDEPENDENT FROM THE CONSUMERS AND PRODUCERS BUT ALLOWS THEM AN API TO INTERACT.
The setup we’re reaching indicates the following: There needs to be a middle-man distribution that is independent of the producer and consumer. On this note, let’s return to the idea of topics we mentioned earlier and our change of perspectives. Say that we reach a situation for which we need to add a new consumer. If we want the consumer to read all the information, we’ll need to modify our broker to include this particular consumer.
However, if we want our broker to be independent of our consumer and producer, it makes more sense for consumers to reach out to the broker (rather than the broker reaching to the consumers). So, we set up a way for our consumer to subscribe to our brokers. Actually, let’s include the idea of the cluster needing a specific set of the data (the topic), not the entirety of all data. So, our consumer will subscribe to a topic on a broker. Further, it makes sense that our producers would not want to publish to an entire broker but rather publish to the necessary brokers it needs to. However, for the sake of straightforwardness, let’s impose the following rule:
RULE: A CONSUMER NEEDS TO SUBSCRIBE AND LISTEN TO A TOPIC. A TOPIC IS NOT RESPONSIBLE TO PUBLISHING TO A CONSUMER.
RULE: A PRODUCER CAN PUBLISH ONLY TO A SINGLE TOPIC.
Excellent, let’s scale up and communicate so many messages that the brokers cannot keep up. In this case, we simply introduce multiple brokers and send information to the broker cluster. Each broker can then maintain a certain set of producer messages.
However, there are two issues with this:
- If a consumer wants data from multiple topics it has to find which broker has the topic information it wants.
- If a broker goes down, the entire topic’s message transfer is lost, and considering the volumes at hand, this is a possibility.
Let’s address both of these issues in one step. One issue we were seeing is that a particular broker cannot handle messages from all producers, so each broker needs to handle one topic’s data. But, there’s no reason it can’t hold copies of data from other brokers (but not be responsible for its distribution). It’ll be the “lead” of a particular topic and then copy the rest of the data from its sibling brokers that are leading their own sources (at the same time providing the data they’re leads of). Also, in the case a particular broker goes down, its siblings will have a copy of its processing so that one of them can start processing.
At this point the only feature left to provide growth for is our topic. We’ve been looking at topics as specific sets of data, but let’s look at their functionality. Each topic is currently viewed as a queue of ordered messages relevant to a particular piece of information.
So, Let’s partition each queue, which causes issues with ordering of the messages. This can be quickly addressed though if we provide some topic-wide ordering of messages. As long as the consumer reads from all partitions of a topic it’ll have all messages and will be able to apply ordering on the data transferred after it’s been received. Now, if we notice that a topic is getting too many messages we simply need to add more partitions to add to parallel message consumption.
Finally, in order to add more granular safety, we set the “lead” broker not by topic but by partition. This way, each broker is the lead for a particular partition for a topic. Also, this allows each broker to be lead for a partition of multiple topics. The additional safety imposed is that all the data in a topic is not protected in a single broker. If a broker goes down, only a single partition of a topic (the partition the broker is leading) will be affected.
RULE: ALL TOPICS ARE SPLIT INTO PARTITIONS. FOR FAIL-PROOFING EACH PARTITION IS REPLICATED ACROSS TO EVERY BROKER. EACH BROKER IS “LEAD” FOR A SINGLE PARTITION IN A TOPIC. A BROKER CAN THUS BE “LEAD” FOR MULTIPLE PARTITIONS (EACH IN DIFFERENT TOPICS)
Finally, let’s return to the idea of ordering. Above, a workaround we mention for imposing ordering on a series of partitions is to have some kind of order marking, which is imposed post message distribution. However, there are a many cases where ordering isn’t as important.
For example, at Jet, we would care that the order in which a particular SKU has been changed (a vendor provides some products, a buyer consumes some, and we get some more after this). However, between two SKUs, the ordering isn’t as important. Hence, if we look at a topic as a source of SKU changes, then we can assign each partition to a collection of SKUs, and worry about ordering only within the partition, and disregard ordering between partitions.
Let’s compile the rules we’ve listed above.
- The producer creates messages.
- The consumer consumes messages.
- Each message contains a message order based on the producer.
- Consumers should interact with a specified set of data (topic) rather than all data.
- We need a message distributor (broker) between the consumers and producers that is functionally independent from the consumers and products but that allows them an API containing topics with which they can interact.
- A consumer needs to subscribe and listen to a topic, which is not responsible for publishing to a consumer.
- A producer can publish only to a single topic.
- All topics are split into partitions. Each broker is “lead” for a single partition in a topic. All partitions are replicated across brokers for fail proofing. A broker can thus be “lead” for multiple partitions (each in different topics).
This setup is Kafka. Now, some of the details of the functionality have been glossed over, but hopefully, from the above you understand how it works and how it addresses concerns including fail-safes, ordering, and scaling.