Zephyrnet Logo

A Detailed Guide to Apache Storm Fundamentals

Date:

This article was published as a part of the Data Science Blogathon.

Introduction

Continuous data streams are ubiquitous and become even more so as the number of IoT devices in use increases. Of course, data is stored, processed, and analyzed to provide predictive and actionable results. But analyzing petabytes takes a long time, even with Hadoop (as good as MapReduce can be) or Spark (a fix for MapReduce’s limitations).

Source: storm.apache.org

Second, we often don’t need to derive formulas for a long time. We don’t need to consider all of the petabytes of incoming data collected over months at any given moment, just a real-time snapshot. We may not need to know the longest trending hashtag of the past five years, but this one will do for now.

That’s what Storm is built for, to take in tons of data coming in extremely fast, possibly from multiple sources, analyze it, and publish real-time updates to the UI or elsewhere without saving itself.

This article is not complete, and the end of Storm, nor is it meant to be. The Storm is big, and just one long read probably won’t do it anyway. Any feedback, additions, or constructive criticism would be greatly appreciated.

What is Apache Storm?

Apache Storm is a distributed system for processing big data in real time. The Storm is designed to handle massive amounts of data in a fault-tolerant and scale-out manner. It is a streaming data frame that has the capability of the highest reception speed. It’s simple, and you can do all kinds of real-time data manipulations in parallel.

Apache Storm continues to be a leader in real-time data analytics. The Storm is easy to set up and control and guarantees that every message will be processed at least once across the topology.

How Does Apache Storm Work?

Storm’s architecture can be compared to a network of roads connecting a set of checkpoints. The operation starts at a certain control point (called the spout) and passes through other control points (called the screws).

Traffic, of course, is a stream of data that is loaded by a stream (from a data source, such as a public API) and routed to various screws where the data is filtered, sanitized, aggregated, analyzed, and sent to a human UI. view or any other target.

A network of nozzles and bolts is called topology, and data flows in the form of tuples (a list of values ​​that can be of different types).

storm physical view

Source: storm.apache.org

The important thing to talk about is the direction of data traffic. Conventionally, we would have one or more sinks that fetch data from an API, a Kafka topic, or some other queuing system. The data would then flow unidirectionally to one or more screws, which can be passed on to other screws.

Screws can publish analyzed data to the user interface or another screw. But traffic is almost always unidirectional, like a DAG. Although it is certainly possible to create loops, it is unlikely that we would need such a convoluted topology.

Installing the Storm release involves a series of steps that you can perform on your computer. But later, I will use Docker containers to deploy the Storm cluster, and the images will take care of setting up everything we need.

Parallelism in Storm Topologies

The Fully understanding parallelism in Storm can be daunting, at least in my experience. The topology requires at least one process to be running (obviously). But as part of this process, we can parallelize the execution of our nozzles and screws using threads.

In our example, RandomDigitSpout will run only one thread, and the data spewed from that thread will be distributed between the 2 EvenDigitBolt threads.

But the method of this distribution, referred to as stream grouping, can be important. For example, you might have a stream of temperature records from two cities where the tuples emitted by the nozzle look like this:

Apache Storm

Source: storm.apache.org

Suppose we connect only one screw whose task is to calculate the changing average temperature of each city. If we can reasonably expect to get roughly the same number of tuples from both cities at any given time interval, it would make sense to dedicate 2 threads to our screw and send data for Atlanta to one of them and New York to the other.

For our purpose, a grouping of fields would divide the data between threads according to the value of the field specified in the grouping: There are other types of groupings too. In most cases, however, the grouping probably won’t matter much, and you can just shuffle the data and throw it randomly between the screw threads (shuffle grouping).

Another important component is the number of worker processes our topology will run on. The total number of threads we entered will then be split equally between the worker processes. So in our random digit topology example, we had 1 drain thread, 2 even screw threads, and 4 screw threads multiplied by ten (7 total). The 2 worker processes would be responsible for running 2 threads of the multiply by ten screws, 1 even numbered screw, and one of the processes will run the 1 drain thread.

Of course, the 2 work processes will have their main threads, starting with the nozzle and screw threads. So we will have 9 threads in total. These are collectively called executors.

It is important to note that if you set the nozzle parallelism hint > 1 (i.e., multiple executors), you may emit the same data multiple times. Let’s say a spout reads from Twitter’s public streaming API and uses two executors. This means that screws receiving data from the nozzle will receive the same tweet twice. Data parallelism comes into play only after the sink sends out the tuples, i.e., after the tuples are split between the screws according to the specified stream grouping. While running multiple workers on a single node would be quite pointless. However, later we will use a proper, distributed, multi-node cluster and see how the workers are distributed across the different nodes.

What a Thunderstorm Looks Like?

Any production-grade topology would be directly submitted to a cluster of machines to take full advantage of Storm’s scalability and fault tolerance. The Storm distribution is installed on the master node that is (Nimbus) and all slave nodes that is (Supervisors).

Slave nodes run Storm Supervisor daemons. The Zookeeper daemon on a separate node is used to coordinate between the master and child nodes. By the way, Zookeeper is only used for cluster management and never for any message passing.

storm cluster

Source: storm.apache.org

It’s not like the gargoyles and bolts send data to each other or anything like that. The Nimbus daemon finds available supervisors through ZooKeeper, with which supervisor daemons register themselves. And other management tasks, some of which will become clear soon.

Storm UI is the web interface used to manage the state of our cluster. We’ll get to that soon.

Our topology is sent to the Nimbus daemon on the master node and distributed among worker processes running on child/parent nodes. With Zookeeper, it doesn’t matter how many slave/supervisor nodes you start initially because you can always add more seamlessly, and Storm will automatically integrate them into the cluster.

Whenever we start the manager, it allocates a certain number of worker processes (which we can configure), which the specified topology can then use. So there are a total of 5 assigned workers in the picture above.

conf.setNumWorkers(5)

This code means that the topology will try to use a total of 5 workers. Since our two nodes have a total of 5 allocated workers: each of the 5 allocated worker processes will run one instance of the topology. If we did:

conf.setNumWorkers(4)

Then one worker process would remain idle/unused. If the number of workers specified was 6 and the total number of workers allocated was 5, then only 5 actual topological workers would work due to the constraint.

Before we set this all up with Docker, there are a few important things to keep in mind regarding fault tolerance:

  • If a worker on any child node dies, the manager daemon will restart it. The worker will be reassigned to another computer if the reboot fails repeatedly.
  • If an entire child node dies, its share of the work will be assigned to another parent/child node.
  • If Nimbus falls, workers remain unaffected. However, until Nimbus is restored, workers will not be reassigned to other slave nodes if their node crashes.
  • Nimbus & Supervisors are stateless on their own, but with Zookeeper, some state information is stored so things can pick up where they left off if a node fails or an unexpected daemon dies.
  • Nimbus, Supervisor, and Zookeeper demons are all fast failures. This means they are not very tolerant of unexpected errors, and if they encounter one, they shut down. Because of this, they must be run under supervision by a watchdog program that constantly monitors them and automatically restarts them if they ever crash. Supervisord is probably the most popular option (not to be confused with the Storm Supervisor daemon).

Note: In most Storm clusters, Nimbus is never deployed as a single instance but as a cluster. Suppose this fault tolerance is not incorporated, and our single Nimbus fails. In that case, we lose the ability to submit new topologies, gracefully kill running topologies, redistribute work to other manager nodes if one crashes, etc. For more simplicity, our illustrative cluster will use a single instance. Similarly, Zookeeper is often deployed as a cluster, but we will only use one.

Use Cases of Apache Storm

Apache Storm is very famous for processing large data streams in real time. For this reason, most companies use Storm as an integral part of their system.

Twitter − Twitter uses Apache Storm for its line of “Publisher Analytics products.” “Publisher Analytics Products” process every tweet and click on the Twitter platform. Apache Storm is deeply integrated with Twitter’s infrastructure.

NaviSite − NaviSite uses Storm to monitor/audit the event log. All logs generated in the system will pass through Storm. The Storm will check the message against a configured set of regular expressions, and if there is a match, that particular message will be stored in the database.

Wego: Wego is a travel metasearch engine based in Singapore. Travel data comes from many sources around the world with different timings. Storm helps Wego search for real-time data, resolve concurrency issues, and find the best match for the end user.

Advantages of Apache Storm

Here is a list of benefits that Apache Storm offers −

  • Storm is open source, robust, and user-friendly. It can be used both in small companies and in large corporations.
  • Storm is reliable, flexible, fault-tolerant,  and can support many programming languages.
  • Enables real-time stream processing.
  • Storm is incredibly fast because it has tremendous power to process data.
  • Storm can maintain performance even under increasing load by linearly adding resources. It is highly scalable.
  • Storm has operational intelligence.
  • Storm provides guaranteed data processing even if one of the connected nodes in the cluster is lost or messages are lost.

What are Tasks in Apache Storm?

Another concept in Storm’s parallelism. But don’t worry, a task is just an instance of a sink or gate used by the executor; what does processing do? In rare cases, each executor may need to instantiate multiple tasks.

This is a shortcoming, but I can’t think of a good use case where we would need multiple tasks per executor. Perhaps if we added some parallelism ourselves, such as creating a new thread within the bolt to handle a long-running task, then the main executor thread would not block and be able to continue processing with the second bolt.

However, this can make it difficult to understand our topology. Please comment if anyone knows of scenarios where the performance gain from multitasking outweighs the added complexity.

Anyway, back from this slight detour, let’s look at an overview of topology. For this, Click on the name under Topology Summary and scroll down to Worker Resources:

We can see the distribution of our executors (threads) between 3 workers. And, of course, all 3 workers are on the same single supervisor node we run.

Now, let’s say shrink!

Parting shots

The Designing of a Storm topology or cluster is always about tweaking the various knobs we have and settling where the result seems optimal. A few things to help you in this process, like using a config file to read hints for parallelism, number of workers, etc., so you don’t have to edit and recompile the code repeatedly.

Logically define your screws, one per indivisible task, and keep them light and efficient. Similarly, your nozzles’ next tuple() methods should be optimized.

Use the Storm user interface effectively. By default, it doesn’t show us the full picture, only 5% of the total number of tuples emitted. Use config.setStatsSampleRate(1.0d) to monitor all of them. Monitor the Acks and Latency values ​​for individual screws and topologies through the UI; that’s what you want to look at when turning the knobs.

Conclusion

Storm’s architecture can be compared to a network of roads connecting a set of checkpoints. The operation starts at a certain control point (called the spout) and passes through other control points (called the screws). Traffic, of course, is a stream of data that is loaded by a stream (from a data source, such as a public API) and routed to various screws where the data is filtered, sanitized, aggregated, analyzed, and sent to a human UI. view or any other target.

  • Apache Storm is very famous for processing large data streams in real time. For this reason, most companies use Storm as an integral part of their system. Some notable examples are as follows − Twitter, Navisite, Wego.
  • Nimbus & Supervisors are stateless on their own, but with Zookeeper, some state information is stored so things can pick up where they left off if a node fails or an unexpected daemon dies.
  • The important thing to talk about is the direction of data traffic. Conventionally, we would have one or more sinks that fetch data from an API, a Kafka topic, or some other queuing system. The data would then flow unidirectionally to one or more screws, which can be passed on to other screws.
  • The tuples are split between the screws according to the specified stream grouping. While running multiple workers on a single node would be quite pointless. However, later we will use a proper, distributed, multi-node cluster and see how the workers are distributed across the different nodes.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

spot_img

Latest Intelligence

spot_img