Introduction to data streaming

In economies where the role of big data is ever-increasing, companies turn to business intelligence and reporting tools to have their data analyzed and presented in a precise and readable format. One of the solution to handle a big set of data is a technique called data streaming.

Data streaming is a process which continuously processes a real-time data with a smallest latency as possible and in the best case is beneficial for the final user.

Use cases of data streaming:

  • Consumers and retailers benefit from the availability of real-time data and insights.
  • Credit card fraud can be detected in real time, and retailers can create a real-time single view of a customer that enables a real-time recommendation engine.
  • Data from sensors can be used to detect frauds in real time with notification of end user

In this article, we would like to introduce you some of the tools from Hadoop toolset which are developed for these purposes. In this case, we will be using few tools from Apache Hadoop software collection such as Kafka, NiFi, Druid, and Superset.

A quick introduction to our tool-set:

Apache Kafka

Apache Kafka is a distributed streaming platform capable of handling trillions of events a day.
Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log.
Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue
to a full-fledged streaming platform.

Apache NiFi

Apache NiFi was built to automate and manage the flow of data between systems and address the global enterprise
dataflow issues. It provides an end-to-end platform that can collect, curate, analyze and act on data in real-time,
on-premises, or in the cloud with a drag-and-drop visual interface.

Apache Druid

Druid is an open-source data store designed for sub-second queries on real-time and historical data.
Druid is most commonly used to power user-facing analytic applications.

Apache Superset

Superset is a data exploration platform designed to be visual, intuitive and interactive.
Superset’s main goal is to make it easy to slice, dice and visualize data.
It is developer claims that Superset can perform analytics at the speed of thought.

Our Solution Architecture:

As you can see on the architecture picture all of the mentioned tools are connected together and making real-time data streaming architecture with visualization tool at the end.

The biggest challenge is to connect all of these tools together and make them communicate with each other. This is usually achieved by their API or their own connectors. This article is not technical but is here to show you how real data streaming looks and what tools might be used for it.

Conclusion:

Data streaming is very powerful and used by more and more companies. Implementation of streaming solutions is not easy and is on a semi-technical level. It is not that straightforward and it cannot be done over the night.
If you are thinking of using data streaming or you already are using it, you know that choosing the right toolset is really important.

Source:
http://hadoop.apache.org/
https://hortonworks.com/ecosystems/


Om forfatteren

Robert Lepen

Lyst å lære mer om dette?

Del denne artikkelen