Tag: Spark streaming

Connecting reactive applications with fast data using reactive streams.

Talk by Luc Bourlier. As in the previous posts these are my quick notes.
Who doesn’t know what a reactive application is? Responsive, elastic, resilient and message driven – this is what reactive apps are.
Big data means that there are too much data to be handled by traditional means on a single machine.
Fast Data are big data that comes in big volume and you want up to the second information with continuous process.
Spark streaming is the technology by light bend that does the trick.
Spark is an evolution of map reduce model. A driver program (spark context) talks to the cluster manager to get worker nodes to do the job.
Spark can be used on streams by using mini-batching. A mini batch is the work executed on data received in a unit of time.
Spark streaming deals with all kind of failures (hardware, software and network). It also handles recovery for continuous processing and deals with excess of data volume.
A demo is presented with a raspberry pi cluster. (On raspberry pi you don’t need to push the system to the limit, because you are already at the limit).
Demo ran fine, but it broke, that makes me wonder how stable is this technology. The demo model seemed quite simple.

Back pressure is the mechanism implemented by akka streaming to slow the data producer if the consumer is not able to consume data fast enough.
Congestion in spark was handled by static limit on the input rate. In spark 1.5 the limit has been changed into dynamic rate limit. There is a rate limit estimator based on PID that sets the rate limit.
There are some limitations to this method based on the assumptions used in the design – all records require about the same time to process, the process is linear (a 3rd assumption was there, but it got lost my my note taking)