
~~NOTOC~~

<html><font color=#990000 size="+2"><br><b>What is Reactive Data?</b></font></html>

{{gallery> wiki:general_moc:rd-platform-1.1.png?509x400&859X667crop&lightbox}}


So what is Reactive Data and why do we need a Reactive Data Platform? \% Reactive is a new approach to data processing and analysis.  It lets developers easily build scalable and distributed data-driven applications by making use of the [[wp> Actor model]], wherein computations are triggered in response to event streams or changes in data state.  

Modern architectures are distributed by design and their critical data often has two states; being simultaneously at-rest and [[ http://en.wikipedia.org/wiki/Data_at_Rest#mediaviewer/File:3_states_of_data.jpg | in-motion]].  Data-driven applications are designed to process data changes, turning them into actions that in-turn, result in more data changes.  This typically requires a lot of infrastructure software and the use of specialized programming languages.

The big idea with Reactive Data is that data management systems and query engines themselves should be smart enough to observe and react to data changes. This allows application developers, data analysts and scientists to seamlessly blend traditional data analysis with new techniques like [[wp>Stream processing | Stream Processing]] and [[wp>Reactive programming | Reactive Programming]], without the need for writing and maintaining complex code.  Reactive data processing means that changes in data state can drive application logic and not the other way around, as was the case with legacy systems.  

Why the paradigm shift? \% As enterprise architectures evolved, critical data used to make decisions became voluminous and noisy((Noisy data often includes duplicates or non-relevant information)); constantly on-the-move and scattered across multiple systems.  Increasingly, sub-sets of new information were shipped between data centers, cloud and web applications to enable real-time((Real-time systems are often referred to as Low-Latency)) decision making.  The resulting in-flight data was often more accurate and relevant than the stale contents in an organization's system of record. 

As data became fragmented across applications and storage systems the volume and variety of data in-motion increased. Enterprise architectures evolved to include messaging, routing and data transformation capabilities, at the cost of increased complexity.  Sharing data between applications and making changes to business logic became difficult and error-prone; resulting in brittle systems that rely on expensive hardware to scale. This invalidated many prior assumptions about system design, driving the need for new data processing techniques. Application developers needed a simpler way to work with data at-rest and in-motion.  Something that allows them to see, shape and query information as it moves thru an organization.

A reactive data platform makes it easy for developers and analysts to implement new data processing techniques by making them part of the data processing language. Instead of relying on complex frameworks and specialized programming languages the capabilities are available as part of a general-purpose query engine.  This lets developers use familiar tools for building something new, allowing reactive programming techniques to be used for [[wp>microservice]] development, back-office automation, [[wp>Internet_of_things| IoT]] or [[wp>big data]] analytics applications.  

==== The Actor Model ====

Implementing a reactive data procressing system is not without it's challenges. Writing reactive programs is not easy and often requires a number of specialized skill sets. To get things done StreamScape's data engine makes use of the [[wp>Actor model | actor model]] for asynchronous data aprocessing.  This addresses issues of complexity, scale and cost by eliminating dependency on a central server and providing a common data language that abstracts both real-time and batch processing.  Actors allow the user to develop functionality incrementally as a series of smaller components that handle specific tasks.  Making changes to business logic is made simple without sacrificing performance.   

Unlike general purpose programming languages a reactive data system knows how to query and work with data structures. Declarative dependencies between actors and data make it possible for one application to observe and react to data changes or events produced by another.  Actors communicate asynchronously by passing sequences of data((Also called event streams)) and allow processing to occur in parallel on a massive scale. This frees up applications to perform other tasks while results are computed; allowing them to run on low-cost, commodity hardware and take advantage of modern, [[wp>Multi-core processor|multi-core]] computer architectures. More importantly, actors allow StreamScape users to develop, test, debug, and operate batch and real-time systems on top of a single data processing framework. 

Reactive system design is becoming the architecture of choice, with diverse implementations across the application stack, ranging from UI frameworks like [[http://elm-lang.org/ | Elm]] and [[https://baconjs.github.io/ | Bacon.js]] to full featured reactive programming languages like [[http://www.scala-lang.org/ | Scala]] and [[http://www.erlang.org/ | Erlang]].  Regardless of application, reactive programming concepts are radically transforming the industry.  Large companies such as [[http://techblog.netflix.com/2013/02/rxjava-netflix-api.html | Netflix]], [[https://msdn.microsoft.com/en-us/data/gg577609.aspx|Microsoft]] and [[http://facebook.github.io/react/index.html|Facebook]] as well as start-ups like [[http://www.lightbend.com/ | Lightbend (formerly Typesafe)]] are betting big on going reactive.  As of this writing more than 9,000 individuals from various organizations have signed the [[http://www.reactivemanifesto.org/|Reactive Manifesto]], validating this important new paradigm.  

   
==== The Theory of Reactivity ====

**<color brown>A=R/V<sup>2</sup></color>** 

Althought were having some fun here, the above equation illustrates that <color brown>//Actionable Information = Data Relevance / Data Velocity <sup>2</sup>//</color>. \% It is a reminder of the relationship between good (actionable) data, the ability to filter out the noise and the rate at which new data are being created. Increasing data velocity typically lowers the quality of actionable information.  Improving data quality requires more processing to find relevant information. Data analysis leads to an unavoidable growth in data velocity and volume. 

The formula also reflects the dual nature of critical information that is being continuously converted from high velocity data in-motion to high value data at-rest. Effective analysis requires a series of steps that constantly collect (buffer) and distribute (stream) the data. This is based on the observation that all Business Intelligence (BI) applications typically deal with [[http://sqlmag.com/business-intelligence/bitemporal-design-time-after-time | bi-temporal data]] that represents both current (real-time) and historical information. 

{{ :wiki:general_moc:buffer-stream.png?nolink |}}

Consider [[wp> Map Reduce]], a popular data processing pattern that requires programmers to read chunks of data from databases or file storage and send them to consuming applications (services) for processing.  The services in turn send computed results to other consuming applications that aggregate the bits and presents them to users for reporting or additional analysis.  Data are collected (buffered) during the mapping phase, then passed to the reducers (streamed) and the results are again collected (buffered) and prepared for further processing.  This architecture is at the heart of every decision support system and illustrates how data state constantly changes from at-rest collections  that can be queried to in-motion streams that can be filtered and routed. Critical data has two states, it is simultaneously at-rest and in-motion.

As [[wp>big data]] systems evolve the goals of enterprise architecture are shifting towards a single framework that can handle massive quantities of data by combining batch and stream-processing methods. One popular solution presented by [[http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html | Nathan Marz]] is the so-called [[wp>lambda architecture | Lambda Architecture]].  It takes a data-first approach by allowing streams of time-ordered data to be ingested into specialized  storage systems that prepare information for batch or real-time processing.

Lambda offers a repeatable, albeit complex solution to the dual-state data problem based on open-source architectures developed at Google, Netflix, Facebook, Twitter and other big data pioneers.  More recently an improved version called the [[https://www.oreilly.com/ideas/questioning-the-lambda-architecture | Kappa architecture]] wa proposed by folks at LinkdIn.  Kappa architecture treats everything as a stream and data prep is performed by reading data from an event queue. Although the authors insist on calling it a log. Both solutions depend heavily on a broad stack of Apache open source frameworks like [[wp>Apache Storm | Storm]], [[wp>Apache Spark | Spark]], [[wp>Apache Kafka | Kafka]], [[wp>Apache Hadoop | Hadoop]] and [[wp>Apache Samza | Samza]].  While architectural theories of Lambda and Kappa are sound, their implementation is not for the feint of heart. It requires a deep knowledge of open source frameworks and an even deeper understanding of the divergent programming paradigms that these frameworks use.  The resulting operational complexities of either approach are well known in the industry.  Nevertheless, there is much value in understanding the issues these technologies are attempting to address. 

==== Reactive Data ====

Reactive data systems are built from the ground up to support bi-temporal data. They inherantly understand structured and unstructured data formats and are capable of performing computations asynchronously in response to changing data states. Changing data can raise events or send messages and events can directly alter data. This is accomplished thru new computing constructs like Event Triggers, Actors, Query Futures or Promises depending on what condition the application needs to respond to. Reactive data systems allow architects to design data processing networks that work with ordered streams of structured, immutable data. This capability is essential to data integration, scalability, fault tollerance and Business Intelligence system design. 

StreamScape's Reactive Data Platform™ is built on the following key architectural principles: 

\\

<grid>
<col sm="4">

<popover html="true" placement="top" trigger="focus" title="{{:icons_large:ds.png?28}} \% **Data Aware**" content="Unlike message brokers or NoSQL databases the system allows users to describe, store and query arbitrary data structures using SQL-like syntax. Data dictionaries and validation rules may be created for any kind of data and then used for stream processing, data query, aggregation or to apply schema-on-read to external files, text or web content. Materialized views and Joins on data collections are fully supported."><btn type="danger" >{{fa>database?48&color=white}}</btn></popover>
\\
\\
<color brown>**Data Aware**</color>

</col>

<col sm="4">

<popover html="true" placement="top" trigger="focus"  title="{{:icons_large:cep.png?32}} \% **Declarative Dependencies**" content="
Relationship between data elements, data collections and event flows are expressed thru configuration or script.  Managing dependencies such as when triggers or actors fire or how data is syncrhonized is done by the system not programmatically. As a result most infrastructure and scaffolding code is eliminated; allowing developers to focus on business logic and system design. Dependencies may be queried and visualized making change management and root cause analysis a simple task."><btn type="danger" >{{fa>puzzle-piece?48&color=white}}</btn></popover>
\\
\\
<color brown>**Declarative**</color>


</col>

<col sm="4">

<popover html="true" placement="top" trigger="focus" title="{{:icons_large:evstream.png?32}} \% **Native Stream Processing**" content="Actors and Triggers work on streams of events generated by observable components much like Complex Event Processing engines and Messaging subscribers do. Events can be correlated and turned into data modification actions or vice versa. Parallel processing and asynchronous application logic implementation is made simple, resulting in elastic scale and hardware synchrony that allows for efficient use of multi-core architectures."><btn type="danger" >{{fa>forward?48&color=white}}</btn></popover>
\\
\\
<color brown>**Event-Driven**</color>


</col>

</grid> 

\\

So why is data reactivity important?  While there are several solutions to the dual-state problem, their architectures are fairly complex and unstable. Results are typically acheived by duct taping together multiple technologies that by themselves solve only part of the problem. What's needed is an off-the-shelf tool set that radically simplifies implementation and operation of dual-state systems. Reactive data technologies provide excaclty that.

Application frameworks like Apache Spark, Tez or Kafka allow programmers to develop highly customized solutions that provide the most optimal efficiency and resource trade-offs. Reactive data systems codify these techniques and turn them into a new data processing tool, allowing the technology to be applied to a much broader set of problems without the need for complex programming. Ability to crunch massive amounts of data in-flight, without loading it into intermediate storage systems is becoming extremely valuable across many business domains. Building custom data processing infrastructures to acheive this is costly and ultimatley offers no competitive advantage.  However, deploying new analytics systems quickly in order to test a hypothesis provides the data agility that a business needs in order to stay ahead of the game. 


==== The Network is the Database ====

As organizations become increasingly data-driven, the way they query and use information is also changing. Users are no longer limited by clunky request-response mechanisms of legacy applications. Reactive data processing lets users declare //future// queries and //promise// that computation results are delivered to targeted applications or devices.  Actors and triggers make it easy for back-end systems to correlate cause and effect, respond to events and work with sets of streaming data. New query techniques allow users to work with tabular data, objects or documents and ask complex questions of the data without restrictions on size or structure.

Wether custom-built or implemented with off-the-shelf tools, reactive data systems are becoming a reality and shaping the next generation of analytics tools. The resulting data processing networks allow organizations to 'democratize data' by shaping and moving critical information closer to the analytical apps used by knowledge workers.  Rather than attempt to draw insights from all the stuff locked in departmental siloes, decision makers can query a network of information that spans many disparate sources, combining the data in real-time based on their needs making centralized data warehousing unnecessary. In this new computing paradigm as critical information spreads across disparate systems, the network becomes the database.

\\

[[:start |{{:icons_large:bowlerhat-transp.png?48}}]] [[:start|Back]]