Data Management

To be analyzed, Data needs certain qualities according to their intended purpose, i.e. validity, consistency and completeness. Data Management creates and maintains these qualities.

Choosing the right infrastructure to extract insights from big data is key in the process of data management. The multitude of data (structured, unstructured and semi-structured) makes it difficult to select and employ the most suitable tool. The Variety, Velocity, Veracity and Volume (4 V`s) of Big Data has to be taken into account while deciding on the data analysis framework.

We present the recent challenges of this topic and the most important processing systems as well as institutions and platforms. Our collection of related sources helps to delve further into the topic.

© Smart Data Forum

Open Source Instruments for Distributed Systems

Data processing by frameworks, such as Apache Flink, Spark, Hadoop and others, aim at the visualization, detection of deviations or errors of large amounts of data. There are two ways to analyze big data: in batch and stream. Batch data analysis means that all data is processed at once. Stream processing analyzes data in real time, allowing immediate decisions to be made.

Batch processing means that large, historic amounts of data are analyzed in sets or “batches”. The information is collected over a period of time and analyzed all at once. It means that data processing will take longer (normally seconds) than stream processing, which is usually done in real-time. Batch processing can also be done offline. This system is useful, for instance, for generating payrolls or bank statements.

  • Apache Spark– open source framework for both batch and streaming data, running on Hadoop, Kubernetes, Apache Mesos and in the cloud. The framework can access various data sources and it gives the opportunity to efficiently execute iterative algorithms and define complex transformation pipelines. It is mainly used to process large datasets. The applications can be written in Python, Scala, Java, R and SQL.
  • Apache Flink– open source distributed system for stateful computations that processes bounded and unbounded datasets. The framework gives the opportunity to efficiently execute iterative algorithms and define complex transformation pipelines. It runs on Hadoop YARN, Kubernetes and Apache Mesos, and can also run independently. It performs calculations at any scale.
  • Apache SystemML – open source workspace for machine learning that uses big data. It mainly runs on Apache Spark and is able to scale the data line by line. Subsequently the system decides whether the code will run on a Spark or Hadoop cluster or on the driver. It supports Python and R.
  • Apache Hadoop – open source distributed system for processing large datasets based on simple algorithms. It has the ability to scale up to a network of computers that subsequently provide extra computation and storage capacities. The framework is based on a library of modules: Common, Distributed File System, YARN and MapReduce.

Challenges

  • End-to-End Management Machine Learning Pipelines – Running in a shared environment, they are difficult to implement, mainly because the production pipelines include a series of interdependent processing levels. When any of the processing stages is updated, it is problematic to identify which dataset can be recomputed and which can be reused. The development and training of machine learning models is mainly focused on the efficient search of hyperparameters. Many computations are currently still being performed several times.
  • Declarative specification of analysis algorithms offers the possibility to abstract system-specific details (e.g. caching of RDDs, avoidance of GroupBys in Spark) and to optimize them across the entire data flow. To achieve this, the above-mentioned decomposition takes place in parallel data flow fragments.

Stream processing systems help the user to extract useful insights from data in real time. For this to work, data has to flow directly into the analysis system from the moment it is generated. Stream processing can be efficiently used, for instance, in fraud detection, log monitoring, or customer behavior analysis. It works only online.

  • Apache Flink – open source distributed system for stateful computations that processes bounded and unbounded datasets. It runs on Hadoop YARN, Kubernetes and Apache Mesos and can also run independently. It performs calculations at any scale.
  • Apache Spark Streaming – open source framework that helps create scalable fault-tolerant streaming functions. Because it runs on Spark, the system allows the user to reuse the algorithm for batch processing and merge streams with historical data. The application can be written in Python, Scala and Java.
  • Apache Kafka – open source distributed streaming platform that processes data in real-time. It collects and stores streams of records in a fault-tolerant durable way, each record containing a timestamp, a key and a value. Kafka runs as a cluster on a single or several servers.
  • Apache Storm – open source distributed computation system that operates in real-time. The framework is mainly designed for processing unbounded streams of datasets in a scalable, fault-tolerant manner. It is useful for machine learning, real-time analytics and continuous computation. Storm runs on any programming language.
  • Apache Beam – open source system that uses a unified programming model for both streaming and batch data processing. It is able to execute pipelines on a large amount of execution environments. The application can be written in Python, Java and Go.

Challenges

  • Lower-level / higher-level APIs – Due to the complicated APIs, streaming systems are more complex than batch systems. Often the APIs require user intervention to specify applications for physical operations.
  • Integration in End-to-End Streaming – Most of the streaming workloads are executed in the framework of a bigger application. To integrate them, important engineering work is needed, which is also time-consuming.
Selected Institutions & Platforms
  • Berlin Big Data Centre – Competence center for development of methods and technologies for data science based on machine learning and data management. (BMBF)
  • Competence Centre for Scalable Data Services and Solutions – ScaDS combines the methodological competence of the universities in Dresden and Leipzig in a virtual organization and brings together international leading experts in the field of Big Data. (BMBF)