Data Mesh, Edge-Computing, DevOps, Data Lakes and Data Wrangling
In digitization and technology, the focus is increasingly on decentralization. The following aspects play important roles here: Decentralized systems can be more secure than centralized systems because they are not a single target for attacks. By distributing data and resources across many nodes in the network, it becomes more difficult to compromise the entire system. Decentralized systems can scale more easily than centralized systems. The load can be distributed across many nodes in the network, rather than a single centralized entity. Decentralized systems can be more transparent because all stakeholders have access to the same information. This can increase trust and collaboration. Decentralized systems can be more independent and not controlled by a single central authority. This allows decisions to be made by a broader decision-making base, which can lead to better coverage of functionality and security. Decentralized systems can foster innovation because they are more adaptable. This can lead to new ideas being implemented more quickly. Examples of these developments include microservice architectures, blockchain, and decentralized cloud systems.
It is only in the analysis and evaluation of data that other approaches have become established in the last decade. Data is collected from sensors, applications, social media, log files and many other sources and systems and is stored in large data sinks, such as data lakes. A data lake is a data architecture in which large volumes of raw data from various sources are stored. In contrast to traditional databases or data warehouses, which usually require a structured data organization, unstructured or semi-structured data is stored in a data lake. A data lake therefore allows companies to store a wide range of data without worrying about how the data needs to be structured before it is stored in a data warehouse or database. This should enable data analysts and data scientists to use this data to gain insights into business processes and customer behavior. Information and knowledge is to be generated from the unstructured raw data. In order to analyze the data in the data lake, it usually has to be structured before processing. This process is known as data wrangling and includes steps such as data preparation, data cleansing and the creation of data models. If the data scientists now work in central data teams set up specifically for this purpose, they often face several challenges with the raw data. On the one hand, the central data team must ensure that data is accessible to all relevant departments and teams in the company. This often requires close collaboration with other departments to ensure that data processing and storage meets the requirements of the various business areas. The quality of the data is also an important prerequisite for successful data analysis. The central data team must ensure that the data is correct and consistent, which often requires close cooperation with the departments involved. Last but not least, such a central data team consists of various roles, such as data scientists, data engineers, business analysts and IT specialists. What is often missing are people with the necessary domain knowledge to structure the various raw data and to add semantics to this data. Data without a framework of meaning, data without a description and without information about the processes and framework conditions involved are often difficult to transform into meaningful contexts. The battle with the data becomes a battle with the specialist departments and the conditions there, such as standards, rules and technical terms. There is a bottleneck between the people or systems that acquire the data and those that want to analyze it. A separate process must also be created and maintained across team boundaries to provide feedback to the subject matter experts on the results of the analyses.
The following aspects play important roles here: Decentralized systems can be more secure than centralized systems because they are not a single target for attacks. By distributing data and resources across many nodes in the network, it becomes more difficult to compromise the entire system. Decentralized systems can scale more easily than centralized systems. The load can be distributed across many nodes in the network, rather than a single centralized entity. Decentralized systems can be more transparent because all stakeholders have access to the same information. This can increase trust and collaboration. Decentralized systems can be more independent and not controlled by a single central authority. This allows decisions to be made by a broader decision-making base, which can lead to better coverage of functionality and security. Decentralized systems can foster innovation because they are more adaptable. This can lead to new ideas being implemented more quickly.
Examples of these developments include microservice architectures, blockchain, and decentralized cloud systems. In addition to decentralization, there are also many efforts to perform tasks as far as possible where they can also be solved efficiently. Edge computing as a technical example and DevOps as an organizational one catch the eye. In the former, data is already processed and meaningfully filtered where it arises, and in the latter, teams are formed that bundle development and thus knowledge about the domain and operation and thus knowledge about the requirements of software users. Thus breaks are prevented by missing or incorrect interfaces in the respective processes and/or overhead. Only in the analysis and evaluation of data have other approaches become established in the last decade. Data is collected from sensors, applications, social media, log files and many other sources and systems and is stored in large data sinks, such as data lakes. A data lake is a data architecture that stores large amounts of raw data from multiple sources. Unlike traditional databases or data warehouses, which typically require structured data organization, a data lake stores unstructured or semi-structured data. A data lake, therefore, allows organizations to store a wide range of data without worrying about how the data needs to be structured before it is stored in a data warehouse or database. This should enable data analysts and data scientists to use this data to gain insights into business processes and customer behavior. Information and knowledge are to be generated from the unstructured raw data. In order to analyze the data in the data lake, it usually has to be structured before processing. This process is called data wrangling and includes steps such as data preparation, data cleansing and the creation of data models. If the data scientists now work in central data teams established specifically for this purpose, they often face several challenges with the raw data. On the one hand, the central data team must ensure that data is accessible to all relevant departments and teams in the company. This often requires close collaboration with other departments to ensure that data processing and storage meets the needs of the various business units. Data quality is also an important requirement for successful data analysis. The central data team must ensure that the data is accurate and consistent, which often requires close collaboration with the departments involved.
Last but not least, such a central data team consists of different roles, such as data scientists, data engineers, business analysts and IT specialists. What is often missing are people with the necessary domain knowledge to be able to structure the various raw data and also to add semantics to this data. Data without a framework of meaning, data without a description and without information about the processes and frameworks involved are often difficult to transform into meaningful relationships. The struggle with the data becomes a struggle with the specialist departments and the circumstances there, such as standards, rules and technical terms. There is a bottleneck between the people or systems that acquire data and those that want to evaluate the data. In addition, a separate process must be created and maintained across team boundaries in order to provide feedback to the subject matter experts on the results of the analyses.
If we now apply the concepts of decentralization and local responsibility described above to the problem just described, the result is a decentralized microservice architecture and the following approaches. Analogous to functionality, data and their semantic models are also offered as APIs by domain teams, i.e., by subject matter experts. In the resulting data mesh architectures, only the respective data interfaces and semantic models need to be maintained and exchanged. Analyses can be made where the necessary data is available, and additional data can be easily retrieved via further interfaces and provided with further models. Data and their semantic models become a product that can be collected, processed and kept consistent by the specialist departments. The result is a democratization of data. The raw data itself remains where it was collected and is linked to domain knowledge and offered in a standardized way. Data Mesh is thus a modern architecture developed to facilitate the scaling of data in the enterprise.
In summary, the following benefits result from a Data Mesh:
(1) Data Mesh promotes ownership of data by distributing responsibility for data to individual teams. This decentralizes data processes and decisions, which can lead to better collaboration and faster decision-making.
(2) Data Mesh enables seamless scaling of data and analytics processes. The decentralized nature of Data Mesh allows organizations to tailor data processing to business unit needs while ensuring that data processing remains effective and efficient.
(3) Data Mesh provides more flexibility in the selection of technologies and tools by allowing teams within the organization to make their own decisions about which technologies to use. This allows organizations to select the best tools and technologies to meet their specific needs.
(4) Data Mesh promotes accountability and transparency around data quality, as each data owner has responsibility for the quality of their data. This can lead to higher data quality as each data owner strives to ensure that their data is clean and consistent.
(5) Data Mesh encourages innovation by enabling teams to respond quickly to new data sources and analytics needs. This can lead to faster innovation and new insights that drive the business forward.
(6) Data Mesh promotes collaboration and knowledge sharing among teams because it encourages a decentralized, team-based structure. This allows teams to collaborate more effectively and share their knowledge and skills to achieve better results.