by Dennis Hemker, Stefan Kreutter and Harald Mathis (Fraunhofer FIT)
Data is the key factor for solving problems with modern approaches like AI but only a part of the whole. To reduce complexity in the AI life cycle, we propose a modularised tooling approach and describe the data-related parts of it.
Recent developments like ChatGPT [L1] or Stable Diffusion [L2] have shown the high potential of data-driven approaches to tackle a wide range of problems across different domains.
With the involvement of Artificial Intelligence (AI), a typical data-focused process before training a model includes filtering, labelling, transforming and splitting. Orthogonal to these development steps of the AI itself, data needs to be acquired, stored, backed up and made accessible. Additionally, there is no guarantee that only people with deep technical backgrounds are part of such processes.
Current solutions for managing AI-related workflows and their data tend to be heavy-weighted, trying to offer all-in-one software. This often comes at the cost of complexity, either in usage (locked into one ecosystem), setup and maintenance (need for clusters like Kubernetes), pricing (high rates per member per month) or privacy (cloud only instead of on-premise). Within the project "progressivKI" (funded by the BMWK) [L3], a modular AI platform is developed. While this platform tries to solve different parts of a machine learning workflow like GPU-based training, containerisation and model performance comparison, it also decouples data handling aspects.
As described in [1], AI-code itself only takes a small part of the whole project, raising the need for supportive infrastructure tooling. The main idea behind our approach is to develop a platform following principles of the Unix philosophy [L4]. This means to solve problems with small and effective tools while retaining full control over all parts of the process.
Architecture
The main assumption within the developed platform is to treat every data as files. Each step in the pipeline (which can be a small script/executable in an arbitrary programming language) consumes files as inputs and produces files as outputs that can be consumed by later steps. We refer to these steps as "stages" in a pipeline.
The aforementioned stages form a direct acyclic graph (DAG). To handle their execution in the correct order and to avoid redundant calculations on unchanged stages, we employ Data Version Control (DVC) [L5]. In a Git-like fashion DVC can handle data sources as "remotes" and supports different storage locations like Amazon S3, Google Drive, SSH, HTTP, local file systems and more. This allows easy integration, updating, replacing and versioning of data sources. Additionally, there is no lock-in to any specific framework or (cloud) infrastructure. AI-code and DVC configuration files are versioned with Git.
In the current implementation we employ a S3-compatible storage server. It can make use of its own replication and backup mechanism and is managed separately. The project-specific bucket is split into three different parts:
1. Inbox. This place can be considered as data heap. New arbitrary data can be put here occasionally by external producers.
2. Cache. DVC-specific cache to store intermediate file artefacts, e.g. metrics, models and processed files. Fully managed by DVC.
3. Outbox. The place where trained models or other file artefacts like transformed data can be saved. Can be accessed by other consumers.
This architecture allows fine-grained configuration and rights management for each project and user. Data producers can be equipped with write-only access keys to the inbox while consumers of outbox artefacts can be granted read-only credentials. Developers of AI pipelines can access all three parts of the bucket. Furthermore, the division can also be spread across multiple buckets or even storage locations. By treating storage as a single component as DVC does, it can be designed per need. Either available at short notice and quickly or long-term, redundant and fail-safe.
An easy-to-use tool written in the Python programming language enables collaborators in such projects to simply upload data to specific inboxes. While avoiding complicated interfaces or processes, barriers are broken down for data producers with less technical backgrounds, allowing fast, secure and reliable exchange between them and developers. From respective inboxes, the data can be pulled effortlessly to local workspaces or remote servers for further processing or model training. The server hardware can be located anywhere, on-premise, cloud or in any possible combination.
Figure 1: Example data preparation pipeline in an AI workflow.
Figure 1 depicts the process of a typical AI data preparation pipeline. The stages to the right of "Inbox" reside in the cache part of the bucket. Assets from both the inbox or the cache can be packaged and moved to the outbox for further consumption. As everything is treated as files, it is easy to use one's favourite tooling to visualise or analyse these files. By not forcing other interfaces, the structure remains flexible and customisable for attaching to other ecosystems of software.
Data validation and quality checks [2] can be integrated as stages, while data reduction algorithms [3] can be deployed as transparent services located in front of an inbox. This allows the building of pipelines tailored per need, across frameworks and hardware.
Future work includes adding a more sophisticated user and rights management, as access key generation and distribution is currently done manually. It is also planned to foster data privacy by utilising encryption techniques on the server side. Additionally, we want to investigate further deployment tools like [L6] and benchmark data throughput of the storage server.
Links:
[L1] https://openai.com/blog/chatgpt/
[L2] https://stability.ai/blog/stable-diffusion-public-release
[L3] https://www.fit.fraunhofer.de/de/geschaeftsfelder/digitale-gesundheit/fraunhofer-anwendungszentrum-symila/projekt-highlights/progressiv-ki.html
[L4] http://www.catb.org/~esr/writings/taoup/html/ch01s06.html
[L5] https://dvc.org/
[L6] https://mlem.ai/
References:
[1] D. Sculley, et al., “Hidden technical debt in machine learning systems”, in NeurIPS, 2015.
[2] E. Breck, et al., “Data infrastructure for machine learning”, SysML conf., 2018.
[3] G. Karya, et al., “Basic knowledge construction technique to reduce the volume of low-dimensional big data”, ICIC, 2020.
Please contact:
Dennis Hemker
Fraunhofer Institute for Applied Information Technology FIT, Germany