D3.2 : Data Preprocessing Toolbox
Loading...
Downloads
5
Date issued
Journal Title
Journal ISSN
Volume Title
Publisher
Location
Signature
License
Abstract
This deliverable introduces the “Toolboxes” module of the EXA4MIND Platform, extending
and covering the idea of a Data Preprocessing Toolbox developed within Work
Package 3 (WP3, Extreme Data Analytics and Processing) of the EXA4MIND project.
The Preprocessing Toolbox contains a set of generic and application-specific preprocessing
tools that enable data cleaning, transformation, fusion, and harmonisation
across heterogeneous data sources. Processing tools focusing on validation have
been made into an own Validation Toolbox submodule, and an Analytics & Artificial
Intelligence (AI) Toolbox is being compiled as well. With a uniform approach to command
line interfacing and curated code, our Toolboxes provide examples that users
can use or adapt for their own Extreme Data applications. While such applications are
too individual to cater for them with a reasonably limited set of tools, the idea of our
Toolboxes (Preprocessing, Validation and Analytics & AI) is thus to enable the users to
construct their own processing steps within advanced data-driven workflows. Preprocessing
steps in particular ensure consistency and quality of input data, and thus
lay the foundation for effective querying and analytics services across the Extreme
Data platform.
The Preprocessing Toolbox has been designed in close alignment with WP2 (Data
Spaces Management), enabling seamless integration between distributed data spaces
and advanced analytics capabilities. In particular, it is easy to employ the Toolboxes
and individual tools on data from different data sources, constructing data-driven
workflows within an instance of the EXA4MIND Advanced Query and Indexing System
(AQIS, from WP3). Including a fairness-check tool and validation mechanisms, the
Toolboxes support the overall project objective of enabling trustworthy, green, and
fair AI. This document complements the Open Source code repositories which are
the main part of the deliverable. It briefly presents motivation and design principles,
and gives pointers to the actual implementations and usage instructions.
Description
Subject(s)
Data Preprocessing, Data Validation, Data Analytics, Artificial Intelligence, EXA4MIND Toolboxes, Extreme Data