Paralelní zpracování dat a možnosti datové analytiky v rámci Big Data
Loading...
Downloads
12
Date issued
Authors
Derján, Lukáš
Journal Title
Journal ISSN
Volume Title
Publisher
Vysoká škola báňská - Technická univerzita Ostrava
Location
Signature
Abstract
The diploma thesis focuses on analysing the way of working and processing the high-volume unstructured datasets, called Big Data. Reader will find out more about the architecture of Big Data-oriented solutions and its comparison with the traditional architecture of Business Intelligence solutions (BI). Now traditional Business Intelligence tools and solutions are still not technologically ready for processing Big Data. This has led into emergence of new approaches to parallel data processing and the new Big Data-oriented, technologies. Data analytics is playing an important role when talking about the Big Data. If using relevant analysis, organizations can get more information about their customers, uncover hidden relationships in data and increase their profits and customers loyalty.
There is a platform that is technologically ready for processing and analysing Big Data. The Apache Hadoop. This platform is more described within the theoretical part, where the terms of Big Data and parallel data processing are explained, as well as in practical part of the diploma thesis, where the platform is used for analytical processing of the pre-selected data file. Thus basic features of a programming framework MapReduce and a distributed file system HDFS (together forming the Hadoop implementation) are explained.
In terms of applicability the implementation of analytical tasks according to customer requirements is the real outcome. An increasing number of analytical platforms deployment on top of existing BI solutions in organizations and the ever-increasing volume of publicly available data, is then in social terms, a potentially problematic area that sooner or later hit the barriers personal privacy.
The practical part of the thesis is based on the project requirements from the client company. The project is focused on finding the suitability of Big Data Hadoop platform for running analytical tasks over the relatively small datasets. To verify the suitability the n-gram analysis was used the selected data file. MapReduce framework as well as in-memory solutions Spark and TEZ has been used as the engines within the Hadoop platform. The conclusions of the thesis has been used as input for further decisions making regarding building the Big Data architecture within the organization and evaluation necessary transformation of existing BI solution for Hadoop platform.
Description
Import 22/07/2015
Subject(s)
Big Data, Apache Hadoop, Data analysis, Parallel data processing, Business Intelligence, n-gram analysis, in-memory solutions, Hive, MapReduce, Spark, TEZ