Title Model nadzora i upravljanja inkrementalnim ažuriranjem skladišta podataka
Title (english) Model for Supervision and Management of Incremental Updating in Data Warehouse
Author Ljiljana Brkić
Mentor Mirta Baranović (mentor)
Committee member Damir Kalpić (predsjednik povjerenstva)
Committee member Zoran Skočir (član povjerenstva)
Committee member Mladen Varga (član povjerenstva)
Granter University of Zagreb Faculty of Electrical Engineering and Computing (Department of Applied Computing) Zagreb
Defense date and country 2011-07-12, Croatia
Scientific / art field, discipline and subdiscipline TECHNICAL SCIENCES Computing
Universal decimal classification (UDC ) 004 - Computer science and technology. Computing. Data processing
Abstract Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka u skladištu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladišta podataka koji se mogu testirati. U fokusu ovog rada je testiranje ETL (Extract-Transform-Load hrv. Ekstrakcija- Transformacija-Punjenje) procesa. U radu je predložen općeniti model i algoritmi za integracijsko testiranje određenih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crne kutije, a testiranje se obavlja uspoređivanjem ulaznih i izlaznih skupova podataka s tri lokacije: podaci iz izvorišta podataka, podaci iz konsolidiranog područja za pripremu podataka te podaci iz skladišta podataka. Predloženi model i algoritmi se mogu primijeniti na bilo koje skladište podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Općenitim ga čine meta podaci kojima se opisuju skupovi podataka koji se uspoređuju te strategija uspoređivanja. Rezultati postupaka uspoređivanja koriste se pri sljedećim usporedbama za brže pronalaženje razlika. U radu je također, predložen model i metode za horizontalnu fragmentaciju dimenzijskih i činjeničnih relacija. Predložene metode su prikladne za implementaciju u skladištima podataka u kojima se može odrediti pogodan kriterij fragmentacije - npr. skladištima koja objedinjuju podatke različitih organizacijskih struktura. Integriraju se u fazu punjenja ETL procesa. Postupkom se postiže poboljšanje dimenzija kvalitetne, potpunost i pravovremenost. Kao i predloženi model i algoritmi integracijskog testiranja i ovaj postupak općenitim čine meta podaci. ETL procesom s implementiranom horizontalnom fragmentacijom je automatiziran proces traženja i bilježenja pogrešaka. Informacije prikupljene procesom raspoložive su administratorima sustava i čine podlogu za kvalitetan nadzor i upravljanje procesom ažuriranja skladišta podataka. Naravno, vidljive su i krajnjim korisnicima čime se minimizira vrijeme traženja pogrešnih podataka i uzroka njihove pojave.
Abstract (english) This thesis pertains to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse system’s testing capabilities. The dimension and fact relation's horizontal fragmentation implemented in the load phase of the ETL process automate finding and recording errors. Information collected is available to the system administrators as well as to the end users providing the basis for quality control and management of the process of updating in a data warehouse. The introductory chapter gives a brief overview of the research areas: strategies of refreshing the data in the data warehouse, data quality and the relevant dimensions of quality, testing the ETL process and its role in the data quality issue. The motivation, the main goals and the scientific contributions of this dissertation are explained. At the end of the chapter, main elements of the relational and dimensional database models are described as well as concept of the slowly changing dimensions. The second chapter defines the problem of data integration and describes the data integration system. The roles and the stages of ETL process so as different variations of the traditional ETL architecture are described. The central part of the chapter is dedicated to the strategies of data warehouse refreshment. The existing techniques for detecting and recording changes in data sources are described. The comparative comparison of the described CDC techniques is given. This chapter ends with an historical overview of modelling incremental ETL process. The third chapter gives an overview of some of the most popular commercial and open source ETL tools. Among commercial solutions, Powercenter Informatica and Microsoft SSIS are described. Among open source ETL tools Talend Open Studio, Pentaho Data Inegration and Clover ETL are presented. At the end of the chapter the described ETL tools are compared. The fourth chapter is dedicated to the data quality issue and data quality dimensions. The main part of the chapter is dealing with three data quality dimensions we found crucial for this dissertation: accuracy, completeness and timeliness. An overview of the definitions encountered in literature for these three dimensions is presented. For each of the dimensions, according to the author of this doctoral thesis, the most mature and the most complete expression for determining the numerical values of quality dimensions are selected. Expressions are selected at the level of: attribute values, tuples, relations, and the entire database or data warehouse. For the accuracy dimension the difference between the semantic and syntactic accuracy is precised. At the end of this chapter some techniques for accuracy improvement in the data warehouse are listed. The fifth chapter describes the importance of software testing as well as the purpose and goals of the individual types of tests. The emphasis in this chapter is on the data warehouse testing. An overview of standards and a review of previous research in the area of data warehouse testing are given. The adopted methodological framework for data warehouse system testing is described including the role of certain types of testing: conceptual and logical schema testing, ETL process testing, database testing, front-end testing and regression and integration testing. In the sixth chapter a formal description and implementation details of the incremental data warehouse updating is given. Additionally, the architecture of the data staging area, used in the implementation of incremental updates within dissertation is presented along with the used CDC technique and procedure for determining the data net change within a given timeframe. At the end of the chapter the experimental results are presented. Results are obtained by conducting a series of experiments for three data warehouse refreshment scenarios: the initial load, the full reloading and incremental updates. The seventh chapter describes the generic procedure for integration testing of certain aspects of the ETL process developed and implemented within this dissertation. The data structures enabling data lineage in the dimensional and transactional fact relations are proposed. The formal description of the segmentation procedure for relation participating in comparison as well as searching and matching procedure of congruent segments is given. The formal description is given for relations with equal schema as well as for the relations with various schemas. The metadata model, supporting the proposed procedure and makes it generic, is presented and explained. The role of the subsystem for tracing the records that causes the differences between source and destination information is described. The limitations of the proposed integration testing are stated. The chapter ends with the presentation of experimental results obtained by conducting experiments with implemented procedure. The eighth chapter defines and describes the process of horizontal fragmentation of the fact and dimensional data warehouse relation. The semantic integrity of the data is defined and definitions of the most important semantic integrity rules are given. The ETL process flow with implemented horizontal fragmentation is described. The model of metadata supporting the proposed horizontal fragmentation is presented and explained. The end of the chapter arguments the improvement of the dimensions of completeness and timeliness based on the application of the proposed procedure. Conclusion gives an abridged overview of this dissertation.
Keywords
ETL proces
inkrementalno ažuriranje skladišta podataka
integracijsko testiranje skladišta podataka
horizontalna fragmentacija
potpunost i pravovremenost u skladištu podataka
Keywords (english)
ETL process
incremental updating in data warehouse
integration testing in data warehouse
horizontal fragmentation
timeliness and completeness in data warehouse
Language croatian
URN:NBN urn:nbn:hr:168:296161
Project Number: 036-0361983-2012 Title: Semantička integracija heterogenih izvorišta podataka Leader: Mirta Baranović Jurisdiction: Croatia Funder: MZOS Funding stream: ZP
Project Number: 036-0361983-2020 Title: Baze podataka geoprostornih senzora i pokretnih objekata Leader: Zdravko Galić Jurisdiction: Croatia Funder: MZOS Funding stream: ZP
Study programme Title: Computer Science Study programme type: university Study level: postgraduate Academic / professional title: Doktor znanosti (Doktor znanosti)
Catalog URL http://lib.fer.hr/cgi-bin/koha/opac-detail.pl?biblionumber=36744
Type of resource Text
Extent 162 str ; 30 cm
File origin Born digital
Access conditions Closed access
Terms of use
Created on 2019-06-28 08:18:35