index of contents - Description of SBA and SBQL
Transkrypt
index of contents - Description of SBA and SBQL
TECHNICAL UNIVERSITY OF LODZ Faculty of Electrical, Electronic, Computer and Control Engineering Computer Engineering Department mgr inŜ. Tomasz Marek Kowalski Ph.D. Thesis Transparent Indexing in Distributed Object-Oriented Databases Supervisor: prof. dr hab. inŜ. Kazimierz Subieta Łódź 2009 To my wife Kasia… Index of Contents INDEX OF CONTENTS SUMMARY......................................................................................................... 6 ROZSZERZONE STRESZCZENIE.................................................................... 8 CHAPTER 1 INTRODUCTION ........................................................................ 13 1.1 Context ........................................................................................................................................... 14 1.2 Short State of the Art of Indexing in Databases ......................................................................... 14 1.3 Research Problem Formulation................................................................................................... 15 1.4 Proposed Solution ......................................................................................................................... 16 1.5 Main Theses of the PhD Dissertation .......................................................................................... 17 1.6 Thesis Outline ................................................................................................................................ 18 CHAPTER 2 INDEXING IN DATABASES - STATE OF THE ART ................. 20 2.1 Database Index Properties............................................................................................................ 21 2.1.1 Transparency ........................................................................................................................... 21 2.1.2 Indices Classification............................................................................................................... 22 2.2 Index Data-Structures .................................................................................................................. 23 2.2.1 Linear-Hashing........................................................................................................................ 26 2.2.2 Scalable Distributed Data Structure (SDDS)........................................................................... 28 2.3 Relational Systems ........................................................................................................................ 30 2.4 OODBMSs ..................................................................................................................................... 32 2.4.1 db4o Database ......................................................................................................................... 35 2.4.2 Objectivity/DB ........................................................................................................................ 35 2.4.3 ObjectStore.............................................................................................................................. 37 2.4.4 Versant .................................................................................................................................... 38 2.4.5 GemStone’s Products .............................................................................................................. 39 2.5 Advanced Solutions in Object-Relational Databases ................................................................. 40 2.5.1 Oracle’s Function-based Index Maintenance .......................................................................... 42 2.6 Global Indexing Strategies in Parallel Systems .......................................................................... 44 2.6.1 Central Indexing ...................................................................................................................... 46 2.6.2 Strategies Involving Decentralised Indexing........................................................................... 47 2.7 Distributed DBMSs ....................................................................................................................... 48 CHAPTER 3 THE STACK-BASED APPROACH ............................................ 50 3.1 Abstract Data Store Models ......................................................................................................... 50 3.1.1 AS0 Model .............................................................................................................................. 50 3.1.2 Abstract Store Models Supporting Inheritance........................................................................ 51 3.1.3 Example Database Schema...................................................................................................... 52 3.1.4 Example Store with Static Inheritance of Objects ................................................................... 53 3.2 Environment and Result Stacks................................................................................................... 54 3.2.1 Bind Operation ........................................................................................................................ 55 3.2.2 Nested Function....................................................................................................................... 56 3.3 SBQL Query Language ................................................................................................................ 56 3.3.1 Expressions Evaluation ........................................................................................................... 57 3.3.2 Imperative Statements Evaluation ........................................................................................... 61 3.4 Static Query Evaluation and Metabase....................................................................................... 62 3.4.1 Type Checking ........................................................................................................................ 64 3.5 Updateable Object-Oriented Views ............................................................................................. 65 CHAPTER 4 ORGANISATION OF INDEXING IN OODBMS .......................... 67 4.1 Implementation of a Linear Hashing Based Index..................................................................... 67 4.1.1 Index Key Types ..................................................................................................................... 68 Page 3 of 181 Index of Contents 4.1.2 Example Indices ...................................................................................................................... 69 4.2 Index Management........................................................................................................................ 70 4.2.1 Index Creating Rules and Assumed Limitations ..................................................................... 73 4.3 Automatic Index Updating ........................................................................................................... 75 4.3.1 Index Update Triggers............................................................................................................. 75 4.3.2 The Architectural View of the Index Update Process ............................................................. 78 4.3.3 SBQL Interpreter and Binding Extension................................................................................ 80 4.3.4 Example of Update Scenarios.................................................................................................. 81 4.3.4.1 Conceptual Example ........................................................................................................... 81 4.3.4.2 Path Modification ............................................................................................................... 83 4.3.4.3 Keys with Optional Attributes ............................................................................................ 85 4.3.4.4 Polymorphic Keys............................................................................................................... 87 4.3.5 Optimising Index Updating ..................................................................................................... 90 4.3.6 Properties of the Solution ........................................................................................................ 93 4.3.7 Comparison of Index Maintenance Approaches...................................................................... 94 4.4 Indexing Architecture for Distributed Environment ................................................................. 96 4.4.1 Global Indexing Management and Maintenance ..................................................................... 97 4.4.2 Example on Distributed Homogeneous Data Schema ............................................................. 99 CHAPTER 5 QUERY OPTIMISATION AND INDEX OPTIMISER ................. 101 5.1 Query Optimisation in the ODRA Prototype ........................................................................... 102 5.2 Index Optimiser Overview ......................................................................................................... 103 5.2.1 General Algorithm................................................................................................................. 106 5.3 Selection Predicates Analysis ..................................................................................................... 107 5.3.1 Incommutable Predicates....................................................................................................... 109 5.3.2 Matching Index Key Values Criteria..................................................................................... 111 5.3.3 Processing Inclusion Operator............................................................................................... 112 5.4 Role of a Cost Model ................................................................................................................... 113 5.4.1 Estimation of Selectivity ....................................................................................................... 115 5.5 Query Transformation – Applying Indices............................................................................... 118 5.5.1 Index Invocation Syntax........................................................................................................ 118 5.5.2 Rewriting Routines................................................................................................................ 120 5.5.3 Processing Disjunction of Predicates .................................................................................... 122 5.5.4 Optimising Existential Quantifier.......................................................................................... 123 5.5.5 Reuse of Indices through Inheritance .................................................................................... 124 5.6 Secondary Methods..................................................................................................................... 126 5.6.1 Factoring Out Independent Subqueries ................................................................................. 127 5.6.2 Pushing Selection .................................................................................................................. 128 5.6.3 Methods Assisting Invoking Views....................................................................................... 129 5.6.4 Syntax Tree Normalisation.................................................................................................... 130 5.6.5 Harmful Methods .................................................................................................................. 131 5.7 Optimisations involving Distributed Index ............................................................................... 132 5.7.1 Rank Queries Optimisation ................................................................................................... 134 5.7.1.1 Hoare’s Algorithm in Distributed Environment ............................................................... 136 5.7.1.2 Modification of Hoare’s Algorithm .................................................................................. 138 5.8 Increasing Query Flexibility with Respect to Indices Management ....................................... 140 CHAPTER 6 INDEXING OPTIMISATION RESULTS .................................... 142 6.1 Test Data Distribution ................................................................................................................ 142 6.2 Sample Index Optimisation Test................................................................................................ 144 6.3 Omitting Key in an Index Call Test – enum Key Types .......................................................... 146 6.4 Multiple Index Invocation Test .................................................................................................. 148 6.5 Complex Expression Based Index Test ..................................................................................... 150 6.6 Disjunction of Predicates Test.................................................................................................... 151 CHAPTER 7 INDEXING FOR OPTIMISING PROCESSING OF HETEROGENEOUS RESOURCES............................................................... 153 7.1 Volatile Indexing ......................................................................................................................... 153 7.1.1 Conditions for Volatile Indexing Optimisation ..................................................................... 154 Page 4 of 181 Index of Contents 7.1.2 Index Materialisation............................................................................................................. 154 7.1.3 Solution Properties ................................................................................................................ 155 7.1.4 Prove of Concept Test ........................................................................................................... 155 7.2 Optimising Queries Addressing Heterogeneous Resources..................................................... 157 7.2.1 Overview of a Wrapper to RDBMS ...................................................................................... 158 7.2.2 Volatile Indexing Technique Test ......................................................................................... 160 CHAPTER 8 CONCLUSIONS ....................................................................... 165 8.1 Future Work................................................................................................................................ 167 INDEX OF FIGURES ..................................................................................... 168 INDEX OF TABLES....................................................................................... 170 BIBLIOGRAPHY............................................................................................ 171 Page 5 of 181 Summary SUMMARY The Ph.D. thesis focuses on the development of robust transparent indexing architecture for distributed object-oriented database. The solution comprises management facilities, automatic index updating mechanism and index optimiser. From the conceptual point of view transparency is the most essential property of a database index. It implies that programmers need not to involve explicit operations on indices into an application program. Usually a query optimiser automatically inserts references to indices when necessary into a query execution plan. The second aspect of the transparency concerns a mechanism maintaining the cohesion between existing indices and indexed data. So-called automatic index updating detects data modifications and reflects them in indices accordingly. The thesis has been developed in the context of the Stack-Based Architecture (SBA) [1, 117] a theoretical and methodological framework for developing objectoriented query and programming languages. The developed query optimisation methods are based on the corresponding Stack-Based Query Language (SBQL). The orthogonality of SBQL constructs enables simple defining of complex selection predicates accessing arbitrary data. The main goal of the work is designing the indexing architecture facilitating processing of a possibly wide family of predicates. This requires generic and complete approach to the problem of index transparency. The solution presented in the thesis provides transparent indexing employing single or multiple-key indices in a distributed homogeneous object-oriented environment. The selection of an index structure, either centralised or distributed, is not restricted. The work extensively describes optimisation methods facilitating processing in the context of a where operator, i.e. selection, considering the role of a cost model, conjunction and disjunction of predicates, and the class inheritance. The author proposes a robust approach to automatic index updating capable of dealing with index keys based on arbitrary, deterministic and side effects free expressions. Consequently, optimised selection predicates can be freely composed of various SBQL constructs, in particular, algebraic and non-algebraic operators, path expressions, aggregate functions and class methods invocations. The solution also takes into consideration inheritance and polymorphism. Page 6 of 181 Summary A part of the thesis concerns optimisation methods devoted to distributed objectoriented databases enabling efficient parallel processing of queries. In particular, one of the designed methods concerns optimisation of rank queries. It enables taking advantage of distributed and scalable index structures. A particularly difficult query optimisation domain concerns processing queries addressing heterogeneous resources. A volatile indexing technique proposed by the author is a significant step in this matter. This solution relies on the developed indexing architecture. Additionally, it can be applied to data virtually accessible using SBQL views. In contrast to regular indices, a volatile index is materialised only during a query evaluation. Therefore, efficacy of this technique shows when the index is invoked multiple times which mainly concerns processing of complex and laborious queries. A key aspect concerning the development of database query optimisation methods is preservation of original query semantics. Consequently, for the designed optimisation methods the author has determined rules in a context of the assumed object data model and the SBQL query language. With this knowledge a database programmer can be assisted and advised, e.g. by compiler, on how to design safe and optimisable queries. Moreover, the conducted research can facilitate also database designers. Among the other, the potential influence of other optimisation methods on indexing has been verified. A significant part of algorithms and solutions developed in the thesis have been verified and confirmed in the prototype ODRA OODBMS implementation [58, 59]. Keywords: indexing, database, distributed optimisation, SBA, SBQL, ODRA Page 7 of 181 database, object-oriented, query Streszczenie POLITECHNIKA ŁÓDZKA Wydział Elektrotechniki, Elektroniki, Informatyki i Automatyki Katedra Informatyki Stosowanej Praca doktorska pt.: Przezroczyste indeksowanie w rozproszonych obiektowych bazach danych ROZSZERZONE STRESZCZENIE Bazy danych stanowią podstawę wielu rozległych i w dzisiejszych czasach często rozproszonych systemów komputerowych. Zarządzanie systemami o takim rozmiarze i złoŜoności ułatwiają technologie obiektowe. Przemysł jednak skłania się ku rozwiązaniom relacyjnym, gdy kwestią kluczową jest wydajność. Ten aspekt jest wciąŜ zaniedbany w opartych o obiektowe paradygmaty bazach danych z uwagi na niedostatek zaawansowanych procedur optymalizacyjnych. Indeksowanie jest najwaŜniejszą metodą optymalizacyjną w bazach danych. Zasadnicza koncepcja indeksowania w obiektowych bazach danych nie róŜni się od indeksowania w systemach relacyjnych [15, 20, 29, 54, 55, 65]. Z koncepcyjnego punktu widzenia najistotniejszą własnością indeksu w bazie danych jest przezroczystość. Oznacza ona, Ŝe programista aplikacji z bazą danych nie musi być świadomy istnienia indeksów. Najczęściej optymalizator zapytań jest odpowiedzialny za automatyczne wykorzystanie indeksów. Drugi waŜny aspekt przezroczystości jest związany z utrzymywaniem spójności między indeksami a indeksowanymi danymi. Jest to problem tzw. automatycznej aktualizacji indeksu. Modyfikacje w bazie powinny być automatycznie wykrywane i odzwierciedlane w odpowiednich indeksach. W rozproszonych bazach danych najbardziej zaawansowane rozwiązania opierają się o statyczne partycjonowanie indeksu. Są one zaimplementowane w czołowych produktach obiektowo-relacyjnych. Pozwalają one jedynie na definiowanie kluczy indeksu uŜywając prostych wyraŜeń korzystających z danych znajdujących się w Page 8 of 181 Streszczenie jednej tabeli. W kwestii globalnej optymalizacji zapytań odnoszących się do heterogonicznych zasobów, autor nie znalazł w literaturze naukowej Ŝadnych sformalizowanych metod opartych o indeksowanie. Analiza stanu wiedzy jednoznacznie wskazuje na potrzebę rozwijania metod i architektury indeksowania dla rozproszonych obiektowych baz danych. Ortogonalność języka SBQL pozwala na wyjątkowo łatwe definiowanie złoŜonych predykatów selekcji odnoszących się do dowolnych danych. Głównym celem pracy jest opracowanie architektury indeksowania, która wspomagałaby przetwarzanie moŜliwie szerokiej rodziny predykatów. Wymaga to generycznego i kompletnego podejścia do problemu przezroczystości. PoniewaŜ praca dotyczy rozproszonych obiektowych baz danych kolejnym waŜnym celem jest opracowanie metod optymalizacyjnych, które będą umoŜliwiały zrównoleglenie obliczeń w szczególności poprzez wykorzystanie rozproszonych skalowalnych struktur indeksu. Szczególnie trudną dziedziną w kontekście optymalizacji jest przetwarzanie zapytań odnoszących się do rozproszonych heterogenicznych zasobów. Z tego powodu jako kolejny cel pracy autor postawił sobie identyfikacje przezroczystej i wydajnej strategii indeksowania, którą moŜna by stosować na poziomie globalnego schematu bazy. Kluczowym aspektem pracy nad wszystkimi metodami optymalizacyjnymi jest zachowanie oryginalnej semantyki zapytań. W tym celu autor określił reguły, które odnoszą się do opracowanych metod w kontekście przyjętego obiektowego modelu danych i języka zapytań SBQL. Znajomość tych reguł moŜe być przydatna programistom w budowaniu zapytań, których postać umoŜliwia automatyczną optymalizację. Dodatkowo, przeprowadzone badania mogą być równieŜ pomocne projektantom baz danych. Między innymi, określono potencjalny wpływ innych metod optymalizacyjnych na pracę optymalizatora wykorzystującego indeksy. Zaproponowane przez autora w pracy doktorskiej rozwiązania są przedstawione w kontekście stosowej architektury (SBA, Stack-Based Architecture) [1, 117] i wynikającego z niej języka zapytań (SBQL, Stack-Based Query Language). Architektura stosowa jest to formalna metodologia dotycząca obiektowych języków zapytań i programowania w bazach danych. Page 9 of 181 Streszczenie Tezy będące przedmiotem dysertacji są następujące: 1. Przetwarzanie predykatów selekcji opartych o dowolne wyraŜenia kluczowe korzystające z danych w rozproszonej obiektowej bazie danych moŜe być zoptymalizowane przez scentralizowane lub rozproszone przezroczyste indeksowanie. 2. Wykonywanie złoŜonych zapytań odnoszących się do rozproszonych heterogenicznych zasobów moŜe być wspomagane przez techniki wykorzystujące przezroczystą optymalizację opartą o indeksowanie. W udokumentowaniu w/w tez wykorzystano zaprojektowane przez autora system zarządzania indeksami i optymalizator zapytań stosujący indeksy. Dodatkowym elementem ściśle związanym z pierwszą tezę jest autorskie podejście do problemu automatycznej aktualizacji indeksu. Przedstawione rozwiązanie zapewnia przezroczyste indeksowanie wykorzystujące indeksy oparte o jeden lub wiele kluczy. Optymalizacja dotyczy przetwarzania predykatów selekcji opartych o dowolne, deterministyczne i pozbawione efektów ubocznych wyraŜenia, na które mogą się składać: np. wyraŜenia ścieŜkowe, funkcje agregujące i wywołania metod klas (z uwzględnieniem dziedziczenia i polimorfizmu). Zaproponowana architektura indeksowania moŜe być zastosowana do rozproszonych homogenicznych źródeł danych. Wybór struktury indeksu, scentralizowanej czy rozproszonej, nie jest w Ŝaden sposób ograniczony. Autor zaproponował równieŜ metodę optymalizacji rankingowych zapytań, która umoŜliwia wykorzystanie zarówno istniejących lokalnych indeksów, jak i rozproszonego, skalowalnego globalnego indeksu. Rozwiązaniem zaproponowanym przez autora w celu udowodnienia drugiej tezy pracy jest technika ulotnego indeksowania. Polega ona na tej samej architekturze indeksowania, ale dodatkowo moŜe być stosowana do przetwarzania danych heterogenicznych wirtualnie dostępnych poprzez perspektywy SBQL. W odróŜnieniu od normalnych indeksów ulotny indeks jest materializowany tylko podczas wykonywania zapytania. Przedstawiona technika jest skuteczna w przetwarzaniu złoŜonych zapytań, w których indeks jest wywoływany więcej niŜ jeden raz. Opracowane algorytmy i rozwiązania związane z tezami pracy zostały w znaczącym zakresie zweryfikowane i potwierdzone na prototypowej implementacji w obiektowej bazie danych ODRA [58, 59]. Page 10 of 181 Streszczenie Dysertacja została podzielona na osiem rozdziałów, których zwięzły opis znajduje się poniŜej: Chapter 1 Introduction Wstęp Pierwszy rozdział wprowadza w tematykę pracy, przedstawia jej kontekst, zwięzły opis stanu wiedzy w dziedzinie i motywacje autora. Sformułowano cele pracy oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono szczegółowo tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania. Chapter 2 Indexing In Databases - State of the Art Indeksowanie w Bazach Danych – Stan Wiedzy W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z indeksowaniem w bazach danych. Przytoczono reprezentatywne przykłady istniejących rozwiązań w przemyśle i literaturze naukowej. Rozdział zawiera przegląd róŜnych struktur indeksujących ze szczególnym uwzględnieniem liniowego haszingu, który został wykorzystany w autorskim rozwiązaniu. Dodatkowo, zbadano zcentralizowane i rozproszone strategie indeksowania w róŜnych systemach rozproszonych. Chapter 3 The Stack-based Approach Podejście Stosowe Rozdział dotyczy teoretycznych podstaw tez pracy, tj. architektury stosowej (SBA) i wynikającego z niej języka zapytań SBQL. Przytoczono opisy podstawowych pojęć: stosu środowiskowego, stosu rezultatów, wiązania nazw, statycznej ewaluacji zapytań i aktualizowalnych obiektowych perspektyw. Chapter 4 Organisation of Indexing in OODBMS Organizacja Indeksowania w Obiektowych Bazach Danych Ta część pracy przedstawia zaprojektowaną i w znaczącym zakresie zaimplementowaną architekturę indeksowania w obiektowej bazie danych ODRA. Opisano podstawowe własności zastosowanej struktury indeksu i modułu zarządzania indeksami. Przedstawiono równieŜ autorski mechanizm zapewniający przezroczystą, automatyczną aktualizację indeksów, który opiera się o ideę wyzwalaczy aktualizacji indeksu (index update triggers). Przedstawiona koncepcja jest rozszerzona na potrzeby globalnego indeksowania w kontekście rozwijanej w projekcie ODRA rozproszonej architektury. Page 11 of 181 Streszczenie Chapter 5 Query Optimisation and Index Optimiser Optymalizacja Zapytań z Wykorzystaniem Indeksów Rozdział koncentruje się na rozwijanych przez autora metodach przezroczystego wykorzystania indeksów w optymalizacji zapytań. Przedstawiono algorytmy dotyczące transformacji pośredniego drzewa zapytania oraz związane z nimi reguły. Szczególny nacisk został połoŜony na zachowanie w procesie optymalizacji pierwotnej semantyki zapytania. Opracowane metody zostały poparte odpowiednimi rzeczywistymi przykładami przekształceń w języku SBQL. Autor przedstawił równieŜ dyskusję na temat wpływu innych metod optymalizacji zapytań na indeksowanie. W rozdziale omówiono równieŜ metody optymalizacji dedykowane przetwarzaniu globalnych zapytań w rozproszonym środowisku. W tym zakresie przedstawiono autorskie podejście do optymalizacji zapytań rankingowych w rozproszonej architekturze bazy danych oparte o zmodyfikowany algorytm Hoare’a. Chapter 6 Indexing Optimisation Results Wyniki Optymalizacji przez Indeksowanie W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu indeksowania. Wyniki potwierdzają skuteczność i wydajność opracowanej metodologii. Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań opisanych w rozdziałach 4-tym i 5-tym. Całość stanowi dowód pierwszej tezy dysertacji. Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources Indeksowanie w Optymalizacji Przetwarzania Heterogenicznych Zasobów Ta część pracy dowodzi drugiej tezy dysertacji. W rozdziale przedstawiono technikę tzw. ulotnego indeksowania (volatile indexing technique) oraz jej zastosowanie w optymalizacji zapytań odnoszących się do rozproszonych heterogonicznych danych. Skuteczność zaproponowanej techniki jest potwierdzona testem, w którym optymalizowane jest zapytanie SBQL odnoszące się do zasobów obiektowej bazy danych i zasobów znajdujących się w relacyjnej bazie danych. Chapter 8 Conclusions Podsumowanie Ostatni rozdział podsumowuje pracę nad architekturą systemu indeksowania dla rozproszonej obiektowej bazy danych. Wymieniono opracowane rozwiązania i wyniki badań jednoznacznie potwierdzające słuszność tez pracy doktorskiej. Na koniec wskazano kierunki dalszych badań w tej dziedzinie. Page 12 of 181 Chapter 1 Introduction Chapter 1 Introduction Databases are a fundamental feature of many large computer applications. In many cases databases are to be geographically distributed. The size and complexity of such systems require the developers to take advantage of modern software engineering methods which as a rule are based on the object-oriented approach (cf. UML notation). In contrast, the industry still widely uses relational databases. While the efficiency of them in majority of applications cannot be questioned, many professionals point out their drawbacks. One of the major drawbacks is so-called impedance mismatch. The mismatch concerns many incompatibilities between object-oriented design and relational implementation. The mismatch concerns also incompatibilities between object-oriented programming (in languages such as C++, Java and C#) and SQL, the primary programming interface to relational databases. For this reason in the last two decades new and new object-oriented database management systems are proposed. Some of them are well recognized on the market (e.g. ObjectStore, Objectivity/DB, Versant, db4o, and others), however the scale of applications of them is at least the order of magnitude lower than applications of relational systems (some of them extended by object-oriented features). One of the reasons of relatively low acceptance of commercial object-oriented databases concerns their query languages that are considered very limited and treated as secondary in the development of applications. This is in sharp contrast to relational systems, where SQL is considered the primary factor stimulating their successes. In this research we focus on equipping object-oriented database systems with a powerful and efficient query language. The power of such a language should not be lower than the power of SQL. The performance efficiency of such a language requires powerful query optimization methods. Query optimisation in object-oriented database management systems has been deeply investigated over last two decades. Unfortunately, this research remains mostly not implemented in nowadays OODBMSs because of many reasons: limited query languages, non-implementable methods that were proposed, lack of interest of commercial companies, etc. In this thesis we investigate a well-known and the most important method of performance improvement known as indexing. The research addresses this subject in the Page 13 of 181 Chapter 1 Introduction context of the Stack-Based Architecture (SBA), which is a theoretical and methodological framework for developing object-oriented query and programming languages. The solutions that we have developed are implemented and tested in the ODRA OODBMS prototype [58, 59] that is based on SBA and its own query language SBQL (Stack-Based Query Language). 1.1 Context The Stack-Based Architecture (SBA) is a formal methodology addressing object-oriented database query and programming languages [1, 117]. It assumes the object relativism principle that claims no conceptual difference between the objects of different kinds or stored on different object hierarchy levels. Everything (e.g. a Person object, a salary attribute, a procedure returning the age of a person and a view returning well-paid employees) is considered an object with an own unique identifier. SBA reconstructs query languages’ concepts from the point of view of programming languages (PLs) introducing notions and methods developed in the domain of programming languages (e.g. environment stack, result stack, nesting and binding names). ODRA (Object Database for Rapid Application development) is a prototype object-oriented database management system based on the Stack-Based Architecture (SBA) [2, 119]. ODRA introduces its own query language SBQL that is integrated with programming capabilities and abstractions, including database abstractions: updatable views, stored procedures and transactions. The main goal of the ODRA project is to develop new paradigms of database application development together with a distributed database-oriented and object-oriented execution environment. 1.2 Short State of the Art of Indexing in Databases The general idea of indices in object-oriented databases does not differ from indexing in relational databases [15, 20, 29, 54, 55, 65]. The most characteristic property of the database indexing is transparency. A programmer of database applications does not need to be aware of the indices existence as they are utilised by the database engine automatically. This is usually accomplished by a query optimiser that automatically inserts references to indices into a query execution plan when necessary. The second important aspect of transparency concerns maintaining cohesion between existing indices and the data that is indexed. Data modifications are Page 14 of 181 Chapter 1 Introduction automatically detected and corresponding changes are reflected in indices. This process is called automatic index updating. Many indexing methods can be adopted from relational database systems and even their applicability can be significantly extended. There are also situations where indexing methods from RDBMSs become outdated in object-oriented databases. In particular, join operations do not need to be supported because in object databases the necessity for joins is much lower due to object identifiers and explicit pointer links in the database. In the object-oriented database domain the research into indexing has been mainly focused on path expression processing and inheritance hierarchy inside indexed collections [10, 11, 12, 21, 67, 77, 81, 111]. Some papers propose generic approaches to provide automatic index maintenance transparency [43, 46]. However, there is no information that these proposals have been actually incorporated in commercial or open source database products. Indexing is also an important subject in a distributed environment. The most of research concerns development of various distributed index structures and global indexing strategies. Many works are conducted in the context of data exchange in p2p networks. In databases, the most advanced solutions are based on static index partitioning. They are implemented in leading object-relational products. Nevertheless, an index key definition is limited to expressions accessing data from only one table. The Author have not found in the research literature any formalised global optimisation methods based on indexing for processing queries involving heterogeneous resources. The analysis of the state of art unambiguously indicates that the development of indexing methods and architectures that are dedicated to distributed object-oriented databases is still a valid and challenging subject. 1.3 Research Problem Formulation The orthogonality of SBQL language constructs allows defining selection predicates using complex and robust expressions accessing arbitrary data. The transparent indexing of objects to facilitate processing queries involving such predicates requires development of a generic and complete solution. Particularly, achieving automatic index updating transparency is simple only in case of indices defined on simple keys, i.e. direct attributes and table columns. Inheritance, methods Page 15 of 181 Chapter 1 Introduction polymorphism, data distribution, etc. make difficult identifying objects influencing a value of an index key. Data processing in a distributed environment enables parallel processing of queries and may take advantage of distributed and scalable index structures. This creates a demand for introducing an appropriate indexing architecture and specific optimisation methods. An even more complex task concerns evaluation of queries addressing a heterogeneous distributed environment. From the point of view of performance it is vital to exploit local resources optimisation methods and to develop robust techniques improving query processing on a global schema level. Identifying effective transparent global indexing strategies is in this context a significant, but particularly challenging subject. Finally, each optimisation method improving query performance must ensure preservation of query semantics. Therefore, in the context of a query language and an object model the appropriate rules for exploiting such methods must be determined. With this knowledge a database programmer can be assisted and advised, e.g. by a compiler concerning how to design proper optimisable queries. 1.4 Proposed Solution In order to provide transparent indexing in distributed object-oriented databases, the author of this thesis proposes the following tenets: • precisely defined indices management facilities and convenient syntax for an index call to be used in query optimisation, • set of algorithms, optimisation methods and rules composing the index optimiser, i.e. the module responsible for detecting parts of a query that can be substituted with an index call and performing appropriate query transformations, • the generic automatic index maintenance solution based on index update definitions assigned to indices and associated with them index update triggers assigned to objects participating in indexing, • volatile indexing technique enabling taking advantage of the developed indexing architecture and omitting troublesome issue of the automatic index maintenance in processing specific family of queries addressing heterogeneous resources. Page 16 of 181 Chapter 1 Introduction The most important properties necessary to provide desired indices behaviour have been implemented in ODRA OODBMS prototype and are operating [59]. 1.5 Main Theses of the PhD Dissertation The summarised theses are: 1. Processing of selection predicates based on arbitrary key expressions accessing data in a distributed object-oriented database can be optimised by centralised or distributed transparent indexing. 2. Evaluation of complex queries involving distributed heterogeneous resources can be facilitated by techniques taking advantage of transparent index optimisation. The common basis for accomplishing the theses are developed indexing management facilities and the index optimiser. The first thesis is additionally supported by the author’s generic approach to automatic index maintenance. The proposed approach provides transparent indexing using single or multiple-key indices. It applies to selection predicates based on arbitrary, deterministic and side effects free expressions consisting of e.g. path expressions, aggregate functions and class methods invocations (addressing inheritance and polymorphism). An extensive part of the work comprises optimisation methods facilitating processing in the context of a where operator (i.e. selection), considering the role of a cost model, conjunction and disjunction of predicates, and class inheritance. The proposed architecture can handle homogeneous data distribution and distributed index structures. The selection of an index structure, either centralised or distributed, is not restricted. The author also introduces an efficient method for optimisation of rank queries taking advantage of indexing in a distributed environment. The solution proposed by the author addressing the second thesis is the volatile indexing technique. It relies on the same indexing architecture, but addresses as well data virtually accessible through SBQL views. A volatile index differs from a regular index since it is materialised only during a query evaluation. Therefore, efficacy of this technique can be seen in processing of laborious queries when the index is invoked more than once. A significant part of theses has been verified and confirmed by a prototype Page 17 of 181 Chapter 1 Introduction implementation in the ODRA OODBMS. The only important aspect to be implemented and validated in the future concerns data and index distribution in the context of the first thesis. This element is planned to be finished together with the development of a distributed infrastructure in the ODRA prototype. 1.6 Thesis Outline The thesis is organised as follows: Chapter 1 Introduction The chapter presents a general overview of the thesis subject, context, the author’s motivation, formulation of the problem and objectives of the research, the theses and the description of developed solutions. Chapter 2 Indexing In Databases - State of the Art The state of the art chapter introduces basic concepts concerning indexing in databases together with an overview of solutions existing in commercial products and in the research literature. Additionally, the inspection of varieties of index-structures and indexing strategies applying to centralised and distributed environment is provided. Chapter 3 The Stack-based Approach The theoretical fundament for the thesis is the Stack-Based Architecture (SBA) and the corresponding query language SBQL. The chapter introduces basic notions relevant to the work including environment and result stacks, static query evaluation and updateable object-oriented views. Chapter 4 Organisation of Indexing in OODBMS The chapter presents the designed and implemented indexing architecture in the ODRA OODBMS. It focuses particularly on basic properties of the employed index structure, the designed indexing management facilities and module providing automatic index updating transparency (based on the author’s index update triggers concept). Finally, extending the architecture to distributed databases is discussed. Chapter 5 Query Optimisation and Index Optimiser The algorithms and rules responsible for taking advantage of indices in transparent optimisation of queries with respect to query semantics are presented and explained on examples. The chapter includes description of indexing methods designed Page 18 of 181 Chapter 1 Introduction for a distributed environment and discussion about influence of secondary methods on indexing. Chapter 6 Indexing Optimisation Results The chapter presents results of tests confirming efficiency of the methods presented in the thesis. Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources The chapter focuses on the volatile indexing technique and presents its application in optimisation of queries addressing heterogeneous resources. The description is supported by an appropriate test proving the efficacy of this technique. Chapter 8 Conclusions The chapter gives conclusions concerning achieved objectives and depicts the area of future works. Page 19 of 181 Chapter 2 Indexing In Databases - State of the Art Chapter 2 Indexing In Databases - State of the Art Indices are auxiliary (redundant) data structures stored at a server. A database administrator manages a pool of indices generating a new or removing an existing one depending on the current need. As indices at the end of a book are used for quick page finding, a database index makes quick retrieving objects (or records) matching given criteria possible. Because indices have relatively small size (comparing to a whole database) the gain in performance is fully justified by some extra storage space. Due to single aspect search, which allows one for very efficient physical organisation, the gain in performance can be even several orders of magnitude. In general, an index can be considered a two-column table where the first column consists of unique key values and the other one holds non-key values, usually references to objects or database table rows. Key values are used as an input for an index search procedure. As the result, the procedure returns corresponding non-key values from the same table row. In query optimisation indices are usually used in the context of a where operator when the left operand refers to a collection indexed by key values composing the right operand. [29, 33, 118] High-level language query Syntactic analysis and validation Intermediate query representation Query optimiser Query evaluation plan Query code generator Query code Runtime database processor Query result Fig. 2.1 A typical stages of high-level language query optimisation [29] A database query is expressed in a high-level query language (e.g. SQL, OQL). Fig. 2.1 presents general steps of processing a query. First, it has to be the subject of syntactic analysis (parsing). Next, it is validated for semantic correctness and accordance with a present database schema. The database uses internal query Page 20 of 181 Chapter 2 Indexing In Databases - State of the Art representation usually organised into a tree or graph structure. There might be many execution strategies that a DBMS can follow to obtain an answer to a query. In terms of query results all execution plans are equivalent but the cost difference between alternative plans can be enormous. The cost is usually measured as the time needed to complete query execution. A database query optimiser should efficiently estimate the cost of a plan. Final steps of query processing consist of code generation according to the designed execution strategy and eventually its execution [29, 54, 118]. An important part of designing an execution plan is the analysis of database indices. The query optimiser should be capable of identifying parts of the query which evaluation can be assisted with indexing. Next, with the help of a database cost model it has to decide which combination of indices would minimise the cost of query execution. An important task of database administrators is to manage a pool of indices, which is a part of processes of physical designing and tuning of a database. From the physical and conceptual properties of a database index follow its obvious advantages. However, when the design is improper, processing queries through indices may cause disadvantages concerning the global processing time. The disadvantages are usually caused by frequent database updates, which may totally undermine the gain in query processing due to an index, because the updating cost of the index exceeds the gain due to faster query processing. 2.1 Database Index Properties Indices are an essential constituent of a database’s architecture. Obviously, their central feature is a data-structure that can be efficiently organised, searched and maintained. Nonetheless, their actual strength lies in unique properties and versatile utilisation of a database’s index. The significant advantage and a partial cause of a success of large database systems is an indexing transparency. 2.1.1 Transparency In a common approach, the programmer should not involve explicit operations on indices into an application program. To make indexing transparent from the point of view of a database application programmer, it ought to ensure two important functionalities: index optimisation and automatic index updating. The first functionality means that indices are used automatically during query Page 21 of 181 Chapter 2 Indexing In Databases - State of the Art evaluation. Therefore, the administrator of a database can freely establish new indices and remove them without changing codes of applications. The responsibility for ensuring such transparency lies in query optimisation and particularly in the index optimiser. The second functionality, i.e. automatic index updating, in research literature is also referred as index maintenance or dynamic index adaptation. It is a response to changes in a database. Indices, like all redundant structures, can lose cohesion with data if the database is updated. An automatic mechanism should alter, remove or rebuild an index in case of database updates that affect topicality of its contents. Consequently, the gain in queries performance coming from indexing compromises insertion, deletion and data modifications speed, since such operations require suitable updates to indices. Thus, it is an administrator’s responsibility to manage indices judiciously and not to cause an overall database’s performance deterioration, particularly in update-intensive systems. In general, databases provide the user transparency fully. Nevertheless, some approaches let administrators and application designers decide about the degree of the index transparency and explicitly control indices state depending on a need. Occasionally, the transparency is supported only to a limited extent burdening a database’s user. 2.1.2 Indices Classification According to [29] there are three essential kinds of a database’s index: • primary index – physically ordering data on a disk or in a memory according to some unique property field (each record must contain a unique value for such a field – so-called primary key), • clustering index – introducing physical data order according to a non-unique property (i.e. when several data records can have equal value of ordering fields), • secondary index – providing an alternative access to data according to designated criteria without affecting their actual location (called also secondary access paths or methods). Since one physical ordering is possible, a data table or collection can have only one primary index or clustering index not both. The limit concerning the number of Page 22 of 181 Chapter 2 Indexing In Databases - State of the Art secondary indices depends on a database. In reality often occur some departures from definitions above. For example, in some databases data and indices are stored separately and even primary or clustering indices contain only references to actual data, which are stored physically e.g. in a linked list. Indices can be also classified according to a relation between keys and indexed data. Usually the division is the following: • dense index – contains an entry for each key value occurring in a database, • sparse index – associates blocks of ordered indexed data only with a single key value (e.g. the lowest one). Primary indices are usually sparse since physically data are often divided into blocks. Additionally to dense and sparse, a range index can be considered since index can be split into slots representing specified ranges of key values. Other obvious classification of indices concerns their data-structure, e.g. hash table or a B-tree. In databases sometimes many index kinds are combined in one socalled multilevel index. The next subchapter describes the most popular kinds of datastructures employed in databases indexing systems. 2.2 Index Data-Structures The most popular data structures used for index organisation are various kinds of B-trees proposed by Bayer and McCreight [8] and hash tables invented by H. P. Luhn [14]. Improving efficiency of selecting or sorting data queries vary with the choice of a proper data structure for indexing certain data. However, each index consumes some amount of database store space and needs some additional overhead for time of inserting, modifying or removing indexed data. Individual properties of different structures have been presented in thousands of papers and books devoted to databases and algorithms, e.g. [14, 23, 29]. In the context of this dissertation the kind of an exploited physical index organisation is generally insignificant. Only some properties of the index structure are important, in particular: • key order preservation, i.e. range queries support, • support for indexing using multiple keys, • distribution of an index on multiple servers. Page 23 of 181 Chapter 2 Indexing In Databases - State of the Art The same index interface from the point of the database can be used for a variety of index structures. Therefore, this work omits detailed discussion of this subject focusing mainly on an index structure used in the author’s implementation, i.e. linear hashing. The hash table uses the hash coding method based on a hash function which maps key values into a limited set of integer values. A calculated hash value points to a table in a memory (called a bucket) holding corresponding non-key values. This method allows indexed values to be looked up or updated in a very short constant time, particularly when a hash function distributes key values equally. A disadvantage of this technique is a necessity of specifying a size of the index table. However, dynamic hashing and linear hashing (described in the next section) deal with this issue. Another problem appears when two or more keys are mapped to the same location in the table. Similarly, it may happen that two or more objects have the same key value of an attribute. Resolving these so-called collisions leads to deterioration of index performance. There are many techniques allowing to put such items in a hash table and query them in a fairly fast way: a rehash function, a linked list approach (separate chaining), a linked list inside a table (coalesced hashing) and buckets. Methods involving linear or dynamic hashing use load control algorithms automatically forcing a hash table to expand to prevent performance loss. Another very popular indexing technique is based on B-trees. A B-tree is slightly worse than hash table from the point of view of search time and frequent data updates, which often involve the tree reorganisation. However, its advantage is simplicity of an algorithm and the economical memory space consumption. B-trees store keys in a nondescending order, so they can be very helpful in laborious queries involving sorting or ranking data. Many different kinds of tree-structures are proposed in the literature and incorporated in commercial products, e.g. • B+ tree, B# tree, B* tree – varieties of B-tree, • AVL tree, splay tree – balanced binary search trees, • radix tree – optimised to store a set of strings in lexicographical order, • and many more [14]. Indexing techniques used in data warehousing applications are a bit different from the techniques used in on-line transaction processing. Bitmap indices are stored as bitmaps (often compressed) [17, 66]. Consequently, answer to the most of queries can Page 24 of 181 Chapter 2 Indexing In Databases - State of the Art be obtained by performing bitwise logical operations. They are the most effective on keys with a limited set of values (e.g. a gender field) and often use a combination of such keys (i.e. multiple keys index). When these conditions are met, the bitmap indices prove reduced storage requirements and greater efficiency than regular indices. On the other hand the performance of the index maintenance is their serious drawback. Bitmap indexes are primarily intended for non-volatile systems since the method is very sensitive to updating indexed data. It causes the necessity to keep locks on segments storing a bitmap index to reflect a change, which is very time consuming. In typical cases bitmap indices are easier to destroy and re-create than to maintain. Other variants of indices for data warehousing have also been developed [84]: • projection index - quite useful in cases where column values must be retrieved for all selected rows, because they probably would be found on the same index page, • bit-sliced index – based on processing bitmaps; provide an efficient means of calculating aggregates. Many other index structures evolve to facilitate various indices applications: • inverted files, signature-based files – two principal indexing methods for text documents databases purposes [134] and for indexing according to set-valued attributes with low cardinality [41], • multi-index, path index, access support relations, T-Index, path dictionary index – for path expression processing in OODBMSs [10, 11, 12, 67, 81], • inherited multi-index, nested-inherited index, triple-node hierarchies, H-tree, CH-tree, hcC-tree (hierarchy class Chain tree), signature file hierarchies, signature graphs – oriented on facilitating processing collections organized in class hierarchies in OODBMSs [10, 11, 12, 21, 77, 111], • R-Tree, UB-tree, kd-tree, X-Tree, Parametric R-Tree, TPR-Tree (Time parameterized R-Tree), TPR*-Tree, grid file - for spatial (i.e. multidimensional) and spatio-temporal data, e.g. Geographic Information Systems, [30, 32, 40, 125]. • etc. Another group of index structures can be defined in a distributed environment Page 25 of 181 Chapter 2 Indexing In Databases - State of the Art domain. In general, together with a growth of an indexed dataset an index can be split into small parts maintained on independent servers; hence, utilising their storage (e.g. main memory or disks) and processing power. In contrast to local indices such an index: • enables exploiting a parallel computing power (therefore, they are usually referred as parallel or distributed indices), • can be scalable, freely spreading its parts between network nodes, without compromising its primary efficiency, • provides a higher level of concurrency. An overview of properties of a distributed index structure based on the idea of linear hashing is given in section 2.2.2. Similarly to local indices, parallel indices have been developed in many variants for various applications and systems, e.g.: • scalable distributed data structure variants (cf. section 2.2.2), • distributed hash table (DHT) [34] for data indexing in peer-to-peer (P2P) networks, e.g. Chord [113], • scalable distributed B-tree [3], • a combination of a bit vector, a graph structure and a grid file database multi-key distributed index [45], • hierarchical distributed index psiX for XML documents in p2p networks [105], • DiST, PN-tree – structures for indexing multidimensional (spatial) datasets [4, 16]. All index structures mentioned in this subchapter are only a small fraction of existing solutions, which are described in thousands of research papers. The next section concerns a linear hashing index which is an important part of the author’s prototype implementation verifying the thesis. 2.2.1 Linear-Hashing Linear hashing is a dynamic indexing structure invented by Witold Litwin [72]. Similarly as a regular hash table, it comprises the buckets, which store index entries according to some hash function. The linear hashing strives to keep a relation between the number of index entries and the number of buckets in order to ensure constant Page 26 of 181 Chapter 2 Indexing In Databases - State of the Art search, insertion and deletion efficiency and to minimise the buckets capacity overflowing. Buckets are added (through splitting) and removed (through merging) one at a time which is possible by taking advantage of a dynamic hashing functions family. At the start, a linear hashing structure consists of N0 empty buckets numbered starting from 0 to N0-1. Three important parameters describe an index state: • n – the number of a bucket to be split next if necessary (initially equal to 0), • j – a current lowest index buckets level (initially equal to 0). • N – the number of buckets equal to N0·2j+n (consequently, initially equal to N0). The buckets from nTH to N0·2j -1 belong to the jTH level while the rest of the buckets, i.e. from 0 to nTH-1 and all starting from N0·2j bucket to NTH-1, belong to the jTH+1 level. Index entries are spread over the index according to hash functions hj(key) depending on the level of an index bucket. The target bucket T for a key is determined according to the following formulas: [ [ [ if (h( j , key ) ∈ 0, n TH ) ⇒ h( j + 1, key ) T ( j , key ) = , TH j if (h( j , key ) ∈ n , N 0 ⋅ 2 ) ⇒ h( j , key ) [ where: • h(j, key) := hash(key) mod (N0·2j), • hash(key) is the basic key hashing function, • [minvalue – stands for the inclusive left limit of the defined values range, • maxvalue[ – stands for the exclusive right limit of the defined values range. The most crucial operation, i.e. splitting, is triggered after an insertion when the index load becomes too high. A new bucket is appended to the buckets table and elements from nTH bucket are divided between nTH and the new bucket nTH+N0·2j according to the h(j+1, key) function. It is worth noticing that: h(j+1, key)∈{h(j, key), h(j, key) + N0·2j} Next, parameters n and N are incremented by one. Eventually, when n reaches N0·2(j+1) indicating that the hash table has doubled its size from N0·2j the n parameter is set to 0 and the index level j is incremented by one. Oppositely to splitting, if during a deletion the index load falls below some fixed Page 27 of 181 Chapter 2 Indexing In Databases - State of the Art threshold then merging of buckets is performed. An example bucket split procedure is presented in Fig. 2.2. Buckets entries are represented by values of their hash(key) functions. The state before the split is presented in Fig. 2.2a. The parameters of an index were the following: n = 0, N0 = N =100, j = 0, so all buckets are addressed using the h(0, key) function. Fig. 2.2 Example of a bucket split operation [72] The split is performed on the nTH bucket, which was already overflowed. A new bucket at the end of the buckets table is allocated and it is filled with entries moved from the 0 bucket for which h(1, key), i.e. hash(key) mod 200, is equal to 100. Finally, n and N are incremented. An index state after the split is shown in Fig. 2.2b. As it is shown dynamic expansion of a linear hashing table helps to minimise the buckets overflow. An overview of SDDS based on the idea of linear hashing, an efficient structure for distributed indexing, is presented in the following section. 2.2.2 Scalable Distributed Data Structure (SDDS) SDDS is a scalable distributed data structure introduced by W. Litwin [73, 74] which deals with storing index positions in a file distributed over a given network. Its properties make it a good candidate for indexing global data in a distributed Page 28 of 181 Chapter 2 Indexing In Databases - State of the Art infrastructure (e.g. grid). SDDS uses LH* which generalises a linear hashing method described in section 2.2.1 to a distributed memory or disk files. In contrast to the linear hashing, SDDS buckets can be located on different sites. LH* structure does not require a central directory, and it grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The SDDS strategies differ with an approach to buckets splitting which can be managed by a coordinator site, triggered by a bucket overflow or by controlling an index load factor. An application of the SDDS significantly extends features of the linear hashing. The major advantages of an SDDS index concerning distributed indexing are the following: • avoiding of a central address calculation spot, • parallel and distributed query evaluation support, • concurrency transparency, • scalability – it does not assume any constraints in size or capacity, • an SDDS file expands over new servers when current servers are overloaded, • index updating does not demand a global refresh on servers or clients, • over 65% of an SDDS file is used, • in general, the small number of messages between servers (1 per random insert; 2 per key search), • parallel operations on SDDS M buckets require at most 2·M+1 messages and between 1 and O(log(M)) rounds of messages. The characteristics of SDDS outperform in efficiency the centralised index directory approach (described in detail in section 2.6.1) or any static data structures. Variants of SDDS index include implementations: • preserving key order and supporting range queries (e.g. a RP* family of SDDS structures [75]), • providing a high-availability, i.e. toleration for unavailability of some servers sites composing SDDS (e.g. LH*RS [76]) Page 29 of 181 Chapter 2 Indexing In Databases - State of the Art 2.3 Relational Systems The System-R, developed by IBM1 Research between 1972 and 1981, is the first database management system implementing the relational model [6]. Innovative solutions developed within the system included query optimiser utilising indices [15, 65]. The overview of the relational query optimisation including the fundamentals of an approach to indexing has been collected in [20, 54, 55]. Almost 40 years of research on relational systems resulted in development of various indexing aspects. Numerous indexing based solutions are incorporated in available commercial products. The major RDBMSs currently are SQL Server by Microsoft [109], DB2 by IBM [24], Informix by IBM [53] and Oracle by Oracle Corporation [91]. The most popular open-source relational systems are PostgreSQL by PostgreSQL Global Development Group [103], MySQL by SUN Microsystems [83] and Firebird by Firebird Foundation [31]. The well-known indexing solutions designed for RDBMSs are the following: • primary index, clustering index, secondary access paths (cf. section 2.1.2), • multi-key index – enables indexing using combination of multiple fields, • derived key index (iSystem DB2/400 by IBM) [51], function-based index (Oracle) [115], functional indexes (Informix) [49] – indices on expressions, built-in or user functions that exactly match selection predicates within an SQL where clause, • computed-column indices (MSSQL Sever) – solution similar to the previous one but relying on an additional table column (computed-column), which can define indexable expression using derived attributes and user functions (the index maintenance relies on maintenance of the computed column) [110], • temporary index – transient internal structure created automatically by DB engine or defined manually (it is described below in this subchapter) [51, 110], • development of diverse index structures (it is the topic of subchapter 2.2), • other product specific solutions. In RDBMSs the keys that are used for defining an index on a table are usually 1 International Business Machines Corporation Page 30 of 181 Chapter 2 Indexing In Databases - State of the Art simple values stored in columns. Developers of such an index can use various index structures and mechanisms for assuring index transparency. A query optimiser can easily identify where clauses addressing indexed selection predicates. Modifications to an indexed table are also easy to detect by the DB engine during run-time or even earlier through the analysis of an intermediate form of DML2 statements. Insertion or deletion of table rows transparently triggers addition or removal of an appropriate index entry. Analogously, modification to any value in a key column results in changes inside an index. Therefore, details of automatic index updating in RDBMSs are usually omitted in technical RDBMS specifications and considered rather as an implementation issue. Function-based indices and other similar solutions enabling defining keys using expressions addressing more than one table column and internal functions or userwritten functions generally do not introduce conceptual difficulties. The functions supporting such indices can be written in a native database language (e.g. PL/SQL) or an external programming language (C++, Java, etc.). Furthermore, they must be deterministic (i.e. depend only on the state of a database store) and side effects free (i.e. do not introduce any changes to data). The idea of function-based indices is derived from optimisation through method (or function) pre-computation or materialisation. It is widely discussed in the research literature [5, 9, 13, 27, 57, 80]. The optimisation gain relies on pre-calculating the result of a given function or a derived attribute for all objects of a collection. The obtained results are used as keys to index objects and are stored inside the index. Thus, when queries are evaluated, the optimiser strives to use the result computed earlier in order to avoid laborious execution of functions and derived attributes. The automatic index maintenance of function-based indices requires simply considering modifications to any value stored in all columns used in a key definition. Nevertheless, this aspect of indexing becomes complex when object-oriented model and language extensions are considered. In extreme cases, it may even lead to serious errors (see section 2.5.1). If appropriate indices do not exist then the optimiser can try to facilitate query 2 Data Manipulation Language Page 31 of 181 Chapter 2 Indexing In Databases - State of the Art processing using temporary indices instead. Their applications are described in detail for iSystem DB2/400 by IBM [51]. A temporary index can be created solely for performing joins (e.g. nested loop join), ordering, grouping, distinct and record selection. It is applied by the optimiser to satisfy a specific query request. Such an index can be built as a part of a query plan. After query execution it is destroyed. In effect, it would not be reused and shared across jobs and queries. Sometimes a temporary index can be created for a longer period. Such a decision can be made by the DB engine basing on the analysis of query requests over time. In order to reuse and share such an index it has to be altered if the underlying table changes. The advantage of a temporary index is the shorter access time as it is stored only in main memory. 2.4 OODBMSs Index organisation and optimisation in object-oriented database management systems have been deeply researched, see [7, 10, 11, 12, 21, 43, 46, 77, 79, 81, 93, 111]. Experimental database prototypes are among other the following: IRIS by Hewlett Packard, ORION by MCC3, OPENOODB by Texas Instruments and project ENCORE/ObServer by Brown University. Few former commercial OODBMSs are: ONTOS by Ontos, ARDENT by ARDENT Software4, ODE by AT&T Bell Labs, POET by POET Software [29]. OODBMSs base on a hierarchical object-oriented data model. One of important notions of the object-oriented model is a reference, i.e. a pointer link to an object. Pointer links express relationships (associations) between objects. In the result of attempts to standardise object-oriented database management systems the ODMG5 [18] purposed OQL6 [29, 33] which to some extent influenced the development of objectoriented query languages. Differences in data models and query languages imply that some indexing techniques are specialised to relational or object-oriented approaches only. OQL involves path expressions composed of object names separated by dots in 3 Microelectronics and Computer Technology Corporation, Austin, Texas 4 Formerly O2 by O2 Technology 5 Object Data Management Group 6 Object Query Language Page 32 of 181 Chapter 2 Indexing In Databases - State of the Art order to navigate via pointers to objects easily. Navigation to a pointed object in OODBMSs can be fast as it is usually resolved at a low-level with a direct link. In the relational model such relationships (i.e. primary-foreign key dependencies) require performing joins and for efficient query evaluation require indices. Nevertheless, some object-oriented systems may implicitly rely on a flat, relational-like data model. In such a case, navigation along a pointer link still requires performing an implicit join among objects. Thus the assumption limiting OQL path expressions is that the operand before a dot operator should not deliver a collection. Much work has been dedicated in the OODBMSs research to cope with improving the efficiency of processing nested predicates, i.e. based on derived attributes defined using path expressions. These works additionally extend path expression indexing with consideration of inheritance issues. The most important proposed solutions are Multi-Index, Inherited Multi-Index, Nested-Inherited Index, Path Index, Access Support Relations [10, 11, 12], Triple-node hierarchies [77] and T-Index (focused on semi-structured data) [81]. The efficiency of these methods was deeply studied, described through appropriate cost models and verified by prototype implementations. The solutions are focused on various criteria, such as the cost of retrieval, cost of updates operations or cost of storage. However, the transparency aspect of automatic index updating is not always precisely explained. Generally, it is assumed that each modification of an attribute of a class instance and creation or deletion of an instance should cause appropriate index updating actions. However, instances of one of classes accessed by an indexed path expression can be located in different collections. Moreover, these collections can contain the arbitrary number of objects not associated with indexed objects. These circumstances can make automatic index updating routines inapplicable or seriously affect database’s performance. Let us consider an example of an OQL query returning data concerning departments who are supervised by an employee John Doe: SELECT * FROM Departments d WHERE d.supervisedBy.name = “JOHN DOE” A path expression based index supporting query evaluation concerns only a number of employees who are department supervisors. Unfortunately, modifying a name of each employee would be burdened by the index maintenance mechanisms. This inconvenience is however justified. In the approach to automatic index updating Page 33 of 181 Chapter 2 Indexing In Databases - State of the Art presented in [10, 11, 12] all instances of classes associated with a path expression based index need to be taken into consideration to ensure index validity after data modifications. Hence, an index structure often preserves some additional information concerning objects currently not accessed by the given index but located in collections processed during the path expression evaluation. An overview of the architecture of a system oriented on indexing based on path expressions is given in section 2.4.5. The distributed object management system H-PCTE7 developed at the University of Siegen [47] has proposed a different solution to automatic index maintenance. It is independent of an index structure kind and its contents. This work relies on an extended OQL language variant P-OQL [42] designed to reflect a data model of H-PCTE. The approach is based on so-called index update definitions which consist of event description for an event causing the need for an index update, a reference for the affected index structure, a query determining the elements for which the respective index entries have to be updated and a corresponding update operation. These index update definitions can be generated during index creation. The solution handles complex derived attributes, for instance, employing regular path expressions and exploiting OQL aggregate functions. On the other hand, the authors outline some limitations of this approach concerning efficiency and consideration of user-methods giving general suggestions how these disadvantages should be overcame [43]. Another, approach to index maintenance is in detail discussed for function-based indexing [46] developed in the context of Thor, a distributed, object-oriented database system developed by the Massachusetts Institute of Technology [71]. It descends from works on optimisation for methods and functions in databases [9, 57]. Indices are maintained using a so-called objects registration schema. Registration concerns only objects which modification can affect an index. An index update is triggered by a mechanism that checks registration information during objects’ modification. Despite the theoretical genericity of this approach, it has not been fully implemented, since Thor provides object persistence for applications, but without the support for queries. In [123] the authors present an approach generalising the methods based on indices by stored queries. They propose to store a response, i.e. the result of a query for 7 High-performance Portable Common Tool Environment Page 34 of 181 Chapter 2 Indexing In Databases - State of the Art a current database state, according to the query. Universality of this solution enables taking advantage of indexing exploiting complex predicates, e.g. aggregate functions. However, in context of traditional approach to the database’s index this work is close to optimisation by query caching. To the best of author’s knowledge, only a few indexing techniques proposed in the scientific literature have been incorporated into commercial OODBMSs products and major prototypes. Careful inspection of applied indexing facilities is possible through analysis of major object-oriented DBMSs. The mentioned above prototypes and commercial products presented in next sections represent only part of existing objectoriented database management systems landscape. Nevertheless, they provide sufficiently complete overview of the indexing state of the art in OODBMSs. 2.4.1 db4o Database The db4o database system by db4objects [25] is designed as a tool providing transparent persistence for object-oriented language objects. Native Queries in db4o supply an advanced query interface using the semantics of JAVA programming language [22]. Another query interface is SODA. Transparent indexing in the db4o OODBMS is provided only for attributes of classes defining indexed collections [26]. This means that db4o handles index maintenance and query optimisation automatically. The documentation does not present details about indices properties, but only about the usage. SODA query optimisation allows db4o to use indices. Native Queries are converted to SODA where possible. Otherwise Native Query are be executed by instantiating all objects. 2.4.2 Objectivity/DB The Objectivity/DB by Objectivity [85, 86, 87] approach to objects persistence in programming languages is similar to db4o; however, it is considered more as an alternative to traditional understanding of a query language. Objectivity/DB, besides the C++, JAVA and SMALLTALK support, provides Objectivity/SQL++ which complies with ANSI-standard SQL-92 and extends it with some object-oriented extensions. Storage objects in terms of Objectivity/DB are used to group other objects and their indices to obtain space utilisation, performance and concurrency requirements. There are three kinds of storage objects that correspond to three levels of grouping in Page 35 of 181 Chapter 2 Indexing In Databases - State of the Art the Objectivity/DB storage hierarchy: federated database, database and the container. A structure of an index maintains references to persistent objects of a particular class (so called indexed class) and its derived classes within a particular storage object. The Objectivity/DB supports indexing on a single class field or concatenated index on several attributes (key fields). The indexed class is specified when creating an index. The creation of an index can be performed on any persistence-capable class, i.e. a class which behaviour enables storing their instances persistently in Objectivity/DB. Indices can be referred as sorted collections of references to indexed objects. The order of key values of an index is very relevant regarding the proper activity of predicate scans. Indexed objects by default are stored by ascending order of values of their key fields; this can be specified while creating the index. Let us consider the index usage in Objectivity/DB. The main goal of an index is to optimise predicate scans. The predicate used in the scan can be one of the following: • a single optimised condition (=, ==, >, <, >=, <=, =~ -string match) that tests the first key field of the index, • a conjunction (&&) of conditions in which the first conjunct is an optimised condition that tests the first key field of the index (no disjunction - OR support). Objectivity provides the way to determine the uniqueness property of an index for a combination of values in its key fields of indexed objects. This can be specified when creating an index. DB however does not automatically ensure the property. The Objectivity just considers indexing objects with unique key field values combination. Otherwise, the next object with the same combination of key fields’ values will not be considered for indexing. Modifications concerning objects of an indexed class in the relevant storage object cause appropriate changes in the index automatically. Additionally, to control updates the session’s index mode can be used. In enables determining the time of an index update relatively to when indexed objects are modified. The index modes are as follows: • INSENSITIVE – an update is applied when the transaction commits, • SENSITIVE – the update will work when the next predicate scan is performed in the transaction or, if no scans are performed, when the transaction commits, Page 36 of 181 Chapter 2 Indexing In Databases - State of the Art • EXPLICIT_UPDATE – suppress automatic updating of indices; the updateintensive application that works with this index mode can update indices explicitly after every relevant change. 2.4.3 ObjectStore Similarly to db4o and Objectivity/DB, one of the goals of the ObjectStore by Progress [88, 89] is making an access to a database transparent for a programming language. In ObjectStore, a collection is an object that groups together other objects. While adding an index to a collection the order and the uniqueness of an index can be specified (by default it is unordered and allows for duplicates). In case of ObjectStore the place of an index storage can be chosen while creation. It can be a pointed database segment or a specified database. By default, the index is stored in the same database, segment, and cluster as the collection to which the index was added. The ObjectStore introduces a so-called multistep index. It can be created using complex path expressions, which can access multiple public data members and methods, as a key. Additionally, for the purpose of optimising queries involving types that have many subtypes, the idea of superindex was implemented. By default, adding an index on a type results in recursive adding of indices to all its subtypes. Still for queries with a large and intricate hierarchy of subtypes regular indexing can seriously deteriorate processing. Adding a superindex to a type with many subtypes differs from a default index in one essential feature, i.e. a superindex is only one. It eliminates a recursion; consequently, only one parent query operation occurs in contrast to multiple queries when using the regular index. The superindex is automatically updated just as it takes place in case of default one. However, there are some flaws regarding the superindex: using a superindex to the small number of subtypes will not bring significant gain, starting a query for a subtype gives no gain from supertype’s superindex, superindex cannot be applied to types with subtypes in different segments of the same database or in a different database. Page 37 of 181 Chapter 2 Indexing In Databases - State of the Art The last superindex’s limitation can be used to prevent from adding new subtypes located in different databases to a superindexed type. The ObjectStore ODBMS automatically optimises a query applied to a collection. If an index is added to a collection, the database first evaluates indexed fields and establishes a preliminary result set. Next, it applies non-indexed fields and methods to elements in the preliminary result set. In ObjectStore optimisation can be done explicitly (by preparing a query) or automatically (otherwise). The latter means that a query is optimised to use exactly indices which are available on the collection being queried. The automatic optimisation is convenient and effective. Nevertheless, when a query is to be run many times against multiple collections, with potentially different indices, it is recommended to take manual control over the optimisation strategy. The ObjectStore supports multistep indices, but provides only partial index maintenance transparency. The ObjectStore automatically updates an index when elements are removed or added to a indexed collection. However, updating an index entry after data modification must be explicitly determined by the programmer. Besides all mentioned above indexing capabilities, ObjectStore can create a primary index for an unordered collection that does not allow duplicates. It is an index used for queries and for looking up objects in such collection. Therefore, the primary index must contain no duplicate keys and must contain all elements in the collection. Thanks to this solution in some cases the look-up times and insertions/removals from the collection are faster. 2.4.4 Versant The Versant Object Database by Versant [126] requires explicit use in programming language codes statements of a query language. The statements are seen in the code as strings. They are processed in run time by a special utility in order to find and manipulate objects in a database. It exploits its native query language VQL8 similar to SQL with some object-oriented extensions. In Versant indices are set on a single attribute of a class and affect all instances of the class. Versant uses two kinds of index structures: B-trees and hash tables. Both maintain a separate storage area containing attribute values and different organisation. 8 Versant Query Language Page 38 of 181 Chapter 2 Indexing In Databases - State of the Art An attribute can be associated with two indices, one of each kind. A B-tree index is useful in case of value-range comparisons, while a hash index is better for exact match comparisons of values. No index inheritance is supported by Versant. An index can be created on an attribute of only one class. No class inheriting from the one with the index will inherit the index. To index subclass’ attributes you need to specifically set indices on each subclass. This results in the need for providing index consistency by an administrator. Advanced transparent indexing in Versant is achieved on virtual attributes, which is a similar technique to indexing on a computed column in the Microsoft SQL Server. This approach enables indexing of derived virtual attributes built using one or more normal attributes of a class [127]. Indices in Versant do not have names and they are maintained automatically while adding, removing or updating the value of an attribute. An extra constraint that a Versant’s index can enforce is the uniqueness. A unique index ensures that each instance of a class has a value of the index attribute that is unique with respect to attribute values in all other instances of the class. In other words, once an attribute receives the unique index, no duplicate value for this attribute can be committed; the database server process will first check for the uniqueness constraint. However, such uniqueness must be assured by the index administrator and can only be changed by removing an index. 2.4.5 GemStone’s Products The last evaluated by the thesis’ author commercial OODBMSs are GemStone products [37]: GemFire the Enterprise Data Fabric [35], which supports a subset of OQL, and Facets [36], which provide transparent persistence for Java programming language using an SQL-92 based-language with object extensions. These are the only tested databases which support transparent indexing employing path expressions. Both databases originate from the GemStone database which approach to indexing has been discussed in [79]. In the context of GemStone indices address path expressions. The variable name appearing in a beginning of a path is called path prefix. Then, a path contains a sequence of links and a path suffix; e.g. Employee.worksIn.manager. For each link (for instance, a variable of an object) in the path suffix one index is available; thus forming a Page 39 of 181 Chapter 2 Indexing In Databases - State of the Art sequence of index components. GemStone supports five basic storage formats for objects one of which is a non-sequenceable collection (NSC). When objects in this format grow large, its representation switches from a contiguous one to a B-tree which maintains the members by OOP9 for NSCs. Every NSC object has a not accessible to the user instance variable named NSCDict. If there are no indices on a NSC, then the value of NSCDict is nil; otherwise, the value of NSCDict is the OOP of an index dictionary. An index dictionary contains the OOPs of one or more dictionary entries. Dictionary entry contains the information about the kind of an index (either equality or identity), the length of a path suffix and two arrays: the first one representing the offset representation of the path suffix and the second one responsible for holding the OOP of the index component for each instance variable in the path suffix. Those index components are implemented using B+-trees. An index component stores the information about the ordering of keys in the component’s B-tree. If the path suffixes of two or more indices into a NSC have common prefix, then the indices will share the index components on the common prefix. In GemStone identity indices directly support exact match lookups; whereas, equality indices and identity indices on boolean, characters and integers directly support =, >, >=, <, <= and range lookups. Objects in GemStone may be tagged with a dependency list. For every index component in which an object is a value in the component’s B-tree, the object’s dependency list will contain a pair of values consisting of the OOP of an index component and an offset. The pair indicates that if a value at the specified offset is updated then an update must be made to the corresponding index component. Consequently, an index component is automatically dependent on the value of the object at the given offset. 2.5 Advanced Solutions in Object-Relational Databases A very promising feature of relational systems extended with object-oriented capabilities is indexing using keys defined on expressions consisting of derived attributes, internal and user methods (described in subchapter 2.3: derived index, 9 GemStone uses unique surrogates called object-oriented pointers (OOPs) to refer to objects, and an object table to map an OOP to a physical location. Page 40 of 181 Chapter 2 Indexing In Databases - State of the Art functional indexes, function-based index and computed-column indices). The plain relational systems appoint limits for index key definition: • an index key can only be calculated using data in a current tuple, since SQL does not enable defining an index using data from other tables associated with the primary-foreign key relationship, • SQL aggregate functions are forbidden in index definitions, since a simple SQL expression used in a selection predicate for a table returns a single value. Without advanced object-oriented extensions there is no support for methods associated with tuples, polymorphism and path expressions. Such limitations also apply to majority of indexing techniques in ORDBMSs. The author has not found any objectrelational DBMS supporting indexing using aggregate functions or path expressions. Relatively complex indices involving method invocations and polymorphism in the object-relational environment can be created with the use of the Oracle function-based indices feature [108, 115]. The Oracle documentation does not provide extensive information concerning the automatic maintenance of such indices. To identify properties of the Oracle’s approach the author performed tests described in the next section. Besides regular indexing facilities, some products introduce robust extensions for advanced indexing purposes. As an example let us consider two solutions provided by IBM, i.e. Virtual-Index in Informix [50] and Index Extensions in DB2 [114]. These tools are dedicated for experienced database programmers which require indexing mechanisms going beyond standard database capabilities, e.g.: • creating secondary access methods (i.e. indexing) that provide SQL access to non-relational and other data that does not conform to built-in access methods. (e.g. a user-defined access method retrieving data from an external location), • creating a specialised index support to take the semantics of structured types into account, • introducing various index structures. Nevertheless, to take advantage of such extensions the user needs often to define specialised routines, in particular, responsible for index maintenance (key generator) and performing index scans (range producer). Therefore, the solutions presented above Page 41 of 181 Chapter 2 Indexing In Databases - State of the Art do not fulfil indexing transparency property. DB2 additionally introduces transparent indexing technique for indexing semistructured XML10 data [48]. The so-called pureXML feature allows to store wellformed XML documents kept in its native hierarchical form in table columns that have the XML data type. XQuery (XML Query Language), SQL, or a combination of both can be used to query and update XML data. An index over XML data indexes a part of a column, according to a definition which is limited to XPath (XML Path Language) expression. Hence, an index key can be a value of an atomic type element nested in an XML structure stored in a column. XML data are stored entirely in table columns, so modifications done to XML data can be easily reflected in the index. The author did not encounter any transparent indexing solutions that would enable indexing using keys more advanced than solutions presented above. The next subsection focuses on evaluation of the function-based index technique, which is one of the most advanced and relevant to the author’s work. 2.5.1 Oracle’s Function-based Index Maintenance In order to verify properties of the function-based index maintenance we introduce the following example of a database schema (Fig. 2.3): Fig. 2.3 Example object-relational schemata The method getTotalIncomes of the EmpType returns the value of salary attribute of a tuple. It is overloaded in the empStudentType in order to consider the value of scholarship. The emp table consists of tuples of both types. Creating an index on such a method associated with a table is straightforward: CREATE INDEX emp_gettotalincomes_idx ON emp e (e.getTotalIncomes()); Such an index is automatically used by the query optimiser. Efficiency of the selection process is improved not only through reducing the number of processed rows but also 10 Extensible Markup Language Page 42 of 181 Chapter 2 Indexing In Databases - State of the Art through avoiding method invocation since calculated results can be taken from an index. The index efficacy has been tested on series of simple queries. Modifications to a salary or a scholarship attribute, e.g. UPDATE emp e SET e.salary = 1500 WHERE e.name = 'KUC'; trigger appropriate changes in the index. According to anticipations, the time of such data alteration after adding an emp_gettotalincomes_idx index deteriorates. Processing an update is more than three times longer, because the automatic index maintenance needs to alter corresponding index’s entries. As far as possible, tests have shown that the created index works correctly. Unfortunately defining an index on method calls in Oracle shows some unexpected disadvantages. The index update operations are also triggered during the modification of any name attribute in the emp collection. Hence, alteration of any emp tuple’s attribute after creating an gettotalincomes_idx index similarly is slower more than three times. This has been caused by unnecessary index updating routines. The Oracle approach to index updating in case of the method-based indices consists in triggering index update routines during modifications done to any data in a tuple with associated index entries. The disadvantage mentioned above grows to a large problem in case when the method used to define an index key accesses a data outside indexed tuples. For example the method getYearCost of the DeptType has the following definition: CREATE OR REPLACE TYPE BODY dept_type IS MEMBER FUNCTION getyearcost RETURN NUMBER DETERMINISTIC IS BEGIN DECLARE counter NUMBER; BEGIN SELECT sum(salary) INTO counter FROM emp e WHERE e.dept.name = self.name; RETURN counter * 12; END; END; END; It accesses not only the given DeptType tuple data but also reaches the emp collection. Oracle also enables indexing dept collection according to getYearCost method: CREATE INDEX dept_getyearcost_idx ON dept d (d.getYearCost()); Page 43 of 181 Chapter 2 Indexing In Databases - State of the Art Similarly like in case of emp_gettotalincomes_idx, a command altering dept tuples triggers updating of the index. However, any modifications done to emp tuples, e.g. INSERT INTO EMP SELECT emp_type ('John Smith', 350, REF(d)) FROM DEPT d WHERE d.name = 'HR'; are not taken into consideration and the dept_getyearcost_idx index loses cohesion with the data. Unfortunately, queries which use the index, e.g.: SELECT d.name, d.getyearcost() FROM DEPT d WHERE d.getyearcost() < 24500; can return incorrect answers, since the selection process and final results depend on the index contents. Hence, the applied index updating solution is not proper to handle indices with keys based on “too complex” methods. In practice the function-based indices feature in Oracle can lead to erroneous work of database queries and applications. The reference dept in EmpType associates employee tuples with departments. It can be used to formulate selection predicates employing path expressions, e.g.: SELECT e.name FROM emp e where e.dept.name = 'HR'; Nevertheless, using such path expressions to define an index is forbidden: CREATE INDEX emp_deptname_idx ON emp e (e.dept.name); ORA-22808: REF dereferencing not allowed because it would require accessing a tuple from another table, which obviously would make the index maintenance impossible. 2.6 Global Indexing Strategies in Parallel Systems Various indexing approaches have been developed in distributed systems over the last two decades. The most of interesting solutions have been implemented in the domain of p2p networks [104, 107]. Work [124] introduces detail taxonomy of indexing strategies (described as index partitioning schemes) for distributed DBMSs. It analyses index maintenance strategies and storage requirements in the context of data partitioning in relational system. It assumes that index is partitioned over same nodes as data. Factors considered Page 44 of 181 Chapter 2 Indexing In Databases - State of the Art as fundaments of the given taxonomy are: • a degree of index replication between system nodes (non-, partial-, full-), • index partitioning in the context of data partitioning – i.e. method determining how index entries are distributed among system nodes. Generally, local indexing strategy implies that indices are locally built on the local data. Distributed indexing occurs when partitioning of the index is different from partitioning of the data. The taxonomy however omits a centralised indexing strategy which is very important. Local data indexing is the most common optimisation method used in database systems. Moreover, it is also applicable to indexing data of a single peer in a distributed environment. There are several advantages of the local indexing strategy in the distributed database environment. The knowledge of indices existing in local stores need not to be available on the level of a global schema. A query addressing global schema, during a process of optimisation, in many cases can be decomposed into subqueries addressing particular servers. Such a sub-query concerns data stored locally on a target site. Before evaluation, it can be optimised according to a local optimiser in order to take advantage of existing local indices. Global query optimisation is divided between servers and the global optimiser needs not to take into account local optimisations. Consequently, local indexing is transparent for global applications. Since data and indices are located on the same machine and in the same repository, an implementation of all indexing mechanisms, including index management and maintenance, is standard. In contrast to distributed environment indexing techniques, it is not so complex. However, local indexing is not always sufficient regarding a computational power of a distributed database. The global indices can be kept by a global store. This approach has the significant potential for optimisation of global queries. An idle time of a global store can be adopted to indexing and cataloguing data held by local servers. From the users point of view, the distributed technology should satisfy the following general requirements: transparency, security, interoperability, efficiency and pragmatic universality. Distributed or federated databases and data-intensive grid technologies, which can be perceived as their successors, aim at providing transparency in many forms: location, concurrency, implementation, scaling, fragmentation, Page 45 of 181 Chapter 2 Indexing In Databases - State of the Art replication, failure transparency, etc. [39]. The transparency is the most important feature for reducing the complexity of a design and for supporting programming and maintenance of applications addressing distributed data and services. It much reduces the complexity of a global application. One of the forms of transparency concerns indexing. As in centralised databases, the programmers should not involve indices explicitly into the code of applications. Any performance enhancements should be on the side of database tuning that is the job of database administration. There are several important aspects connected with transparent indexing in distributed databases: • location and access transparency - the geographical location of indices should not effect the users work, • scaling and migration transparency - indices should be maintained in such a way that servers data may be migrated, added or removed without any impact on the consistency of applications, • failure transparency - indices should be updated or migrated if some of the nodes are broken, • implementation and fragmentation transparency - the user need not to know how indices are implemented or partitioned, • concurrency transparency - the users can access indexed resources simultaneously and need not to know that other users exist. Next sections discuss the basic properties of centralised and distributed approaches to indexing. 2.6.1 Central Indexing The most common practice for distributed resources indexing is dedicating one server for an index repository. This strategy is called central indexing and has certainly proved its value in many internet applications. It played a particularly important role in the development of p2p networks. For example, Napster, an application allowing for sharing music files, used a directory server to locate desired resources [104, 107]. The features of this approach include: • small amount of communication necessary, • an efficiency for selective queries, Page 46 of 181 Chapter 2 Indexing In Databases - State of the Art • an architecture simplicity. However, there are also some disadvantages resulting from central indexing. Indexing server becomes a single point of failure. Moreover, the query evaluation performance deteriorates if a server is overloaded (i.e. too many clients use an index simultaneously) or fails. Also, this approach does not take advantage of parallel computations. 2.6.2 Strategies Involving Decentralised Indexing In the Gnutella [38] p2p network each participating node is responsible for answering and forwarding search requests (a so-called flooded request model). It is an example implementation of the local indexing strategy. However, features of the Napster solution have proved to be superior and resulted in better performance than Gnutella. An efficient possibility of decentralised indexing is the use of global distributed and parallel indices, e.g. SDDS (see section 2.2.2). These kinds of indices assume that a searched key-value points to another server where it can be further forwarded or desired non-key values can be found. A simple example of such technique could be indexing employees by their profession. One server can store references to all employees whose profession starts with a letter A, another server starting with a letter B, etc. The performance comparison of local indexing strategies (described as partialindexes) and distributed indexing (referred as partitioned global indexes) in query processing of horizontally fragmented data is a topic of [70]. The evaluation was in favour of the strategy utilising a distributed index. Similar investigation has been performed in the context of an inverted index for parallel text retrieval systems [19]. The conducted research indicated that the local index strategy should be preferred in case when queries exploiting indices are infrequent. The advantages of the distributed indexing strategy in contrast to the centralised one are the following: • it uses the computing potential of a grid (enabling a parallel query evaluation), • it is insensitive to overloading, • it decentralises necessary communication. Page 47 of 181 Chapter 2 Indexing In Databases - State of the Art An organisation and an architecture of such an index is more complex than in case of a central index. Sites can dynamically join or leave a community forcing the reorganisation of a part or a whole index. An achievement of scalability in an index distribution requires the use of advanced algorithms and data-structures which complexity can have a disadvantageous impact on index performance. It is common for all global indexing techniques that index positions stored on server X are not associated with data stored on server X, so maintaining a convergence between data and an index is more difficult and has to be done on the global level. Some works, e.g. [7], consider indexing schemas for a distributed page server OODB recognizing local caching of a centralised index as distributed indexing strategy. Nevertheless, this technique does not introduce significant performance improvement for parallel query processing. 2.7 Distributed DBMSs Despite of the relatively large number of distributed relational and objectoriented DBMSs only small fraction of them has global indexing capabilities. The most advanced solutions are based on index partitioning, e.g. SQL Server and Oracle. In databases, partitioning usually refers to tables or indices. The common model of table partitioning in distributed databases relies on the static division of data into independent datasets [92]. Data are partitioned horizontally by the “declustering” of relations based on a function (usually a hash function or a range index). This kind of partitioning is static since rules assigning datasets to designated partitions do not change without an administrator’s interference. With a hash function the data can be partitioned according to one attribute or a combination of several attributes. Such an approach enables efficient processing of exactly matching queries, often independently within only one partition. As a representative example, the Oracle’s approach to index partitioning is discussed in this subchapter. Its details are described in [82, 133]. Oracle enables creating a partitioned index for a partitioned and non-partitioned table. On the other hand, a partitioned table can have partitioned and non-partitioned indices. If a key partitioning a table is identical to a key partitioning a corresponding index, then the index is local. In remaining cases, we deal with a global index. Nevertheless, partitioning of tables uses the same mechanisms, so the number of index partitions in Page 48 of 181 Chapter 2 Indexing In Databases - State of the Art case of global or local indices usually does not differ. Moreover, local indexing is superior to global in efficiency of the index maintenance. Consequently, it is the most common used indexing strategy. Partitioned indices inherit majority of regular indices features, e.g. they can be defined using function-based expressions. In the Oracle’s approach, partitioning is not a mean of data integration and partitions are not managed autonomously. Therefore, a global or local partitioned index can be created only on an entire table. The research [45] concerns the architecture of a multi-key distributed index. It proposes a distributed index composed of two types of index structures: Global Index (GI) and Local Index (LI). GI is a part managed on a distributed database’s level and each LI is created and maintained by local database components. In such an architecture different indexing aspects are described, e.g. query optimisation, index implementation and maintenance (referred as dynamic adaptation [44]), together with evaluation of performance. Generally, the capabilities of the presented approach do not overcome the presented Oracle’s index partitioning solution. The SDDS index structure was employed in the SD-SQL11 Server database [106] in order to distribute data dynamically and transparently between separate database instances. Table rows are moved between sites according to primary key values by SDDS algorithms. This approach solves the limitations of static table partitioning improving data load balancing. SD-SQL Server automatically manages and accordingly queries database instances. This solution is built on top of SQL Server using database stored procedures. 11 Scalable Distributed SQL Page 49 of 181 Chapter 3 The Stack-based Approach Chapter 3 The Stack-based Approach The Stack-based Architecture (SBA) [1, 117, 118] is a formal methodology concerning the construction and semantics of database query languages, especially object-oriented. SBA is a coherent theory that enables creating a powerful query language for practically any known data model. The basic assumption behind SBA is that query languages are variants of programming languages. Consequently, notions, concepts and methods developed in the domain of programming languages should be applied also to query languages. In particular, the main semantic and implementation notion in majority of programming languages is an environment stack. It is an elementary structure used for defining names space, binding names, calling procedures [120] (including recursive calls), passing parameters and supporting object-oriented notions such as encapsulation, inheritance and polymorphism. The Stack-based Approach to query languages exploits the environment stack mechanism in order to define and implement operators specific to query languages, such as selection, projection, navigation, join and quantifiers. Taking advantage of the semantics based on the environment stack, SBA makes it possible to achieve full orthogonality and compositionality of the operators. Moreover, SBA enables seamless integration of query language with imperative constructs and other programming abstractions, including procedures, types and classes, This chapter contains a brief description of basic SBA notions and presents the model query language SBQL (Stack-Based Query Language) developed according to SBA. More details on SBA and SBQL can be found in [117, 118]. 3.1 Abstract Data Store Models SBA deals with several universal models of object stores. Depending on the complexity, they are referred to as abstract store models AS0, AS1, AS2 and AS3 (previously M0, M1, M2 and M3 were used, correspondingly). Each next model extends the previous one with some new features. The mentioned models do not exhaust all possibilities; however, they cover the most currently known ones. 3.1.1 AS0 Model The AS0 model is built according to relativity and internal objects identification Page 50 of 181 Chapter 3 The Stack-based Approach principles. It is a very simple data store model that is capable of representing semistructured data [121]. In AS0 each object comprises an internal identifier (implicit for the programmer), an external identifier (an object name available for the programmer) and a value. There are three kinds of objects: atomic, reference and complex. Assuming that I denotes the set of all acceptable internal identifiers, N the set of acceptable external names of objects, V the set of simple values like numbers, strings, etc. and O denotes any set of AS0 objects, we can define objects as the following triples (where i1, i2 ∈ I, n ∈ N and v ∈ V ): • Atomic objects <i1, n, v> - the simplest kind of objects. They are identified by internal identifier i1, have name n and hold an atomic value v. • Reference objects <i1, n, i2> - they model relations between objects. Similarly like in the previous case, they are identified by a internal identifier i1 and have a name n. Their value is an identifier i2 referring to some object. • Complex objects <i1, n, O> used to model object nesting. Object with an internal identifier i1 and name n consists of objects which belong to O. Elements of O are considered subobjects of the object having i1 as the identifier. 3.1.2 Abstract Store Models Supporting Inheritance The AS1 store model extends AS0 with classes and static inheritance. A class is a plain complex object containing subobjects which represent invariants of a certain group of objects. Additionally the inheritance relation between class objects can be defined. Apart from the inheritance relation, there is a relation defining an object's membership to a corresponding class. The AS2 model introduces the notion of object's dynamic role. Each object can be associated with one or more such roles. If an object is an owner of a role, its situation is similar to being a class instance. However whereas the inheritance has static character during run-time object can take new and lose old roles and inheritance between roles is dynamic [56, 98]. The AS3 model extends AS1 (AS3.1) or AS2 (AS3.2) model with the encapsulation mechanism. It is assumed that each class can be equipped with an export list which is a set of class fields names that are explicitly visible outside implemented class instances. Other fields are not visible and are treated as private. Page 51 of 181 Chapter 3 The Stack-based Approach 3.1.3 Example Database Schema The example schema in Fig. 3.1 is introduced as a basis for presenting conceptual examples in this and in the following chapters. The abstraction level of the schema relates to the AS1 Store Model described in the previous section. Therefore, the schema consists of hierarchical objects, pointer links between objects, classes, static inheritance and multiple inheritance. These are the most relevant elements from the point of view of object-oriented modelling. The indexing solution for the ODRA database management system supports many features that result from adapting the AS1 store model. Student : StudentClass scholarship : Integer getFullName() : String getScholarship() : Integer setScholarship(wartość : Integer) Person : PersonClass name : String surname : String age : Integer married : Boolean getFullName() : String 1 EmpStudent : EmpStudentClass getFullName() : String getTotalIncomes() : Integer Emp : EmpClass salary : Integer getFullName() : String getTotalIncomes() : Integer worksIn employs 0..* 1 Dept : DeptType name : String 1 address address 1 AddressType city : String street : String zip : Integer[0..1] 1 Fig. 3.1 Example of an object-oriented database schema for a company The example schema illustrates personnel records of the company. It introduces several classes PersonClass, StudentClass, EmpClass, EmpStudentClass and two structure types DeptType and AddressType. Persistent instances of the classes mentioned above can be accessed using their instance names Person, Student, Emp and finally EmpStudent. Objects called Dept have DeptType structure with a primary attribute name and represent departments of the company. Each Person object stands for person somehow connected with the company. Its attributes provide basic information: name, age and marital status. Additionally each Dept and Person object includes an address subobject which specifies a city, a street name and optionally a zip-code Page 52 of 181 Chapter 3 The Stack-based Approach according to the AddressType structure. Instances of the EmpClass represent current employees of the company and extend Person object attributes with the salary attribute. Emp and Dept objects are associated by references. The worksIn reference of an Emp object leads to a department. Dept objects contain employs references to department employees. Another class, which extends the PersonClass, is the StudentClass. Its objects refer to students who are granted a scholarship by the company. For that reason, this class introduces the scholarship attribute. The last class presented on the schema is called EmpStudentClass and like its name suggest it inherits from EmpClass and StudentClass. It is introduced to represent students who are simultaneously employees of the company. In SBQL using Person name results in returning all instances of PersonClass class and its subclasses. Similarly, via Emp name programmer refers both EmpClass and EmpStudentClass instances. Beside attributes, classes are composed of methods. Taking advantage of the polymorphism some methods are overridden in derived subclasses. E.g. getTotalIncomes() method of EmpClass returns the value of a salary attribute, but for instances of the EmpStudentClass it returns sum of salary and scholarship attributes. 3.1.4 Example Store with Static Inheritance of Objects Referring to the data schema in Fig. 3.1 we introduce the example store shown in Fig. 3.2, consistent with the AS1 model (cf. section 3.1.2), presenting classes and objects, their values, identifiers and the most important relations between them. An identifier is a property of every database entity. This sample store consists of two objects of DeptType type and two instances of EmpClass (one of them is also an EmpStudentClass instance). One Emp object describes Marek Kowalski – a person who works in a CNC department. The EmpStudent object depicts Piotr Kuc, a student who is employed by the HR department. Classes PersonClass and StudentClass are omitted but according to the schema in Fig. 3.1 they are present in the database. Page 53 of 181 Chapter 3 The Stack-based Approach (i11) EmpClass (i13) getFullName() : < method code > (i61) Emp (i31) EmpStudent (i14) getTotalIncomes() : < method code > (i62) name : „Marek” (i63) surname : „Kowalski” (i32) name : „Piotr” ... (i33) surname : „Kuc” (i64) age : 28 (i34) age : 30 (i65) married : true (i21) EmpStudentClass (i66) address (i36) address (i23) getFullName() : < method code > (i67) city : „Kraków” (i37) city : „Warszawa” (i24) getTotalIncomes() : < method code > (i68) street : „Bracka” (i69) zip : 99999 (i35) married : false (i38) street : „Koszykowa” ... (i39) scholarship : 500 (i70) worksIn (i40) worksIn (i71) salary : 1200 (i41) salary : 1000 (i131) Dept (i132) name : „CNC” (i141) Dept (i142) name : „HR” (i133) address (i143) address (i134) city : „Opole” (i144) city : „Kraków” (i135) street : „Wiejska” (i145) street : „Reymonta” (i136) zip : 80043 (i146) zip : 08797 (i151) employs (i152) employs … … Fig. 3.2 Sample store with classes and objects 3.2 Environment and Result Stacks The semantics of a query language in the Stack-based Approach is explained using two stacks: the environment stack (which was mentioned earlier) and the result stack. The environment stack (ENVS) controls names binding space. This stack consists of sections and each section holds binders. Binder is a construct which is used to bind names with an appropriate run-time entity. It is assumed that binders will be written as n(r), where n ∈ N, r ∈ R. The R set denotes all possible queries results. This brings us to the second mentioned result stack (QRES). It is used for storing temporary Page 54 of 181 Chapter 3 The Stack-based Approach and final query results. The following r elements belong to the R queries results set: • Value result (number, characters string, logical value, date, etc.) – are results of literal expressions or arise through dereference (process of acquiring a value) of an atomic database objects. • Reference result (identifier of an internal object) – are plain results of expressions referring through names to database objects. They are usually results of name binding, however they can also appear through dereference of reference objects. • Binder result (mentioned earlier pair n(r), where n ∈ N, r ∈ R) – are created when the operators introducing an auxiliary name are used (as, groupas) or as a result of the nested iX operation (described further), where iX is an identifier of a reference or complex object. • Structure result (struct{r1, r2, …, rn}, where r1, r2, …, rn ∈ SR and SR ⊂ R is a set of query results which are not collections) – such results are a sequence of single results (SR set elements). Structures usually are created using comma expression, join or as a result of dereference of a complex object. • Collection of single results (bag{r1, r2, …, rn}, sequence{r1, r2, …, rn} where r1, r2, …, rn ∈ SR) – they consists of any elements of the R set except other collections. Results collections can be nested in other collections only if they are values of binders. Collections are typically created as a result of binding names or using set operators (e.g. union). There are two main collection types: preserving order sequences and not preserving order bags. Nevertheless other collections can be introduced if necessary, e.g. an array. The following sections describe bind and nested operations which are defined using the stacks presented above. All these SBA elements are very essential from the point of view of the author’s work. 3.2.1 Bind Operation Each name occurring in a query is bound with an appropriate run-time entity according to name binding space. Names’ binding is performed using so called bind operation. This operation works on the environment stack in order to find appropriate Page 55 of 181 Chapter 3 The Stack-based Approach binders in its sections. In the beginning of a query evaluation the ENVS comprises one section (i.e. base section) which holds binder to all database root objects. During a query evaluation new sections, empty or holding several binders, are put onto or pushed off the environment stack but the base section remains untouched. Generally binding of the n name consists in searching the ENVS in direction from top to bottom for the first section which holds at least one binder described with the n name. Since binder names can repeat inside one section, it is possible that a result of a binding operation will be a collection of all found binders values. Particularly if no section holds binders described with n name then empty collection is returned. 3.2.2 Nested Function The nested function formalises all cases that require pushing new sections on the ENVS, particularly the concept of pushing the interior of an object. This function takes any query result as a parameter and returns a set of binders. The following results of nested operation are defined depending on a parameter kind: • Reference to a complex object – the result is a set consisting of binders created using subobjects of the given complex object. For each subobject created binder has its name and value is defined by the subobject’s internal identifier. • Reference to a pointer object – the result is a set holding a binder with a name of an object pointed by the pointer and a value equal to internal identifier of the pointed object. • Binder – the result is a set holding the identical binder. • Structure – the result is a set that is the union of the results of the nested function applied for all elements of the structure. • In other cases, the result is the empty set. 3.3 SBQL Query Language Queries in the Stack-based Approach are treated in the same way as traditional programming languages treat expressions. Therefore in this thesis the terms expression and query are used interchangeably. Even though SBA is independent of syntax in order to explain some semantic constructs the abstract syntax called SBQL (Stack-Based Query Language) is used. Stack-Based Query Language is a formalised object-oriented query language in the SQL Page 56 of 181 Chapter 3 The Stack-based Approach or OQL style, however its syntactic has been significantly reduced, particularly to avoid large syntactic constructs like select…from…where known from SQL. 3.3.1 Expressions Evaluation SBQL expressions follow the compositionality principle, which means that the semantics of a query is a function of the semantics of its components, recursively. Similarly to programming languages, the simplest queries are names and literals. The most complex queries are created by free connecting several subqueries by operators (providing typing constraints are preserved). There are no constrains concerning nesting queries. SBA uses operational semantic in order to define operators. The most important SBQL operators and their semantic are described in the tables below. Tab. 3-1 Evaluation of traditional arithmetic operators Unary operators: + Evaluation steps: 1. 2. 3. 4. 5. 6. Evaluation steps: 1. 2. 3. 4. 5. 6. Execute the subquery. Take the result from QRES. Verify is it a single result (if not run-time exception is raised). For the reference result dereference is performed. Execute appropriate operation on the value. Push the final result on QRES. Binary operators: + - * / = != < <= > >= or and Execute both subqueries in sequence. Take both results from QRES. Verify they are single results (if not run-time exception is raised) For each reference result dereference is performed. Execute appropriate operation on the values. Push the final result on QRES. Tab. 3-2 Evaluation of operators working on collections Structure constructor operator: , (comma) Evaluation steps: 1. 2. 3. 4. Initialise an empty bag (eres). Execute both subexpressions in sequence. Take both results from QRES (first e2res and next e1res). For each element (e1) of the e1res result do: 4.1. For each element (e2) of the e2res result do: 4.1.1. Create structure {e1, e2}. If e1 and/or e2 is structure then its fields are used. 4.1.2. Add structure to eres. 5. Push eres on QRES. Page 57 of 181 Chapter 3 The Stack-based Approach Bag and sequence constructors: bag sequence Evaluation steps: 1. 2. 3. 4. 5. Initialise an empty bag (eres). Execute subquery. Take result from QRES. Result is treated as structure and each structure field is added to eres. Push eres on QRES. Existence operator: exists Evaluation steps: 1. Execute subquery. 2. Take result from QRES. 3. Push false on QRES if result is the empty collection, otherwise true. Removing duplicates: unique, uniqueref Evaluation steps: 1. 2. 3. 4. Initialise an empty bag (eres). Execute subquery. Take a result collection from QRES (colres). For each element (el) of the colres result do: 4.1. If there is no element in eres equal to el then add el to eres. 5. Push eres on QRES. In order to evaluate unique operators elements from colres are subjected to dereference operation if necessary. Sum of sets: expr1 union expr2 Evaluation steps: 1. Initialise an empty bag (eres). 2. Execute both subexpressions in sequence. 3. Take results from QRES. 4. Insert all elements from both results into eres. 5. Push eres on QRES. Traditional set operators: expr1 minus expr2, expr1 intersect expr2 Evaluation steps: 1. 2. 3. 4. Initialise an empty bag (eres). Execute both subexpressions in sequence. Take both results from QRES (first e2res and next e1res). For each element (e1) of the e1res result do: 4.1. In case of minus: if e2res does not contain element equal to e1 push e1 on QRES. In case of intersect: if e2res contains element equal to e1 push e1 on QRES. In order to compare elements from e1 and e2 operator performs necessary dereference operations. Inclusion operator: expr1 in expr2 Evaluation steps: 1. Execute both subexpressions in sequence. 2. Take both results from QRES (first e2res and next e1res). 3. For each element (e1) of the e1res result do: 3.1. If e2res does not contain element equal to e1 then the false logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push true logical literal on QRES. Traditional aggregate operators: sum, min, avg, max, count Evaluation steps: 1. Execute subquery. 2. Take a result collection from QRES (colres). Page 58 of 181 Chapter 3 The Stack-based Approach 3. The final result is initialised (0 in case of sum and count operators, value of the first colres collection element). 4. For each element (el) of the colres result do: 4.1. Suitably to the given operator the final result is updated considering el element or its value. 5. Push the final result on QRES. In order to evaluate sum, min, max operators elements from colres are subjected to dereference operation if necessary. The evaluation of avg operator consists of sum and count operators evaluation. Tab. 3-3 Evaluation of non-algebraic SBQL operators Projection/navigation: leftquery . (dot) rightquery Evaluation steps: 1. 2. 3. 4. Initialise an empty bag (eres) Execute the left subquery. Take a result collection from QRES (colres). For each element (el) of the colres result do: 4.1. Open new section on ENVS. 4.2. Execute function nested(el). 4.3. Execute the right subquery. 4.4. Take its result from QRES (elres). 4.5. Insert elres result into eres. 5. Push eres on QRES. Selection: leftquery where rightquery Evaluation steps: Similarly link in case of dot operator except for: Evaluation steps: Similarly link in case of dot operator except for: 4.5. Verify whether elres is a single result (if not run-time exception is raised). 4.6. If elres is equal to true add el to eres. Dependent/navigational join: leftquery join rightquery 4.5. Perform Cartesian Product operation on el and elres. 4.6. Insert obtained structure into eres. Universal quantifier: leftquery forall rightquery Evaluation steps: 1. Execute the left subquery. 2. Take a result collection from QRES (colres). 3. For each element (el) of the colres result do: 3.1. Open new section on ENVS. 3.2. Execute function nested(el). 3.3. Execute the right subquery. 3.4. Take its result from QRES (elres). 3.5. Verify whether elres is a single result (if not run-time exception is raised). 3.6. If elres is equal to false then the false logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push true literal on QRES. Existential quantifier: exists leftquery such that rightquery Page 59 of 181 Chapter 3 The Stack-based Approach Evaluation steps: Evaluation steps: Similarly link in case of forall operator except for: 3.6. If elres is equal to true then the true logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push false literal on QRES. Sorting: leftquery orderby rightquery 1. Execute join operation. 2. Take the result from QRES 3. Sort obtained structures according to second structure field, then third, forth, etc. 4. Create new collection using the first structure fields. 5. Push the final collection on QRES Tab. 3-4 Evaluation of auxiliary names defining operators Assigning auxiliary names to collections elements: subquery as name 1. Execute the subquery. 2. Take its result from QRES steps: 3. Each element of the obtained collection replace with a binder using the given as the operator parameter name and the given element as a value. 4. Push the final collection on QRES. Assigning auxiliary name to the whole collection: subquery groupas name Evaluation Evaluation steps: 1. Execute the subquery. 2. Take its result from QRES 3. Create binder using the given as the operator parameter name and the obtained result as a value. 4. Push the final collection on QRES. Tab. 3-5 Evaluation of sequences ranking operators Assigning auxiliary ranking binders to sequence: seqquery rangeas name 1. Execute the seqquery returning sequence (all sequences are indexed starting from 1). steps: 2. Take its result from QRES (seqres) 3. Each element of the obtained sequence (seqres) replace with a structure consisting of • the given element and • binder using the given as the operator parameter name and an index of element in the sequence as a value. 4. Push the final sequence converted to a bag on QRES. Extracting elements from a sequence: seqquery[subquery] Evaluation Evaluation 1. Initialise an empty bag (eres) 2. Execute the seqquery returning sequence (all sequences are indexed starting from 1) Page 60 of 181 Chapter 3 The Stack-based Approach Take its result from QRES (seqres) Execute the subquery returning collection of integers. Take a result collection from QRES (intres). For each element (el) of the intres result do: 6.1. Insert seqres element with an index el into eres. 7. Push eres on QRES. In this section the most important SBQL language operators are presented. SBA steps: 3. 4. 5. 6. enables introducing also more sophisticated operators, e.g. transitive closures and fixed point equations. Nonetheless, operators essential for the author’s work are presented above. 3.3.2 Imperative Statements Evaluation The following operators used to modify the state of data are also a part of the SBQL language; however, they cannot construct expressions that can be used by other operators to form complex queries. Tab. 3-6 Evaluation of imperative operators Assigning a value to an object: := Evaluation steps: Evaluation steps: 1. 2. 3. 4. 5. 6. Execute the right subexpressions. Take its results from QRES. Verify it is a single results (if not run-time exception is raised) Perform dereference on a result if necessary to obtain a value. Execute the left subexpressions. Take its results from QRES (it is assumed that result of the left subquery should be a reference to a suitable complex object). 7. Verify they it is a single reference (if not run-time exception is raised) 8. Assign value of the right subquery result to the object pointed by the reference result. Creating an object inside existing object: :<< 1. Execute the right subexpressions. 2. Take its results from QRES (it is assumed that the right subquery returns binder). 3. Verify it is a single results (if not run-time exception is raised) 4. Execute the left subexpressions. 5. Take its results from QRES (it is assumed that result of the left subquery should be a reference to a suitable complex object). 6. Verify it is a single results (if not run-time exception is raised) 7. Create database object according to binder name and its value. If binders has atomic value inside then new object is atomic. If binder contains another binder or a structure then complex object should be created (for nested binders appropriate new subobjects are created). 8. Nest the new object inside object referenced by the left subquery result. Page 61 of 181 Chapter 3 The Stack-based Approach Removing an object: delete Evaluation steps: 1. Execute the subquery 2. Take a result collection from QRES (colres). It is assumed that this collection holds references to existing objects 3. For each element (ref) of the colres result do: 3.1. Remove the object pointed by the ref from the database together with its subobjects and objects referencing it. 3.4 Static Query Evaluation and Metabase The SBQL queries during compilation are subjected to the static analysis. This process is indispensable in order to perform the static type control and the most of optimisations. Static analysis consists in mechanisms similar to evaluation of a query. The task of such an evaluation is simulation of the greatest number of possible situations that may occur in the run-time, however using data appropriate for the compile-time. Hence, the static analysis does not refer to real data. Instead, it uses a metabase, i.e. a graph of a databases schema constructed from declaration of program entities. A database schema graph is a similar structure to a database graph. It is also modelled using simple, complex and references objects. Significant differences in contrast to a database graph are the following: • a metabase, instead of particular occurrences of objects, stores only the information about the minimal and maximal numbers of objects, i.e. the cardinality of a collection. • instead of a specific values, the metabase stores the information on a data type and relationships (e.g. static inheritance) between them. • the metabase additionally contains information which can be used during costbased optimisations, e.g. data related statistics. For example the following source code fragment: i : integer [0..*]; setvar : record { txt : string; note : string [1..5] }; would result it the following metabase written according to the AS0 model: <i0, entry, < i1, i, < i2, meta_object_kind, META_VARIABLE> < i3, type_kind, PRIMITIVE> < i4, type, INTEGER> Page 62 of 181 Chapter 3 The Stack-based Approach < i5, minimal_cardinality, 0> < i6, maximal_cardinality, +∝> > < i7, setvar, < i8, meta_object_kind, META_VARIABLE> < i9, type_kind, COMPLEX> < i10, type, i13> < i11, minimal_cardinality, 1> < i12, maximal_cardinality, 1> > <i13, $x_struct_type, <i14, meta_object_kind, META_STRUCTURE> <i15, fields, <i16, txt, <i17, meta_object_kind, META_VARIABLE> < i18, type_kind, PRIMITIVE> < i19, type, STRING> < i20, minimal_cardinality, 1> < i21, maximal_cardinality, 1> > < i22, note, < i23, meta_object_kind, META_VARIABLE> < i24, type_kind, PRIMITIVE> < i10, type, STRING> < i11, minimal_cardinality, 1> < i12, maximal_cardinality, 5> > > > > The equivalent for a query result during compile-time is an operation signature. The following kinds of signatures can be distinguished: • static reference, that is, a reference to a metabase object, • static binder that contains a name and a associated signature as a value, • variant that contains several possible signatures (when during static analysis the unambiguous signature cannot be determined), • value type representation that contains identifier of a primitive type which it represents (usually it concerns literals and static references to atomic objects when dereference is applied), • static structure that contains a set of signatures representing fields of that structure. Page 63 of 181 Chapter 3 The Stack-based Approach Each of these signatures contains additional information, e.g. concerning possible cardinality of the run-time result returned by a query represented by the signature. Besides signatures, the context query analyser is equipped with static equivalents of environment and query result stacks. Distinct from the run-time stacks these structures work with signatures and the database schema graph rather than with query results. 3.4.1 Type Checking The compile-time static query analysis allows for performing static type control [112]. According to type determining rules, which are specified for every operator, the compiler can determine the type of a value returned by a complex query through analysis of its individual parts. The following example is a single rule concerning the union operator: bag[a..b](type) union bag[c..d](type) => bag[a+c..b+d](type) This rule describes a set-theoretic sum of bags, which comprises from at least a elements to at most b elements, represented by the left union operand signature with another set, which cardinality is from c to d elements, represented by the right union operand signature. It additionally assumes that the types of elements in these collections must be identical. Consequently, this rule indicates that the final collection will preserve the type of the input collections and that it would comprise from at least a + c elements to at most b + d elements. The set of similar rules is an internal part of almost every programming language. Yet SBQL is also a query language and thus operation signatures are enhanced with additional information concerning collections like cardinality and order. The rules created for arithmetic operators are usually more restrictive. For example, for + operator the following rule can be designed: value[1..1](integer) + value[1..1](real) => value[1..1](real) This rule assumes that the addition of integer and real values can be executed if operands will be single values (i.e. the cardinality of both arguments is [1..1]), otherwise the typing error would occur. Nevertheless, one can wonder if such assumption concerning the cardinality is too restrictive and consequently whether some Page 64 of 181 Chapter 3 The Stack-based Approach part of the type checking should be moved to the run-time. Alternative form of the rule above can be rewritten: value[0..*](integer) + value[0..*](real) => value[1..1](real) In this case the + operator would allow situation that in the compile-time the actual number of arguments is unknown. The suitable control ensues in the run-time. Therefore, if a left or right subquery does not return a single value then the interpreter would report the run-time error. Such solution leads to the so-called semi-strong type system. Let us consider the example query: (Person where surname = “Kuc”).age + 1 It illustrates the reason why semi-strong type system is more comfortable for a programmer. In this example it is assumed that there exist only one person exist with the given surname. This assumption is controlled dynamically, i.e. if it is not fulfilled than run-time error indicates a typing error. In case of a more restrictive type system, the compiler would reject the above construction. 3.5 Updateable Object-Oriented Views A database view is a collection of virtual objects that are arbitrarily mapped from stored objects. In the context of distributed applications (e.g. web applications) views can be used to resolve incompabilities between heterogeneous data sources enabling their integration [60, 61]. The idea of updateable object views relies in augmenting the definition of a view with the information on users’ intents with respect to updating operations. Only the view definer is able to express the semantics of view updating. To achieve it, a view definition is subdivided into two parts. The first part is a functional procedure, which maps stored objects into virtual objects (similarly to SQL). It returns entities called seeds that unambiguously identify virtual objects (in particular, seeds are OIDs of stored objects). The second part contains redefinitions of generic operations on virtual objects. These procedures express the view definer’s intentions with respect to update, delete, insert and retrieve operations performed on virtual objects. Seeds are (implicitly) passed as procedures’ parameters. A view definition usually contains definitions of subviews, which are defined on the same rule, according to the relativism principle. Because a Page 65 of 181 Chapter 3 The Stack-based Approach view definition is a regular complex object, it may also contain other elements, such as procedures, functions, state objects, etc. The above assumptions and SBA semantics allow achieving the following properties: (1) full transparency of views – after defining a view, its user uses the virtual objects in the same way as stored objects, (2) views can be recursive and (as procedures) may have parameters. Page 66 of 181 Chapter 4 Organisation of Indexing in OODBMS Chapter 4 Organisation of Indexing in OODBMS This chapter concerns primarily the architecture and rules applying to the index management and maintenance. Actual optimisation of query processing is the topic of the next chapter. Nonetheless, improving performance depends on diversity of exploited index structures and flexibility in defining an index. The properties of SBQL, in particular, orthogonality and compositionality, enable easy formulating complex selection predicates including usage of complex expressions with polymorphic methods and aggregate operators. The proposed organisation of indexing provides all necessary mechanisms so the database administrator is unconstrained in creating local or global indices with keys based on such expressions. The implementation exploits the linear hashing index structure (see section 2.2.1). Nevertheless, the solution does not limit the possibility to apply different indexing techniques, e.g. B-Trees. Details of this aspect of database indexing are omitted since it is generally orthogonal and independent of index management, maintenance and query optimisation. 4.1 Implementation of a Linear Hashing Based Index The primary reason for implementing a linear hashing index is a possibility to extend this structure to its distributed SDDS version (see section 2.2.2) in order to utilise optimally distributed database resources. Moreover, the author wants to provide extensive query optimisation support by enabling: • dense indexing for integer, real, string, date and reference key values (dense index key type), • support for optimising range queries on integer, real, string, date key values (range and enum key types), • indexing using multiple keys, • enum key type - a special support facilitating indexing of integer, real, string, date, reference and boolean keys with a countable limited set of distinct values (low key value cardinality). The enum key type provides an additional flexibility when applied to multiple Page 67 of 181 Chapter 4 Organisation of Indexing in OODBMS key indices since such keys can be skipped in an index invocation, i.e. can be considered optional. 4.1.1 Index Key Types The mentioned properties of indexing are introduced through different design of a hash function. The dense key type implies that the optimisation of selection queries which use the given key as a condition will be applied only for selection predicates based on = or in operators. Therefore, a hash function can distribute objects in index randomly omitting key values order. Such an index does not support optimising range queries, however it is faster in processing index invocations with exact match selection criteria. The range key type additionally supports optimisation concerning selection predicates based on range operators: >, ≥, < and ≤. This is achieved through a range partitioning [62, 75] variant implemented by the author. Within index a hash function groups object references in individual buckets (see bucket definition in section 2.2.1) according to key value ranges. The ranges are dynamically split as an index grows increasing its selectivity. The last key type – enum – is introduced in order to take advantage of keys with countable limited set of distinct values, i.e. keys with low values cardinality. The performance of an index can be strongly deteriorated if key values have low cardinality e.g. person eye colour, marriage status (boolean value) or the year of birth. To prevent this, an index internally stores all possible key values (or key values range limits in case of integer values) and uses this information to facilitate index hashing. The enum key type can deal with optimising selection predicates exactly like in the case of the range key type, i.e. for: =, in, >, ≥, < and ≤ operators. Multiple key indexing is introduced through defining overall hash function as a composition of individual keys hash functions. An enum type key hash function assigns key values to consecutive hash-codes. As a result, enum type keys are particularly effective in multiple key indices. First, they can be omitted in index calls generated during optimisation of queries, what improves flexibility of the index optimiser (details are described in section 5.5.2). Furthermore, index invocation evaluation proves great efficiency if all index keys are enum and the number of indexed objects is large enough. In such conditions each key value combination points to a separate bucket of object Page 68 of 181 Chapter 4 Organisation of Indexing in OODBMS references which eliminates necessity to verify search criteria for retained objects. 4.1.2 Example Indices The given in Fig. 3.1 example schema opens possibility to present a wide variety of indices supported by the OODBMS indexing engine implemented by the author. Prefix idx is used to distinguish names of indices from other database entities. Firstly let us discuss simple single key indices created on object’s attributes: • IdxPerAge – returns Person objects according to the value of the age attribute. It is assumed that this index is capable of processing range queries. • IdxEmpAge – identical as above, except only for Emp objects. • IdxEmpSalary – returns Emp objects queried by their salary attribute. This index similarly can use salary range as a selection criterion. • IdxPerSurname – a dense index returning Person objects according to the string type surname attribute. • IdxDeptName – a dense index returning Dept objects according to the name attribute. • IdxPerZip – the range index which returns Person objects queried by a zip attribute of its subobject address. It is important to note that the zip attribute is optional and therefore this index stores only Person objects containing this attribute. • IdxEmpCity – returns instances of the Emp class according to an address.city complex attribute. It is assumed that this index is dense. • IdxAddrStreet – a dense index which returns address subobjects of Person objects according to the street attribute. Differently than in case of other indices non-key objects are defined by a path expression, i.e. Person.address. The following indices use derived and complex attributes as keys: • IdxEmpDeptName – a dense index which uses the derived attribute worksIn.Dept.name to retrieve Emp objects. • IdxEmpWorkCity – an index using the derived attribute worksIn.Dept.address.city for Emp objects. Additionally in order to take Page 69 of 181 Chapter 4 Organisation of Indexing in OODBMS advantage of the fact that company departments are located in the limited number of cities (low key values cardinality) the key type is enum. • IdxDeptYearCost – the most complex of the indices for Dept objects. The key is based on the expression sum(employs.Emp.salary) * 12 which returns an approximate total cost of salaries of a given department for the year period. It is assumed that this index is range. • IdxEmpTotalIncomes – a range index which uses the Emp class method getTotalIncomes() as a key for selecting Emp objects. This method is overridden for instances of the EmpStudent class. Another powerful feature of the proposed indexing solution is multiple key indexing. Using such an index can strengthen the selectivity property (cf. section 5.4.1), in particular, when individual keys return only few distinct values. • idxEmpAge&workCity – an index for instances of Emp objects. It consists of two dense keys. The first key is set on the age attribute and the next one on the derived attribute worksIn.Dept.address.city. It is assumed that it is necessary to specify both attributes to take advantage of this index. • idxPerAge&Surname – the last index for indexing Person objects, which also uses two keys. The first key is set on the age attribute and supports range queries. It is assumed low cardinality of values (enum key type). The second dense key is set on the person surname attribute. This index offers greater flexibility hence the age key can be omitted in an index call. In order to take advantage of indexing the administrator must only create proper indices. The rest of the optimisation issues are completely transparent. 4.2 Index Management All indices existing in the database are registered and managed by the index manager. Beside the list of meta-references to objects describing indices, it holds also auxiliary redundant information needed by the index optimiser and the static evaluator, i.e. a list of structures (called Nonkey Structures) maintained for each indexed collection of objects containing information about: • query defining the given collection, Page 70 of 181 Chapter 4 Organisation of Indexing in OODBMS • reference to the metabase object representing objects belonging to this collection, • indices set on the given collection along with their meta-references, • list of keys used to index the given collection of objects holding precise information about each key: o an expression defining the key, o a list of indices using the given key. The efficient access to elements of the lists mentioned above is provided by auxiliary indices. The structure of the index manager is presented in Fig. 4.1. Fig. 4.1 Index manager structure Page 71 of 181 Chapter 4 Organisation of Indexing in OODBMS The index manager assists in the index optimisation process by making wellorganised information about existing indices available (details can be found in section 5.2.1 and subchapter 5.3). For instance if the index optimiser processes a where clause which selects objects from the whole Emp collection then the Nonkey Structures Index would return necessary information about indices set on EmpClass instances. Such information in case of indices introduced in section 4.1.2 is presented in Fig. 4.2. INDEXED OBJECTS INFORMATION EmpClass OBJECTS VARIABLE METAREFERENCE Emp query: LIST OF ASSOCIATED INDICES idxEmp Salary idxEmp City idxEmp WorkCity idxEmp TotalIncomes idxEmp Age&WorkCity LIST OF KEYS 1. KEY INFORMATION 2. KEY INFORMATION Key expression: Key expression: Key expression: salary city getTotalIncomes() LIST OF INDICES UTILISING THE KEY LIST OF INDICES UTILISING THE KEY LIST OF INDICES UTILISING THE KEY idxEmpSalary (1ST key) idxEmpCity (1ST key) idxEmpTotalIncomes (1ST key) 4. KEY INFORMATION 3. KEY INFORMATION 5. KEY INFORMATION Key expression: Key expression: age worksIn.Dept.address.city LIST OF INDICES UTILISING THE KEY LIST OF INDICES UTILISING THE KEY idxEmpAge&WorkCity (1ST key) idxEmpWorkCity (1ST key) idxEmpAge&WorkCity (2ND key) INDEX OF KEYS STRUCTURES ACCORDING TO KEY EXPRESSION FIELD VALUE Fig. 4.2 Example Nonkey structure for Emp collection Taking advantage of the Nonkey Structure presented above the index optimiser can efficiently match selection predicates of the where clause with the associated indices keys. Page 72 of 181 Chapter 4 Organisation of Indexing in OODBMS 4.2.1 Index Creating Rules and Assumed Limitations Each index has to be unique in its namespace. Because of the wide range of discussed topics in this work the concept of modules and namespaces connected with them developed in the ODRA prototype [68, 119] are omitted. The administrator issues the add index command to create a new index in the database. The syntax of this command is the following: add index <indexname> ( <typeind_1> [ | <typeind_2> ... ] ) on <nonkeyexpr> (<keyexpr_1> [ , <keyexpr_2> ... ] ) where: • indexname – stands for a unique name of the index, • typeind_i – is the a type indicator of the i-th index key specified by the following values: dense, range and enum (described in section 4.1.1), • nonkeyexpr – a path expression defining indexed objects, • keyexpr_i – a query defining the i-th key used to retain indexed objects. The number of type indicators corresponds to the number of keys forming an index. Indexed objects are defined by the nonkeyexpr expression which must be bound in the lowest database section (the databases root) of the environment stack. For simplification, it is assumed that this definition should be built using a path expression (name expressions connected using dot non-algebraic operators). Moreover, this path expression should return collection of distinct objects to be indexed. It is important because some of the optimisation methods are not currently designed to deal with collections containing duplicates. Using reference objects in defining nonkeyexpr can result in the possibility of indexing duplicates, e.g. usually more than several employees works in a single company department and hence the following path expression: Emp.worksIn.Dept would probably return same Dept objects many times because worksIn references associate different employees with the same department. Nevertheless, the mentioned limitation concerning indexed objects definition is normal in the context of typical indexing solutions for databases. Additionally, such an Page 73 of 181 Chapter 4 Organisation of Indexing in OODBMS index can be used to enforce a constraint that collection should comprise distinct objects. Enabling support for more complex definitions of non-key objects is possible; however, it would result in some limitations concerning applying optimisations and in increasing implementation complexity of automatic index updating, which is presented in subchapter 4.3. Each key value expression keyexpr_i should be defined in the context of the objects defined by nonkeyexpr expression. Consequently the query: nonkeyexpr join (keyexpr_1 [ , keyexpr_2 ... ] ) returns non-key objects together with corresponding key values. Each keyexpr_i should depend on the join operator. Index keys should return values of following types: integer, real, string, date, reference or boolean. Moreover, each key expression has to be deterministic, i.e. for a given non-key object it must return the exact same result provided that data used to calculate it has not changed (for example it excludes the usage of random method). An important property of a created index is the cardinality of keys. For each key it indicates the possible number of returned values. Usually keys return a single value so their cardinality is [1..1] and a key value exists for each non-key object. As the result, the whole non-key collection is indexed. When the minimal key cardinality is zero, e.g. address.zip key for the idxPerZip index, some objects can be omitted in indexing, since their key value may not exist. This situation does not disable indexing; however, it introduces several requirements for the database programmer in order not to thwart index optimisation (this problem is explained in detail in subchapter 5.3 and section 5.5.2). Currently the author has not provided a support for indexing when the keys maximum cardinality is above singular because of the ambiguity in generating key values for an object, i.e. more than one key values combination can be generated for a single object. Considering such a scenario would require introducing minor changes in generating the index structure and extending index optimisation methods to properly deal with selection predicates working with collections. If all conditions described above are met, the index manager initialises an index structure and creates an index related meta-object. Next, it proceeds to organise information required for optimisation. First, a corresponding to the given nonkeyexpr Page 74 of 181 Chapter 4 Organisation of Indexing in OODBMS expression Nonkey Structure is located or a new one is created. Then, the structure is updated with information depicted in Fig. 4.1 concerning the index and all its keys. Each keyexpr_i expression is marked with the index being created in the proper Key Structure. This terminates creation of a new index. However, it is crucial that the index manager enables index updating mechanism in order to fill the index structure with objects. During that operation, the index is filled with appropriate objects. This topic together with related issues is discussed in subchapter 4.3. 4.3 Automatic Index Updating Indices, like all redundant structures, can lose cohesion if the data stored in the database are altered. Rebuilding an index is to be transparent for application programmers and should ensure validity of maintained indices. For that reasons the automatic index updating has been designed and implemented. Furthermore, the additional time required for an index update in response to a data modification should be minimised. This is critical from the point of view of the large databases efficiency. Any change to data cannot cause long lasting verification of existing indices or rebuilding the whole index from scratch. To achieve this, a database system should efficiently find indices which became outdated because of a performed data modification. Next, the appropriate index entries should be corrected so that all index invocations would provide valid answers. Such index updating routines should not influence the performance of retrieving information from the database and the overhead introduced to writing data should be minimal (particularly when no index has been affected by changes to the database). However finding a general and optimal solution for index updating is not possible because of the complexity of DBMSs. Such a task requires analysis of many different real life situations occurring in the database environment in order to minimise deterioration of performance. 4.3.1 Index Update Triggers Each modification performed on objects (creation, update and deletion) is executed through the ODRA object store CRUD (acronym for Create, Read, Update and Delete) which generally is responsible for access to persistent data and other database entities. The proposed approach to automatic index updating concentrates on this element of the system as it is the easiest and certain way to trace data modifications. Page 75 of 181 Chapter 4 Organisation of Indexing in OODBMS Possible modifications that can be performed on an object are the following: • updating a value of an integer, double, string, boolean, date or object reference, • deleting, • adding a child object (in case of a complex object), • other database implementation dependent modifications: e.g. adding a child to an aggregate object (role of this kind of objects in index updating is described in section 4.3.5). The author has introduced a group of special auxiliary structures called Index Update Triggers (IUT) together with Triggers Definitions (TD). These elements are essential to perform index updating. Each IUT associates one database object with an appropriate index through a TD. Existing IUTs automatically initialise the index updating mechanism when a modification concerning the given object is about to occur. More than one IUT can be connected with a single object. TDs provide means to find objects which should be equipped with IUTs. Additionally, TD specifies the type of an IUT. An object is associated with IUTs when it participates in accessing non-key objects or calculating key values for indices. Therefore, modification to objects not linked with any index does not trigger unnecessary index updating. Altering objects equipped with IUTs is likely to influence topicality of indices and IUTs. Four basic types of IUTs (each IUT refers to different TD type) are proposed: 1. Root Index Update Trigger (R-IUT) – is by default associated with the root database entry which is a direct or indirect parent for all indexed database objects. When a new object is created in the databases root, the trigger can cause generation of a NonkeyPath- or Nonkey- Index Update Trigger (described below) for the new child object. This trigger is also used to initialise or terminate all triggers associated with an index. 2. NonkeyPath Index Update Trigger (NP-IUT) – a type of a trigger associated with objects which are potential direct or indirect parent objects for new indexed objects. This type of a trigger is generated when an index non-key object is defined by a path expression (e.g. idxAddrStreet index), i.e. when non-key objects are not direct children of the databases root. Similarly to a R-IUT, this Page 76 of 181 Chapter 4 Organisation of Indexing in OODBMS trigger can cause generation of a NonkeyPath- or Nonkey- Index Update Trigger for the new child object. 3. Nonkey Index Update Trigger (NK-IUT) – a trigger that is assigned to indexed (non-key) objects. It is generated by direct parent object’s update triggers (R-IUTs or NP-IUTs). The process of creating a NK-IUT consists of the following steps: • first a NK-IUT is assigned to the given indexed object, • the key value is calculated, • the corresponding index entry is created (if a valid key value is found), • Key Index Update Triggers (described below) are generated and parameterised with the indexed object identifier. Creating a child object inside a non-key object initialises routines identical to a Key Index Update Trigger. 4. Key Index Update Trigger (K-IUT) – associated with objects used to evaluate a key value for a specific non-key object (identifier passed together with TD as an additional parameter). Each modification to such objects can potentially modify the process of evaluating a key and hence its value. Therefore, a K-IUT is responsible for updating a corresponding index entry and maintaining appropriate K-IUTs corresponding to the given non-key object. Basing on the sample store depicted in Fig. 3.2 for indices idxPerAge, idxEmpWorkCity and idxAttrStreet introduced in section 4.1.2 example IUTs shown in Fig. 4.3, Fig. 4.4 and Fig. 4.5 would be generated. Let us assume that i0 is the identifier of the databases root. Non-key objects associated with K-IUTs are stated in parentheses. Fig. 4.3 Example Index Update Triggers generated for idxPerAge index Page 77 of 181 Chapter 4 Organisation of Indexing in OODBMS Fig. 4.4 Example Index Update Triggers generated for idxEmpWorkCity index Fig. 4.5 Example Index Update Triggers generated for idxAddrStreet index 4.3.2 The Architectural View of the Index Update Process The overview of the index update process that has been proposed and implemented by the author is presented in Fig. 4.6. Fig. 4.6 Automatic index updating architecture Page 78 of 181 Chapter 4 Organisation of Indexing in OODBMS When the administrator adds an index, TDs are created before IUTs (this step is shown using the green coloured arrows numbered 1a and 1b): • Index manager initialises a new index and issues the triggers manager a message to build TDs. • Next, the triggers manager activates the index updating mechanism which basing on the knowledge about indices and TDs proceeds to add IUTs: o This process is initialised by introducing a R-IUT for the databases root entry. o The R-IUT trigger propagates remaining triggers to database objects. o When a NK-IUT is added to an indexed non-key object then a key value is evaluated and an adequate entry is added to the index. Removing an index causes the removal of IUTs and TDs. Together with NK-IUTs corresponding index entries are deleted. The mediator managing the addition and removal of IUTs is a special extension of the CRUD interface. The second case the index updating mechanism is activated occurs when the databases store CRUD interface receives a message to modify the object which is marked with one or more IUTs (shown in Fig. 4.6. using the blue coloured arrow with number 2). CRUD notifies the index updating mechanism about forthcoming modifications and all necessary preparation before database’s alteration are performed. This step is particularly important in case of changes which can affect a key value for the given non-key object. It consists of: • locating an index entry which corresponds to the non-key object (a key value is necessary), • identifying objects that are accessed in order to calculate the key value (they are equipped with an identical K-IUT). After gathering required information CRUD performs requested modifications and the index updating mechanism proceeds to: • update index entries for the given non-key object by: o moving an entry corresponding to the non-key according to a new key value, Page 79 of 181 Chapter 4 Organisation of Indexing in OODBMS o removing the outdated entry if there is no proper new key value, o inserting a new entry into the index if a proper key value was calculated only after database alteration. • update existing IUTs by generating new or removing outdated ones. This finishes servicing the trigger caused by alteration of the database. 4.3.3 SBQL Interpreter and Binding Extension A significant element used by the index updating mechanism is a query execution engine, i.e. the SBQL interpreter (also shown in Fig. 4.6.), extended with the ability to: 1. Log database objects that occur during evaluation of an index key expression. Logging takes place during binding object names on ENVS (other database entities like procedures, views, etc. and literals are discarded) – this feature is used to locate all objects which are or should be equipped with K-IUTs. 2. Limit the first performed binding only to one specified object – this feature significantly accelerates and facilitates verification whether a new child subobject added to an object with R-IUT or NP-IUT should be equipped with NP-IUT or NK-IUT, i.e. to check whether a new child is a non-key object or the potential direct or indirect parent of a non-key object. The only module of the SBQL interpreter which required author’s modifications is the run-time binding manager. The proposed extension is introduced using Java static inheritance; therefore, applying different binding mechanisms to the SBQL interpreter is straightforward. The interpreter is used by the index updating mechanism in order to: • traverse from the database’s root or objects equipped with NP-IUT to non-key objects, • generate a key value for a given non-key object. Let us consider the following example of adding IUTs starting with R-IUT during creation of the idxAddrStreet index. The SBQL interpreter is used first to evaluate the query Person, which returns identifiers to PersonClass and its subclasses EmpClass, StudentClass, EmpStudentClass instances, i.e. in this case i31 EmpStudent and i61 Emp objects. Consequently, NP-IUTs are added to the object i31 and the object Page 80 of 181 Chapter 4 Organisation of Indexing in OODBMS i61. In order to propagate triggers to non-key objects, first, the index updating mechanism performs the operation nested on i31 object to prepare a suitable context for the SBQL interpreter by pushing necessary binders on the environment stack. Then, the query address is evaluated and returns i36 object. Similarly, actions are taken for the i61 Emp object and the i66 identifier is returned. Therefore, the NK-IUTs are added to address objects i36 and i66. Next, in the context of both non-key objects the SBQL interpreter evaluates the key expression street. Accordingly, street objects i34 and i64 containing key values are returned. This procedure allows inserting two non-key objects into the idxAdrStreet index and building R-IUT, NP-IUT and NK-IUT triggers. Nevertheless, it is insufficient to find objects which should be equipped with a K-IUT because of possible key expression complexity which is not limited only to a path expression. However, enhancement to the run-time binding manager enables finding those objects during calculation of a key value. All objects called in evaluation of a key expression by the SBQL interpreter occur during binding operation. Moreover, this enhancement allows finding aggregate objects which implicitly facilitate binding. Such objects can be also useful in improving performance of index updating (cf. section 4.3.5). To conclude the example, as a result IUTs have been generated according to Fig. 4.5. The next section discusses more complex examples concerning K-IUTs in order to present versatility of the proposed approach. 4.3.4 Example of Update Scenarios In order to trace example scenarios of index updating let us refer to the sample store in Fig. 3.2. In case of examples presented below the most important are object and method identifiers. Classes PersonClass and StudentClass occur during nesting operation; however, in presented examples they do not effect binding operation. We assume that all examples are correct so during evaluation run-time errors do not occur. In particular, a left operand of assign expressions always returns precisely one object to be modified. 4.3.4.1 Conceptual Example The given statement concerns updating the age attribute for the Person object which surname is equal to “Kuc”: (Person where surname = “Kuc”).age := 31 Page 81 of 181 Chapter 4 Organisation of Indexing in OODBMS According to the store state depicted in Fig. 3.2, the left operand of the assignment returns the age attribute with the identifier i34. The interpreter sends a message to the ODRA database CRUD mechanism to update a value of the i34 integer attribute to 31. Before the update operation, the CRUD mechanism checks for IUT triggers connected with the attribute being modified. Let us assume that according to Fig. 4.3 there is a KIUT described by the following properties: < index: idxPerAge, non-key object: i31 > associated with the object i34. Consequently, the index updating mechanism is triggered to calculate a key value used to access the i31 object in the idxPerAge index. It is important that additionally during this step the objects affecting the key value are identified. To obtain the key value updating routines initialise a new SBQL interpreter instance with empty ENVS and perform the following operations: 1. A reference to the i31 object is put onto the QRES and nested operation is performed. 2. New frames are created on the ENVS. The lowest stack section contains components of classes according the inheritance hierarchy: first for PersonClass, followed by EmpClass and StudentClass (however order is not predictable because of the multiple inheritance) and above them EmpStudentClass. The top ENVS frame is filled with subobjects of the i31 EmpStudent object. Fig. 4.7 Calculating the idxPerAge index key value for i31 object Next, the interpreter proceeds to evaluate the idxPerAge index key expression age. The evaluation steps are shown in Fig. 4.7. The bind operation with the age name Page 82 of 181 Chapter 4 Organisation of Indexing in OODBMS parameter is performed. The i34 attribute is put onto QRES. During binding the i34 identifier is stored by the index updating mechanism, as it has influenced the value of the key. The key value is obtained by dereferencing the i34 attribute. The index updating mechanism uses the non-key object i31 and the calculated key value to locate the corresponding to the i31 idxPerAge index entry. This is necessary for modifying the index after updating the key value for the given EmpStudent object. Now the CRUD mechanism can alter the i34 attribute and assign it the new value 31. After age update, the process of calculating the key value is repeated. In this case, it does not differ from the preceding one presented in Fig. 4.7. The index updating mechanism uses all gathered information to: • Update the idxPerAge index: the entry for the non-key object i31 with the key value 30 is properly adjusted to the new key value 31. • Revise IUTs for idxPerAge index and non-key object i31: before as well as after modifying the i34 attribute the index updating mechanism identified that i34 is the only object influencing the key value. Since K-IUT (index: idxPerAge, non-key object: i31) associated with this object is still valid no changes are made to existing IUTs. This finishes the index updating routines for the example presented above. 4.3.4.2 Path Modification The next more complex example concerns reassigning the Emp object which surname is equal to “Kowalski” to HR department through updating worksIn reference: (Emp where surname = “Kowalski”).worksIn := ref Dept where name = “HR” This operation causes assignment of the i141 Dept object reference to the i70 worksIn attribute. If there is the idxEmpWorkCity index then CRUD finds a K-IUT described by the following properties < index: idxEmpWorkCity, non-key object: i61 > associated with object i70. Therefore, before CRUD proceeds to modify the value of the worksIn attribute routines presented in Fig. 4.8 are performed in order to calculate a corresponding key value, i.e. value of the i134 object – “Opole”, and to identify objects which compose the key, i.e. identifiers occurring during binding: i70 worksIn, i131 Dept, i133 address and i134 city (written with the green colour in Fig. 4.8). Page 83 of 181 Chapter 4 Organisation of Indexing in OODBMS Fig. 4.8 Calculating the idxEmpWorkCity index key value before update The CRUD mechanism performs update on the i70 worksIn attribute value. Fig. 4.9 Calculating the idxEmpWorkCity index key value after update Page 84 of 181 Chapter 4 Organisation of Indexing in OODBMS This modification from the point of view of the index updating mechanism introduces significant changes in the evaluation of the key expression. As it can be seen in Fig. 4.9 not only the key value changed to “Kraków”, i.e. value of the i144 city attribute, but also the set of identifiers of objects which affect the key value is different, i.e. i70 worksIn, i141 Dept, i143 address and i144 city. The index updating mechanism uses all gathered information to: • Update the idxEmpWorkCity index: the entry for the non-key object i61 with the key value “Opole” is adjusted to the new key value “Kraków”. • Revise IUTs for idxEmpWorkCity index and non-key object i61: before as well as after modifying the i70 attribute the index updating mechanism has identified that the i70 object influences the key value. However objects i131, i133 and i134 no longer affect the key value and therefore K-IUTs < index: idxEmpWorkCity, non-key object: i61 > associated with them are removed. On the other hand currently objects i141, i143 and i144 assist in computing the key value so K-IUTs < index: idxEmpWorkCity, non-key object: i61 > are reassigned to them. It is important to note that modifying the worksIn attribute of i61 Emp should be followed by suitable changes in Dept objects to ensure consistency between worksIn and employs references. However, it is not an issue of the automatic index updating but rather of a particular database application. 4.3.4.3 Keys with Optional Attributes The given example consists of two statements removing and adding a zip attribute for an address subobject of an Emp object which surname is equal to “Kowalski”. It shows the main idea how the automatic index updating deals with deletion and creation of objects. The following statement causes that the database CRUD mechanism removes the i69 object: delete((Emp where surname = “Kowalski”).address.zip) Let us assume that the idxPerZip index is created and hence before deletion the index updating mechanism finds a K-IUT described by the following properties < index: idxPerZip, non-key object: i61 > associated with the object i69. The index updating mechanism calculates the key value corresponding to the non-key object (states of SBQL stacks during evaluation are presented in Fig. 4.10), i.e. value of the i69 object – Page 85 of 181 Chapter 4 Organisation of Indexing in OODBMS 99999. Additionally, objects which influence the key value are identified, i.e. identifiers occurring during binding: the i66 address and the i69 zip. The latter identifier because of the removal will not be further considered by the index updating mechanism during update triggers revising. Fig. 4.10 Calculating the idxPerZip index key value before removing zip attribute The CRUD mechanism deletes the i69 zip attribute together with all associated IUTs. Consequently, the successful evaluation of the key value is not possible (it is depicted in Fig. 4.11). Despite the lack of a key value the index update mechanism finds identifiers of objects which are used during the key value calculation, i.e. the i66 address. Fig. 4.11 Calculating the idxPerZip index key value without zip attribute The index updating mechanism uses gathered information to: • Update the idxPerZip index: the entry for the non-key object i61 is removed. • Revise IUTs for the idxPerZip index and the non-key object i61: before as well as after modifying the i66 attribute influences the key value. The i69 zip attribute no longer affects the key value, however the K-IUT < index: idxPerZip, nonkey object: i61 > associated with it was already removed during the deletion. As a result, no other changes are made to existing IUTs. Page 86 of 181 Chapter 4 Organisation of Indexing in OODBMS Let us analyse how the automatic index updating deals with inserting a new zip attribute into the address object: (Emp where surname = “Kowalski”).address :<< zip(99726) The index updating mechanism finds a K-IUT described by the following properties < index: idxPerZip, non-key object: i61 > associated with the object i66. Before the insertion the key value corresponding to the non-key object is calculated. State of the key value has not changed; so, identically as in Fig. 4.11 no key value is found. During the key value computation the i66 address object is used. The CRUD mechanism creates the new zip attribute with the 99726 value and the identifier i104 and inserts it into the i66 address object. The index updating mechanism proceeds to the evaluation of the key expression according to steps presented in Fig. 4.12. Objects which influence the key value are identified, i.e. the i66 address object and the i104 new zip object. Fig. 4.12 Calculating the idxPerZip index key value after inserting zip attribute Consequently the index updating mechanism: • Updates the idxPerZip index: the entry for the non-key object i61 with the key value 99726 is added. • Revises IUTs for the idxPerZip index and the non-key object i61: before as well as after inserting the new zip attribute the index updating mechanism identified that i66 influences the key value. However, the i104 object is assigned the K-IUT < index: idxPerZip, non-key object: i61 > because it is important in computing the key value. 4.3.4.4 Polymorphic Keys The following examples concerns the idxEmpTotalIncomes index with key based Page 87 of 181 Chapter 4 Organisation of Indexing in OODBMS on the getTotalIncomes() method which is polymorphic depending on the class of the non-key object. For EmpClass instances getTotalIncomes() returns value of the salary attribute: return deref(salary) whereas for EmpStudentClass it also takes into consideration the scholarship attribute: return deref(salary) + deref(scholarship) Statement below concerns updating the salary attribute for the Emp object which surname is equal to “Kowalski”: (Emp where surname = “Kowalski”).salary := 2000 As a result of evaluation the SBQL interpreter sends a message to the CRUD mechanism to modify the i71 salary attribute of the i61 Emp object. Before the modification is executed the index updating mechanism finds the K-IUT < index: idxEmpTotalIncomes, non-key object: i61 > associated with the i71 salary attribute. The key value for the non-key object is computed and amounted to 1200. Binding operations performed during the evaluation presented in Fig. 4.13 indicate that objects which influence the key value are the i14 getTotalIncomes, i.e. the EmpClass procedure object, and the i71 salary object. The procedure identifier (written with red colour) can be discarded by the index updating mechanism during update triggers revision. Fig. 4.13 Calculating the idxEmpTotalIncomes index key value for i61 object before update After the salary attribute update the second calculation of the key value is similar to one presented above in Fig. 4.13. Only the final value changes to 2000. Considering this information the index updating mechanism: • Updates the idxEmpTotalIncomes index: the entry for the non-key object i61 is adjusted to the new salary key value 2000, • Revises IUTs for idxEmpTotalIncomes index and non-key object i61: before as Page 88 of 181 Chapter 4 Organisation of Indexing in OODBMS well as after inserting same IUTs were identified by the index updating mechanism hence no changes are done. Let us consider how the automatic index updating deals with an EmpStudentClass instance. The given statement concerns updating scholarship attribute for EmpStudent objects which age is equal to 30: (EmpStudent where age = 30).setScholarship(1500) According to the sample schema in Fig. 3.2 due to invocation of the setter setScholarship() method the i39 scholarship attribute of the i31 EmpStudent object will be updated. Again before the modification the key value for the idxEmpTotalIncomes index is calculated. From the interpreter routines depicted in Fig. 4.14 results that objects i24 getTotalIncomes (the EmpStudentClass procedure object), i41 salary attribute and the i39 scholarship attribute influenced the key value. The procedure identifier can be discarded during the update triggers revision. Fig. 4.14 Calculating the idxEmpTotalIncomes index key value for i31 object before update After the scholarship attribute update the second calculation of the key value is similar to one presented above in Fig. 4.14. Only final steps differ (see Fig. 4.15). In order to conclude CRUD operations the index updating mechanism: Page 89 of 181 Chapter 4 Organisation of Indexing in OODBMS • Updates the idxEmpTotalIncomes index: the entry for the non-key object i31 is adjusted to the new key value 2500, • Revises IUTs for the idxEmpTotalIncomes index and the non-key object i31: before as well as after inserting same IUT triggers were identified by the index updating mechanism hence no changes are done. Fig. 4.15 Last steps of computing the idxEmpTotalIncomes index key value for i31 after update The proposed approach presented on examples above due to its generality is capable of dealing with updating indices with even more complex keys. Extending this solution to support AS2 and the following abstract store types (which consider dynamic inheritance and encapsulation as depicted in section 3.1.2) does not require introducing significant changes. 4.3.5 Optimising Index Updating The presented solution to index updating is universal and versatile; however, without optimisations it can cause unnecessary performance deterioration particularly in simple updating cases. In the most common scenario, a key value is defined by a path expression (e.g. indices idxPerAge, idxPerZip, idxEmpDeptName, idxEmpWorkCity). Often alterations concerning indexed objects key are updating the object which holds a key value. Such an object could be equipped in a different type of a trigger the Index Key Value Update Trigger (KV-IUT) instead of the K-IUT. Modifying the value of an object equipped with this trigger does not require using the SBQL interpreter to recalculate the key value. Moreover, revising IUTs is also unnecessary. This would significantly simplify the index updating mechanism. For example in case of database state presented in Fig. 3.2 and the idxEmpWorkCity index the following statement changing city of HR department to Warszawa: Page 90 of 181 Chapter 4 Organisation of Indexing in OODBMS (Dept where name = “HR”).address.city := “Warszawa” would execute the KV-IUT associated with the i144 city object and the i31 EmpStudent non-key object. However, in order to calculate key value instead of executing the query worksIn.Dept.address.city in context of the non-key object the index updating mechanism directly dereferences the city object. Moreover, revising K-IUTs is skipped. A next optimisation takes advantage of aggregate objects which are used to model a collection of objects with the same name and type. Aggregate objects are a physical optimisation for searching subobjects used when cardinality of a subobject is not singular. The parent object instead of multiple subobjects with the same name contains one aggregate subobject. Calling all subobjects by their common name is achieved through the mediation of an aggregate though aggregate is their direct parent. If similar IUTs refer to such a collection of objects then their aggregate parent can be equipped with the identical IUT; consequently, it can be automatically propagated to new created aggregate object children. For example let us consider the idxPerAge index and adding new Person object to the database. The new Person object is not added directly to the database’s root, but to a Person aggregate object. Therefore, correct NKIUT is simply propagated from the aggregate and does not need to be generated by the R-IUT which is more complex since it requires an additional verification procedure. In the current implementation the index updating mechanism works within a range of an atomic databases CRUD operation. Still often even single statement can cause several changes to the database. In many cases it would be optimal to gather necessary information during an execution of series of atomic operations and delay index updating and index update triggers revision to the very end of a complex operation. This would however require cooperation with the database transaction mechanism, which is still under development in the implemented prototype. The following example statement: (Dept where name = “HR”).address := ("Warszawa" as city, "Koszykowa" as street) consists of the four following atomic CRUD operations: first the deletion of i144 city, i145 street objects and next creation of new city, street objects. In case of the idxEmpWorkCity index it results in running index updating at least three times for the i31 EmpStudent non-key object (only deletion of the i145 street object is not connected with K-IUTs). However, it would be efficient (approximately three times faster) to Page 91 of 181 Chapter 4 Organisation of Indexing in OODBMS execute the index updating mechanism only before the first atomic deletion to gather necessary information and after completing the creation of objects. The depicted lazy index updating strategy would be optimal if the most of index maintenance routines would occur in a database’s idle time, i.e. after data modifying statement’s execution but before a next index invocation. The last approach to optimisation of index updating concerns not only efficiency but also decreasing databases load by removing unnecessary IUTs. If identical K-IUT refers to a collection of identical subobjects then only their aggregate parent can be equipped with the K-IUT instead. This solution would reduce space occupied by the IUTs in case of more complex indices (e.g. in the idxDeptYearCost index employees of a department are accessed using references inside employs aggregate object); nevertheless, the index updating mechanism has to additionally check for update triggers of the parent aggregate objects. As a result of introducing the transaction mechanism together with aggregate objects maintaining some of IUTs would be unnecessary. This method must precisely consider the architecture of a database’s store and properties of an object-oriented query language. In the current implementation aggregate objects are automatically created for a complex object containing subobjects with cardinality different than singular. Therefore, a statement cannot create a direct new child to existing complex object provided that the appropriate subobject was not earlier deleted within processing the given statement. The K-IUT connected with complex objects would not be necessary because a similar trigger which is responsible for preparation of the index update mechanism would be earlier started during deletion of subobjects of a complex object. For instance, in the previous example, concerning the idxEmpWorkCity index and modifying the address of the HR department, deleting the i144 city subobject should initialise the index updating mechanism. Therefore, the K-IUT associated with the i143 address object would not be necessary. Similarly, for the given i31 EmpStudent non-key object the K-IUT for the i141 Dept object can be also omitted. Majority of the sketched above optimisations proposed by the author are implemented in the ODRA database prototype. Modifications which concern taking advantage of the transaction mechanism are planned to be implemented together with the development of transactions in ODRA. The other source of potential optimisations concerning the index maintenance for indices based on path expressions is the research Page 92 of 181 Chapter 4 Organisation of Indexing in OODBMS literature, e.g. [10, 11, 12]. 4.3.6 Properties of the Solution The proposed index updating mechanism meets guidelines depicted in the introduction in subchapter 4.3 and proves several supplementary advantages, i.e.: • each modification to the indexed data is automatically reflected in the appropriate indices contents, • index updating routines do not influence the performance of retrieving information from the database, • index updates are triggered only in case of modifications concerning objects used to access the indexed objects or to determine key values, • modification to a single key value introduces an additional time overhead, which is comparable to the time of calculating the given key value two times and performing modification to index records, • automatic index updating performance can be improved by many optimisations described in subchapter 4.3.5, • basic solution is independent from: o the query language and execution environment (does not require additional routines during compile-time or run-time), o index structure, • generic support for a variety of index definitions (including usage of complex expressions with polymorphic methods and aggregate operators). On the other hand, the proposed solution to index updating issue introduces many additional database structures. Unfortunately, almost every object used to access indexed objects or calculate a key value must be equipped with appropriate IUTs (some exceptions are depicted in subchapter 4.3.5). It is caused by the properties of the SBQL query language that make in many situations difficult or even impossible to predict basing only on an index definition which objects should trigger an index update. Nevertheless, the author does not exclude possibility to develop more optimisation methods for this aspect of index updating process. In particular, such space-preserving optimisations can be easily introduced for very simple indices, e.g. on objects attributes. Page 93 of 181 Chapter 4 Organisation of Indexing in OODBMS 4.3.7 Comparison of Index Maintenance Approaches The ODRA OODBMS is a proof of concept prototype as well as the implemented index updating mechanism. There is a great deal of available relational, object-relational or object-oriented databases based on different paradigms and exploiting different index structures for diverse applications. Those systems generally are deprived of detail efficiency comparisons between other existing solutions in aspect of maintaining index cohesion. The fair comparison of approaches can be conducted considering general properties to the index maintenance and its influence on capabilities of a database indexing. Thus, the comparison of efficiency between the proposed solution and solutions applied in other systems is omitted. The overview concerning indexing features of many existing products and some prototype approaches is described earlier in subchapters 2.3, 2.4 and 2.5. Routines responsible for index maintenance in relational and object-relational databases are straightforward and therefore simple. An undoubted advantage of the index updating approach in majority of relational databases is an economic usage of the data store. The information necessary for the mechanisms maintaining cohesion between data and indices is associated with table columns as they are identical for each row. The quantity of such information is therefore independent from the quantity of data stored in tables (i.e. the number of table rows). Similarly, object-oriented databases associate automatic index updating mechanisms with a whole collection or a class rather than with an object. In contrast, in the implemented solution for the ODRA object-oriented database IUT triggers are in many cases written together with complex objects and atomic objects containing values. Fortunately, majority of the databases store space is occupied by the data rather than by the redundant information. This situation is acceptable considering that nowadays databases administer a very large amount of memory (or disk) space. The proposed implementation of the fully transparent indexing in the StackBased Approach enables creation and automatic maintenance of indices with keys defined using arbitrary deterministic expressions, including methods invocation (also polymorphic and aggregate functions), e.g.: • idxEmpDeptName – key based on worksIn.Dept.name path expression, • idxDeptYearCost – key based on sum(employs.Emp.salary) * 12 expression, • idxEmpTotalIncomes – key based on an Emp class method getTotalIncomes(). Page 94 of 181 Chapter 4 Organisation of Indexing in OODBMS This method is overridden for instances of the EmpStudent class. As it was said in subchapter 2.5, properties of a query language (e.g. SQL), a lack of appropriate object-oriented extensions or a primitive approach to the index maintenance limits defining advanced indices. For the following reasons, the most of advanced object-relational transparent indexing approaches including SQL Server (computed columns) and Informix (functional indexes) does not provide sufficient support to introduce indices with complexity similar to ones presented above. Similarly, the IBM DB2 Universal Database in spite of proposing the Index Extensions, which are very powerful indexing tools, also had not provided sufficient transparent solutions. From among OODBMSs only GemStone products enable indexing for indices based on a path expression like idxEmpDeptName index. The Oracle function-based index feature, despite lack of path expression based index support, provides facilities for creating an index similar to the idxEmpTotalIncomes index. The test conducted in section 2.5.1 has used the schema in Fig. 2.3, partially corresponding to the object store in Fig. 3.1. The created emp_gettotalincomes_idx Oracle index is based on an analogous polymorphic method. The disadvantage of the Oracle’s equivalent of the discussed index concerns its influence on the database performance. The index updates occur in case of any modifications to indexed table, not only those concerning columns used to determine key values. The attempt to introduce in Oracle an index dept_getyearcost_idx corresponding to the idxDeptYearCost index was unsuccessful. The modifications to a table with data about employees, which were used to calculate index key, caused dept_getyearcost_idx to lose cohesion with the data. There are no similar errors concerning maintaining idxDeptYearCost index in the ODRA implementation. Advanced approaches to indices based on path expressions are described in many research documents, e.g. [10, 43]. The index maintenance issue is usually solved by preserving additional information inside the index structure, which enables efficient and correct index updating. To the best of author’s knowledge, implemented solutions concerning the transparent index maintenance presented in the research literature or incorporated in commercial products apply to a specified family of index definitions and cannot be considered generic. Not implemented, but a generic solution for the maintenance of function-based indexes is defined in [46]. Similarly to the ODRA implementation, index updating Page 95 of 181 Chapter 4 Organisation of Indexing in OODBMS information is connected with objects associated with indices. In contrast to all presented above solutions to the automatic index updating issue, the author’s approach based on Index Update Triggers implemented in ODRA provides transparent, complete and generic support for a variety of index definitions. Moreover, the additional data modification costs associated with the index maintenance concerns exclusively objects used to access the indexed objects or to determine a key value. One can argue about increased storage cost caused by IUTs. Nevertheless, as it is shown in [10, 11, 12] the maintenance of indices defined using complex expressions require introducing a lot of additional information in the index structure (not only entries of indexed objects according to values pointed by path expressions). Other advantage of author’s IUTs set on objects used to determine a key value is that they include direct reference to an indexed object, whereas other solutions [10, 43] are often forced to identify it indirectly (e.g. by reverse navigation methods or accessing key value first and looking out indexed object in the index). 4.4 Indexing Architecture for Distributed Environment Different aspects of indexing presented in this chapter form the complete architecture of local index management and maintenance. In subchapter 2.6 the local indexing strategy is explained. It completely relies on the local indexing architecture and general optimisation methods for distributed query processing (i.e. global query decomposition). Therefore, analysis of this strategy is considered straightforward and is omitted in this subchapter. The discussed global indexing architecture concerns homogeneous, horizontally fragmented data on the integration schema level. It is a currently developed approach (in the ODRA prototype) to integration of distributed resources. The integration schema describes how data and services residing on local servers are to be integrated. It consists of individual schemas. The idea of a schema is a combination of an interface known from the object-oriented programming languages and a typical database schema. A schema is an abstract description specifying objects with attributes and methods that must be provided by a group of servers contributing to the given schema. Nonetheless, local servers implementing the schema still have wide autonomy. Contributed objects can be either materialised or virtual using SBA views. They can contain additional attributes and methods not included in the schema. Moreover, objects contributing Page 96 of 181 Chapter 4 Organisation of Indexing in OODBMS servers provide an own implementation of object methods and can transparently take advantage of inheritance and polymorphism. Generally, schemas enable type-safe querying integrated horizontally fragmented data. A query addressing an integration schema is decomposed on parts referring to individual schemas and appropriate subqueries are sent to servers to be evaluated locally in parallel. The local evaluation differs, depending on local schema implementation. According to the taxonomy presented in [124] the global indexing strategy proposed in this subchapter corresponds to a Non-Replicated Index with Index Partitioning Attribute indexing schema. It is the result of the following factors: • a selected distributed index structure – i.e. SDDS basic variant – does not replicate parts of an index on different servers, • in the global indexing strategy data partitioning and index partitioning are orthogonal. The data integration approach does not imply any concrete data partitioning method; hence, a distributed index can spread on the greater number of servers than data. In that context, the described indexing schema is not entirely compatible with the presented taxonomy. Similarly, the centralised indexing strategy is not taken into consideration in the taxonomy. More advanced and complex data integration, e.g. involving mixed fragmentation, data heterogeneity and replication, can be implemented on top of presented integration schemas using updateable views (see subchapter 3.5). Such solutions are a topic described in works, e.g. [68], and many research papers, e.g. [2, 39, 60, 61] including contributed by the author [63, 64, 131]. The next section discusses a proposed approach to indexing management and index maintenance in distributed object-oriented database. To conclude this subchapter an example of indexing in a global schema is presented. 4.4.1 Global Indexing Management and Maintenance Let us consider creating a global index defined on a schema, addressing a horizontally fragmented collection stored on servers (contributing sites). First, an appropriate index structure is created. Stored non-key values consist of an indexed object reference together with the information on its origin, i.e. contributing site Page 97 of 181 Chapter 4 Organisation of Indexing in OODBMS identifier. A global index can be centralised, i.e. located on one server, or distributed between several indexing sites over the database. Regardless of an indexing strategy, such an index must be made available to many servers. Locally it can be represented by a proxy forwarding index calls. A centralised index communicates with proxies on servers directly. In case of a distributed indexing strategy an individual proxy can forward index calls to an arbitrary indexing site hosting an index part. Optimally, a proxy may transparently become a part of a distributed index. Further processing of an index call and communication between indexing sites depends on a particular index implementation. For example, a linear hashing implementation discussed in subchapter 4.1 can be used for centralised indexing. It can be extended to an SDDS distributed index in order to preserve indexing properties and enable parallel processing. The next step of creating the global index is its registration. The subchapter 4.2 proposes an organisation of the index management which can be completely applied to indices on the level of integration schema. An auxiliary information provided by the index manager, which is needed by the index optimiser and the static evaluator, are used in the same way as in case of the local indexing. The main difference lies in the fact that information about global indices must be replicated together with the integration schema on all servers which can utilise it. Obviously, the indices referenced by the index manager can be a local proxies enabling communication with an appropriate centralised or distributed index. In the next turn, the index manager initialises populating the index. According to the author’s approach presented in subchapter 4.3 it is connected with an activation of the automatic index updating. Again, this mechanism relies mainly on the local index maintenance architecture. It is essential that the currently considered data distribution model disables storing references to remote objects; therefore, in presented solution it is assumed that each key value can be calculated within an indexed object site. As a result, the index manager delegates activation of the index maintenance to contributing sites where appropriate Trigger Definitions are created according to an index definition. Next, locally and independently Index Update Triggers are generated. During this operation objects are inserted into the global index. If local index maintenance routines evaluating non-key or key expressions encounter elements, such as e.g. views invocations or links to remote databases, making the automatic index updating impossible then an appropriate error message is sent to the global index manager and Page 98 of 181 Chapter 4 Organisation of Indexing in OODBMS the creation of the global index is cancelled. Concluding, populating the global index and the further transparent index maintenance is provided mainly locally by the architecture presented in Fig. 4.6 where only the index manager and database indices are global (in contrast to a case discussed in subchapter 4.3). The final element of the indexing architecture, i.e. the approach to index transparency from the point of view of query processing, is the topic of Chapter 5. The presented solution is general as it applies equally to indexing on local and global levels. 4.4.2 Example on Distributed Homogeneous Data Schema Let us consider a schema describing horizontally fragmented data presented in the Figure below: Fig. 4.16 Example database schema for data integration It comprises three interfaces defining what attributes and methods must contributed Person, Emp and Dept collections of objects contain. Contributing sites have to share data fitting the given integration schema. Actual schemas of contributing sites can be distinct. An example database schema that matches one presented above was introduced in Fig. 3.1. Differences of a local schema like e.g. other collections, inheritance relations between collections, extra attributes or methods, does not matter as long as a local schema contains elements required by the integration schema. Let us consider the creation and work of an idxEmpDeptName global index using a derived attribute worksIn.Dept.name to retrieve Emp objects. First, an appropriate empty, centralised or distributed index structure is initialised and is made available among the distributed database servers. Next, it is registered by the index manager and information is generated which is necessary by index optimiser and static Page 99 of 181 Chapter 4 Organisation of Indexing in OODBMS evaluator modules working on queries addressing the integration schema. In the final step of the global index creation, the index manager initialises automatic index updating mechanisms on contributing sites. On each site this operation causes the following steps (described in detail in section 4.3.2): • according to the index definition Trigger Definitions are created, • Root Index Update Trigger is added to the database’s root, • Nonkey Index Update Triggers associated with objects belonging to an Emp collection are generated, • for each non-key object the key value is calculated and objects used to determine it are equipped with Key Index Update Triggers and a corresponding index entry is added to the global index. It is significant that evaluation of the key expression worksIn.Dept.name in the context of an Emp object can be performed completely on a contributing site since the integration model restricts a worksIn reference to point to a local Dept object. This makes indexing architecture simple and effective. Consequently, changes affecting indexed data are detected locally within a contributing site and independently an appropriate global index update command is issued by local index maintenance mechanisms. Finally, the index optimiser does not distinguish between local and global queries applying indices available in a schema a query addresses. Similarly, it is possible to create and utilise in the integration schema depicted in Fig. 4.16 almost all indices, which apply to Fig. 3.1, introduced in section 4.1.2. The only global index that cannot be created by the administrator is idxEmpSalary, because in the integration schema Emp objects are devoid of a salary attribute. There exist several other aspects that implementation of indexing in distributed environment should consider. The main problem concerns dynamic joining and disconnecting of contributing or indexing sites and distributed transactions management. However, there exists a variety of solutions addressing those issues that can be applied, e.g. [29, 76]. Page 100 of 181 Chapter 5 Query Optimisation and Index Optimiser Chapter 5 Query Optimisation and Index Optimiser The research on optimisation of SBQL queries resulted in the work [93] deeply investigating this issue and in many papers e.g. [94, 95, 96, 97, 98, 99, 100, 122]. The goal of the developed optimisation methods is similar like in the case of optimisation in RDBMSs [20, 29, 54, 55]. The original query is processed in order to improve its efficiency by modifying its default evaluation plan and at the same time to preserve its semantics. In the implemented approach query optimisation is achieved through query transformations, mostly efficient, reliable and easy to implement query rewriting methods. In contrast to relational optimisers, no other intermediate query representations are applied, e.g. object-oriented algebra. The transformation processes are facilitated by static query analysis (sketched in subchapter 3.4). Query optimisation exploits information about the size of an environment stack during an evaluation of query parts in order to: • equip each non-algebraic operator occurring in a query with number referring to a ENVS section which it opens, • assign the current size of ENVS to each name when it is bound, together with the section number where the binding is performed. Static query analysis also facilitates locating query parts which raise a threat of run-time errors. One of the most important methods exploiting information from the static analysis is factoring out independent subqueries [95, 97]. Frequently a database query contains a subquery for which all names are bound in sections different than opened by a currently evaluated non-algebraic operator. Such a subquery can be evaluated before this operator puts its section onto ENVS. Consequently, the calculation of this subquery is planned earlier than it would result from the original query syntax tree. This operation is vital in optimisation of non-algebraic operators evaluation because it prevents from processing the subquery multiple times in case when its result is always the same. Let us consider the query which retrieves surnames of the employees who earn as much as employee with surname “Kuc”: Page 101 of 181 Chapter 5 Query Optimisation and Index Optimiser (Emp where salary = (Emp where surname = “Kuc”).salary).surname The SBQL optimiser rewrites it to the following form: ((Emp where surname = “Kuc”).salary groupas salaux). (Emp where salary = salaux).surname The independent subquery, which determines the salary of the given employee, is factored out and therefore is calculated only ones at the very beginning. Its result is stored inside the salaux binder and is repeatedly accessed by the where clause in order to compare salaries of all employees. Other SBQL optimisations, which are also implemented in the ODRA prototype, take advantage of the distributivity property of some SBQL operators (e.g. pushing selection before join), use redundant database structures (e.g. indices, caching) or perform other query transformations (e.g. removing auxiliary names, removing dead subquries), etc. Some of these methods are discussed in the context of indexing in further sections. 5.1 Query Optimisation in the ODRA Prototype ODRA (Object Database for Rapid Application development) [2, 119] is a research platform providing database application development tools. The essential features of the prototype are functional: run-time environment integrated with an OODBMS, SBQL query language, optimisation framework, etc. This subchapter depicts the view on the internal architecture of the ODRA optimisation framework. Its schema is presented in Fig. 5.1; it contains data structures (dashed lines figures) and program modules (grey boxes). The architecture reflects only the most important components from the point of view of the query optimisation and processing. Each ODRA instance can work as a client and as a server; therefore, this subdivision is introduced to increase comprehensibility. A server can service many clients and a client can communicate with many servers. Fig. 5.1 illustrates also general SBQL query processing flow. First, a query is parsed from its textual form to an equivalent query syntax tree. The processing flow order concerning suitable transformations of the syntax tree proceeds according to the numbers on the schema: 1. Static evaluation adds necessary operators (e.g. casts and dereferences), and Page 102 of 181 Chapter 5 Query Optimisation and Index Optimiser equips the query syntax tree with signatures which facilitate optimisers. 2. Query syntax tree is processed through the chain of optimisers in an appropriate order. Each optimiser rewrites the query and returns its syntax tree with current set of signatures. The index optimiser is concerned as one of such optimisers; however, it additionally employs the index manager module. 3. The syntax tree of the optimised and type-checked query is sent for further compilation and evaluation to a suitable ODRA module. Fig. 5.1 ODRA optimisation architecture [2] 5.2 Index Optimiser Overview The index optimiser is the main mechanism responsible for reorganising queries in order to take advantage of available indices. It is one of optimisers, which can be used in the query optimisation process. The index optimiser is essential to ensure one of Page 103 of 181 Chapter 5 Query Optimisation and Index Optimiser the most important indexing properties – index transparency. During compilation of adhoc SBQL queries or ODRA modules, which often contain not optimised queries in procedures, updateable views generic procedures and class methods, queries are processed by the index optimiser in order to improve their efficiency. Fig. 5.2 illustrates the index optimisation process and all vital cooperating ODRA elements. Fig. 5.2 Schema of the index optimiser The index optimiser input is a query which is already passed through static evaluation. Therefore, its syntax tree nodes are equipped with signatures containing typing information. The index optimiser adds index calls to the query and performs necessary modifications. The most important issue concerning all optimisation methods is to preserve query semantics while rewriting so the optimisation does not affect a query evaluation result. The transformed query must also preserve typing constraints. The index optimiser communicates with the following ODRA modules: • Index manager – provides information about indices set on database’s objects. This information is internally ordered and enables the index optimiser to find indices according to their non-keys as well as keys. • Metabase – provides a detailed description of a database’s schema. The index optimiser uses information about indices from the metabase to determine if an index call can substitute a fragment of the query. • Cost model – holds statistical information about properties of databases objects attributes. The index optimiser choosing between alternative index Page 104 of 181 Chapter 5 Query Optimisation and Index Optimiser combinations uses the cost model to pick the best solution. • Static evaluator – calculates signatures in a query syntax tree. Each time the index optimiser applies an index, the modified part of the syntax tree is filled with the description of types. The example scenario of query syntax tree transformation applied by the index optimiser is shown on the Fig. 5.3. Fig. 5.3 Example optimisation applied by the index optimiser The given query concerns retrieving persons with a surname “KOWALSKI” who are 28 years old: Person where ((surname = “KOWALSKI”) and (age = 28)) The index Optimiser applies the idxPerAge index which retrieves Person objects according to their age attribute and rewrites a query to the following form: $index_idxPerAge(28 groupas $equal) where surname = “KOWALSKI” Fig. 5.3 shows that first the predicate age = 28 is selected and removed. The index optimiser replaces the where left operand (Person) with an index invocation exactly matching the removed predicate. This transformation preserves semantic equivalence. Page 105 of 181 Chapter 5 Query Optimisation and Index Optimiser 5.2.1 General Algorithm The proposed and implemented solution works in the context of where operators (which in the SBA approach are responsible for selection) when the left operand is indexed by key values of the right operand selection predicates. However, it is possible to take advantage of indexing also when dealing with another non-algebraic SBQL operator, i.e. forany quantifier. This case is explained in section 5.5.4. The general index optimiser algorithm (shown in Fig. 5.4) attempts optimisation for each single or group of nested where operators found in the query. Let us consider optimising the given where branch qOBJ where qP1 where qP2 where … where qPn All subqueries, the left operand qOBJ generating objects for selection and all operands qP1, qP2, …, qPn defining selection predicates, may contain internal where clauses. In case of queries qP1, qP2, …, qPn some of where clauses can be pushed outside the analysed branch using the factoring out independent subqueries method (described in section 5.6.1), however not all. Some selection predicates can contain potentially index applicable where clauses which partially depend on the main where operator, e.g. Emp where ((age as empage). (salary > avg((Emp where age = empage).salary))) Such selection predicates could be a field of a standalone indexing optimisation. Therefore operands qP1, qP2, …, qPn should be processed by the index optimiser separately. Where clauses can also be found inside the left operand qOBJ query tree. In this case, no regular index is applicable for the main selection clause (only simple path expressions can define regularly indexed objects, cf. section 4.2.1). However, before optimising where clauses inside qOBJ the index optimiser first should try to optimise the given branch using other index related optimisation techniques e.g. volatile indexing (described in subchapter 7.1). This order should be preserved so that the qOBJ query remains unchanged during main branch analysis process. Page 106 of 181 Chapter 5 Query Optimisation and Index Optimiser Fig. 5.4 Index optimiser algorithm For each where branch the first object of the analysis is the left operand qOBJ. The query qOBJ has to be completely independent, so the optimiser checks node signatures of the query if it is bound in the lowest ENVS section of the database (numbered 1). Necessary information is provided by the static evaluator. If there is more than one base sections it is necessary to check if binding will be performed in a database section. Next qOBJ is used as a key for the Nonkey Structures Index, which is maintained by the index manager (detailed description is in subchapter 4.2). As a result, the index optimiser has the access to necessary information concerning indices set on the left operand. If suitable indices are found, the algorithm proceeds to match selection predicates. If not, it skips to the next where branch. 5.3 Selection Predicates Analysis The most important and complex index optimiser routines concern analysis of selection predicates. The analysis directly precedes selecting the best index and query Page 107 of 181 Chapter 5 Query Optimisation and Index Optimiser rewriting. The central part of the algorithm focuses on clauses which consist of one or several nested where operators: qOBJ where(1st) qP1 where(2nd) qP2 where(3rd) … where(n-th) qPn A right operand of the first where operator qP1 defines selection predicates which address objects returned by the query qOBJ. Consecutively qP2, …, qPn concern the following where expressions. The most frequently used form of a where clause is defined by a single where expression: qOBJ where qP1 An object for which all queries qP1, qP2, …, qPn return true passes the selection. First all objects are confronted with query qP1. Those which match qP1 predicates are passed to the next where expression and query qP2 is evaluated. This process is repeated for all where expressions. From the point of view of a single object the where operator behaves like the conjunction operator && that performs a short-circuit logical expression (known from many programming languages). Therefore when the qPi predicates return false than next predicates qPj (where j > i) are skipped. This property is often used to prevent run-time errors. E.g. in this way the following query can be executed without a run-time error: Person where exists(address.zip) where address.zip = 99726 whereas the query: Person where exists(address.zip) and address.zip = 99726 will cause a run-time error in case when at least one Person object does not contain subattribute zip derived from the address attribute. SBQL semantic of the and operator assumes that both left and right operands are always evaluated. The goals of the index optimiser rewriting are the following: • Preserve semantic equivalence between the query rewritten by the index optimiser and the original input query, so that their evaluation is identical from the point of view of a database user and the database’s and program’s state. • Optimise selection, by reducing the amount of data to be processed. The index optimiser takes advantage of indices modifying qOBJ and therefore reducing the number of objects evaluated by where operators. Next, it adequately eliminates some selection predicates from queries qP1, qP2, …, qPn. Page 108 of 181 Chapter 5 Query Optimisation and Index Optimiser Each query qP1, qP2, …, qPn can be represented by a conjunction sequence of n predicates i.e. joined with and operators, e.g. qP1 stands for m conjunct subpredicates: p1,1 and p1,2 and ... and p1,m In the simplest case, there may be only a single subpredicate. Each pi,j (where i ∈ 1, 2, …, n; j ∈ 1, 2, …, mi and mi is the number of conjunct subpredicates in query qPi) is an expression that should return single boolean literal true or false. Particularly it can be: • a binary expression based on operators comparing pair of values e.g. =, <, >, ≥, ≤, • a binary expression based on operators working with sets e.g. in, contains, • a binary expression based on non-algebraic quantifiers, • other binary expressions e.g. instanceof, • some of unary expressions e.g. exists, not, • a disjunction binary expression i.e. or operator (which may contain set of many selection predicates), • other not listed above expressions. A method describing how the index optimiser is dealing with selection predicates based on the or operator is an important issue which is described in section 5.5.3. 5.3.1 Incommutable Predicates The first step to identify indices that can be used is to find which predicates are able to take part in the optimisation process. This is the most important stage from the point of view of query semantics. When applying a given index where clause must contain subpredicates which specify the indexing function key-values criteria. The query qOBJ is substituted by the index call and mentioned subpredicates are removed. Therefore, evaluation of these subpredicates is moved to the very beginning – the index invocation. The amount of objects evaluated by where operators decreases. Such an operation is possible due to commutativity of a conjunction resulting from the logic theory: (p1 and p2) = (p2 and p1) Page 109 of 181 Chapter 5 Query Optimisation and Index Optimiser However in case of SBQL where clauses not all sub-predicates can be freely moved to be used before evaluation of the first where operator. Unfortunately, in some conditions it may lead to discrepancy between original and optimised queries semantics. As a result of moving sub-predicate pi,j (which belongs to i-th where expression) some objects normally evaluated by i-th and preceding where expressions may be skipped (this is the goal of optimisation). Usually it is desired to decrease the amount of data processed by predicates and where operators. Still there are some cases where it should be forbidden, i.e. when evaluation of a predicate: • is not run-time safe, • produces side effects i.e. changes of the database or program state. The first case occurs when an undesired or unpredicted (by a database programmer) state of the database causes a run-time error during the predicate evaluation. This situation is shown on the following example: Person where address.zip = 99726 The zip attribute has cardinality [0..1] therefore if one of evaluated Person objects does not contain an address.zip attribute a run-time error will occur. Using an index call, which reduces the number of objects evaluated by similar predicates, lessens the threat of run-time error in the optimised query. Unfortunately, this is semantically incorrect. The second case concerns predicates which contain producing side-effects calls to user defined SBQL procedures, views or class methods. E.g. predicate calling getScholarship method of Student objects in the query: Student where getScholarship() = 1000 should be evaluated for all Student objects according to query semantics. Nevertheless if getScholarship() only returns scholarship attribute and does not introduce any sideeffects (e.g. incrementing an internal Student object access counter to the scholarship attribute) then the number of Student objects evaluated by this predicate can be decreased. Otherwise such an optimisation may lead to unexpected query behaviour. Both situations described above influence other database optimisers and therefore should be identified by common compile-time processes. In conclusion, the index optimiser verifies queries qP1, qP2, …, qPn for run-time unsafe or causing side-effects predicates. If such a predicate is found in query qPk (k-th Page 110 of 181 Chapter 5 Query Optimisation and Index Optimiser where clause) than none sub-predicates located in queries qPk, …, qPn can be used by the index optimiser. Otherwise, the k-th where operator wound not always process all objects like in case of the original query. After this verification, the index optimiser focuses on the following part of the main where clause: qOBJ where(1st) qP1 where(2nd) qP2 where(3rd) … where((k-1)-th) qPk-1 and ignores all sub-predicates in queries qPk, …, qPn. 5.3.2 Matching Index Key Values Criteria After verifying predicates the index optimiser proceeds to match indices with selection criteria defined in queries qP1, …, qPk-1. Let us assume that there exist m indices ix1, ix2, …, ixm established on defined by the qOBJ query objects and consists of one or several keys iki,j (the j-th key on i-th index). Generally, an optimiser processes binary expression predicates based on =, <, >, ≥, ≤ and in operators. Such a single predicate pi,j (where i ∈ 1, 2, …, k-1) consists of two operands and comparison operator: left_operand operator right_operand The index optimiser checks whether left_operand or right_operand defines any of the keys iki,j used for constructing available indices. If one of the operands matches an index key (key_operand) then another is treated as the criterion value (value_operand). The construction of an index key is described in section 4.2.1. Value_operand is any query processed within a where operator but in contrast to key_operand it must be independent from this operator. For instance a query: Person where exists(salary) where salary > (age * 100) disables applying index idxEmpSalary set on employees salary, because age changes during evaluation for each employee as it is dependent on the nearest where operator. If a processed predicate meets the conditions described above then the index optimiser updates the information about suitable index key values criteria iki,j. All keys support criteria based on = and in operators. Range criteria (operators <, >, ≥, ≤) requires keys defined as range or enum (cf. section 4.1.1). In order to process unary expressions which return a boolean value, such as e.g. the exists operator or boolean attributes, they are treated as the following simple binary Page 111 of 181 Chapter 5 Query Optimisation and Index Optimiser expression: unary_expression = true which is semantically correct. Boolean keys characterise a weak selectivity thus they are not suitable for constructing single-key indices. However, they are useful in multi-key indexing and therefore unary expression predicates are supported by the index optimiser. 5.3.3 Processing Inclusion Operator Generally both comparison operands have singular cardinality which is ensured by verification of predicates described in subchapter 5.3. The only exception concerns the in operator because its semantic does not constrain cardinality. Let us consider two important variants concerning in operator operands depending on the location of the key_operand. The first variant is when the left_operand operand defines an index key: key_operand in value_operand If the left_operand has the cardinality [0..1] than using index replacing this predicate would not be possible because the in operator returns true if the left predicate returns an empty bag. For example the query: Person where address.zip in 99726 returns Person objects who have zip code equal to 99726 or have no address.zip attribute; however, the index call: idxPerZip(99726) returns only objects with address.zip attribute equal to 99726. Such a transformation would cause semantic inconsistency. Since the key_operand must have maximal cardinality 1 (cf. section 4.2.1) the index optimiser should in the discussed case accept only the singular cardinality of the left_operand. The cardinality of the right_operand is not relevant because an index invocation can deal with a collection of alternative key values (section 5.5.1). The second variant is when the left_operand operand defines a key search value: value_operand in key_operand According to the inclusion operator semantics, it returns true when left operand returns Page 112 of 181 Chapter 5 Query Optimisation and Index Optimiser an empty bag. On the other hand returning a collection containing different values by the left operand would result in returning false because the key_operand can include only one value. In both these cases the result of an inclusion does not depend on key_operand hence when cardinality of value_operand is not singular then the index optimiser skips processing this predicate. In order to use an index the value_operand must return a single value. It is worth noticing that in the second variant when the cardinality of the key_operand is [0..1] there is no threat of a run-time error, indices can be applied and the inclusion operator can be used instead the equality operator. For example in the query: Person where 99726 = address.zip the evaluation of the predicate may cause a run-time error; whereas, the following form of predicate ensures safe evaluation: Person where 99726 in address.zip To conclude this section the index optimiser can start matching key values criteria defined by the inclusion operator only if the left_operand has the singular cardinality. 5.4 Role of a Cost Model Once all predicates are processed, the index optimiser possesses information concerning all available indices about key values criteria found in the analysed where clause. If the criteria are sufficient for applying an index or few indices the next step of the optimisation is executed, i.e. choosing the best index (or a combination of indices). The cost model is used to check if applying an index improves efficiency and to select the most selective index. In some cases using two or more available indices in a single where clause is difficult or impossible. For example in the query: Person where surname = ”Nowak” and age = 30 let us assume that both indices idxPerSurname or idxPerAge exist and can be used. After applying one of them e.g.: idxPerSurname(“Nowak”) where age = 30 Page 113 of 181 Chapter 5 Query Optimisation and Index Optimiser the second one has to be omitted because Person name was removed and the where left operand lacks suitable objects for idxPerAge. In a similar situation using idxPerAge: idxPerAge(30) where Surname = ”Nowak” makes applying the idxPerSurname index impossible. In such cases there exist the possibility to use set intersection through transforming the given query to the following form: idxPerSurname(“Nowak”) ∩ idxPerAge(30) However, it is not certain that this operation will reduce the cost of evaluation. A profit of using both indices can be seriously decreased by necessity of determining partial results intersection. In this simple example there are three possible ways to transform and evaluate this query and it is difficult to decide which of them is optimal in the sense of the evaluation time cost. Often the use of one index as in previous examples can be optimal therefore the index optimiser considers only this possibility. A selection can be assisted with a proper model of a query evaluation cost called cost model. Data collections differ in size, selectivity, cardinality, distribution and other logical and physical features; thus, building a complete theoretical model of costs is impossible. The cost model is therefore a heuristic-empirical model which can be approximated through many experiments in a real environment. It can take advantage from all index properties available for measuring and database meta-model. The following elements of this model can be taken into account: • size of indexed object sets, • index selectivity, i.e. the average number of non-key values returned in a result of a random index use, • data read time from disk memory (or another permanent memory) • execution time of operators used in the transformed query, e.g. set intersection operator, • selectivity of a condition in a where clause, e.g. expressed in percents of selected objects, • etc. It is possible to take into account many other factors. The publications Page 114 of 181 Chapter 5 Query Optimisation and Index Optimiser concerning optimising relative queries [90, 101, 102] give many patterns for building such model which can be used creatively adapting them to a new optimising approach and a new database environment. The better cost model guarantees better optimisation and in case of the index optimisation process better selection of indices to apply. The idea of calculating the index selectivity is used in the implemented solution and therefore is described in the next section. 5.4.1 Estimation of Selectivity The selectivity is determined using the known concept of reduction factors [102]. For SBA indexing the theoretical basis is presented in [93]. The reduction factor for a selection predicate is the estimated percentage of objects selected by a predicate to all objects to which that predicate was applied in the following query: qOBJ where pSELECTION_PREDICATE In case of the most popular atomic predicates: key_operand operator value_operand example reduction factors can be defined as follows [102] (we assume that values generated by the key_operand are uniformly distributed): • for key_operand = value_operand as 1 valuesCardinalityOf (key _ operand ) • for key_operand in value_operand as countValuesOf (value _ operand ) valuesCardinalityOf (key _ operand ) • for key_operand > value_operand and key_operand ≥ value_operand (where both operands are real numbers) as HighestValueOf (key _ operand ) − value _ operand HighestValueOf (key _ operand ) − LowestValueOf (key _ operand ) • for key_operand > value_operand (where both operands are integer numbers) as HighestValueOf (key _ operand ) − value _ operand HighestValueOf (key _ operand ) − LowestValueOf (key _ operand ) + 1 • for key_operand ≥ value_operand (where both operands are integer numbers) as HighestValueOf (key _ operand ) − value _ operand HighestValueOf (key _ operand ) − LowestValueOf (key _ operand ) + 1 Page 115 of 181 Chapter 5 Query Optimisation and Index Optimiser where valuesCardinalityOf(key_operand) returns the number of different values which are returned by the query qOBJ.key_operand and countValuesOf(value_operand) returns the number of values returned by the value_operand query. In practice estimating a reduction factor is very difficult. Particularly in case of range operators <, >, ≥, ≤, because the value_operand value can be unknown during the compile-time if it is not a literal. Therefore, the average reduction factor has been assumed 0.5 for atomic range operators. Similarly calculating reduction factor for predicates based on in operator requires the knowledge about the number of elements returned by the value_operand. Because of the simplified cost model this number has been assumed as a constant value 5 in all cases. The cost model must also deal with complex selection predicates i.e. made up of two or more atomic conjunct predicates. The implemented solution generally assumes that sub-predicates assembling them are statistically independent. Consequently, a reduction factor of a complex predicate is calculated as a product of atomic subpredicates it is formed of. For example the query: Person where surname = ”NOWAK” and age > 30 retrieves persons named Nowak who are older than thirty. The reduction factor s1 for the atomic sub-predicate surname = “NOWAK” depends on the number of different names in the database: s1 = 1 1 = = 0.001 valuesCardinalityOf ( surname) 1000 and selectivity for another sub-predicate age > 30 is assumed as s2 = 0.5 which gives the reduction factor of the whole complex predicate: sel = s1 * s2 = 0.001 * 0.5 = 0.0005 Let us assume that two indices could be applied, i.e. idxPerSurname(“NOWAK”) where age > 30 and idxPerAge( ]30, ∞] ) where surname = “NOWAK” where in a definition of a values range: Page 116 of 181 Chapter 5 Query Optimisation and Index Optimiser • ]minvalue – stands for the exclusive left limit of the defined values range, • [minvalue – stands for the inclusive left limit of the defined values range, • maxvalue[ – stands for the exclusive right limit of the defined values range, • maxvalue] – stands for the inclusive right limit of the defined values range. The index optimiser should select the first one, because it has a smaller reduction factor; thus, it is more selective. Nevertheless, the most optimal solution would be to apply the multi-key index constructed on both keys surname and age (as a range key): idxPerAge&Surname( ]30, ∞]; “NOWAK”) In the author’s implementation the rule that sub-predicates assembling them are statistically independent has one exception, namely, when there are two opposing range predicates on the same condition keys. Such predicates improve selectivity and therefore the cost model additionally multiplies the obtained reduction factor by the constant value i.e. 0.25. The selection of this value should be heuristic and empirical as it depends on the size of a range occurring in a processed query. For example, the reduction factor for predicates in the query: Person where age >= 23 and age < 28 is calculated in the following way: sel = s1 * s2 * 0.25 = 0.5 * 0.5 * 0.25 = 0.0625 and the following index can be applied: idxPerAge( [23, 28[ ) Additionally, in case of where clauses which consist of sub-predicates defined with the or operator (cf. section 5.5.3) the cost model must enable estimating the selectivity of predicates joined with the union of two or more where expressions. To calculate the reduction factor of such a union, reduction factors of predicates in individual where clauses are summed together. Let us analyse the following example. The query: uniqueref((Person where surname = ”NOWAK”) union (Person where age > 30)) retrieves persons either with the surname Nowak or ones who are older than 30. The reduction factors of individual atomic predicates are calculated in previous examples: Page 117 of 181 Chapter 5 Query Optimisation and Index Optimiser s1 = 0.001 and s2 = 0.5 The following formula constitutes the selectivity of the union of where clauses with those predicates: sel = s1 + s2 = 0.001 + 0.5 = 0.501 In this case, the cost model omits the cost of the union and uniqueref expressions evaluation to simplify the index selection process. The example query can be transformed to the following form: uniqueref(idxPerSurname(“NOWAK”) union idxPerAge( ]30, ∞] )) 5.5 Query Transformation – Applying Indices After successfully selecting an index for the query evaluation optimisation the index optimiser must rewrite the given where clause. The simple example of applying a dense, single-key index is shown earlier in Fig. 5.3. The general idea behind rewriting the query by the index optimiser is not complex; however, a few cases and proposed solutions should be discussed. Some elements presented in this subchapter, like index invocation semantics, are only implementation issues. Generally the algorithm in the first step generates a proper index call, which next is used to substitute qOBJ, i.e. the left operand of the main where clause. Finally, the unnecessary predicates are removed. Eventually the index optimiser can generate a completely new where clause containing index invocation and replace the original where clause in the given query. Generating an index call is the most complex task; therefore, the knowledge of the proposed syntax is necessary. 5.5.1 Index Invocation Syntax From the SBQL syntax point of view an index invocation is simply a procedure invocation: $index_<indexname> ( <key_param_1> [; <key_param_2> ...] ) The number of parameters is equal to the number of index keys. Each key parameter defines a desirable value of a key. An index function call returns references to objects matching specified criteria. Page 118 of 181 Chapter 5 Query Optimisation and Index Optimiser Names used to invoke indices contain the prefix $index_ for two reasons: • to prevent database users from calling indices explicitly ($ is not accepted by the SBQL parser) , • to make optimisation developer and testers easier identifying index calls in optimised queries. A key parameter expression can define a single value as a criterion. In that case its evaluation should return integer, double, string, reference or boolean value result or reference to such a value. In the author’s implementation to pass a dense key value to index call a binder named $equal is created using the groupas operator. E.g. in the following call the parameter is a binder containing integer value 28: $index_idxPerAge(28 groupas $equal) Binders are used to increase readability and to make introducing new types of parameters for index calls easier. To specify a values range criterion as a key value the parameter expression should return a structure consisting of four parameters: (<lower_limit>, <upper_limit>, <lower_closed>, <upper_closed>) where: <lower_limit> and <upper_limit> are key values specifying the range, <lower_closed> is a boolean value indicating whether <lower_limit> belongs to the criterion range, <upper_closed> is a boolean value indicating whether <upper_limit> belongs to the criterion range. An example index idxPerAge&Surname invocation returns references to persons who’s age is in the range [23, 28[ and surname is “KOWALSKI”: $index_idxPerAge&Surname( (23, 28, true, false) groupas $range; “KOWALSKI” groupas $equal) Similarly like in case of single value key parameters, parameters specifying the range are passed using the value of a binder named $range. The key parameter can specify also a collection of key values as a criterion. This is done when the key parameter returns a bag of key values. Page 119 of 181 Chapter 5 Query Optimisation and Index Optimiser $index_idxPerAge((25 union 30 union 35) groupas $in) The binder named $in is used to pass a collection of key values. If the criterion parameter returns an empty bag, then the index call returns an empty bag too. 5.5.2 Rewriting Routines Majority of the rewriting routines are straightforward and consist of generating an index call and removing unnecessary predicates as it has been described earlier. This section describes rewriting routines for complex combinations of predicates; however, it only focuses on a general idea rather than on implementation details. Firstly, let us discuss application of an index with a key specifying a range. If optimised query selection predicates specify only one limit of the range (lower or upper) then the second limit is generated automatically, i.e. a possible smallest or biggest value for the given key. For example the query: ((sum(Person.age) / count(Person)) groupas auxavg). Person where age > auxavg and surname = “KOWALSKI” can be transformed by the index optimiser in order to use the idxPerAge&Surname index: ((sum(Person.age) / count(Person)) groupas auxavg). $index_idxPerAge&Surname( ((auxavg, 2147483647, false, true) groupas $range; “KOWALSKI” groupas $equal) If there are more than one predicate or two opposite predicates describing the range on the given key then min, max, union and comparison operators are used to obtain a correct key range parameter. E.g. the query: ((sum(Person.age) / count(Person)) groupas auxavg). Person where age > auxavg and 23 <= age and age < 28 can be rewritten using the index invocation with a complex key value parameter expression: ((sum(Person.age) / count(Person)) groupas auxavg). $index_idxPerAge( (max(auxavg union 23), 28, 23 > auxavg, false) groupas $range) Page 120 of 181 Chapter 5 Query Optimisation and Index Optimiser The value of the lower limit is the maximal value chosen between values auxavg and 23. When 23 is greater than auxavg the lower limit should belong to the criterion range, otherwise does not. This is ensured by the <lower_closed> parameter: 23 > auxavg. The index optimiser in some cases uses if then expression to predict whether a given query returns no result and invoking the index is unnecessary i.e. if selection predicates are in contradiction. This has to be checked e.g. when for a given key there exist more than one selection predicate and at least one is based on = or in operator. If any selection predicate contradicts with a predicate based on = or in operator then such a query will return an empty bag. E.g the query: ((sum(Person.age) / count(Person)) groupas auxavg). (Person where age >= auxavg and 30 = age) can be transformed into the following form: ((sum(Person.age) / count(Person)) groupas auxavg). if (30 >= auxavg) then $index_idxPerAge(30 groupas $equal) which guarantees that the index idxPerAge will not be invoked and the empty bag will be returned if the condition 30 >= auxavg is false. In some cases multiple key indices allow to omit a key in an index call (cf. enum type key described in section 4.1.1); hence, the index optimiser supports scenario when there are no selection predicates refering to such a non-obligatory key. In this case both lower and upper bounds are set to the smallest and the biggest key value. E.g.: Person where surname in “NOWAK” can be rewritten to use the multi-key index idxPerAge&Surname with an omitted age key: $index_idxPerAge&Surname( ((((-2147483648, 2147483647), true), true)) groupas $range; “Nowak” groupas $in) To omit boolean key in an index call the set key parameter criteria is used: (false union true) groupas $in Predicates based on operators <, >, ≥, ≤, = need operands with a singular cardinality. Because of the threat of a run-time error such selection predicates consisting of key operands with optional cardinality cannot be used to apply the suitable index. As Page 121 of 181 Chapter 5 Query Optimisation and Index Optimiser it was shown in subchapter 5.3 in order to prevent run-time errors the exists operator can be used on a given key. E.g.: Person where exists(address.zip) where address.zip > 99720 and salary <= 99727 and age <= 28 can be rewritten to take advantage of the idxPerZip index. Additionally after applying indexing the index optimiser removes the unnecessary exists expression: $index_idxPerZip((99720, 99727, false, true)) where age <= 28 This solution was considered in the author’s implementation, because it enables range indices on keys with the optional cardinality. The presented rewriting rules concern the most common and important situations that the index optimiser has to deal with. Nevertheless, these solutions do not cover all possible scenarios when applying indexing is possible (e.g. rank queries and count operator used instead of exists). This issue needs further research focused on all SBQL operators and analysis of queries occurring in the database. Finally, the index optimiser rewriting routines concerning disjunction predicates (based on or operator) are also very significant and therefore they are described separately in the next section. There are also other methods, auxiliary to the index optimiser, that increase indexing potential in queries. They are the topic of subchapter 5.6. 5.5.3 Processing Disjunction of Predicates The index optimiser is prepared to deal with queries with selection predicates joined by the or operator. It is possible due to the law of conjunction distributivity, which results from the logic theory: [(p1 or p2) and p3] = [(p1 and p3) or (p2 and p3)] As or weakens a selection it also makes optimisation more complex. Therefore, if applying an index is possible without considering predicates joined by or, then the index optimiser may skip deeper analysis and use the index. In other cases in order to check all possibilities for indexing the index optimiser removes the or operator and splits a non-algebraic where operator expression into two partial selection expressions. Objects returned by both these expressions can be duplicates; so, it is necessary to leave only distinct object references. It is achieved through the uniqueref expression. Indexing may reduce amount of data processed by Page 122 of 181 Chapter 5 Query Optimisation and Index Optimiser such a query only if it can be applied to both partial expressions. This procedure is recursive if there are more than one or operator. Let us consider the following example optimisation of the query: Emp where age = 28 and (address.city = “Szczecin” or ”Szczecin” in worksIn.Dept.address.city) When there is no single-key index set on the Emp objects age attribute the query is split by the index optimiser into the following form: uniqueref((Emp where age = 28 and address.city = “Szczecin”) union (Emp where age = 28 and “Szczecin” in worksIn.Dept.address.city)) Depending on the current cost model both indices can be applied: uniqueref( ($index_idxEmpCity( ”Szczecin” groupas $equal) where age = 28) union ($index_idxEmpAge&workCity(28 groupas $equal; “Szczecin”)) The implementation of the or operator support does not actually perform splitting of the where clause into several where clauses. Instead, the index optimiser works with several different combinations of predicates (containing predicates from or operator child branches). After finding indices matching all combinations, the cost model selects the best indices for each combination. Next, the cost model is used to find the most selective set of combinations. A set which contains x combinations of predicates necessary to build equivalent to the original query union of x where clauses is used to generate the optimised query. 5.5.4 Optimising Existential Quantifier In SBQL the existential quantifier is a non-algebraic operator assuming the following syntax: exists q1 such that q2 The query q2 must return true or false for each object defined by q1. If for at least one object returned by q1 the query q2 is true, then the expression returns true, otherwise false. There is a method to reuse the index optimiser routines presented in the previous Page 123 of 181 Chapter 5 Query Optimisation and Index Optimiser sections of this chapter in order to apply indices which relate to predicates in the query q2. The existence quantifier has to be earlier rewritten to the form containing the selection operator: exists(q1 where q2) where exists is the SBQL algebraic unary operator. Both these queries are semantically equivalent, but in the second form the index optimiser can be used to process the where clause in order to reduce the amount of evaluated data. If the optimisation succeeds the query can be further transformed. When all predicates from the query q2 have been used by the index the final optimised query form is the following. exists(indexCallExpression) On the other hand when some q2 predicates are not associated with the selected index (q2’ stands for expression describing these predicates): exists(indexCallExpression where q2’) it is better finally to transform the unary exists operator to an expression with the existential quantifier: exists indexCallExpression such that q2’ The latter form is more efficient because contrary to the where operator the evaluation breaks when a processed element returned by the indexCallExpression matches the predicates defined in query q2’ (see Tab. 3-3). 5.5.5 Reuse of Indices through Inheritance In the AS0 store model any two different path expressions defining an index non-key value always return different objects. This is not true in case of static or dynamic inheritance (AS1, AS2, etc.). The name of subclass instances collection usually indicates a subset of a bigger collection. For example according to the schema shown in Fig. 3.1 the query EmpStudent returns the common subset of the results returned by queries Emp and Student. Similarly, all objects belonging to collection Emp can be found among the superclass PersonClass instances. As a result, all indices addressing objects of the superclass contain subclasses instances and therefore such indices also should be used in optimisation of selection queries which concern the subclass instances subset of the indexed objects set. An invocation of the index set on a collection of superclass instances often returns more Page 124 of 181 Chapter 5 Query Optimisation and Index Optimiser objects than it is required. For example the following query: Emp where age = 28 and surname = “KUC” cannot be directly optimised using any index mentioned in section 4.1.2. The selection predicates concern attributes of EmpClass’s superclass, i.e. PersonClass, and for that reason the administrator probably equipped the whole Person collection with suitable indices, e.g. idxPerAge, idxPerSurname, idxPerAge&Surname. Such indices return not only EmpClass instances and therefore the index optimiser applying one of them has to introduce facility that would remove non-EmpClass instances from the index invocation result. This can be done using an SBQL coerce operator. In the AS1 and AS2 models it can be used to convert an object into an object of a more specific or a more general class. Additionally, this conversion rejects objects that are not instances of a specified class. The syntax of the coerce operator was taken from the typical syntactic convention that is known from languages such as C, C++, Java, etc. as cast. Consequently the example query above can be rewritten to one of the following forms: (Emp) idxPerAge(28 groupas $equal) where surname = “KUC” (Emp) idxPerSurname(”KUC” groupas $equal) where age = 28 (Emp) idxPerAge&Surname(28 groupas $equal; ”KUC” groupas $equal) The method presented in this section requires extending the cost model because the selectivity should take into consideration unwanted objects returned by an index and an additional cost of a coerce operation. Finally, the optimisation is not possible using an index set on a subclass instances collection since determining objects that were not taken into account would require inspection of the whole collection of objects. It would not decrease the amount of data processed within the query what is the main idea of indexing. For example, the idxEmpCity index cannot be applied by the index optimiser in the following query: Person where address.city = “Warszawa” Concluding, generally when considering the attribute as an index key, the best rule for the administrator would be to create indices considering all instances of the class which introduces the given attribute. Such indices are more versatile as they can be used for optimising selection queries addressing subclasses collections. Page 125 of 181 Chapter 5 Query Optimisation and Index Optimiser 5.6 Secondary Methods Efficacy of indexing depends on several factors, e.g. good selection of indices generated by the database administrator, perspicuous and correct construction of selection queries, etc. This subchapter focuses on presenting how various query rewriting methods can facilitate the index optimiser. Query AUXILIARY AUXILIARY AUXILIARY METHOD 1 METHOD 2 METHOD N INDEX OPTIMISER Query Equipped with Indices Query Prepared for Applying Indices UNDESIRED UNDESIRED UNDESIRED METHOD 1 METHOD 2 METHOD M Optimised Query Fig. 5.5 Query optimisation with the index optimiser pre-processing Fig. 5.5 presents how auxiliary methods assist in indexing. These methods can be divided into several types: • optimisation methods, e.g. [95, 97]: o factoring out independent subqueries, o pushing selection, • methods assisting optimisation of queries invoking views, e.g. [122]: o query modification, o removing unnecessary auxiliary names, • and other methods, e.g. [93]: o query syntax tree normalisation. The following sections shows by short examples how secondary methods enable indexing. Details concerning these methods, algorithms or their role in query processing are not described. Given examples do not cover all possible situations of facilitating the index optimiser. It is also not excluded that other not listed methods exist and could be Page 126 of 181 Chapter 5 Query Optimisation and Index Optimiser useful. This work does not focus on presenting the proper order of applying this methods because this would require deeper analysis and testing of the query optimisation environment. The process shown in Fig. 5.5 also includes undesired methods which could be put in an optimisation sequence after the index optimiser. They may affect negatively application of indices. An example of such harmful routines regarding indices is shown in section 5.6.5. 5.6.1 Factoring Out Independent Subqueries Factoring out is one of the most important optimisation methods. It has roots in optimisation of nested queries in relational DBMSs. In SBA it is used in the context of a non-algebraic operator which contains a subquery independent from this operator. As it was depicted in the beginning of Chapter 5, the general idea of factoring out consists in moving the subquery before a non-algebraic operator. Thus, it is evaluated only once before the non-algebraic operator loop. Let us consider the following query selecting persons who earn salary equal to the lowest salary in CNC department: Emp where salary = min((Dept where name = “CNC”).employs.Emp.salary) During the compile-time the number of employees working in CNC department is unknown, so in case when the subquery (Dept where name = “CNC”).employs.Emp.salary returns an empty bag the evaluation of min operator would not be possible and the runtime error will occur. According to the conditions described in section 5.3.1, a selection predicate containing such an operand would disallow applying indexing. However, the situation improves after applying factoring out independent subqueries. The subquery calculating the minimal salary in CNC department is arbitrarily independent and hence can be calculated before selection: min((Dept where name = “CNC”).employs.Emp.salary) groupas $aux0. Emp where salary = $aux0 The advantage of such a transformation is that selection predicates are free of the runtime error threat and the idxEmpSalary index can be safely applied: Page 127 of 181 Chapter 5 Query Optimisation and Index Optimiser min((Dept where name = “CNC”).employs.Emp.salary) groupas $aux0. $index_idxEmpSalary(($aux0) groupas $equal) Factoring out independent subqueries is a very important secondary method to applying indices, because hazardous predicates in where clauses completely disallow indexing. Similar situations were identified and verified in [78]. 5.6.2 Pushing Selection Pushing selection uses the property of distributivity of some non-algebraic operators. It is a generalised equivalent of pushing a selection before a join known from relational DBMSs. The example query retrieving the age of persons whose surname is Nowak: Person.(age where surname = “NOWAK”) is formed in an unfortunate manner that disables using the idxPerSurname index. The selection predicate is independent of the where operator because it does not relate to age attribute and therefore the predicate can be pushed before the where clause and selection can be applied to Person objects: (Person where surname = “NOWAK”).age Although in this case that transformation itself would not cause a high efficiency gain it enables using the mentioned above index: $index_idxPerSurname(“NOWAK” groupas $equal).age which could improve the query performance even by orders of magnitude. This would not be possible without the pushing selection method. The second example involves pushing a selection before a join operator. The following query: (Dept join (sum(employs.Emp.salary) * 12))) where name = “HR” returns “HR” department with the overall year-long cost of salaries of its employees. The sum of employees’ salaries is calculated unnecessarily for all departments different from HR. Examining the binding levels of the selection predicate name = “HR” proves that name depends on where, because it is bound in the scope opened by that operator. However, this predicate relates only to Dept objects and consequently it can be applied directly to the left operand of the join: Page 128 of 181 Chapter 5 Query Optimisation and Index Optimiser (Dept where name = “HR”) join (sum(employs.Emp.salary) * 12) After rewriting the subquery: sum(employs.Emp.salary) * 12 will be evaluated for the significantly smaller number of Dept objects. Moreover, this form makes it possible to apply the idxDeptName index: $index_idxDeptName(“HR” groupas $equal) join (sum(employs.Emp.salary) * 12) Both examples show positive influence of pushing selection on work of the index optimiser. 5.6.3 Methods Assisting Invoking Views Views introduce a higher level of abstraction in designing applications. Unfortunately, it may result in serious performance deterioration because of a limited access of optimisation methods to the body of invoked views. Let us consider the following example of the query operating on the UnderpaidEmp view, which refers to Emp objects with the attribute salary lower than 1000: (UnderpaidEmp where City = “Szczecin”).FullName The subview City returns employee’s city of residence and the subview FullName returns concatenated name and surname of an employee. Invocation of the UnderpaidEmp view can only be optimised by utilising the index idxEmpSalary. In case of such a wide range applying an index would probably bring small gain in the query performance because of a weak query selectivity. Using the query modification technique (known also from relational DBMSs), which idea lies in combining a query with definitions of the views being invoked, would properly replace the UnderpaidEmp view and its sub-views City and FullName by their definitions: (((Emp and salary < 1000) as up) where ((up. address.city) as upac).upac = “Szczecin”). ((up.name + “ “ + up.surname) as upn).upn Directly after query modification applying all possible indices is not available to the Page 129 of 181 Chapter 5 Query Optimisation and Index Optimiser index optimiser because of auxiliary names that were introduced by the views definitions. Hence the removing unnecessary auxiliary names method should by applied to the obtained query: ((Emp where salary < 1000) where address.city = “Szczecin”). (name + “ “ + surname) This form of the transformed query is still semantically equivalent to the initial query, but it enables using the last predicate concerning the derived attribute address.city of the Person object to apply an index: ($index_idxEmpCity(“Szczecin” groupas $equal) where salary < 1000).(name + “ “ + surname) The idxEmpCity index has relatively good selectivity hence using it is more profitable than applying the idxEmpSalary index. 5.6.4 Syntax Tree Normalisation Syntax tree normalisation can be used in order to convert two semantically equivalent subqueries to identical expressions. This operation can be issued to any commutable binary algebraic operator: left_operand operator right_operand This method associates every SBQL expression and literal with a metric which makes possible subqueries comparison. If the right_operand has a smaller metric than the left_operand then the binary operator operands are swapped: right_operand operator left_operand In the context of indexing the syntax tree normalisation addresses the keys built on derived attributes involving complex expressions exploiting commutable operators. As the simple example of such a key let us consider the overall year-long cost of salaries of employees calculated in the context of a department: sum(employs.Emp.salary) * 12 Generation of the idxDeptYearCost index on this key for Dept objects could greatly increase the evaluation performance of the following example query: Dept where sum(employs.Emp.salary) * 12 > 1000000 because the index key would exactly match the complex predicate used for departments Page 130 of 181 Chapter 5 Query Optimisation and Index Optimiser selection. Although the query above is not very selective rewriting the query to the following form: $index_idxDeptYearCost ((1000000, 2147483647, false, true) groupas $range) would improve efficiency, since calculating the overall year-long cost or salaries in departments will be omitted. The problem arises when the user would form the predicate in the following manner: Dept where 12 * sum(employs.Emp.salary) < 1000000 Even though semantically the left operand of the selection predicate: 12 * sum(employs.Emp.salary) is equal to one used in the previous example query applying the idxDeptYearCost index is not possible as factors of the product are swapped. Assuming that the integer literal has a larger metric than the sum operator, the syntax tree normalisation would convert the query accordingly: Dept where sum(employs.Emp.salary) * 12 < 1000000 enabling using the idxDeptYearCost index. Syntax tree normalisation itself does not improve the performance of query evaluation; on the contrary, it introduces a small delay during the compile-time process. Its value can be appreciated only in the context of other optimisation methods like indexing or caching. 5.6.5 Harmful Methods As an example of a method that may make applying indexing impossible, let us consider factoring out common path-subexpressions [93] and the following query: Emp where worksIn.Dept.address.city = “Warsaw” and worksIn.Dept.address.street = “Sienkiewicza” that returns employees who work in a department in Warsaw at Sienkiewicza Street. In this query the derived attribute worksIn.Dept.address is accessed two times for each Emp object. In this case, worksIn.Dept.address is the common path-subexpression for left operands of both selection predicates. Using mentioned factoring out method the obtained optimised query: Page 131 of 181 Chapter 5 Query Optimisation and Index Optimiser Person where worksIn.Dept.address. (city = “Warsaw” and street = “Sienkiewicza”) computes worksIn.Dept.address expression only once. Nonetheless in case of the discussed query it would be better to apply the idxEmpWorkCity index: $index_idxEmpWorkCity(“Warsaw” groupas $equal) where worksIn.Dept.address.street = “Sienkiewicza” which is not possible because of disadvantageous predicates transformation done by factoring out common path-subexpressions. The major optimisation methods significantly facilitate indexing. Still some methods, like one presented in the example, should be considered to be put subsequently to the index optimiser. Ordering the optimisation methods is a process which involves heuristic analysis of various queries and common sense of the optimiser designer. All presented examples outline the index optimiser in the context of a whole optimisation process and should be considered useful. 5.7 Optimisations involving Distributed Index As it was described in section 2.2.2, main advantages of a distributed index is parallel access and increased capacity. Therefore, it greatly improves the efficiency of an index concurrently exploited by multiple queries. Nevertheless, performance of a distributed index in case of a single call is usually similar to a centralised index. Let us consider the following query issued by a client: <nonkeyexpr> where min <= <keyexpr> and <keyexpr> <= max It returns a part of a collection defined by <nonkeyexpr> so that the derived attribute <keyexpr> is within the range determined by min and max. Additionally, let us assume the following: • data is equally distributed on qDATA servers, • n is the number of elements in the processed collection, • s is the number of elements selected by the query. The where clause is evaluated in parallel on qDATA servers, each storing approximately n/qDATA elements; therefore, local evaluation on a server has the computational Page 132 of 181 Chapter 5 Query Optimisation and Index Optimiser complexity O(n/qDATA) expressed in the big O notation. Overall performance could be improved by local indexing on servers only if all servers would provide an appropriate index; however, this cannot be provided by the global schema administrator. Significant efficiency improvement can be obtained by creating a distributed range SDDS index appropriate to optimise query evaluation. Such an index is spread on qIDX machines among many available over a distributed database. Assuming that data are also equally distributed inside the index, each server contains n/qIDX elements. Therefore, performing selection of all elements from a fragment of the index on an individual server has the complexity O(n/qIDX) and does not depend on actual data distribution. It is important to remind that qIDX dynamically grows with the number of indexed elements so it is usually much greater than qDATA. Regardless of existing indices, the time complexity of merging partial results on the client is O(s) since it depends on query selectivity. The computational complexity of centralised index based on linear hashing is comparable (see section 2.2.1). Nevertheless, a distributed index can significantly improve performance in relation to a centralised index in case of a count query issued by a client: count(<nonkeyexpr> where min <= <keyexpr> and <keyexpr> <= max) The query running time expressed in the big O notation is O(n). The evaluation is done in parallel on qDATA sites containing data. The client only calculates the sum of results obtained from qDATA servers. In case of employing a centralised index the performance complexity remains the same as in case of a where clause. Nevertheless, the execution can be faster. Regardless of an indexing strategy, such a count query is efficiently computed by the index itself. Actual data are not used or their participation in evaluation of the query is very little. The properties of a distributed index enable computing a count query in parallel on index sites. The time of calculating a sum on a client is omitable especially as partial results are usually obtained from one or several of qIDX servers. Therefore, the distributed index in case of the discussed count query should prove its efficacy. In the next section the author proposes optimisation concerning other type of SBQL queries, which can take advantage of efficiency of the distributed index in processing count queries. Page 133 of 181 Chapter 5 Query Optimisation and Index Optimiser 5.7.1 Rank Queries Optimisation Queries can return a sequence. Consequently, there should be a possibility to select kTH sequence element, last sequence element, etc. This is achieved through rank queries, which nowadays are very widespread. Particularly search engines over internet provide answers according to internal or explicit rankings. The rank queries subject is researched in context of regular query optimisation, construction of database systems devoted to ranking and development of its theoretical fundaments (e.g. relational ranking algebras), e.g. [52, 69]. The mostly exploited ranking query is looking up top-k elements. Such a query can be easily facilitated with the use of a suitable ordered index. However, the author would like to discuss optimisation of ranking queries in a more general case. SBQL allows expressing rank queries using square brackets operator and operator rangeas (see Tab. 3-5). A query concerning objects defined by a nonkeyexpr expression and ordered using a <keyexpr> expression, which rank is between integers defined by <min> and <max> can be formulated in at least three semantically equivalent forms: Query 5.1 Ranking Queries in SBQL 1. Using square brackets and bag of integers: (<nonkeyexpr> orderby <keyexpr>) [bag(<min >, <min>+1, <min>+2 …, <max>] 2. Using square brackets and a range of integers bag constructor (unsupported yet in the ODRA database): (<nonkeyexpr> orderby <keyexpr>)[<min >..<max>] 3. Using the rangeas operator and appropriate selection predicates: ((<nonkeyexpr> orderby <keyexpr>) as <name> rangeas <rank> where <rank> >= <min> and <rank> <= <max>).<name> where <name> and <rank> are auxiliary names. The third solution was introduced in the Loqis system [116]. It is the most universal as it can be freely used with other options of SBQL. The query which returns a sequence{res1, res2, ..., resn)} can be further processed by the rangeas operator. It equips individual results with binders, e.g. rank, which store ordered natural numbers: bag{struct{res1, rank(1)}, struct{res2, rank(2)} , …, struct{resn, rank(n)}}. This Page 134 of 181 Chapter 5 Query Optimisation and Index Optimiser solution enables query language to form freely conditions on the rank binders. For example, the following rank query: ((Emp orderby salary) rangeas n) where n <= 10 returns a bag of 10 worse earning employees with an additional binder called n: bag {struct{Emp1, n(1)}, struct{Emp2, n(2)},…, struct{Emp10, n(10)}} In case of rank queries sorting is not the goal of a query although the orderby operator is used. Creating a sorted sequence of data significantly reduces optimising potential. Sorting of the data usually deteriorates the average performance of query evaluation to O(n·log2(n)) running time (where n – is the number of elements to sort)[23]. In case when data is distributed between several servers, queries forcing data ordering cannot be completely decomposed on particular sites in order to execute in parallel. If order is forced, then each element is processed separately. A client has to request all the data required by a query from servers and processes it to obtain the result, i.e. so-called total data shipping strategy. Therefore, the performance of this strategy is at least linear. Most of the methods based on rewriting and indexing assume that data are not ordered. Therefore, queries involving sorting require dedicated optimisation methods. In order to avoid sorting of a whole collection defined by <nonkeyexpr> the author considered approach based on looking up <min>TH and <max>TH elements according to the given ranking key. Assuming that the method returning kTH element of the given set is defined, e.g. $findKthElement(k:integer), then ranking query forms in Query 5.1 can be transformed accordingly to a semantically equivalent form: Query 5.2: Evaluation of Rank Query Without Sorting ($findKthElement(<min>).<keyexpr> as val_min join I $findKthElement(<max>).<keyexpr> as val_max join ((<min> - count( II III <nonkeyexpr> where <keyexpr> < val_min)) as delta)) .((<nonkeyexpr> where val_min <= <keyexpr> and <keyexpr> <= val_max) orderby <keyexpr>) IV [delta..<max>-<min>+delta]) The transformed query: I. finds a ranking key value of <min>TH and <max>TH elements and stores them as a Page 135 of 181 Chapter 5 Query Optimisation and Index Optimiser value of auxiliary binders named val_min and val_max, II. next, since there can be more elements with the same value as the <min>TH element the query calculates which in a row element with a value val_min is the <min>TH element and stores this number using an auxiliary binder named delta, III. elements with a ranking key value between val_min and val_max inclusively are extracted, IV. finally, elements are sorted and those before the <min>TH element and after the <max>TH element are removed using a ranking operator. Assuming that mostly the number of elements extracted by ranking query is practically small (s coefficient), the performance of strategy presented above depends greatly on $findKthElement method and on both where clauses. The evaluation complexity of II and III part of Query 5.2 is O(n). Without indexing computations are done in parallel on qDATA sites. Facilitated by a global distributed index the evaluation splits between qIDX servers. Next subsections discuss variants of Hoare’s algorithm resolving finding the kTH element problem and their influence on overall performance of ranking queries. 5.7.1.1 Hoare’s Algorithm in Distributed Environment The Hoare’s algorithm is based on the well-known quicksort sorting algorithm based on bisection [23]. During each iteration algorithm splits examined data into two parts (smaller and equal or greater than randomly selected so-called pivot element). In contrast to quicksort, after dividing a set the Hoare’s algorithm executes itself recursively only on the part of a set that contains wanted kTH element while omitting the other part. Such approach results in linear evaluation complexity so is faster than obvious algorithm basing on sorting the given set. Nevertheless, similarly like sorting it consumes additional resources to store a copy of data to swap freely elements according to the algorithm. Applying classic Hoare’s algorithm for centralised processing of a rank query would result in: • total shipping of distributed data to be processed on a main server (usually on a client), • obtaining (in big O notation) linear evaluation complexity O(n), • large consumption of memory and the large number of write operations on the Page 136 of 181 Chapter 5 Query Optimisation and Index Optimiser main server. The straightforward parallelisation of the Hoare’s algorithm on qDATA servers storing processed data requires introducing: • the controlling algorithm on the main server deciding of iteration parameters. • peer algorithms on qDATA servers processing actual data. An iteration of the parallelised Hoare’s algorithm consists of the following steps: 1. The controlling algorithm randomly selects a pivot value within the given range. The selection can be performed variously, e.g. using: • a query for a random element within the given range from a random data server, • a median calculated for a collection of random elements within the range from all data servers, • basing on the knowledge about minimal and maximal limits of the range an average value can be used. 2. The controlling algorithm sends a message to peer algorithms to divide the given range according to the selected pivot. 3. Peer algorithms divide the range and inform the main server about the cardinality of obtained parts (additionally random elements from both parts can be sent). 4. The controlling algorithm determines a part of the given range containing the kTH element and this part becomes a new range for a next iteration. The controlling algorithm stops after O(log2(n)) iterations when the number of elements in the range is reasonably small. Those elements are sent to the main server and the kTH element is selected. Actually, the absolute kTH element would be the k mincount element of the final range, where mincount is the number of elements smaller than the final range. Concluding the parallel evaluation of Hoare’s algorithm has the following properties: • avoiding total data shipping, • O(log2(n)) rounds of communication, • large consumption of memory and the large number of write operations on Page 137 of 181 Chapter 5 Query Optimisation and Index Optimiser servers storing data, • existing local or global indices cannot be used. Using this algorithm to evaluate Query 5.2 guarantees running time complexity O(n) and distribution of calculations on qDATA servers. Further improvements require taking advantage of indexing. 5.7.1.2 Modification of Hoare’s Algorithm In the proposed modification to the Hoare’s algorithm splitting examined data into two parts is not done physically. Instead, during an iteration the number of elements lesser and greater than the pivot value is determined. This allows selecting a side of the range divided by the pivot holding the kTH element. Such a simple bisection algorithm can be entirely defined on a client side as an SBQL program: set_min := min(<nonkeyexpr>.<keyexpr>); set_max := max(<nonkeyexpr>.<keyexpr>); set_position := 0; do { pivot := (set_min + set_max)/2; less_count := count(<nonkeyexpr> where set_min <= <keyexpr> and <keyexpr> < pivot); if (set_position + less_count >= k) set_max := pivot; else { set_min := pivot; set_position += less_count; } } while (less_count > stop_const); return ((<nonkeyexpr> where set_min <= <keyexpr> and <keyexpr> <= set_max) orderby <keyexpr>)[k - set_position]; where: • set_min and set_max are boundary values of an examined set, • pivot is an arithmetic centre of set boundary values, • less_count holds the number of set elements with values smaller than pivot, • set_position indicates the number of elements with a value smaller than set_min, • stop_const is a constant used for terminating the main loop as the examined set size reduces together with less_count value. Page 138 of 181 Chapter 5 Query Optimisation and Index Optimiser The performance of the proposed algorithm is O(n*log2(n)). Since loop is executed O(log2(n)) times, the overall evaluation, similarly as the Hoare’s algorithm, requires O(log2(n)) rounds of communication. The regular evaluation of statements containing min, max and count operators can be decomposed on qDATA servers. If there exists an appropriate SDDS index the evaluation is split between qIDX servers. The method for determining the pivot and a loop stop condition does not influence the running-time complexity. Still, it can be implemented differently to tune the algorithm performance. There are many advantages of the proposed algorithm: • implementation simplicity, • successfully avoiding total data shipping, • small memory usage and the number of write operations (data is mainly read), • very small amount of data is sent through a network, • data processing is transparently facilitated by existing local or global indices. Features of this algorithm along with the performance apply also to the rank queries evaluation strategy shown in Query 5.2. The table below compares different approaches to execution of rank queries. Tab. 5-1 Features of Rank Queries Evaluation Strategies Feature List Query 5.2 based on Unoptimised Query 5.2 based on Distributed Query 5.1 author’s approach Hoare’s Reduced network traffic NO YES YES Small memory usage NO NO YES Can utilise local indices NO NO YES Algorithm simplicity YES NO YES Computational complexity O(n*log2(n)) O(n) O(n*log2(n)) Parallel evaluation on qIDX sites with SDDS support NO NO YES Despite slightly better performance of the rank queries evaluation strategy employing the Hoare’s algorithm, the author’s approach possesses more advantages. Unfortunately, efficiency verification of the proposed rank queries optimisation method has not been done yet because of immature stage of the ODRA platform implementation. Appropriate tests are planned to be performed in the future. Page 139 of 181 Chapter 5 Query Optimisation and Index Optimiser 5.8 Increasing Query Flexibility with Respect to Indices Management In the current ODRA prototype each change in a database data definition (usually triggered through DDL commands) forces recompilation of the applications which can depend on the modified entity. Such situations obviously occur also during index management operations. After adding an index, compilation and optimisation are required to introduce index calls in queries inside existing applications (example shown in Fig. 5.3). Whereas, after removing an index the compiled form of the query syntax tree must be free from all calls to non-existing indices. On the other hand, recompiling would additionally require terminating some running applications; therefore, in many cases it is not possible or troublesome. In order to solve this problem in the future, the author proposes the solution to obtain a more flexible compiled form of the optimised query. The necessary changes can be introduced by the index optimiser in the stage of the query syntax tree rewriting. Let us assume that like in many solutions in RDMSs an index can be disabled or enabled by the administrator. First, it is crucial to provide a mechanism to ensure validity of the query even if index remove or index disable command would be issued before or during the evaluation of the query. Therefore, the author proposes a new build-in SBQL method: $request(index_name) : boolean that can be introduced in the query syntax tree in compile-time to facilitate query processing. Its argument is the name of a database entity, i.e. an index name in this case, intended to be called and return type is the boolean value. If the given index exists and is valid for usage the $request method returns true and additionally it prevents a database from disabling or removing this index before its successive call finishes. The $request method returns false if the specified index is not accessible or valid. Consequently, assuming the idxPerAge index exists, the following example query: Person where surname = ”NOWAK” and age = 30 can be rewritten to the form preventing any problems with evaluation in case of removal or disabling of idxPerAge index: Page 140 of 181 Chapter 5 Query Optimisation and Index Optimiser if ($request(idxPerAge)) $index_idxPerAge(30 groupas $equal) where surname = ”NOWAK” else Person where surname = ”NOWAK” and age = 30 Such approach allows even more flexible and independent exploiting of indices by user applications. The administrator, apart from adding currently necessary indices, can register information about indices that are anticipated to be used in the future. It is predictable already in the stage of designing the data schema. The administrator can usually easily estimate which attributes can be used to construct selection predicates. The suitable information can be introduced to the index manager by issuing a command adding an index in the disabled state. The index optimiser can consider using registered indices during the query evaluation by applying an appropriate transformation to the query syntax tree. For example assuming the administrator added information about idxPerAge, idxPerSurname and idxPerAge&Surname indices the query above can be rewritten to increase its flexibility: if ($request(idxPerAge&Surname)) then $index_idxPerAge&Surname(30 groupas $equal; “NOWAK” groupas $equal) else if ($request(idxPerSurname)) then $index_idxPerSurname(“NOWAK” groupas $equal) where age = 30 else if ($request(idxPerAge)) then $index_idxPerAge(30 groupas $equal) where surname = ”NOWAK” else Person where surname = ”NOWAK” and age = 30 Determining the precedence of indices should be facilitated by the cost model, e.g. according to selectivity property; therefore, the best available index is always used in the run-time. This solution permits the administrator to freely disable and enable available indices without a necessity to compile the query again. Consequently, partial recompilation of user applications would be essential only in case of adding a different, not yet registered index in order to improve performance of queries that can exploit it. The most important benefit of the proposed solution is that applications can work continuously, independently from index management actions and flexibly exploit available indices. Page 141 of 181 Chapter 6 Indexing Optimisation Results Chapter 6 Indexing Optimisation Results Tests results are average values from 20 subsequent measurements performed on the example schema presented in Fig. 3.1 populated with random data. Tests were performed on the following single machine: Tab. 6-1 Optimisation testbench configuration Property Processor RAM HDD OS JVM Value Intel Mobile Core 2 Duo T2300, 1.66 GHz 2,00 GB 120 GB, 5400 rpm MS Windows Server 2003 R2 Service Pack 2, 32 bit Sun JRE SE 1.6.0_03 The data store of the ODRA OODBMS prototype is entirely mapped in the RAM memory using the memory-mapped file access. Current implementation allows performing tests on 300000 objects representing people related to the company. 6.1 Test Data Distribution location The data distribution is presented in the following figures. Warszaw a Łódź Kraków Wrocław Poznań Gdańsk Szczecin probability 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 department Fig. 6.1 Department’s location distribution production retail wholesale research warehousing CNC customer serv ice logistics security pay ments HR employ ment BHP probabiliy 0 0,05 0,1 0,15 0,2 0,25 Fig. 6.2 Employee’s department distribution Page 142 of 181 0,3 0,35 0,05 0 ANNA MARIA KATARZYNA MAŁGORZAT AGNIESZKA KRYSTYNA BARBARA EWA ELśBIETA ZOFIA JANINA TERESA JOANNA MAGDALENA MONIKA JADWIGA DANUTA IRENA HALINA HELENA BEATA ALEKSANDR MARTA DOROTA MARIANNA GRAśYNA JOLANTA STANISŁAW IWONA KAROLINA BOśENA URSZULA JUSTYNA RENATA ALICJA PAULINA SYLWIA NATALIA WANDA AGATA ANETA IZABELA EWELINA MARZENA WIESŁAWA GENOWEFA PATRYCJA KAZIMIERA EDYTA STEFANIA 0 0,07 0 NOWAK KOWALSKA WIŚNIEWSKA WÓJCIK KOWALCZYK KAMIŃSKA LEWANDOWSKA ZIELIŃSKA SZYMAŃSKA WOŹNIAK DĄBROWSKA KOZŁOWSKA JANKOWSKA MAZUR WOJCIECHOWSKA KWIATKOWSKA KRAWCZYK PIOTROWSKA KACZMAREK GRABOWSKA PAWŁOWSKA MICHALSKA ZAJĄC KRÓL JABŁOŃSKA WIECZOREK NOWAKOWSKA WRÓBEL MAJEWSKA OLSZEWSKA STĘPIEŃ JAWORSKA MALINOWSKA ADAMCZYK NOWICKA GÓRSKA DUDEK PAWLAK WITKOWSKA WALCZAK RUTKOWSKA SIKORA BARAN MICHALAK SZEWCZYK OSTROWSKA TOMASZEWSKA PIETRZAK JASIŃSKA WRÓBLEWSKA 0,08 JAN ANDRZEJ PIOTR KRZYSZTOF STANISŁAW TOMASZ PAWEŁ JÓZEF MARCIN MAREK MICHAŁ GRZEGORZ JERZY TADEUSZ ADAM ŁUKASZ ZBIGNIEW RYSZARD DARIUSZ HENRYK MARIUSZ KAZIMIERZ WOJCIECH ROBERT MATEUSZ MARIAN RAFAŁ JACEK JANUSZ MIROSŁAW MACIEJ SŁAWOMIR JAROSŁAW KAMIL WIESŁAW ROMAN WŁADYSŁAW JAKUB ARTUR ZDZISŁAW EDWARD MIECZYSŁAW DAMIAN DAWID PRZEMYSŁAW SEBASTIAN CZESŁAW LESZEK DANIEL WALDEMAR salary Indexing Optimisation Results Chapter 6 2400 - 6000 1800 - 2400 1400 - 1800 1000 - 1400 600 - 1000 300 - 600 probabiliy 0 0,05 0,1 Page 143 of 181 0,15 Fig. 6.6 Male person’s first name distribution 0,2 0,25 Fig. 6.3 Employee's salary range distribution probability 0,07 0,06 0,05 0,04 0,03 0,02 0,01 Fig. 6.4 Female person’s first name distribution probability 0,06 0,05 0,04 0,03 0,02 0,01 Fig. 6.5 Female person’s surname distribution probability 0,04 0,03 0,02 0,01 Chapter 6 Indexing Optimisation Results 0,07 probability 0,06 0,05 0,04 0,03 0,02 0 NOWAK KOWALSKI WIŚNIEWSKI WÓJCIK KOWALCZYK KAMIŃSKI LEWANDOWSKI ZIELIŃSKI WOŹNIAK SZYMAŃSKI DĄBROWSKI KOZŁOWSKI JANKOWSKI MAZUR WOJCIECHOWSKI KWIATKOWSKI KRAWCZYK KACZMAREK PIOTROWSKI GRABOWSKI ZAJĄC PAWŁOWSKI KRÓL MICHALSKI WRÓBEL WIECZOREK JABŁOŃSKI NOWAKOWSKI MAJEWSKI STĘPIEŃ OLSZEWSKI JAWORSKI MALINOWSKI DUDEK ADAMCZYK PAWLAK GÓRSKI NOWICKI SIKORA WALCZAK WITKOWSKI BARAN RUTKOWSKI MICHALAK SZEWCZYK OSTROWSKI TOMASZEWSKI PIETRZAK ZALEWSKI WRÓBLEWSKI 0,01 Fig. 6.7 Male person’s surname distribution Instances of classes PersonClass, StudentClass, EmpClass and EmpStudentClass are distributed equally. Regular and employed students’ age is distributed randomly between 19 and 30 inclusive. Employees’ age distribution is between 18 and 65 inclusive. Remaining persons are between 1 and 100 years old inclusive. The value of student’s scholarship is randomly 0, 200 or 500. Sex values are equally distributed. The distribution of data is closer to the assumed one as the number of employees increases. 6.2 Sample Index Optimisation Test Main tests compare times between query executions with the enabled and disabled index optimiser. Additional elements of execution taken into account are staticevaluation (i.e. type-checking) and optimisation (cf. subchapter 5.1). For each test the set of existing indices is specified. Each query is given a plot of reference (ref. avg. time) and additionally optimised by indexing (opt. avg. time) execution times for 10, 100, 1000, 3000, 10000, 30000, 100000 and 300000 person objects together with the optimisation gain (the evaluation times’ ratio). A time measurement can be disrupted by unexpected actions of OS, hardware or applications running in background. In order to eliminate influence of such interferences the test results are estimated with an average of 20 subsequent measurements. Multiple measurements particularly increase precision of tests with short query evaluation times. Therefore, in case of tests lasting longer the smaller number of measurements where performed, i.e. 5 measurements for tests longer than 10 minutes and 1 measurement for tests longer than 30 minutes. To improve the readability results on plots are presented using the logarithmic Page 144 of 181 Chapter 6 Indexing Optimisation Results scale on x-axis. All queries below are devoid of decoration introduced by the static evaluator (e.g. implicit dereferences and coercions) and transformations done by other than indexing standard ODRA optimisation methods. Query 6.1a: Retrieves persons 28 and less years old named KOWALSKI reference Person where surname = "KOWALSKI" and age <= 28 index optimised idxPerAge&Surname( (-2147483648, 28, true, true) groupas $range); "KOWALSKI" groupas $equal) ref. avg. time opt. avg. time gain 30 100 90 25 80 [s] 60 15 50 40 10 gain [ratio] 70 20 30 20 5 10 0 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.8 Evaluation times and optimisation gain for Query 6.1 The optimisation gain for this simple query above proves effectiveness of ODRA indexing in case of the idxPerAge&Surname index. The data distribution indicates that surname KOWALSKI occurs in two or three cases out of 100 people. Optimisation gain plot shows that for the number of persons greater than 30000 the amount of data processed by the query reduced more than 50 times, i.e. out of 100 people one or two person are processed. It is the result of combining both selection predicates in an index call. Creating similar indices for collections smaller than 100 objects is not beneficial. The second type of performed tests is designed to verify index properties. It is achieved by comparing an optimisation gain obtained with a use of different indices. Query 6.1b idxPerAge&Surname optimisation idxPerAge&Surname( (-2147483648, 28, true, true) groupas $range); "KOWALSKI" groupas $equal) Page 145 of 181 Chapter 6 Indexing Optimisation Results idxPerAge optimisation idxPerAge((-2147483648, 28, true, true) groupas $range) where surname = "KOWALSKI" idxPerSurname optimisation idxPerSurname("KOWALSKI" groupas $equal) where age <= 28 idxPerAgeSurname gain idxPerAge gain idxPerSurname gain 80 70 60 gain [ratio] 50 40 30 20 10 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.9 Indices optimisation gain for Query 6.1 The plot confirms that index calls reach a desired optimisation gain with a growth of the number of non-key objects. The performance improvement using the idxPerAge index is small (gain ratio little smaller than 2) because the age predicate concerns the large value range. The idxPerAge&Surname multiple key index efficiency significantly overtakes single key indices for larger databases (the number of persons much greater than 10000). 6.3 Omitting Key in an Index Call Test – enum Key Types In case of multiple-key indices, e.g. idxPerAge&Surname, the index optimiser can sometimes omit specifying a value of a key in an index call (enum type key described in section 4.1.1). This feature makes index more flexible and consequently the set of existing indices can be reduced. Query 6.2a: Counts persons named KOWALSKI, KOWALSKA, NOWAK reference count(Person where surname in ("KOWALSKI" union "KOWALSKA" union "NOWAK")) Index optimised count(idxPerAge&Surname( (-2147483648, 2147483647, true, true) groupas $range); ("KOWALSKI" union "KOWALSKA" union "NOWAK") groupas $in)) Page 146 of 181 Chapter 6 Indexing Optimisation Results ref. avg. time opt. avg. time gain 35 100 90 30 80 25 60 [s] 20 50 15 40 gain [ratio] 70 30 10 20 5 10 0 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.10 Evaluation times and optimisation gain for Query 6.2 The optimisation gain for the large number of objects is comparable to reducing an amount of data processed according to the selection predicate, i.e. approximately one out of ten people has surname KOWALSKI, KOWALSKA or NOWAK. This result satisfies expectations; however, creating suitable single key index, i.e. idxPerSurname, can improve efficiency even better. Query 6.2b idxPerAge&Surname optimisation count(idxPerAge&Surname( (-2147483648, 2147483647, true, true) groupas $range); ("KOWALSKI" union "KOWALSKA" union "NOWAK") groupas $in)) idxPerSurname optimisation count(idxPerSurname( ("KOWALSKI" union "KOWALSKA" union "NOWAK") groupas $in)) Despite the fact that both index calls (see Query 6.2b) return the same collection of objects the plot in Fig. 6.11 indicates that the optimisation gain for the idxPerSurname index is even 30 times greater than for the idxPerAge&Surname index. Additional reason of such a high performance is that index optimised queries do not process selected objects by a where clause (like in the original query). Omitting an index key is a useful feature, but depending on an index structure it has an impact on the index efficiency. Page 147 of 181 Chapter 6 Indexing Optimisation Results idxPerAgeSurname gain idxPerSurname gain 350 300 gain [ratio] 250 200 150 100 50 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.11 Indices optimisation gain for Query 6.2 6.4 Multiple Index Invocation Test If an index call is located on the right side of a non-algebraic operator, e.g. dot, then it is likely to be evaluated more than once during the query execution. This is shown on the example Query 6.3 with an idxEmpTotalIncomes index. In Fig. 6.12 and Fig. 6.13 the logarithmic scale has been used also on y-axis. In Fig. 6.12 the dependency between the optimisation gain and the number of persons is close to linear and grows to 457 for 300000 person objects. Query 6.3a: For 61 years old, married employees living in Łódź, working in Łódź or Wrocław retrieves a name concatenated with a surname and the number of employees with the equal amount of total incomes. Reference ((Emp where address.city = "Łódź" and worksIn.Dept.address.city in ("Łódź" union "Wrocław") and married = true and age = 61) as e). (e.name + " " + e.surname, count(Emp where getTotalIncomes() = e.getTotalIncomes())) index optimised ((Emp where address.city = "Łódź" and worksIn.Dept.address.city in ("Łódź" union "Wrocław") and married = true and age = 61) as e). (e.name + " " + e.surname, count(idxEmpTotalIncomes(e.getTotalIncomes())) Page 148 of 181 Chapter 6 Indexing Optimisation Results opt. avg. time gain 100000 10000 10000 1000 1000 100 100 10 10 1 1 [s] 100000 0,1 gain [ratio] ref. avg. time 0,1 0,01 0,01 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.12 Evaluation times and optimisation gain for Query 6.3 Additionally introducing another index – idxEmpAge&WorkCity – in order to optimise evaluation of the first part of the query can significantly influence the performance: Query 6.3b idxEmpTotalIncomes optimisation ((Emp where address.city = "Łódź" and worksIn.Dept.address.city in ("Łódź" union "Wrocław") and married = true and age = 61) as e). (e.name + " " + e.surname, count(idxEmpTotalIncomes(e.getTotalIncomes())) idxEmpAge& WorkCity optimisation ((idxEmpAge&WorkCity(61 groupas $equal; ("Łódź" union "Wrocław") groupas $in) where address.city = "Łódź" and married = true) as e). (e.name + " " + e.surname, (e.name + " " + e.surname, count(Emp where getTotalIncomes() = e.getTotalIncomes()))) both indices optimisation ((idxEmpAge&WorkCity(61 groupas $equal; ("Łódź" union "Wrocław") groupas $in) where address.city = "Łódź" and married = true) as e). (e.name + " " + e.surname, count(idxEmpTotalIncomes(e.getTotalIncomes())) For a database consisting of 300000 persons two indices compound gives the optimisation gain approximately 40 times greater (see Fig. 6.13). Despite such difference, the most important is an index repeatedly invoked, i.e. idxEmpTotalIncomes. Without this index the query performance does not improve noticeably. Page 149 of 181 Chapter 6 Indexing Optimisation Results idxEmpTotalIncomes gain both indices gain idxEmpAgeWorkCity gain 100000 10000 gain [ratio] 1000 100 10 1 0,1 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.13 Indices optimisation gain for Query 6.3 6.5 Complex Expression Based Index Test The test concern processing complex selection predicates. As an example, optimisation involving the idxDeptYearCost index is shown. Query 6.4: Gets names of departments which employees earn more in a year than 10000. reference (Dept where sum(employs.Emp.salary) * 12 > 10000).name idxDeptYearCost( index (10000, 2147483647, false, true) groupas $range).name optimised opt. avg. time gain 30 3000 25 2500 20 2000 15 1500 10 1000 5 500 0 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.14 Evaluation times and optimisation gain for Query 6.4 Page 150 of 181 gain [ratio] [s] ref. avg. time Chapter 6 Indexing Optimisation Results The query concerns selecting some from 13 departments. The not-optimised execution time grows linearly with the increasing number of Person objects. In case of the idxDeptYearCost index a precise key value is pre-calculated and written inside an index structure. Therefore, the index call execution is extremely fast and independent from the number of persons. 6.6 Disjunction of Predicates Test The query optimisation presented below is the result of the process described in section 5.5.3. It is assumed that administrator has created idxEmpCity and idxEmpWorkCity indices. A lack of one of them would make optimisation impossible because the execution of Query 6.5 would still require processing all data. Query 6.5a: Retrieves employees with age greater than or equal to 57 and less than 61 who live or work in Szczecin. reference count(Emp where age >= 57 and age < 61 and (address.city = "Szczecin" or worksIn.Dept.address.city in "Szczecin")) index optimised count(uniqueref( (idxEmpCity("Szczecin" groupas $equal) where deref(age) >= 57 and deref(age) < 61) union (idxEmpWorkCity("Szczecin" groupas $equal) where deref(age) >= 57 and deref(age) < 61))) ref. avg. time opt. avg. time gain 18 100 16 90 80 14 60 [s] 10 50 8 40 6 gain [ratio] 70 12 30 4 20 2 10 0 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 6.15 Evaluation times and optimisation gain for Query 6.5 The plot confirms that the implemented solution improves the query execution time Page 151 of 181 Chapter 6 Indexing Optimisation Results even for 1000 person objects database. Gain decrease for 100000 persons is a result of random differences in distribution of specified by indices parameters objects (the number of objects matching the given criteria grows linearly only in general). In some situations, it is possible to take advantage of an index which would not require rewriting of predicates in disjunction. In this case the idxEmpAge index can be introduced by the administrator and splitting the given query into two where clauses can be avoided. The plot in Fig. 6.16 however indicates that in this particular case the latter index optimisation in major produced the smaller optimisation gain. The proper approach for selecting the most efficient solution should rely on a fast and accurate (as close as it is possible) cost model (see subchapter 5.4). Query 6.5b idxEmpCity and idxEmpWorkCity optimisation count(uniqueref( (idxEmpCity("Szczecin" groupas $equal) where deref(age) >= 57 and deref(age) < 61) union (idxEmpWorkCity("Szczecin" groupas $equal) where deref(age) >= 57 and deref(age) < 61))) idxEmpAge optimisation count(idxEmpAge((57, 61, true, false) groupas $range) and (address.city = "Łódź" or worksIn.Dept.address.city in "Szczecin")) idxEmpCity and idxEmpWorkCity gain idxEmpAge gain 18 16 14 gain [ratio] 12 10 8 6 4 2 0 10 100 1000 10000 100000 no. of persons Fig. 6.16 Indices optimisation gain for Query 6.5 Page 152 of 181 1000000 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources The updateable views described in subchapter 3.5 allow seamless integration of heterogeneous data sources. As a result, the users can transparently process and modify data shared by contributing resources. Such an environment because of complex multilayer architecture is not efficiency-oriented; therefore, it requires dedicated optimisation methods. In this context, the approach to the index maintenance presented in subchapter 4.3 is inappropriate. External resource objects or table rows used to determine non-key or key values are not easy to identify. Additionally, a distributed database is unaware of local data modifications. Generic solution to this problem is out of the scope of the dissertation since it is a wide research topic itself. The author’s approach exploiting indexing architecture presented in the thesis is based on observation that in many index-optimised queries an index is invoked multiple times. In such scenario, a query performance would benefit even if the index would be created throughout the evaluation of the query. Consequently, the index maintenance becomes unnecessary. 7.1 Volatile Indexing The idea of a volatile index is similar to a temporary index used in RDBMSs (see subchapter 2.3). A regular index, as a redundant structure, requires an automatic updating mechanism to be in cohesion with data. In case of volatile indices database permanently stores an index definition and materialises an index only during the query evaluation. Therefore, automatic updating for volatile indices becomes superfluous. The main and obvious disadvantage of this approach is the necessity to perform index materialisation during query evaluation. The time of the index generation is at least a time of single evaluation of a where clause on which the optimisation occurs. Therefore, the query evaluation performance will not improve if such an index is invoked only once. The index optimiser should predict such situations to avoid unnecessary generating a volatile index. Page 153 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources 7.1.1 Conditions for Volatile Indexing Optimisation When a volatile index is called multiple times during the query evaluation the optimisation gain can be comparable to a regular index. Such situation can occur when the optimised where clause is situated on the right side of a non-algebraic operator. It is presented on the following figure. Non-algebraic expression Right subexpression rightexpr Left subexpression leftexpr WhereExpression where Left where subexpression refsexpr Right where subexpression predicates Fig. 7.1 Query suitable for applying a volatile index When the leftexpr expression returns a collection the rightexpr containing a where clause is multiply evaluated against each collection result (see Tab. 3-3). It is assumed that there exists an index defined on objects returned by the refsexpr expression. Consequently, refsexpr has to be independent from the non-algebraic operator. Moreover, predicates expression should contain selection predicates defining key values for the given index which are dependent on the given non-algebraic operator; so, the index key would be context dependent (key values should not be constant for all iterations of the non-algebraic operator). Otherwise the where clause would be independent and should be evaluated only once before the non-algebraic expression (see factoring out independent subqueries method in section 5.6.1). 7.1.2 Index Materialisation In ODRA the volatile index materialisation occurs directly before the first index invocation. It consists of the following steps: 1. The non-key and key values are calculated through execution of the query: <nonkeyexpr> join (<keyexpr_1> [ , <keyexpr_2> ... ]) which is generated on the basis of the index definition. nonkeyexpr expression is equal to refsexpr expression from the optimised where clause (see Fig. 7.1) and Page 154 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources keyexpr_i expressions form the predicates. The query returns a collection of structures consisting of a non-key value and corresponding key values. 2. An index structure is initialised. 3. The cached query run-time result is made available for the index structure. 4. Non-key values are indexed according to key values. In that way, cached run-time results are directly returned by index calls during a volatile index optimised query evaluation. A normal index constructs run-time results from the values stored in the database; therefore, an individual volatile index invocation might be faster. After the query execution volatile index contents is removed and only the index definition remains. It is vital that a query determining index non-key and key values must be executed most efficiently as possible; hence, participation of available optimisers is often indispensable. Particularly it is important, when the query addresses a distributed and heterogeneous collection of objects. 7.1.3 Solution Properties Many important properties of a regular index concern also a volatile index: • from the user point of view it is used like a regular index; except, it is created using different command, e.g.: add vltlindex • the index transparency is achieved using standard index optimiser routines, • a volatile index call in an SBQL syntax tree and in a compiled ODRA intermediate byte-code is the same as in case of a regular index call. For that reason, the architecture of the volatile indexing technique relies on the developed architecture of indexing described in Chapter 4 and Chapter 5. Next sections on examples show the effectiveness of the volatile indexing technique and its application for indexing heterogeneous and distributed resources. 7.1.4 Prove of Concept Test Let us consider the example query introduced in subchapter 6.4: Page 155 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources ((Emp where address.city = "Łódź" and worksIn.Dept.address.city in ("Łódź" union "Wrocław") and married = true and age = 61) as e).(e.name + " " + e.surname, count(Emp where getTotalIncomes() = e.getTotalIncomes())) Its syntax tree is depicted below. Fig. 7.2 Syntax tree of the example query The dot expression (in a root the syntax tree) represents the root non-algebraic operator from the Fig. 7.1. Its left subquery returns collection of binders named e containing references to 61 years old, married employees living in Łódź, working in Łódź or Wrocław. The right dot subquery is evaluated for each binder with an employee object. A count expression, which is marked with a dashed line, calculates the number of employees with an equal amount of total incomes to the processed employee. In example from subchapter 6.4 a where clause in this subexpression was substituted with an index call: count(idxEmpTotalIncomes(e.getTotalIncomes())) An index key is a total income of currently processed employee; thus, it depends on the dot non-algebraic operator. The query meets all conditions, presented in section 7.1.1, that are necessary to take advantage of a volatile index vltlIdxEmpTotalIncomes similarly defined as the idxEmpTotalIncomes index. Consequently, the count subexpression can be transformed accordingly: count(vltlIdxEmpTotalIncomes(e.getTotalIncomes())) The plot in Fig. 7.3 shows the optimisation gain for the given query optimised with a use of the indices mentioned above. The gain for the query optimised using the volatile indexing technique is in general smaller than in case of a regular index. Nevertheless, for a database consisting of more than 30000 persons the query performance improvement is significant. After applying a volatile index the query execution is more than 39 times faster. Page 156 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources vltlIdxEmpTotalIncomes gain idxEmpTotalIncomes gain 1000 gain [ratio] 100 10 1 0,1 10 100 1000 10000 100000 1000000 no. of persons Fig. 7.3 Optimisation gain for volatile and regular indices 7.2 Optimising Queries Addressing Heterogeneous Resources The most important feature distinguishing a volatile index from a normal index concerns non-key values limitations. In the current solution, regular indices can index database objects defined using simple path expressions which return object references. Such limitation is caused by the index updating mechanism; therefore, it does not concern the volatile indexing technique where non-keys definition can be an arbitrary expression returning: • remote object references, • virtual object references (updateable views seeds – see subchapter 3.5), • binders and literals. The basic assumption concerning a non-key object and key value definition for a volatile index is determinism, i.e. it must return the exact same result provided that data used to calculate it has not changed. The significant advantage of the volatile indexing technique is its practicability from the point of view of integration of distributed and heterogeneous resources. The next section gives the overall description of wrapper enabling the transparent integration of RDBMS resources into ODRA distributed database repository. The following section Page 157 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources gives an example involving processing heterogeneous schema exploiting the wrapper. The test proves that the volatile indexing technique can significantly facilitate evaluation of queries in such an environment. 7.2.1 Overview of a Wrapper to RDBMS Wrapping relational resources into the ODRA prototype has been developed under the eGov-Bus project12 [28]. The ODRA database server is used as a virtual repository the only component accessible for the top-level users and applications. A virtual repository presents a global schema. Virtually integrated data from the underlying heterogeneous resources are made available using SBQL views’ definitions (described in subchapter 3.5). In the virtual repository concept neither data nor services are to be be copied, replicated and maintained in the global schema, as they are supplied, stored, processed and maintained on their autonomous sites. The research devoted to object-oriented wrappers to relational databases supporting query optimisation has been described in work [128] and in many papers [63, 64, 129, 130, 131, 132]. The author has contributed to the virtual repository and wrapper development. An ODRA resource (an ODRA engine) denotes any data resource providing an interface capable of executing SBQL queries and returning SBQL result objects. A nature of such a resource is irrelevant, as only the mentioned capability is important. In the simplest case, where a resource is an ODRA database, its interface has a direct access to an ODRA database engine (DBMS). However, as virtual repository aims to integrate existing business resources, whose models are mainly relational ones, an interface becomes much more complicated, as there is no directly available data store – SBQL result objects must be created dynamically basing on results returned from SQL relational queries evaluated directly in a local RDBMS. Such cases (the most common in real-life application) force introducing additional middleware, an object-relational wrapper as a client-server solution. A standard ODRA database can be extended with as many wrappers as needed (e.g. for 12 Advanced eGovernment Information Service Bus supported by the European Community under “Information Society Technologies” priority of the Sixth Framework Programme - contract number: FP6-IST-4-026727-STP Page 158 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources relational or semi-structured data stores) and plugged into any resource model without any lost of its primary performance. Furthermore, a wrapper server can be developed independently, providing a communication protocol to its client. Of course, an ODRA database with a wrapper’s client can work on a separate machine. A query evaluation process exploiting the wrapper is depicted in Fig. 7.4. One of the global applications sends a query (arrow 1). This query is expressed with SBQL, as it refers to the business object oriented model available to global (top-level) users. According to the global schema and its information on data fragmentation, replication and physical location (obtained from integration schemata), the query is sent to appropriate resources. In Fig. 7.4 this stage is realised with arrows 2, 2a and 2b. global application 10 global SBQL query 1 global SBQL query result global virtual store global schema 2 virtual repository SBQL query result composition 2a 2b 9 9a 9b partial SBQL queries' results partial SBQL query ODRA interface partial query syntax tree transformations 3 SBQL query evaluation partial SBQL query 8 SQL queries 4 SBQL result objects ODRA interface wrapper client SQL optimization information 3 SQL queries 4 encapsulated 7 SQL queries' results wrapper server JDBC connection resource model partial SBQL query ODRA resource ODRA interface SQL queries 5 6 SQL queries' results RDBMS ODRA resource ODRA resource Fig. 7.4 Query evaluation through the wrapper [64] The partial query aiming at a given relational resource is further processed with a resource’s ODRA interface. First, the interface performs query optimisation. Apart from efficient SBQL optimisation rules applied at any resource’s interface, queries can be transformed so that powerful native SQL optimisers can work and amount of data Page 159 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources retrieved from the RDBMS is acceptably small. Relational optimisation information (indices, cardinalities, primary-foreign key relationships, etc.) is provided by the wrapper server’s resource model (arrow 3) and appropriate SBQL query syntax tree transformations are performed. Appropriate tree branches (responsible for such SQL queries) are substituted with calls to execute immediately procedures with optimisable SQL queries. Once syntax tree transformations are finished, the interface starts regular SBQL query evaluation. Whenever it finds an execute immediately procedure, SQL query is sent to the server via the client (arrows 4, the client passes SQL queries without any modification). The server executes SQL queries as a resource client (JDBC connection), arrow 5, and their results, arrow 6, are encapsulated and sent to the client (arrow 7). Subsequently, the client creates SBQL result objects from results returned from the server (it cannot be accomplished at the resource site, which is another crucial reason for a client-server architecture) and puts them on regular SBQL stacks for further evaluation (arrow 8). In the preferable case (which is not always possible), results returned from the server are supplied with TIDs (tuple identifiers), which enables parametrising SQL queries within the SBQL syntax tree with intermediate results of SBQL subqueries. Having finished its evaluation, the interface sends a “partial result” upwards (arrow 9), where it is combined with results returned from other resources (arrows 9a and 9b) and the global query result is composed (depending on fragmentation types, redundancies and replication). This result is returned to the global application (arrow 10). A test presented in the next section takes advantage from above presented features of the ODRA and the wrapper to RDBMS. 7.2.2 Volatile Indexing Technique Test For the following test, a local ODRA OODBMS data schema and external data schema are combined. The local data represent one company, which will be referred to as company O. Its schema and distribution is the same as in the tests from this and the previous chapter. The external relational schema (Fig. 7.5) concerns another company, which will be referred as company R. Its records are automatically wrapped to a simple internal object-oriented schema. Page 160 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources Fig. 7.5 Example relational schema of company R Virtually each table row corresponds to a complex object, which consists of primitive (atomic) subobjects according to table columns. Finally, the schema is transformed by designed updateable views (Fig. 7.6) and extended with a virtual pointer worksIn associating employees with departments. Fig. 7.6 Views object-oriented schema of company R RDBMSEmp and RDBMSDept views allow ODRA users transparently access, process and even modify data shared by the RDBMS. For the test purposes, the relational schema created on PostgreSQL 8.2 RDBMS has been populated with data about 100 employees and departments according to the distribution depicted in subchapter 6.1. The example usage of the volatile indexing technique is tested on the following query: Query 7.1a: For each company O employee return a name concatenated with a surname and the number of employees with a higher salary working in the R company in the same department as the given employee. original query Emp as empaux.(empaux.name + " " + empaux.surname, (empaux.worksIn.Dept.name as deptnameaux). ((empaux.salary as empsalaryaux).count(RDBMSEmp where worksIn.RDBMSDept.name = deptnameaux and salary > empsalaryaux))) Its syntax tree is consistent with the pattern shown in Fig. 7.1. The left dot subquery addresses all employees of the O company. The right dot subquery is evaluated for each binder containing an employee object. A count expression calculates the number of employees of the R company selected according to a where clause. Values of selection predicates are specified by a currently processed company O employee. In order to optimise query evaluation optimiser uses methods mentioned in section 5.6.3, i.e. query modification technique and removing unnecessary auxiliary names. In such form the transformed query can be processed by the wrapper optimiser Page 161 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources and the whole count expression can be substituted with an appropriate SQL query (see Query 7.1b – reference). This query also depends on a currently processed company O employee. Therefore, it is multiply sent to relational data resource (connected through the wrapper) where it can be evaluated with an assistance of native optimisations provided by the RDBMS (e.g. indices, projections, joins). Query 7.1b reference Emp as empaux.(empaux.name + " " + empaux.surname, (empaux.worksIn.Dept.name as deptnameaux). ((empaux.salary as empsalaryaux). execsql("select COUNT(*) from employees, departments where departments.name = '" + deptnameaux + "' AND departments.id = employees.department_id AND employees.salary > '" + empsalaryaux + "'"), "<0 $employees | | e | none | binder 0>", "admin.rdbms") index optimised Emp as empaux.(empaux.name + " " + empaux.surname, (empaux.worksIn.Dept.name as deptnameaux). ((empaux.salary as empsalaryaux). count($vltlIdxRDBMSEmp(deptnameaux groupas $equal; (empsalaryaux, 1.7976931348623157E308, false, true) groupas $range)))) ref. avg. time opt. avg. time gain 7000 100 90 6000 80 5000 60 [s] 4000 50 3000 40 gain [ratio] 70 30 2000 20 1000 10 0 0 10 100 1000 10000 100000 1000000 no. of persons Fig. 7.7 Evaluation times and optimisation gain for Query 7.1 The alternative way to improve performance of the Query 7.1 evaluation is to take advantage of the volatile indexing technique. The administrator can create a volatile index on RDBMSEmp defined using multiple keys. The first key is a name of a department where an employee works (definition worksIn.RDBMSDept.name) and Page 162 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources another one is an employee salary (definition salary). The second key should enable optimisation of range queries. Let us assume that such index exists and its name is vltlIdxRDBMSEmp. The index optimiser would transform the given query in order to exploit the volatile indexing technique (see Query 7.1b - index optimised). The test evaluation plot in Fig. 7.7 depends on the number of company O employees. The gain indicates that the second approach to query optimisation results in more than 40 times better performance. The significant advantage of the index optimised query evaluation is a decrease of communication between ODRA and RDBMS resource to minimum. The volatile index materialisation has a significant influence on the Query 7.1b evaluation. According to description in section 7.1.2 the execution of the following query is necessary: RDBMSEmp join (worksIn.RDBMSDept.name, salary) Naïve evaluation would first return whole contents of the employees table corresponding to the RDBMSEmp expression. Next, for each table row an appropriate SQL query would be issued to RDBMS in order to determine a name of an employee’s department. In order to improve the evaluation of the query determining non-key and key values query modification technique, removing unnecessary auxiliary names and the wrapper optimiser optimisation methods have been used. Consequently, the query has been transformed to the following optimised form: execsql("select employees.info, employees.department_id, employees.surname, employees.salary, employees.id, employees.sex, employees.name, employees.birth_date, departments.name from employees, departments where ((departments.id = employees.department_id) AND (departments.id = departments.id))", "<0 | | | none | struct <1 $employees | | e | none | binder 1> <1 $departments | $name | | string | value 1> <1 $employees | $salary | | real | value 1> 0>", "admin.rdbms") As a result, sending only one SQL query to the wrapper is necessary to materialise the vltlIdxRDBMSEmp volatile index, i.e. to cache results containing seeds of virtual objects corresponding to company R employee records together with required key values. The evaluation of this query on the given RDBMS is generally longer and retrieves larger amount of data than a single SQL query invocation from the reference query. Nevertheless, profit of using a volatile index is considerable. The gain indicates that single invocation of a volatile index is more than 40 times faster than execsql evaluation Page 163 of 181 Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources in the reference query. The test has been performed on the single machine so the communication between the wrapper and the RDBMS, storing company R data, was realised through a local loopback interface. Nevertheless, the wrapper can without obstruction access resources in a distributed environment. Additional factors, like e.g. network throughput and traffic delays, negatively influence the overall query evaluation performance. Particularly in the context of the current example, they would significantly deteriorate execution of the reference query, because it requires processing multiple SQL queries. The only contraindication for introducing the volatile indexing technique would occur if index materialisation would consume too many local resources and deteriorate the performance. This however is not an issue of the given example and should be considered by the database administrator. The volatile indexing solution is generic. It can be applied for any schema consisting of deterministic views, e.g.: • views transforming one schema (schema of actual data or view schema) to another, • views providing access to external, legacy resources (e.g. returning objects provided by the wrapper to RDBMS) regardless of their location, • views integrating data from several distributed and heterogeneous resources into a common objects schema (see integration schema description in subchapter 4.4). In order to test the volatile indexing technique against the last type of views extending ODRA OODBMS with mechanisms supporting such schemas is required. Page 164 of 181 Chapter 8 Conclusions Chapter 8 Conclusions The theses stated in the Ph.D. dissertation have been proved valid: 1. Processing of selection predicates based on arbitrary key expressions accessing data in a distributed object-oriented database can be optimised by centralised or distributed transparent indexing. The designed indexing architecture provides the assumed level of transparency in the established distributed and homogeneous environments. The architecture comprises the optimisation module that is able to employ transparently indices in a query, automatic update of indices in response to modification of corresponding data and the administration module for organising and managing indices. In order to enable creation and maintenance of indices supporting keys defined using arbitrary, deterministic and side effects free expressions the author has introduced a special kind of database triggers. Each individual database’s object is associated with an Index Update Trigger (IUT) if it: belongs to an indexed collection, contains nested indexed objects or is used to determine a key value. Any modification submitted to such objects triggers a procedure necessary to update a corresponding index. Triggers associated with objects used in key value evaluation are called Key Index Update Triggers (KIUTs). Besides the information about an associated index, KIUTs hold references to a corresponding indexed object. Determining objects participating in calculation of the key value is made possible by an extension of the query execution engine enabling logging objects that occur during binding. As the result, together with a re-calculation of a key value for an indexed object it is possible to validate and correct existing KIUTs. This approach is generic regardless of an expression defining an index key. In general, query optimisation consists in analysis of a query syntax tree and replacing its parts with index calls using selection predicates employed as call parameters. The index optimiser is capable of processing both conjunction and disjunction of predicates. The solution considers an established object model and properties of SBQL query language providing rules ensuring that optimisations preserve query semantics. There are no restrictions concerning supported selection predicates and Page 165 of 181 Chapter 8 Conclusions a level of the transformed query, i.e. whether it addresses local or global schema. Finally, no restrictions concern the selection of an index structure. An employed indexing technique, i.e. linear hashing, is exemplary. However, it is essential that linear hashing can be substituted with its scalable distributed equivalent (LH* SDDS) without any changes in work of other elements of the presented indexing architecture. Furthermore, this would enable the parallelisation of computation and increase concurrency of the index. The author has proposed the rank queries optimisation method efficient in a distributed environment particularly when taking advantage of a distributed index is possible. 2. Evaluation of complex queries involving distributed heterogeneous resources can be facilitated by techniques taking advantage of a transparent index optimisation. Heterogeneity introduces a higher complexity level of database’s architecture which makes employing global indexing difficult. External schemas can be imported into an object-oriented database and their data can be processed transparently, i.e. indistinguishably from purely object-oriented data. This approach has been applied in the wrapper to a relational database developed for OODBMS based on SBA and SBQL. Furthermore, updateable object-oriented views, which are defined in SBQL, can envelope and arbitrarily transform an external schema to build a top-level objectoriented schema. The author’s volatile indexing technique addresses the environment depicted above. It relies on a significant part of the developed indexing architecture, i.e. index optimiser and management facilities, omitting mechanisms responsible for maintaining cohesion between indices and indexed data. In contrast to regular indices, volatile indices are materialised during the time of query evaluation. The indexed results are cached making individual index calls very efficient. An improvement of query performance depends on the number of times a volatile index is invoked during the query evaluation. Therefore, this approach proves efficacy in optimisation of complex and laborious global queries. The theses have been confirmed by the prototype implementation of indices for the ODRA prototype and tests presenting the example optimisation gain for the majority of proposed solutions. Page 166 of 181 Chapter 8 Conclusions 8.1 Future Work The thesis is a significant contribution to the topic of transparent indexing in distributed object-oriented databases. Nevertheless, there still exist many unexplored directions of research in this domain. The presented solutions can be considered as the strong fundament for further works. The current considerations lead to the work on creating robust global indices for horizontally fragmented homogeneous data. Vertical and mixed fragmentations involve SBQL views constituting the global schema. In order to extend capabilities of indexing to such a model, semantics of views should be taken into consideration. The important research issues concern determining: • a method of persisting inside an index data made virtually available by views, • rules which constrain creating such global indices. A workaround to those problems applying to the specified family of queries is the author’s volatile indexing technique. Currently the control over this technique is given to the administrator. Automation of the creation of volatile indices by the database’s engine would enable better adaptation of index definition to a particular query. This would require constructing algorithms for finding parts of queries that would gain from indexing and analysis of selection predicates to determine the best combination of index keys. The author believes that the research on this subject in the context of SBA and SBQL would result in original query optimisation methods. Another challenging subject is designing transparent global non-volatile indices facilitating processing of distributed and heterogeneous resources. The first problem concerns identifying external data, e.g. relational tuples, within the index. Consequently, a mechanism enabling fast materialisation of individual objects wrapping external data should be provided. Finally, full transparency involves development of an architecture ensuring automatic updating of indices in response to external data alteration. Solving those problems may require introducing special facilities for registering external resources, e.g. global objects register, and extending wrapping mechanisms with additional functionality, e.g. update triggers support. Efficacy of all future solutions should be additionally proved by tests. Therefore, extending indexing implementation for the ODRA prototype is the essential issue. The closest works will involve support for distributed transactions. Page 167 of 181 Index of Figures Index of Figures Fig. 2.1 A typical stages of high-level language query optimisation [29]...................... 20 Fig. 2.2 Example of a bucket split operation [72] .......................................................... 28 Fig. 2.3 Example object-relational schemata.................................................................. 42 Fig. 3.1 Example of an object-oriented database schema for a company....................... 52 Fig. 3.2 Sample store with classes and objects ............................................................... 54 Fig. 4.1 Index manager structure .................................................................................... 71 Fig. 4.2 Example Nonkey structure for Emp collection ................................................. 72 Fig. 4.3 Example Index Update Triggers generated for idxPerAge index ..................... 77 Fig. 4.4 Example Index Update Triggers generated for idxEmpWorkCity index........... 78 Fig. 4.5 Example Index Update Triggers generated for idxAddrStreet index ................ 78 Fig. 4.6 Automatic index updating architecture.............................................................. 78 Fig. 4.7 Calculating the idxPerAge index key value for i31 object ................................. 82 Fig. 4.8 Calculating the idxEmpWorkCity index key value before update ..................... 84 Fig. 4.9 Calculating the idxEmpWorkCity index key value after update ........................ 84 Fig. 4.10 Calculating the idxPerZip index key value before removing zip attribute...... 86 Fig. 4.11 Calculating the idxPerZip index key value without zip attribute .................... 86 Fig. 4.12 Calculating the idxPerZip index key value after inserting zip attribute .......... 87 Fig. 4.13 Calculating the idxEmpTotalIncomes index key value for i61 object before update.............................................................................................................................. 88 Fig. 4.14 Calculating the idxEmpTotalIncomes index key value for i31 object before update.............................................................................................................................. 89 Fig. 4.15 Last steps of computing the idxEmpTotalIncomes index key value for i31 after update.............................................................................................................................. 90 Fig. 4.16 Example database schema for data integration ............................................... 99 Fig. 5.1 ODRA optimisation architecture [2] ............................................................... 103 Fig. 5.2 Schema of the index optimiser ........................................................................ 104 Page 168 of 181 Index of Figures Fig. 5.3 Example optimisation applied by the index optimiser .................................... 105 Fig. 5.4 Index optimiser algorithm ............................................................................... 107 Fig. 5.5 Query optimisation with the index optimiser pre-processing.......................... 126 Fig. 6.1 Department’s location distribution .................................................................. 142 Fig. 6.2 Employee’s department distribution................................................................ 142 Fig. 6.3 Employee's salary range distribution............................................................... 143 Fig. 6.4 Female person’s first name distribution .......................................................... 143 Fig. 6.5 Female person’s surname distribution............................................................. 143 Fig. 6.6 Male person’s first name distribution.............................................................. 143 Fig. 6.7 Male person’s surname distribution ................................................................ 144 Fig. 6.8 Evaluation times and optimisation gain for Query 6.1.................................... 145 Fig. 6.9 Indices optimisation gain for Query 6.1 .......................................................... 146 Fig. 6.10 Evaluation times and optimisation gain for Query 6.2.................................. 147 Fig. 6.11 Indices optimisation gain for Query 6.2 ........................................................ 148 Fig. 6.12 Evaluation times and optimisation gain for Query 6.3.................................. 149 Fig. 6.13 Indices optimisation gain for Query 6.3 ........................................................ 150 Fig. 6.14 Evaluation times and optimisation gain for Query 6.4.................................. 150 Fig. 6.15 Evaluation times and optimisation gain for Query 6.5.................................. 151 Fig. 6.16 Indices optimisation gain for Query 6.5 ........................................................ 152 Fig. 7.1 Query suitable for applying a volatile index.................................................... 154 Fig. 7.2 Syntax tree of the example query .................................................................... 156 Fig. 7.3 Optimisation gain for volatile and regular indices .......................................... 157 Fig. 7.4 Query evaluation through the wrapper [64] .................................................... 159 Fig. 7.5 Example relational schema of company R ...................................................... 161 Fig. 7.6 Views object-oriented schema of company R ................................................. 161 Fig. 7.7 Evaluation times and optimisation gain for Query 7.1.................................... 162 Page 169 of 181 Index of Tables Index of Tables Tab. 3-1 Evaluation of traditional arithmetic operators.................................................. 57 Tab. 3-2 Evaluation of operators working on collections............................................... 57 Tab. 3-3 Evaluation of non-algebraic SBQL operators .................................................. 59 Tab. 3-4 Evaluation of auxiliary names defining operators............................................ 60 Tab. 3-5 Evaluation of sequences ranking operators...................................................... 60 Tab. 3-6 Evaluation of imperative operators .................................................................. 61 Tab. 5-1 Features of Rank Queries Evaluation Strategies ............................................ 139 Tab. 6-1 Optimisation testbench configuration ............................................................ 142 Page 170 of 181 Bibliography Bibliography 1. Adamus R., Habela P., Kaczmarski K., Lentner M., Stencel K, Subieta K. StackBased Architecture and Stack-Based Query Language. ICOODB 2008, Berlin: http://www.odbms.org/download/030.02%20Subieta%20StackBased%20Architecture%20and%20StackBased%20Query%20Language%20March%202008.PDF 2. Adamus R., Kowalski T.M., Subieta K., et al: Overview of the Project ODRA. Proceedings of the First International Conference on Object Databases, ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp. 179-197 3. Aguilera M. K., Golab W., Shah M. A.: A practical scalable distributed B-tree. Proceedings of the VLDB Endowment, 1(1), pp. 598-609, 2008 4. Ali M. H., Saad A. A., Ismail M. A.: The PN-Tree: A Parallel and Distributed Multidimensional Index. Distributed and Parallel Databases 17(2), pp. 111-133, 2005 5. Andrzejewski W., Królikowski Z., Masewicz M., Wrembel R.: Hidden Markov Models as prediction mechanism for object oriented database systems with hierarchical materialisation. II Krajowa Konferencja Naukowa “Technologie Przetwarzania Danych” Poznań, September 2007 (in Polish) 6. Astrahan M. M.: System R: A relational approach to data management. ACM Transactions on Database Systems, 1(2), pp. 97-137, June 1976 7. Basu J., Keller A. M., Pöss M.: Centralized versus Distributed Index Schemes in OODBMS - A Performance Analysis. Proc. of ADBIS 1997, pp. 162-169 8. Bayer R., McCreight E.: Organization and maintenance of large ordered indexes. Acta Inf. 1, 1972, 173-189 9. Bertino, E.: Method precomputation in object-oriented databases. SIGOS Bulletin, 12 (2, 3), 1991, pp. 199-212 10. Bertino E. et al.: Indexing Techniques for Advanced Database Systems. Kluwer Academic Publishers, Boston Dordrecht London,1997 11. Bertino E., Catania B., Chiesa L.: Definition and Analysis of Index Organizations for Object-Oriented Database Systems. Information Systems, Page 171 of 181 Bibliography v.23 n.2, p.65-108, April 1, 1998 12. Bertino E., Foscoli P.: Index Organizations for Object-Oriented Database Systems. IEEE Transactions on Knowledge and Data Engineering archive Volume 7, Issue 2 (April 1995), pp. 193-209 13. Bębel, B., Wrembel, R.: Method Materialization Using the Hierarchical Technique: Experimental Evaluation, Proc. of Joint Conference on KnowledgeBased Software Engineering (JCKBSE), Slovenia, 2002 14. Black P.E.: Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 17 November 2008.: http://www.nist.gov/dads 15. Blasgen M. W., Casey R. G., Eswaran K. P.: An Encoding Method for Multifield Sorting and Indexing. Communications of the ACM, Nov. 1977, p. 874. 16. Nam B., Sussman A.: DiST: Fully Decentralized Indexing for Querying Distributed Multidimensional Datasets. Proceedings of the 20th IPDPS 2006, IEEE 2006 17. Burleson D.: Turbocharge SQL with advanced Oracle9i indexing. March 26, 2002: http://www.dba-oracle.com/art_9i_indexing.htm 18. Cattell R.G.G., Barry D.K.(Eds.): The Object Data Standard: ODMG 3.0. Morgan Kaufmann 2000 19. Cambazoglu B.B., Catal A., Aykanat C.: Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems. ISCIS 2006, Istambul, Turkey, pp. 717-725 20. Chaudhuri S.: An Overview of Query Optimization in Relational Systems. Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, Seattle, Washington, United States , pp. 3443, 1998 21. Chen Y., Chen Y.: Signature file hierarchies and signature graphs: a new index method for object-oriented databases. Proc. of the SAC, pp. 724-728, 2004 22. Cook W.R., Rosenberger C.: Native Queries for Persistent Objects: A Design White Paper. 2006: http://www.db4o.com/about/productinformation/whitepapers/Native%20Queries Page 172 of 181 Bibliography %20Whitepaper.pdf 23. Cormen T. H. et al.: Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 9: Medians and Order Statistics, pp.183–196. 24. DB2: http://ibm.com/software/data/db2, 25. db4o: http://www.db4o.com/ 26. db4o Tutorial for Java. Production Release V6.3: http://www.db4o.com/about/productinformation/resources/db4o-6.3-tutorialjava.pdf 27. Eder J., Frank H., Liebhart W.: Optimization of Object-Oriented Queries by Inverse Methods. Proceedings of the 2nd International East/West Database Workshop, September 1994, Klagenfurt, Austria, pp. 108-120 28. eGov-Bus, http://www.egov-bus.org/web/guest/home 29. Elmasri R., Navathe S. B.: Fundamentals of Database Systems. 4th Edition, Pearson Education, Inc, publishing as Addison-Wesley, 2004, ISBN 0-32112226-7 30. Fenk R., Markl V., Bayer R.: Interval Processing with the UB-Tree. . Proc. IDEAS Conf., IEEE Computer Society, 2002, pp. 12-22 31. Firebird: http://www.firebirdsql.org/ 32. Gaede V., Günther O.: Multidimensional Access Methods, ACM Computing Surveys, 30(2), pp. 170-231, June 1998 33. Garcia-Molina H., Ullman J.D., Widom J.: Database Systems: The Complete Book. 1st edition, Pearson Education, Inc., publishing as Prentice Hall, 2002 34. Garcés-Erice L., et al.: Data Indexing in Peer-to-Peer DHT Networks. Proceedings of the ICDCS 2004, pp. 200-208 35. GemFire Enterprise Developer’s Guide. Version 5.7, GemStone, September 2008: http://www.gemstone.com/docs/5.7.0/product/docs/html/Manuals/wwhelp/wwhi mpl/js/html/wwhelp.htm 36. GemStone FacetsTM Programming Guide, Version 4.0, GemStone, June 2006: Page 173 of 181 Bibliography http://www.facetsodb.com/downloads/facets/Programming.pdf 37. GemStone Systems, Inc.: http://www.gemstone.com/ 38. Gnutella Protocol Development: http://rfc-gnutella.sourceforge.net/ 39. Habela P., Kaczmarski K., Kozankiewicz H., Lentner M., Stencel K., Subieta K.: Data-Intensive Grid Computing Based on Updateable Views. ICS PAS Report 974, June 2004 40. Hadjieleftheriou M., Hoel E. G., Tsotras V. J.: SaIL: A Spatial Index Library for Efficient Application Integration. GeoInformatica 9(4), pp. 367-389, 2005 41. Helmer S., Moerkotte G.: A performance study of four index structures for setvalued attributes of low cardinality. VLDB Journal, 12(3): pp. 244-261, October 2003 42. Henrich A.: P-OQL: an OQL-oriented query language for PCTE. In Proc. 7th, Conf. on Software Engineering Environments (SEE ’95), pages 48–60, Noordwijkerhout, Niederlande, 1995. IEEE 43. Henrich A.: The Update of Index Structures in Object-Oriented DBMS. Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM'97), Las Vegas, Nevada, November 10-14, 1997. ACM 1997, ISBN 0-89791-970-X: pp. 136-143 44. Hosain M. S., Newton M. A. H., Rahman M. M.: Dynamic Adaptation of Multikey Index for Distributed Database System. Proceedings of the 9th WSEAS International Conference on Computers, Athens, Greece, July 2005. 45. Hosain M. S., Newton M. A. H.: Multi-Key Index for Distributed Database System, International Journal of Software Engineering and Knowledge Engineering, Vol. 15, No. 2, May 2005, pp. 433–438 46. Hwang D. J.: Function-based indexing for object-oriented databases. PhD thesis, Massachusetts Institute of Technology, February 1994 47. H-PCTE: http://pi.informatik.uni-siegen.de/pi/hpcte/hpcte.html 48. IBM® DB2 Information Center. version 9.5, 6 August 2008: http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/ 49. IBM® Informix® Dynamic Server Information Center. version 11.50, 20 August Page 174 of 181 Bibliography 2008: http://publib.boulder.ibm.com/infocenter/idshelp/v115/ 50. IBM® Informix® Virtual-Index Interface, Programmer’s Manual. Version 11.50, SC23-9439-00, May 2008 http://publibfp.boulder.ibm.com/epubs/pdf/c2394390.pdf 51. IBM® System iTM and i5/OS® Information Center. version 6, 1st edition, 2008: http://publib.boulder.ibm.com/infocenter/systems/scope/i5os/index.jsp 52. Ilyas I.F., Aref W.G. et al.: Adaptive rank-aware query optimization in relational databases. ACM Transactions on Database Systems (TODS), v.31 n.4, p.1257-1304, December 2006 53. Informix: http://ibm.com/informix 54. Ioannidis Y. E.: Query Optimization, ACM Computing Surveys, symposium issue on the 50th Anniversary of ACM, Vol. 28, No. 1, 1996, pp. 121-123 55. Jarke M., Koch J.: Query Optimization in Database Systems. ACM Computing Surveys 16(2), 1984, pp. 111-152 56. Jodłowski A.: Dynamic Object Roles in Conceptual Modelling and Databases. Ph.D. Thesis, The Institute of Computer Science, The Polish Academy of Sciences, 2002 57. Kemper A., Kilger C., Moerkotte G.: Function Materialization in Object Bases: Design, Realization and Evaluation. IEEE Transaction on Knowledge and Data Engineering, Vol. 6, No. 4, August 1994, pp 587-608 58. Kowalski T.M., Kuliberda K, Adamus R., Wislicki J., Murleski J.: Local and Global Indexing Strategies and Data-Structures in Distributed Object-Oriented Databases. SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 153 - 156 59. Kowalski T.M., Wiślicki J., Kuliberda K., Adamus R., Subieta K.: Optimization by Indices in ODRA. Proceedings of the First International Conference on Object Databases, ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp.97-117 60. Kozankiewicz H., Leszczyłowski J., Subieta K.: Implementing Mediators through Virtual Updateable Views. Engineering Federated Information Systems, Proceedings of the 5th Workshop EFIS 2003, July 17-18 2003, UK, pp.52-62 Page 175 of 181 Bibliography 61. Kozankiewicz H., Stencel K., Subieta K.: Integration of Heterogeneous Resources through Updatable Views. Workshop on Emerging Technologies for Next Generation GRID (ETNGRID-2004), June 2004, Proc. published by IEEE 62. Kroll B., Widmayer P.: Distributing a search tree among a growing number of processors. In Proc. of ACM-SIGMOD, May 1994 63. Kuliberda K, Adamus R., Wiślicki J., Kaczmarski K.,Kowalski T. M.,Subieta K.: Autonomous Layer for Data Integration in a Virtual Repository. 3th International Conference on Grid computing, high-performAnce and Distributed Applications (GADA'06), France, Springer 2006 LNCS 4276, pp. 1290-1304 64. Kuliberda K., Meina M., Wiślicki J., Kowalski T.M., Adamus R., Kaczmarski K., Subieta K.: On Distributed Data Processing in Data Grid Architecture for a Virtual Repository. SiS 2008 proceedings, Łódź, Poland, 2008 (to appear) 65. Kwan S. C., Strong H. R.: Index Path Length Evaluation for the Research Storage System of System R. IBM Research Report RJ2736, San Jose, CA., January 1980 66. Lane P. et al.: Oracle® Database Data Warehousing Guide. 11g Release 1 (11.1), Part Number B28313-02, September 2007: http://download.oracle.com/docs/cd/B28359_01/server.111/b28313/toc.htm 67. Lee W.-C., Lee D. L. Path dictionary: a new approach to query processing in object-oriented databases. IEEE Transactions on Knowledge and Data Engineering, Volume 10, Issue 3 (May/June 1998), pp. 371-388 68. Lentner M.: Integration of data and applications using virtual repositories. PhD Thesis, PJIIT, Warszawa 2008 69. Li C., Chang K. C.-C., et al.: RankSQL: query algebra and optimization for relational top-k queries. Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland 70. Liebeherr J., Omiecinski E., Akyildiz I. F.: The Effect of Index Partitioning Schemes on the Performance of Distributed Query Processing. IEEE Transactions on Knowledge and Data Engineering archive, Volume 5, Issue 3, 1993, pp. 510-522 71. Liskov B. et al: Safe and Efficient Sharing of Persistent Objects in Thor. In Proc. Page 176 of 181 Bibliography of ACM SIGMOD International Conference on Management of Data, pages 318–329, Montreal, Canada, June 1996 72. Litwin W.: Linear Hashing: a new tool for file and tables addressing. Reprinted from VLDB-80 in READINGS IN DATABASES. 2-nd ed. Morgan Kaufmann Publishers, Inc., 1994. Stonebraker , M.(Ed.) 73. Litwin W., Neimat M.-A., Schneider D. A.: LH*: linear hashing for distributed files. In Proc. of ACM-SIGMOD, May 1993 74. Litwin W., Neimat M. A., Schneider D. A.: LH*: Scalable, Distributed Database System. 1996, ACM Trans. Database Syst., 21(4):480-525 75. Litwin W., Neimat M.-A., Schneider D. A.: RP*: A family of order-preserving scalable distributed data structures. In Proc. of VLDB, September 1994 76. Litwin W., Schwarz T. J. E.: LH*RS: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes. SIGMOD Conference 2000: 237248 77. Luk F. H.-W., Fu A.: W. Triple-node hierarchies for object-oriented database indexing. In Proceedings of the 7th international conference on Information and knowledge management, ACM Press (1998), pp 386-397 78. Łaski M.: Query optimisation in object-oriented databases on example of ODRA system implementation. MSc thesis, Computer Engineering Department, Technical University of Łódź, 2007 (in Polish) 79. Maier D., Stein J.: Indexing in an object-oriented DBMS. In Proceedings on the 1986 international Workshop on Object-Oriented Database Systems, International Workshop on Object-Oriented Database Systems. IEEE Computer Society Press, pp. 171-182 80. Masewicz, M., Wrembel, R., Jezierski, J.: Optimising Performance of ObjectOriented and Object-Relational Systems by Dynamic Method Materialisation. Proc. of ADBIS 2005, Tallin, Estonia 81. Milo T., Suciu D.: Index structures for path expressions. In Proc. of the 7th Int. Conf. on Database Theory (ICDT’99), pp 277-295, 1999 82. Morales T. et al.: Oracle® Database VLDB and Partitioning Guide. 11g Release 1 (11.1), Part Number B32024-01, July 2007: Page 177 of 181 Bibliography http://download.oracle.com/docs/cd/B28359_01/server.111/b32024/toc.htm 83. MySQL: http://www.mysql.com/ 84. O’Neil P.E., Quasi D.: Improved Query Performance with Variant Indexes. Proceedings of SIGMOD, pp. 38-49, 1997 85. Objectivity: http://www.objectivity.com/ 86. Objectivity for Java Programmer’s Guide. Part Number: 93-JAVAGD-0, Release 9.3, October 13, 2006 87. Objectivity/SQL++. Part Number: 93-SQLPP-0, Release 9.3, October 9, 2006 88. ObjectStore: http://www.progress.com/objectstore/ 89. ObjectStore Java API User Guide, ObjectStore. Release 7.1 for all platforms, Progress Software Corporation, August 2008: http://www.psdn.com/library/servlet/KbServlet/download/5894-10229715/osjiug.pdf 90. Olken F., Rotem D.: Simple Random Sampling for Relational Databases. Proceedings of VLDB, pp. 160-169, 1986 91. Oracle: http://www.oracle.com/ 92. Özsu M. T., Valduriez P.: Distributed and Parallel Database Systems. ACM Computing Surveys, Volume 28(1), March 1996, pp. 125-128 93. Płodzień J.: Optimization Methods In Object Query Languages. PhD Thesis. IPIPAN, Warszawa 2000 94. Płodzień J., Kraken A.: Object Query Optimization in the Stack-Based Approach. Proc. ADBIS Conf., Springer LNCS 1691, 1999, pp. 303-316 95. Płodzień J., Kraken A.: Object Query Optimization through Detecting Independent Subqueries. Information Systems, Pergamon Press, 2000 96. Płodzień J., Subieta K.: Applying Low-Level Query Optimization Techniques by Rewriting. Proc. DEXA Conf., Springer LNCS 2113, 2001, pp. 867-876 97. Płodzień J., Subieta K.: Optimization of Object-Oriented Queries by Factoring Out Independent Subqueries. Institute of Computer Science Polish Academy of Sciences, Report 889, 1999 Page 178 of 181 Bibliography 98. Płodzień J., Subieta K.: Query Processing in an Object Data Model with Dynamic Roles. Proc. WSEAS Intl. Conf. on Automation and Information (ICAI), Puerto de la Cruz, Spain, CD-ROM, ISBN: 960-8052-89-0, 2002 99. Płodzień J., Subieta K.: Query Optimization through Removing Dead Subqueries. Proc. ADBIS Conf., Springer LNCS 2151, 2001, pp. 27-40 100. Płodzień J., Subieta K.: Static Analysis of Queries as a Tool for Static Optimization. Proc. IDEAS Conf., IEEE Computer Society, 2001, pp. 117-122 101. Poosala V., Ioannidis Y.E.: Selectivity Estimation without the Attribute Value Independence Assumption. Proceedings of VLDB, pp. 486-495, 1997 102. Ramakrishnan. R.: Database Management Systems. WCB/McGraw-Hill,1998 103. PostreSQL: http://www.postgresql.org/ 104. Ranjan R., Harwood A., Buyya R.: Peer-to-Peer Based Resource Discovery in Global Grids: A Tutorial. IEEE Communications Surveys and Tutorials, Volume 10, Number 2, pp: 6-33, ISSN: 1553-877X, USA, 2008. 105. Rao P., Moon B.: psiX: Hierarchical Distributed Index for Efficiently Locating XML Data in Peer-to-Peer Networks. Technical Report 05-10, University of Arizona, 2005 106. Sahri S., Litwin W., Schwartz T.: SD-SQL Server: a Scalable Distributed Database System. CERIA Research Report, December 2005 107. Schoder D., Fischbach K.: Core Concepts in Peer-to-Peer (P2P) Networking. In: Subramanian, R.; Goodman, B. (eds.): P2P Computing: The Evolution of a Disruptive Technology, Idea Group Inc, Hershey. 2005 108. Shiela R. et al.: Advanced Application Developer's Guide. 11g Release 1 (11.1), Part Number B28424-03, August 2008: http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28424/toc.htm 109. SQL Server: http://www.microsoft.com/sqlserver/ 110. SQL Server 2008 Books Online. 2008: http://msdn.microsoft.com/enus/library/ms130214.aspx 111. Sreenath B., Seshadri S.: The hcC-tree: An Efficient Index Structure for Object Oriented Databases. Proc. 21-st VLDB Conf., Santiago de Chile, pp. 203-213, Page 179 of 181 Bibliography 1994 112. Stencel K.: Semi-strong Type Checking in Database Programming Languages. (in Polish), PJIIT - Publishing House, Warszawa 2006, 207 pages 113. Stoica I., et al.: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, Volume 11, Number 1, pp. 17-32, 2003 114. Stolze K., Steinbach T.: DB2 Index Extensions by example and in detail. 2003: http://www3.software.ibm.com/ibmdl/pub/software/dw/dm/db2/dm0312stolze/0312stolze.pdf 115. Strohm R. et al.: Oracle® Database Concepts. 11g Release 1 (11.1), Part Number B28318-05, October 2008: http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/toc.htm 116. Subieta K. LOQIS: The Object-Oriented Database Programming System. Proc.1st Intl. East/West Database Workshop on Next Generation Information System Technology, Kiew, USSR 1990, Springer Lecture Notes in Computer Science, Vol.504, pp.403-421 117. Subieta K.: Stack-Based Approach (SBA) and Stack-Based Query Language (SBQL). http://www.sbql.pl , 2008 118. Subieta K.: Theory and Construction of Object-Oriented Query Languages. PJIIT - Publishing House, ISBN 83-89244-28-4, 2004, 522 pages (in Polish) 119. Subieta K. et al.: ODRA Manual. August 2008: http://www.sbql.pl/various/ODRA/ODRA_manual.html 120. Subieta K., Kambayashi Y., Leszczyłowski J.: Procedures in Object-Oriented Query Languages. Proc. 21-st VLDB Conf., Zurich, pp.182–193, 1995 121. Subieta K., Leszczyłowski J., Ulidowski I.: Processing Semi-Structured Data in Object Bases. ICS PAS Report 852, February 1998 122. Subieta K., Płodzień J.: Object Views and Query Modification. (in) Databases and Information Systems (eds. J. Barzdins, A. Caplinskas), Kluwer Academic Publishers, ss. 3-14, 2001 123. Subieta K., Rzeczkowski W.: Query Optimization by Stored Queries. Page 180 of 181 Bibliography Proceeding of VLDB, pp. 369-380, 1987 124. Taniar D., Rahayu J. W.: A Taxonomy of Indexing Schemes for Parallel Database Systems. Distributed and Parallel Databases, Volume 12, Number 1, Kluwer Academic Publishers, pp. 73-106, 2002 125. Tao Y., Papadias D., Sun J.: The TPR*-Tree: An Optimized Spatio-Temporal Access Method for Predictive Queries. Proceedings of VLDB, Berlin, Germany, pp. 790-801, 2003 126. VERSANT: http://www.versant.com/ 127. VERSANT Database Fundamentals Manual. Release 7.0.1.0, July 2005: http://www.versant.com/developer/resources/objectdatabase/documentation/data base_fund_man.pdf 128. Wiślicki J.: An object-oriented wrapper to relational databases with query optimisation. PhD Thesis, Technical University of Łódź, Łódź 2008 129. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Integration of Relational Resources in an Object-Oriented Data Grid. SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 277-280 130. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Implementation of a Relational-to-Object Data Wrapper Back-end for a Data Grid, SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 285-288 131. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Integration of relational resources in an object-oriented data grid with an example. Journal of Applied Computer Science (2006), Vol. 14 No. 2, Łódź, Poland, 2006, pp. 91-108 132. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R., Subieta K.: Implementation and Testing of SBQL Object-Relational Wrapper Supporting Query Optimisation. Proceedings of the First International Conference on Object Databases, ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp.39-56 133. Wrembel R., Bębel B.: Oracle: Designing of Distributed Databases. Wydawnictwo Helion, 2003 (in Polish) 134. Zobel J., Moffat A., Ramamohanarao K.: Inverted Files versus Signature Files for Text Indexing. ACM Transactions on Database Systems, 23(4): pp. 453-490, 1998 Page 181 of 181