index of contents - Description of SBA and SBQL

Transkrypt

index of contents - Description of SBA and SBQL
TECHNICAL UNIVERSITY OF LODZ
Faculty of Electrical, Electronic,
Computer and Control Engineering
Computer Engineering Department
mgr inŜ. Tomasz Marek Kowalski
Ph.D. Thesis
Transparent Indexing in
Distributed Object-Oriented Databases
Supervisor:
prof. dr hab. inŜ. Kazimierz Subieta
Łódź 2009
To my wife Kasia…
Index of Contents
INDEX OF CONTENTS
SUMMARY......................................................................................................... 6
ROZSZERZONE STRESZCZENIE.................................................................... 8
CHAPTER 1 INTRODUCTION ........................................................................ 13
1.1 Context ........................................................................................................................................... 14
1.2 Short State of the Art of Indexing in Databases ......................................................................... 14
1.3 Research Problem Formulation................................................................................................... 15
1.4 Proposed Solution ......................................................................................................................... 16
1.5 Main Theses of the PhD Dissertation .......................................................................................... 17
1.6 Thesis Outline ................................................................................................................................ 18
CHAPTER 2 INDEXING IN DATABASES - STATE OF THE ART ................. 20
2.1 Database Index Properties............................................................................................................ 21
2.1.1 Transparency ........................................................................................................................... 21
2.1.2 Indices Classification............................................................................................................... 22
2.2 Index Data-Structures .................................................................................................................. 23
2.2.1 Linear-Hashing........................................................................................................................ 26
2.2.2 Scalable Distributed Data Structure (SDDS)........................................................................... 28
2.3 Relational Systems ........................................................................................................................ 30
2.4 OODBMSs ..................................................................................................................................... 32
2.4.1 db4o Database ......................................................................................................................... 35
2.4.2 Objectivity/DB ........................................................................................................................ 35
2.4.3 ObjectStore.............................................................................................................................. 37
2.4.4 Versant .................................................................................................................................... 38
2.4.5 GemStone’s Products .............................................................................................................. 39
2.5 Advanced Solutions in Object-Relational Databases ................................................................. 40
2.5.1 Oracle’s Function-based Index Maintenance .......................................................................... 42
2.6 Global Indexing Strategies in Parallel Systems .......................................................................... 44
2.6.1 Central Indexing ...................................................................................................................... 46
2.6.2 Strategies Involving Decentralised Indexing........................................................................... 47
2.7 Distributed DBMSs ....................................................................................................................... 48
CHAPTER 3 THE STACK-BASED APPROACH ............................................ 50
3.1 Abstract Data Store Models ......................................................................................................... 50
3.1.1 AS0 Model .............................................................................................................................. 50
3.1.2 Abstract Store Models Supporting Inheritance........................................................................ 51
3.1.3 Example Database Schema...................................................................................................... 52
3.1.4 Example Store with Static Inheritance of Objects ................................................................... 53
3.2 Environment and Result Stacks................................................................................................... 54
3.2.1 Bind Operation ........................................................................................................................ 55
3.2.2 Nested Function....................................................................................................................... 56
3.3 SBQL Query Language ................................................................................................................ 56
3.3.1 Expressions Evaluation ........................................................................................................... 57
3.3.2 Imperative Statements Evaluation ........................................................................................... 61
3.4 Static Query Evaluation and Metabase....................................................................................... 62
3.4.1 Type Checking ........................................................................................................................ 64
3.5 Updateable Object-Oriented Views ............................................................................................. 65
CHAPTER 4 ORGANISATION OF INDEXING IN OODBMS .......................... 67
4.1 Implementation of a Linear Hashing Based Index..................................................................... 67
4.1.1 Index Key Types ..................................................................................................................... 68
Page 3 of 181
Index of Contents
4.1.2 Example Indices ...................................................................................................................... 69
4.2 Index Management........................................................................................................................ 70
4.2.1 Index Creating Rules and Assumed Limitations ..................................................................... 73
4.3 Automatic Index Updating ........................................................................................................... 75
4.3.1 Index Update Triggers............................................................................................................. 75
4.3.2 The Architectural View of the Index Update Process ............................................................. 78
4.3.3 SBQL Interpreter and Binding Extension................................................................................ 80
4.3.4 Example of Update Scenarios.................................................................................................. 81
4.3.4.1 Conceptual Example ........................................................................................................... 81
4.3.4.2 Path Modification ............................................................................................................... 83
4.3.4.3 Keys with Optional Attributes ............................................................................................ 85
4.3.4.4 Polymorphic Keys............................................................................................................... 87
4.3.5 Optimising Index Updating ..................................................................................................... 90
4.3.6 Properties of the Solution ........................................................................................................ 93
4.3.7 Comparison of Index Maintenance Approaches...................................................................... 94
4.4 Indexing Architecture for Distributed Environment ................................................................. 96
4.4.1 Global Indexing Management and Maintenance ..................................................................... 97
4.4.2 Example on Distributed Homogeneous Data Schema ............................................................. 99
CHAPTER 5 QUERY OPTIMISATION AND INDEX OPTIMISER ................. 101
5.1 Query Optimisation in the ODRA Prototype ........................................................................... 102
5.2 Index Optimiser Overview ......................................................................................................... 103
5.2.1 General Algorithm................................................................................................................. 106
5.3 Selection Predicates Analysis ..................................................................................................... 107
5.3.1 Incommutable Predicates....................................................................................................... 109
5.3.2 Matching Index Key Values Criteria..................................................................................... 111
5.3.3 Processing Inclusion Operator............................................................................................... 112
5.4 Role of a Cost Model ................................................................................................................... 113
5.4.1 Estimation of Selectivity ....................................................................................................... 115
5.5 Query Transformation – Applying Indices............................................................................... 118
5.5.1 Index Invocation Syntax........................................................................................................ 118
5.5.2 Rewriting Routines................................................................................................................ 120
5.5.3 Processing Disjunction of Predicates .................................................................................... 122
5.5.4 Optimising Existential Quantifier.......................................................................................... 123
5.5.5 Reuse of Indices through Inheritance .................................................................................... 124
5.6 Secondary Methods..................................................................................................................... 126
5.6.1 Factoring Out Independent Subqueries ................................................................................. 127
5.6.2 Pushing Selection .................................................................................................................. 128
5.6.3 Methods Assisting Invoking Views....................................................................................... 129
5.6.4 Syntax Tree Normalisation.................................................................................................... 130
5.6.5 Harmful Methods .................................................................................................................. 131
5.7 Optimisations involving Distributed Index ............................................................................... 132
5.7.1 Rank Queries Optimisation ................................................................................................... 134
5.7.1.1 Hoare’s Algorithm in Distributed Environment ............................................................... 136
5.7.1.2 Modification of Hoare’s Algorithm .................................................................................. 138
5.8 Increasing Query Flexibility with Respect to Indices Management ....................................... 140
CHAPTER 6 INDEXING OPTIMISATION RESULTS .................................... 142
6.1 Test Data Distribution ................................................................................................................ 142
6.2 Sample Index Optimisation Test................................................................................................ 144
6.3 Omitting Key in an Index Call Test – enum Key Types .......................................................... 146
6.4 Multiple Index Invocation Test .................................................................................................. 148
6.5 Complex Expression Based Index Test ..................................................................................... 150
6.6 Disjunction of Predicates Test.................................................................................................... 151
CHAPTER 7 INDEXING FOR OPTIMISING PROCESSING OF
HETEROGENEOUS RESOURCES............................................................... 153
7.1 Volatile Indexing ......................................................................................................................... 153
7.1.1 Conditions for Volatile Indexing Optimisation ..................................................................... 154
Page 4 of 181
Index of Contents
7.1.2 Index Materialisation............................................................................................................. 154
7.1.3 Solution Properties ................................................................................................................ 155
7.1.4 Prove of Concept Test ........................................................................................................... 155
7.2 Optimising Queries Addressing Heterogeneous Resources..................................................... 157
7.2.1 Overview of a Wrapper to RDBMS ...................................................................................... 158
7.2.2 Volatile Indexing Technique Test ......................................................................................... 160
CHAPTER 8 CONCLUSIONS ....................................................................... 165
8.1 Future Work................................................................................................................................ 167
INDEX OF FIGURES ..................................................................................... 168
INDEX OF TABLES....................................................................................... 170
BIBLIOGRAPHY............................................................................................ 171
Page 5 of 181
Summary
SUMMARY
The Ph.D. thesis focuses on the development of robust transparent indexing
architecture for distributed object-oriented database. The solution comprises
management facilities, automatic index updating mechanism and index optimiser. From
the conceptual point of view transparency is the most essential property of a database
index. It implies that programmers need not to involve explicit operations on indices
into an application program. Usually a query optimiser automatically inserts references
to indices when necessary into a query execution plan. The second aspect of the
transparency concerns a mechanism maintaining the cohesion between existing indices
and indexed data. So-called automatic index updating detects data modifications and
reflects them in indices accordingly.
The thesis has been developed in the context of the Stack-Based Architecture
(SBA) [1, 117] a theoretical and methodological framework for developing objectoriented query and programming languages. The developed query optimisation methods
are based on the corresponding Stack-Based Query Language (SBQL).
The orthogonality of SBQL constructs enables simple defining of complex
selection predicates accessing arbitrary data. The main goal of the work is designing the
indexing architecture facilitating processing of a possibly wide family of predicates.
This requires generic and complete approach to the problem of index transparency.
The solution presented in the thesis provides transparent indexing employing
single or multiple-key indices in a distributed homogeneous object-oriented
environment. The selection of an index structure, either centralised or distributed, is not
restricted. The work extensively describes optimisation methods facilitating processing
in the context of a where operator, i.e. selection, considering the role of a cost model,
conjunction and disjunction of predicates, and the class inheritance.
The author proposes a robust approach to automatic index updating capable of
dealing with index keys based on arbitrary, deterministic and side effects free
expressions. Consequently, optimised selection predicates can be freely composed of
various SBQL constructs, in particular, algebraic and non-algebraic operators, path
expressions, aggregate functions and class methods invocations. The solution also takes
into consideration inheritance and polymorphism.
Page 6 of 181
Summary
A part of the thesis concerns optimisation methods devoted to distributed objectoriented databases enabling efficient parallel processing of queries. In particular, one of
the designed methods concerns optimisation of rank queries. It enables taking advantage
of distributed and scalable index structures.
A particularly difficult query optimisation domain concerns processing queries
addressing heterogeneous resources. A volatile indexing technique proposed by the
author is a significant step in this matter. This solution relies on the developed indexing
architecture. Additionally, it can be applied to data virtually accessible using SBQL
views. In contrast to regular indices, a volatile index is materialised only during a query
evaluation. Therefore, efficacy of this technique shows when the index is invoked
multiple times which mainly concerns processing of complex and laborious queries.
A key aspect concerning the development of database query optimisation
methods is preservation of original query semantics. Consequently, for the designed
optimisation methods the author has determined rules in a context of the assumed object
data model and the SBQL query language. With this knowledge a database programmer
can be assisted and advised, e.g. by compiler, on how to design safe and optimisable
queries. Moreover, the conducted research can facilitate also database designers.
Among the other, the potential influence of other optimisation methods on indexing has
been verified.
A significant part of algorithms and solutions developed in the thesis have been
verified and confirmed in the prototype ODRA OODBMS implementation [58, 59].
Keywords:
indexing,
database,
distributed
optimisation, SBA, SBQL, ODRA
Page 7 of 181
database,
object-oriented,
query
Streszczenie
POLITECHNIKA ŁÓDZKA
Wydział Elektrotechniki, Elektroniki, Informatyki i Automatyki
Katedra Informatyki Stosowanej
Praca doktorska pt.:
Przezroczyste indeksowanie w rozproszonych obiektowych bazach danych
ROZSZERZONE STRESZCZENIE
Bazy danych stanowią podstawę wielu rozległych i w dzisiejszych czasach
często rozproszonych systemów komputerowych. Zarządzanie systemami o takim
rozmiarze i złoŜoności ułatwiają technologie obiektowe. Przemysł jednak skłania się ku
rozwiązaniom relacyjnym, gdy kwestią kluczową jest wydajność. Ten aspekt jest wciąŜ
zaniedbany w opartych o obiektowe paradygmaty bazach danych z uwagi na
niedostatek zaawansowanych procedur optymalizacyjnych.
Indeksowanie jest najwaŜniejszą metodą optymalizacyjną w bazach danych.
Zasadnicza koncepcja indeksowania w obiektowych bazach danych nie róŜni się od
indeksowania w systemach relacyjnych [15, 20, 29, 54, 55, 65]. Z koncepcyjnego
punktu
widzenia
najistotniejszą
własnością
indeksu
w
bazie
danych
jest
przezroczystość. Oznacza ona, Ŝe programista aplikacji z bazą danych nie musi być
świadomy istnienia indeksów. Najczęściej optymalizator zapytań jest odpowiedzialny
za automatyczne wykorzystanie indeksów. Drugi waŜny aspekt przezroczystości jest
związany z utrzymywaniem spójności między indeksami a indeksowanymi danymi. Jest
to problem tzw. automatycznej aktualizacji indeksu. Modyfikacje w bazie powinny być
automatycznie wykrywane i odzwierciedlane w odpowiednich indeksach.
W rozproszonych bazach danych najbardziej zaawansowane rozwiązania
opierają się o statyczne partycjonowanie indeksu. Są one zaimplementowane w
czołowych produktach obiektowo-relacyjnych. Pozwalają one jedynie na definiowanie
kluczy indeksu uŜywając prostych wyraŜeń korzystających z danych znajdujących się w
Page 8 of 181
Streszczenie
jednej tabeli. W kwestii globalnej optymalizacji zapytań odnoszących się do
heterogonicznych zasobów, autor nie znalazł w literaturze naukowej Ŝadnych
sformalizowanych
metod
opartych
o
indeksowanie.
Analiza
stanu
wiedzy
jednoznacznie wskazuje na potrzebę rozwijania metod i architektury indeksowania dla
rozproszonych obiektowych baz danych.
Ortogonalność języka SBQL pozwala na wyjątkowo łatwe definiowanie
złoŜonych predykatów selekcji odnoszących się do dowolnych danych. Głównym celem
pracy jest opracowanie architektury indeksowania, która wspomagałaby przetwarzanie
moŜliwie szerokiej rodziny predykatów. Wymaga to generycznego i kompletnego
podejścia do problemu przezroczystości. PoniewaŜ praca dotyczy rozproszonych
obiektowych baz danych kolejnym waŜnym celem jest opracowanie metod
optymalizacyjnych, które będą umoŜliwiały zrównoleglenie obliczeń w szczególności
poprzez wykorzystanie rozproszonych skalowalnych struktur indeksu.
Szczególnie trudną dziedziną w kontekście optymalizacji jest przetwarzanie
zapytań odnoszących się do rozproszonych heterogenicznych zasobów. Z tego powodu
jako kolejny cel pracy autor postawił sobie identyfikacje przezroczystej i wydajnej
strategii indeksowania, którą moŜna by stosować na poziomie globalnego schematu
bazy.
Kluczowym aspektem pracy nad wszystkimi metodami optymalizacyjnymi jest
zachowanie oryginalnej semantyki zapytań. W tym celu autor określił reguły, które
odnoszą się do opracowanych metod w kontekście przyjętego obiektowego modelu
danych i języka zapytań SBQL. Znajomość tych reguł moŜe być przydatna
programistom w budowaniu zapytań, których postać umoŜliwia automatyczną
optymalizację. Dodatkowo, przeprowadzone badania mogą być równieŜ pomocne
projektantom baz danych. Między innymi, określono potencjalny wpływ innych metod
optymalizacyjnych na pracę optymalizatora wykorzystującego indeksy.
Zaproponowane przez autora w pracy doktorskiej rozwiązania są przedstawione
w kontekście stosowej architektury (SBA, Stack-Based Architecture) [1, 117] i
wynikającego z niej języka zapytań (SBQL, Stack-Based Query Language).
Architektura stosowa jest to formalna metodologia dotycząca obiektowych języków
zapytań i programowania w bazach danych.
Page 9 of 181
Streszczenie
Tezy będące przedmiotem dysertacji są następujące:
1. Przetwarzanie predykatów selekcji opartych o dowolne wyraŜenia kluczowe
korzystające z danych w rozproszonej obiektowej bazie danych moŜe być
zoptymalizowane przez scentralizowane lub rozproszone przezroczyste
indeksowanie.
2. Wykonywanie złoŜonych zapytań odnoszących się do rozproszonych
heterogenicznych
zasobów
moŜe
być
wspomagane
przez
techniki
wykorzystujące przezroczystą optymalizację opartą o indeksowanie.
W udokumentowaniu w/w tez wykorzystano zaprojektowane przez autora
system zarządzania indeksami i optymalizator zapytań stosujący indeksy.
Dodatkowym elementem ściśle związanym z pierwszą tezę jest autorskie
podejście do problemu automatycznej aktualizacji indeksu. Przedstawione rozwiązanie
zapewnia przezroczyste indeksowanie wykorzystujące indeksy oparte o jeden lub wiele
kluczy. Optymalizacja dotyczy przetwarzania predykatów selekcji opartych o dowolne,
deterministyczne i pozbawione efektów ubocznych wyraŜenia, na które mogą się
składać: np. wyraŜenia ścieŜkowe, funkcje agregujące i wywołania metod klas (z
uwzględnieniem
dziedziczenia
i
polimorfizmu).
Zaproponowana
architektura
indeksowania moŜe być zastosowana do rozproszonych homogenicznych źródeł
danych. Wybór struktury indeksu, scentralizowanej czy rozproszonej, nie jest w Ŝaden
sposób ograniczony. Autor zaproponował równieŜ metodę optymalizacji rankingowych
zapytań, która umoŜliwia wykorzystanie zarówno istniejących lokalnych indeksów, jak
i rozproszonego, skalowalnego globalnego indeksu.
Rozwiązaniem zaproponowanym przez autora w celu udowodnienia drugiej tezy
pracy jest technika ulotnego indeksowania. Polega ona na tej samej architekturze
indeksowania, ale dodatkowo moŜe być stosowana do przetwarzania danych
heterogenicznych wirtualnie dostępnych poprzez perspektywy SBQL. W odróŜnieniu
od normalnych indeksów ulotny indeks jest materializowany tylko podczas
wykonywania zapytania. Przedstawiona technika jest skuteczna w przetwarzaniu
złoŜonych zapytań, w których indeks jest wywoływany więcej niŜ jeden raz.
Opracowane algorytmy i rozwiązania związane z tezami pracy zostały w
znaczącym zakresie zweryfikowane i potwierdzone na prototypowej implementacji w
obiektowej bazie danych ODRA [58, 59].
Page 10 of 181
Streszczenie
Dysertacja została podzielona na osiem rozdziałów, których zwięzły opis
znajduje się poniŜej:
Chapter 1 Introduction
Wstęp
Pierwszy rozdział wprowadza w tematykę pracy, przedstawia jej kontekst,
zwięzły opis stanu wiedzy w dziedzinie i motywacje autora. Sformułowano cele pracy
oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono
szczegółowo tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania.
Chapter 2 Indexing In Databases - State of the Art
Indeksowanie w Bazach Danych – Stan Wiedzy
W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z
indeksowaniem w bazach danych. Przytoczono reprezentatywne przykłady istniejących
rozwiązań w przemyśle i literaturze naukowej. Rozdział zawiera przegląd róŜnych
struktur indeksujących ze szczególnym uwzględnieniem liniowego haszingu, który
został wykorzystany w autorskim rozwiązaniu. Dodatkowo, zbadano zcentralizowane i
rozproszone strategie indeksowania w róŜnych systemach rozproszonych.
Chapter 3 The Stack-based Approach
Podejście Stosowe
Rozdział dotyczy teoretycznych podstaw tez pracy, tj. architektury stosowej
(SBA) i wynikającego z niej języka zapytań SBQL. Przytoczono opisy podstawowych
pojęć: stosu środowiskowego, stosu rezultatów, wiązania nazw, statycznej ewaluacji
zapytań i aktualizowalnych obiektowych perspektyw.
Chapter 4 Organisation of Indexing in OODBMS
Organizacja Indeksowania w Obiektowych Bazach Danych
Ta część pracy przedstawia zaprojektowaną i w znaczącym zakresie
zaimplementowaną architekturę indeksowania w obiektowej bazie danych ODRA.
Opisano podstawowe własności zastosowanej struktury indeksu i modułu zarządzania
indeksami. Przedstawiono równieŜ autorski mechanizm zapewniający przezroczystą,
automatyczną aktualizację indeksów, który opiera się o ideę wyzwalaczy aktualizacji
indeksu (index update triggers). Przedstawiona koncepcja jest rozszerzona na potrzeby
globalnego indeksowania w kontekście rozwijanej w projekcie ODRA rozproszonej
architektury.
Page 11 of 181
Streszczenie
Chapter 5 Query Optimisation and Index Optimiser
Optymalizacja Zapytań z Wykorzystaniem Indeksów
Rozdział koncentruje się na rozwijanych przez autora metodach przezroczystego
wykorzystania indeksów w optymalizacji zapytań. Przedstawiono algorytmy dotyczące
transformacji pośredniego drzewa zapytania oraz związane z nimi reguły. Szczególny
nacisk został połoŜony na zachowanie w procesie optymalizacji pierwotnej semantyki
zapytania. Opracowane metody zostały poparte odpowiednimi rzeczywistymi
przykładami przekształceń w języku SBQL. Autor przedstawił równieŜ dyskusję na
temat wpływu innych metod optymalizacji zapytań na indeksowanie. W rozdziale
omówiono równieŜ metody optymalizacji dedykowane przetwarzaniu globalnych
zapytań w rozproszonym środowisku. W tym zakresie przedstawiono autorskie
podejście do optymalizacji zapytań rankingowych w rozproszonej architekturze bazy
danych oparte o zmodyfikowany algorytm Hoare’a.
Chapter 6 Indexing Optimisation Results
Wyniki Optymalizacji przez Indeksowanie
W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu
indeksowania. Wyniki potwierdzają skuteczność i wydajność opracowanej metodologii.
Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań opisanych w
rozdziałach 4-tym i 5-tym. Całość stanowi dowód pierwszej tezy dysertacji.
Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources
Indeksowanie w Optymalizacji Przetwarzania Heterogenicznych Zasobów
Ta część pracy dowodzi drugiej tezy dysertacji. W rozdziale przedstawiono
technikę tzw. ulotnego indeksowania (volatile indexing technique) oraz jej zastosowanie
w optymalizacji zapytań odnoszących się do rozproszonych heterogonicznych danych.
Skuteczność
zaproponowanej
techniki
jest
potwierdzona
testem,
w
którym
optymalizowane jest zapytanie SBQL odnoszące się do zasobów obiektowej bazy
danych i zasobów znajdujących się w relacyjnej bazie danych.
Chapter 8 Conclusions
Podsumowanie
Ostatni rozdział podsumowuje pracę nad architekturą systemu indeksowania dla
rozproszonej obiektowej bazy danych. Wymieniono opracowane rozwiązania i wyniki
badań jednoznacznie potwierdzające słuszność tez pracy doktorskiej. Na koniec
wskazano kierunki dalszych badań w tej dziedzinie.
Page 12 of 181
Chapter 1
Introduction
Chapter 1
Introduction
Databases are a fundamental feature of many large computer applications. In
many cases databases are to be geographically distributed. The size and complexity of
such systems require the developers to take advantage of modern software engineering
methods which as a rule are based on the object-oriented approach (cf. UML notation).
In contrast, the industry still widely uses relational databases. While the efficiency of
them in majority of applications cannot be questioned, many professionals point out
their drawbacks. One of the major drawbacks is so-called impedance mismatch. The
mismatch concerns many incompatibilities between object-oriented design and
relational implementation. The mismatch concerns also incompatibilities between
object-oriented programming (in languages such as C++, Java and C#) and SQL, the
primary programming interface to relational databases. For this reason in the last two
decades new and new object-oriented database management systems are proposed.
Some of them are well recognized on the market (e.g. ObjectStore, Objectivity/DB,
Versant, db4o, and others), however the scale of applications of them is at least the
order of magnitude lower than applications of relational systems (some of them
extended by object-oriented features).
One of the reasons of relatively low acceptance of commercial object-oriented
databases concerns their query languages that are considered very limited and treated as
secondary in the development of applications. This is in sharp contrast to relational
systems, where SQL is considered the primary factor stimulating their successes.
In this research we focus on equipping object-oriented database systems with a
powerful and efficient query language. The power of such a language should not be
lower than the power of SQL. The performance efficiency of such a language requires
powerful query optimization methods. Query optimisation in object-oriented database
management systems has been deeply investigated over last two decades.
Unfortunately, this research remains mostly not implemented in nowadays OODBMSs
because of many reasons: limited query languages, non-implementable methods that
were proposed, lack of interest of commercial companies, etc.
In this thesis we investigate a well-known and the most important method of
performance improvement known as indexing. The research addresses this subject in the
Page 13 of 181
Chapter 1
Introduction
context of the Stack-Based Architecture (SBA), which is a theoretical and
methodological framework for developing object-oriented query and programming
languages. The solutions that we have developed are implemented and tested in the
ODRA OODBMS prototype [58, 59] that is based on SBA and its own query language
SBQL (Stack-Based Query Language).
1.1 Context
The Stack-Based Architecture (SBA) is a formal methodology addressing
object-oriented database query and programming languages [1, 117]. It assumes the
object relativism principle that claims no conceptual difference between the objects of
different kinds or stored on different object hierarchy levels. Everything (e.g. a Person
object, a salary attribute, a procedure returning the age of a person and a view returning
well-paid employees) is considered an object with an own unique identifier. SBA
reconstructs query languages’ concepts from the point of view of programming
languages (PLs) introducing notions and methods developed in the domain of
programming languages (e.g. environment stack, result stack, nesting and binding
names).
ODRA (Object Database for Rapid Application development) is a prototype
object-oriented database management system based on the Stack-Based Architecture
(SBA) [2, 119]. ODRA introduces its own query language SBQL that is integrated with
programming capabilities and abstractions, including database abstractions: updatable
views, stored procedures and transactions. The main goal of the ODRA project is to
develop new paradigms of database application development together with a distributed
database-oriented and object-oriented execution environment.
1.2 Short State of the Art of Indexing in Databases
The general idea of indices in object-oriented databases does not differ from
indexing in relational databases [15, 20, 29, 54, 55, 65]. The most characteristic
property of the database indexing is transparency. A programmer of database
applications does not need to be aware of the indices existence as they are utilised by
the database engine automatically. This is usually accomplished by a query optimiser
that automatically inserts references to indices into a query execution plan when
necessary. The second important aspect of transparency concerns maintaining cohesion
between existing indices and the data that is indexed. Data modifications are
Page 14 of 181
Chapter 1
Introduction
automatically detected and corresponding changes are reflected in indices. This process
is called automatic index updating.
Many indexing methods can be adopted from relational database systems and
even their applicability can be significantly extended. There are also situations where
indexing methods from RDBMSs become outdated in object-oriented databases. In
particular, join operations do not need to be supported because in object databases the
necessity for joins is much lower due to object identifiers and explicit pointer links in
the database.
In the object-oriented database domain the research into indexing has been
mainly focused on path expression processing and inheritance hierarchy inside indexed
collections [10, 11, 12, 21, 67, 77, 81, 111]. Some papers propose generic approaches to
provide automatic index maintenance transparency [43, 46]. However, there is no
information that these proposals have been actually incorporated in commercial or open
source database products.
Indexing is also an important subject in a distributed environment. The most of
research concerns development of various distributed index structures and global
indexing strategies. Many works are conducted in the context of data exchange in p2p
networks. In databases, the most advanced solutions are based on static index
partitioning. They are implemented in leading object-relational products. Nevertheless,
an index key definition is limited to expressions accessing data from only one table. The
Author have not found in the research literature any formalised global optimisation
methods based on indexing for processing queries involving heterogeneous resources.
The analysis of the state of art unambiguously indicates that the development of
indexing methods and architectures that are dedicated to distributed object-oriented
databases is still a valid and challenging subject.
1.3 Research Problem Formulation
The orthogonality of SBQL language constructs allows defining selection
predicates using complex and robust expressions accessing arbitrary data. The
transparent indexing of objects to facilitate processing queries involving such predicates
requires development of a generic and complete solution. Particularly, achieving
automatic index updating transparency is simple only in case of indices defined on
simple keys, i.e. direct attributes and table columns. Inheritance, methods
Page 15 of 181
Chapter 1
Introduction
polymorphism, data distribution, etc. make difficult identifying objects influencing a
value of an index key.
Data processing in a distributed environment enables parallel processing of
queries and may take advantage of distributed and scalable index structures. This
creates a demand for introducing an appropriate indexing architecture and specific
optimisation methods.
An even more complex task concerns evaluation of queries addressing a
heterogeneous distributed environment. From the point of view of performance it is
vital to exploit local resources optimisation methods and to develop robust techniques
improving query processing on a global schema level. Identifying effective transparent
global indexing strategies is in this context a significant, but particularly challenging
subject.
Finally, each optimisation method improving query performance must ensure
preservation of query semantics. Therefore, in the context of a query language and an
object model the appropriate rules for exploiting such methods must be determined.
With this knowledge a database programmer can be assisted and advised, e.g. by a
compiler concerning how to design proper optimisable queries.
1.4 Proposed Solution
In order to provide transparent indexing in distributed object-oriented databases,
the author of this thesis proposes the following tenets:
•
precisely defined indices management facilities and convenient syntax for an
index call to be used in query optimisation,
•
set of algorithms, optimisation methods and rules composing the index
optimiser, i.e. the module responsible for detecting parts of a query that can be
substituted with an index call and performing appropriate query transformations,
•
the generic automatic index maintenance solution based on index update
definitions assigned to indices and associated with them index update triggers
assigned to objects participating in indexing,
•
volatile indexing technique enabling taking advantage of the developed indexing
architecture and omitting troublesome issue of the automatic index maintenance
in processing specific family of queries addressing heterogeneous resources.
Page 16 of 181
Chapter 1
Introduction
The most important properties necessary to provide desired indices behaviour
have been implemented in ODRA OODBMS prototype and are operating [59].
1.5 Main Theses of the PhD Dissertation
The summarised theses are:
1. Processing of selection predicates based on arbitrary key expressions
accessing data in a distributed object-oriented database can be optimised by
centralised or distributed transparent indexing.
2. Evaluation of complex queries involving distributed heterogeneous
resources can be facilitated by techniques taking advantage of transparent
index optimisation.
The common basis for accomplishing the theses are developed indexing
management facilities and the index optimiser.
The first thesis is additionally supported by the author’s generic approach to
automatic index maintenance. The proposed approach provides transparent indexing
using single or multiple-key indices. It applies to selection predicates based on arbitrary,
deterministic and side effects free expressions consisting of e.g. path expressions,
aggregate functions and class methods invocations (addressing inheritance and
polymorphism).
An extensive part of the work comprises optimisation methods
facilitating processing in the context of a where operator (i.e. selection), considering the
role of a cost model, conjunction and disjunction of predicates, and class inheritance.
The proposed architecture can handle homogeneous data distribution and distributed
index structures. The selection of an index structure, either centralised or distributed, is
not restricted. The author also introduces an efficient method for optimisation of rank
queries taking advantage of indexing in a distributed environment.
The solution proposed by the author addressing the second thesis is the volatile
indexing technique. It relies on the same indexing architecture, but addresses as well
data virtually accessible through SBQL views. A volatile index differs from a regular
index since it is materialised only during a query evaluation. Therefore, efficacy of this
technique can be seen in processing of laborious queries when the index is invoked
more than once.
A significant part of theses has been verified and confirmed by a prototype
Page 17 of 181
Chapter 1
Introduction
implementation in the ODRA OODBMS. The only important aspect to be implemented
and validated in the future concerns data and index distribution in the context of the first
thesis. This element is planned to be finished together with the development of a
distributed infrastructure in the ODRA prototype.
1.6 Thesis Outline
The thesis is organised as follows:
Chapter 1 Introduction
The chapter presents a general overview of the thesis subject, context, the
author’s motivation, formulation of the problem and objectives of the research, the
theses and the description of developed solutions.
Chapter 2 Indexing In Databases - State of the Art
The state of the art chapter introduces basic concepts concerning indexing in
databases together with an overview of solutions existing in commercial products and in
the research literature. Additionally, the inspection of varieties of index-structures and
indexing strategies applying to centralised and distributed environment is provided.
Chapter 3 The Stack-based Approach
The theoretical fundament for the thesis is the Stack-Based Architecture (SBA)
and the corresponding query language SBQL. The chapter introduces basic notions
relevant to the work including environment and result stacks, static query evaluation
and updateable object-oriented views.
Chapter 4 Organisation of Indexing in OODBMS
The chapter presents the designed and implemented indexing architecture in the
ODRA OODBMS. It focuses particularly on basic properties of the employed index
structure, the designed indexing management facilities and module providing automatic
index updating transparency (based on the author’s index update triggers concept).
Finally, extending the architecture to distributed databases is discussed.
Chapter 5 Query Optimisation and Index Optimiser
The algorithms and rules responsible for taking advantage of indices in
transparent optimisation of queries with respect to query semantics are presented and
explained on examples. The chapter includes description of indexing methods designed
Page 18 of 181
Chapter 1
Introduction
for a distributed environment and discussion about influence of secondary methods on
indexing.
Chapter 6 Indexing Optimisation Results
The chapter presents results of tests confirming efficiency of the methods
presented in the thesis.
Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources
The chapter focuses on the volatile indexing technique and presents its
application in optimisation of queries addressing heterogeneous resources. The
description is supported by an appropriate test proving the efficacy of this technique.
Chapter 8 Conclusions
The chapter gives conclusions concerning achieved objectives and depicts the
area of future works.
Page 19 of 181
Chapter 2
Indexing In Databases - State of the Art
Chapter 2
Indexing In Databases - State of the Art
Indices are auxiliary (redundant) data structures stored at a server. A database
administrator manages a pool of indices generating a new or removing an existing one
depending on the current need. As indices at the end of a book are used for quick page
finding, a database index makes quick retrieving objects (or records) matching given
criteria possible. Because indices have relatively small size (comparing to a whole
database) the gain in performance is fully justified by some extra storage space. Due to
single aspect search, which allows one for very efficient physical organisation, the gain
in performance can be even several orders of magnitude.
In general, an index can be considered a two-column table where the first
column consists of unique key values and the other one holds non-key values, usually
references to objects or database table rows. Key values are used as an input for an
index search procedure. As the result, the procedure returns corresponding non-key
values from the same table row. In query optimisation indices are usually used in the
context of a where operator when the left operand refers to a collection indexed by key
values composing the right operand. [29, 33, 118]
High-level language query
Syntactic analysis and validation
Intermediate query representation
Query optimiser
Query evaluation plan
Query code generator
Query code
Runtime database processor
Query result
Fig. 2.1 A typical stages of high-level language query optimisation [29]
A database query is expressed in a high-level query language (e.g. SQL, OQL).
Fig. 2.1 presents general steps of processing a query. First, it has to be the subject of
syntactic analysis (parsing). Next, it is validated for semantic correctness and
accordance with a present database schema. The database uses internal query
Page 20 of 181
Chapter 2
Indexing In Databases - State of the Art
representation usually organised into a tree or graph structure. There might be many
execution strategies that a DBMS can follow to obtain an answer to a query. In terms of
query results all execution plans are equivalent but the cost difference between
alternative plans can be enormous. The cost is usually measured as the time needed to
complete query execution. A database query optimiser should efficiently estimate the
cost of a plan. Final steps of query processing consist of code generation according to
the designed execution strategy and eventually its execution [29, 54, 118].
An important part of designing an execution plan is the analysis of database
indices. The query optimiser should be capable of identifying parts of the query which
evaluation can be assisted with indexing. Next, with the help of a database cost model it
has to decide which combination of indices would minimise the cost of query execution.
An important task of database administrators is to manage a pool of indices,
which is a part of processes of physical designing and tuning of a database. From the
physical and conceptual properties of a database index follow its obvious advantages.
However, when the design is improper, processing queries through indices may cause
disadvantages concerning the global processing time. The disadvantages are usually
caused by frequent database updates, which may totally undermine the gain in query
processing due to an index, because the updating cost of the index exceeds the gain due
to faster query processing.
2.1 Database Index Properties
Indices are an essential constituent of a database’s architecture. Obviously, their
central feature is a data-structure that can be efficiently organised, searched and
maintained. Nonetheless, their actual strength lies in unique properties and versatile
utilisation of a database’s index. The significant advantage and a partial cause of a
success of large database systems is an indexing transparency.
2.1.1 Transparency
In a common approach, the programmer should not involve explicit operations
on indices into an application program. To make indexing transparent from the point of
view of a database application programmer, it ought to ensure two important
functionalities: index optimisation and automatic index updating.
The first functionality means that indices are used automatically during query
Page 21 of 181
Chapter 2
Indexing In Databases - State of the Art
evaluation. Therefore, the administrator of a database can freely establish new indices
and remove them without changing codes of applications. The responsibility for
ensuring such transparency lies in query optimisation and particularly in the index
optimiser.
The second functionality, i.e. automatic index updating, in research literature is
also referred as index maintenance or dynamic index adaptation. It is a response to
changes in a database. Indices, like all redundant structures, can lose cohesion with data
if the database is updated. An automatic mechanism should alter, remove or rebuild an
index in case of database updates that affect topicality of its contents. Consequently, the
gain in queries performance coming from indexing compromises insertion, deletion and
data modifications speed, since such operations require suitable updates to indices.
Thus, it is an administrator’s responsibility to manage indices judiciously and not to
cause an overall database’s performance deterioration, particularly in update-intensive
systems.
In general, databases provide the user transparency fully. Nevertheless, some
approaches let administrators and application designers decide about the degree of the
index transparency and explicitly control indices state depending on a need.
Occasionally, the transparency is supported only to a limited extent burdening a
database’s user.
2.1.2 Indices Classification
According to [29] there are three essential kinds of a database’s index:
•
primary index – physically ordering data on a disk or in a memory according to
some unique property field (each record must contain a unique value for such a
field – so-called primary key),
•
clustering index – introducing physical data order according to a non-unique
property (i.e. when several data records can have equal value of ordering fields),
•
secondary index – providing an alternative access to data according to
designated criteria without affecting their actual location (called also secondary
access paths or methods).
Since one physical ordering is possible, a data table or collection can have only one
primary index or clustering index not both. The limit concerning the number of
Page 22 of 181
Chapter 2
Indexing In Databases - State of the Art
secondary indices depends on a database. In reality often occur some departures from
definitions above. For example, in some databases data and indices are stored separately
and even primary or clustering indices contain only references to actual data, which are
stored physically e.g. in a linked list.
Indices can be also classified according to a relation between keys and indexed
data. Usually the division is the following:
•
dense index – contains an entry for each key value occurring in a database,
•
sparse index – associates blocks of ordered indexed data only with a single key
value (e.g. the lowest one).
Primary indices are usually sparse since physically data are often divided into blocks.
Additionally to dense and sparse, a range index can be considered since index can be
split into slots representing specified ranges of key values.
Other obvious classification of indices concerns their data-structure, e.g. hash
table or a B-tree. In databases sometimes many index kinds are combined in one socalled multilevel index. The next subchapter describes the most popular kinds of datastructures employed in databases indexing systems.
2.2 Index Data-Structures
The most popular data structures used for index organisation are various kinds of
B-trees proposed by Bayer and McCreight [8] and hash tables invented by H. P. Luhn
[14]. Improving efficiency of selecting or sorting data queries vary with the choice of a
proper data structure for indexing certain data. However, each index consumes some
amount of database store space and needs some additional overhead for time of
inserting, modifying or removing indexed data. Individual properties of different
structures have been presented in thousands of papers and books devoted to databases
and algorithms, e.g. [14, 23, 29]. In the context of this dissertation the kind of an
exploited physical index organisation is generally insignificant. Only some properties of
the index structure are important, in particular:
•
key order preservation, i.e. range queries support,
•
support for indexing using multiple keys,
•
distribution of an index on multiple servers.
Page 23 of 181
Chapter 2
Indexing In Databases - State of the Art
The same index interface from the point of the database can be used for a variety of
index structures. Therefore, this work omits detailed discussion of this subject focusing
mainly on an index structure used in the author’s implementation, i.e. linear hashing.
The hash table uses the hash coding method based on a hash function which
maps key values into a limited set of integer values. A calculated hash value points to a
table in a memory (called a bucket) holding corresponding non-key values. This method
allows indexed values to be looked up or updated in a very short constant time,
particularly when a hash function distributes key values equally. A disadvantage of this
technique is a necessity of specifying a size of the index table. However, dynamic
hashing and linear hashing (described in the next section) deal with this issue. Another
problem appears when two or more keys are mapped to the same location in the table.
Similarly, it may happen that two or more objects have the same key value of an
attribute. Resolving these so-called collisions leads to deterioration of index
performance. There are many techniques allowing to put such items in a hash table and
query them in a fairly fast way: a rehash function, a linked list approach (separate
chaining), a linked list inside a table (coalesced hashing) and buckets. Methods
involving linear or dynamic hashing use load control algorithms automatically forcing a
hash table to expand to prevent performance loss.
Another very popular indexing technique is based on B-trees. A B-tree is slightly
worse than hash table from the point of view of search time and frequent data updates,
which often involve the tree reorganisation. However, its advantage is simplicity of an
algorithm and the economical memory space consumption. B-trees store keys in a nondescending order, so they can be very helpful in laborious queries involving sorting or
ranking data. Many different kinds of tree-structures are proposed in the literature and
incorporated in commercial products, e.g.
•
B+ tree, B# tree, B* tree – varieties of B-tree,
•
AVL tree, splay tree – balanced binary search trees,
•
radix tree – optimised to store a set of strings in lexicographical order,
•
and many more [14].
Indexing techniques used in data warehousing applications are a bit different
from the techniques used in on-line transaction processing. Bitmap indices are stored as
bitmaps (often compressed) [17, 66]. Consequently, answer to the most of queries can
Page 24 of 181
Chapter 2
Indexing In Databases - State of the Art
be obtained by performing bitwise logical operations. They are the most effective on
keys with a limited set of values (e.g. a gender field) and often use a combination of
such keys (i.e. multiple keys index). When these conditions are met, the bitmap indices
prove reduced storage requirements and greater efficiency than regular indices. On the
other hand the performance of the index maintenance is their serious drawback. Bitmap
indexes are primarily intended for non-volatile systems since the method is very
sensitive to updating indexed data. It causes the necessity to keep locks on segments
storing a bitmap index to reflect a change, which is very time consuming. In typical
cases bitmap indices are easier to destroy and re-create than to maintain.
Other variants of indices for data warehousing have also been developed [84]:
•
projection index - quite useful in cases where column values must be retrieved
for all selected rows, because they probably would be found on the same index
page,
•
bit-sliced index – based on processing bitmaps; provide an efficient means of
calculating aggregates.
Many other index structures evolve to facilitate various indices applications:
•
inverted files, signature-based files – two principal indexing methods for text
documents databases purposes [134] and for indexing according to set-valued
attributes with low cardinality [41],
•
multi-index, path index, access support relations, T-Index, path dictionary index
– for path expression processing in OODBMSs [10, 11, 12, 67, 81],
•
inherited multi-index, nested-inherited index, triple-node hierarchies, H-tree,
CH-tree, hcC-tree (hierarchy class Chain tree), signature file hierarchies,
signature graphs – oriented on facilitating processing collections organized in
class hierarchies in OODBMSs [10, 11, 12, 21, 77, 111],
•
R-Tree, UB-tree, kd-tree, X-Tree, Parametric R-Tree, TPR-Tree (Time
parameterized R-Tree), TPR*-Tree, grid file - for spatial (i.e. multidimensional)
and spatio-temporal data, e.g. Geographic Information Systems, [30, 32, 40,
125].
•
etc.
Another group of index structures can be defined in a distributed environment
Page 25 of 181
Chapter 2
Indexing In Databases - State of the Art
domain. In general, together with a growth of an indexed dataset an index can be split
into small parts maintained on independent servers; hence, utilising their storage (e.g.
main memory or disks) and processing power. In contrast to local indices such an index:
•
enables exploiting a parallel computing power (therefore, they are usually
referred as parallel or distributed indices),
•
can be scalable, freely spreading its parts between network nodes, without
compromising its primary efficiency,
•
provides a higher level of concurrency.
An overview of properties of a distributed index structure based on the idea of
linear hashing is given in section 2.2.2. Similarly to local indices, parallel indices have
been developed in many variants for various applications and systems, e.g.:
•
scalable distributed data structure variants (cf. section 2.2.2),
•
distributed hash table (DHT) [34] for data indexing in peer-to-peer (P2P)
networks, e.g. Chord [113],
•
scalable distributed B-tree [3],
•
a combination of a bit vector, a graph structure and a grid file database multi-key
distributed index [45],
•
hierarchical distributed index psiX for XML documents in p2p networks [105],
•
DiST, PN-tree – structures for indexing multidimensional (spatial) datasets [4,
16].
All index structures mentioned in this subchapter are only a small fraction of
existing solutions, which are described in thousands of research papers. The next
section concerns a linear hashing index which is an important part of the author’s
prototype implementation verifying the thesis.
2.2.1 Linear-Hashing
Linear hashing is a dynamic indexing structure invented by Witold Litwin [72].
Similarly as a regular hash table, it comprises the buckets, which store index entries
according to some hash function. The linear hashing strives to keep a relation between
the number of index entries and the number of buckets in order to ensure constant
Page 26 of 181
Chapter 2
Indexing In Databases - State of the Art
search, insertion and deletion efficiency and to minimise the buckets capacity
overflowing. Buckets are added (through splitting) and removed (through merging) one
at a time which is possible by taking advantage of a dynamic hashing functions family.
At the start, a linear hashing structure consists of N0 empty buckets numbered
starting from 0 to N0-1. Three important parameters describe an index state:
•
n – the number of a bucket to be split next if necessary (initially equal to 0),
•
j – a current lowest index buckets level (initially equal to 0).
•
N – the number of buckets equal to N0·2j+n (consequently, initially equal to N0).
The buckets from nTH to N0·2j -1 belong to the jTH level while the rest of the
buckets, i.e. from 0 to nTH-1 and all starting from N0·2j bucket to NTH-1, belong to the
jTH+1 level. Index entries are spread over the index according to hash functions hj(key)
depending on the level of an index bucket. The target bucket T for a key is determined
according to the following formulas:
[
[
[
if (h( j , key ) ∈ 0, n TH ) ⇒ h( j + 1, key )
T ( j , key ) = 
,
TH
j
 if (h( j , key ) ∈ n , N 0 ⋅ 2 ) ⇒ h( j , key )
[
where:
•
h(j, key) := hash(key) mod (N0·2j),
•
hash(key) is the basic key hashing function,
•
[minvalue – stands for the inclusive left limit of the defined values range,
•
maxvalue[ – stands for the exclusive right limit of the defined values range.
The most crucial operation, i.e. splitting, is triggered after an insertion when the
index load becomes too high. A new bucket is appended to the buckets table and
elements from nTH bucket are divided between nTH and the new bucket nTH+N0·2j
according to the h(j+1, key) function. It is worth noticing that:
h(j+1, key)∈{h(j, key), h(j, key) + N0·2j}
Next, parameters n and N are incremented by one. Eventually, when n reaches N0·2(j+1)
indicating that the hash table has doubled its size from N0·2j the n parameter is set to 0
and the index level j is incremented by one.
Oppositely to splitting, if during a deletion the index load falls below some fixed
Page 27 of 181
Chapter 2
Indexing In Databases - State of the Art
threshold then merging of buckets is performed.
An example bucket split procedure is presented in Fig. 2.2. Buckets entries are
represented by values of their hash(key) functions. The state before the split is presented
in Fig. 2.2a. The parameters of an index were the following: n = 0, N0 = N =100, j = 0,
so all buckets are addressed using the h(0, key) function.
Fig. 2.2 Example of a bucket split operation [72]
The split is performed on the nTH bucket, which was already overflowed. A new
bucket at the end of the buckets table is allocated and it is filled with entries moved
from the 0 bucket for which h(1, key), i.e. hash(key) mod 200, is equal to 100. Finally, n
and N are incremented. An index state after the split is shown in Fig. 2.2b. As it is
shown dynamic expansion of a linear hashing table helps to minimise the buckets
overflow.
An overview of SDDS based on the idea of linear hashing, an efficient structure
for distributed indexing, is presented in the following section.
2.2.2 Scalable Distributed Data Structure (SDDS)
SDDS is a scalable distributed data structure introduced by W. Litwin [73, 74]
which deals with storing index positions in a file distributed over a given network. Its
properties make it a good candidate for indexing global data in a distributed
Page 28 of 181
Chapter 2
Indexing In Databases - State of the Art
infrastructure (e.g. grid). SDDS uses LH* which generalises a linear hashing method
described in section 2.2.1 to a distributed memory or disk files.
In contrast to the linear hashing, SDDS buckets can be located on different sites.
LH* structure does not require a central directory, and it grows gracefully, through
splits of one bucket at a time, to virtually any number of servers. The SDDS strategies
differ with an approach to buckets splitting which can be managed by a coordinator site,
triggered by a bucket overflow or by controlling an index load factor.
An application of the SDDS significantly extends features of the linear hashing.
The major advantages of an SDDS index concerning distributed indexing are the
following:
•
avoiding of a central address calculation spot,
•
parallel and distributed query evaluation support,
•
concurrency transparency,
•
scalability – it does not assume any constraints in size or capacity,
•
an SDDS file expands over new servers when current servers are overloaded,
•
index updating does not demand a global refresh on servers or clients,
•
over 65% of an SDDS file is used,
•
in general, the small number of messages between servers (1 per random insert;
2 per key search),
•
parallel operations on SDDS M buckets require at most 2·M+1 messages and
between 1 and O(log(M)) rounds of messages.
The characteristics of SDDS outperform in efficiency the centralised index
directory approach (described in detail in section 2.6.1) or any static data structures.
Variants of SDDS index include implementations:
•
preserving key order and supporting range queries (e.g. a RP* family of SDDS
structures [75]),
•
providing a high-availability, i.e. toleration for unavailability of some servers
sites composing SDDS (e.g. LH*RS [76])
Page 29 of 181
Chapter 2
Indexing In Databases - State of the Art
2.3 Relational Systems
The System-R, developed by IBM1 Research between 1972 and 1981, is the first
database management system implementing the relational model [6]. Innovative
solutions developed within the system included query optimiser utilising indices [15,
65]. The overview of the relational query optimisation including the fundamentals of an
approach to indexing has been collected in [20, 54, 55]. Almost 40 years of research on
relational systems resulted in development of various indexing aspects. Numerous
indexing based solutions are incorporated in available commercial products. The major
RDBMSs currently are SQL Server by Microsoft [109], DB2 by IBM [24], Informix by
IBM [53] and Oracle by Oracle Corporation [91]. The most popular open-source
relational systems are PostgreSQL by PostgreSQL Global Development Group [103],
MySQL by SUN Microsystems [83] and Firebird by Firebird Foundation [31].
The well-known indexing solutions designed for RDBMSs are the following:
•
primary index, clustering index, secondary access paths (cf. section 2.1.2),
•
multi-key index – enables indexing using combination of multiple fields,
•
derived key index (iSystem DB2/400 by IBM) [51],
function-based index
(Oracle) [115], functional indexes (Informix) [49] – indices on expressions,
built-in or user functions that exactly match selection predicates within an SQL
where clause,
•
computed-column indices (MSSQL Sever) – solution similar to the previous one
but relying on an additional table column (computed-column), which can define
indexable expression using derived attributes and user functions (the index
maintenance relies on maintenance of the computed column) [110],
•
temporary index – transient internal structure created automatically by DB
engine or defined manually (it is described below in this subchapter) [51, 110],
•
development of diverse index structures (it is the topic of subchapter 2.2),
•
other product specific solutions.
In RDBMSs the keys that are used for defining an index on a table are usually
1
International Business Machines Corporation
Page 30 of 181
Chapter 2
Indexing In Databases - State of the Art
simple values stored in columns. Developers of such an index can use various index
structures and mechanisms for assuring index transparency. A query optimiser can
easily identify where clauses addressing indexed selection predicates. Modifications to
an indexed table are also easy to detect by the DB engine during run-time or even
earlier through the analysis of an intermediate form of DML2 statements. Insertion or
deletion of table rows transparently triggers addition or removal of an appropriate index
entry. Analogously, modification to any value in a key column results in changes inside
an index. Therefore, details of automatic index updating in RDBMSs are usually
omitted in technical RDBMS specifications and considered rather as an implementation
issue.
Function-based indices and other similar solutions enabling defining keys using
expressions addressing more than one table column and internal functions or userwritten functions generally do not introduce conceptual difficulties. The functions
supporting such indices can be written in a native database language (e.g. PL/SQL) or
an external programming language (C++, Java, etc.). Furthermore, they must be
deterministic (i.e. depend only on the state of a database store) and side effects free (i.e.
do not introduce any changes to data). The idea of function-based indices is derived
from optimisation through method (or function) pre-computation or materialisation. It is
widely discussed in the research literature [5, 9, 13, 27, 57, 80]. The optimisation gain
relies on pre-calculating the result of a given function or a derived attribute for all
objects of a collection. The obtained results are used as keys to index objects and are
stored inside the index. Thus, when queries are evaluated, the optimiser strives to use
the result computed earlier in order to avoid laborious execution of functions and
derived attributes.
The automatic index maintenance of function-based indices requires simply
considering modifications to any value stored in all columns used in a key definition.
Nevertheless, this aspect of indexing becomes complex when object-oriented model and
language extensions are considered. In extreme cases, it may even lead to serious errors
(see section 2.5.1).
If appropriate indices do not exist then the optimiser can try to facilitate query
2
Data Manipulation Language
Page 31 of 181
Chapter 2
Indexing In Databases - State of the Art
processing using temporary indices instead. Their applications are described in detail
for iSystem DB2/400 by IBM [51]. A temporary index can be created solely for
performing joins (e.g. nested loop join), ordering, grouping, distinct and record
selection. It is applied by the optimiser to satisfy a specific query request. Such an index
can be built as a part of a query plan. After query execution it is destroyed. In effect, it
would not be reused and shared across jobs and queries. Sometimes a temporary index
can be created for a longer period. Such a decision can be made by the DB engine
basing on the analysis of query requests over time. In order to reuse and share such an
index it has to be altered if the underlying table changes. The advantage of a temporary
index is the shorter access time as it is stored only in main memory.
2.4 OODBMSs
Index organisation and optimisation in object-oriented database management
systems have been deeply researched, see [7, 10, 11, 12, 21, 43, 46, 77, 79, 81, 93, 111].
Experimental database prototypes are among other the following: IRIS by Hewlett
Packard, ORION by MCC3, OPENOODB by Texas Instruments and project
ENCORE/ObServer by Brown University. Few former commercial OODBMSs are:
ONTOS by Ontos, ARDENT by ARDENT Software4, ODE by AT&T Bell Labs, POET
by POET Software [29].
OODBMSs base on a hierarchical object-oriented data model. One of important
notions of the object-oriented model is a reference, i.e. a pointer link to an object.
Pointer links express relationships (associations) between objects. In the result of
attempts to standardise object-oriented database management systems the ODMG5 [18]
purposed OQL6 [29, 33] which to some extent influenced the development of objectoriented query languages. Differences in data models and query languages imply that
some indexing techniques are specialised to relational or object-oriented approaches
only.
OQL involves path expressions composed of object names separated by dots in
3
Microelectronics and Computer Technology Corporation, Austin, Texas
4
Formerly O2 by O2 Technology
5
Object Data Management Group
6
Object Query Language
Page 32 of 181
Chapter 2
Indexing In Databases - State of the Art
order to navigate via pointers to objects easily. Navigation to a pointed object in
OODBMSs can be fast as it is usually resolved at a low-level with a direct link. In the
relational model such relationships (i.e. primary-foreign key dependencies) require
performing joins and for efficient query evaluation require indices. Nevertheless, some
object-oriented systems may implicitly rely on a flat, relational-like data model. In such
a case, navigation along a pointer link still requires performing an implicit join among
objects. Thus the assumption limiting OQL path expressions is that the operand before a
dot operator should not deliver a collection.
Much work has been dedicated in the OODBMSs research to cope with
improving the efficiency of processing nested predicates, i.e. based on derived attributes
defined using path expressions. These works additionally extend path expression
indexing with consideration of inheritance issues. The most important proposed
solutions are Multi-Index, Inherited Multi-Index, Nested-Inherited Index, Path Index,
Access Support Relations [10, 11, 12], Triple-node hierarchies [77] and T-Index
(focused on semi-structured data) [81]. The efficiency of these methods was deeply
studied, described through appropriate cost models and verified by prototype
implementations. The solutions are focused on various criteria, such as the cost of
retrieval, cost of updates operations or cost of storage. However, the transparency aspect
of automatic index updating is not always precisely explained. Generally, it is assumed
that each modification of an attribute of a class instance and creation or deletion of an
instance should cause appropriate index updating actions. However, instances of one of
classes accessed by an indexed path expression can be located in different collections.
Moreover, these collections can contain the arbitrary number of objects not associated
with indexed objects. These circumstances can make automatic index updating routines
inapplicable or seriously affect database’s performance. Let us consider an example of
an OQL query returning data concerning departments who are supervised by an
employee John Doe:
SELECT * FROM Departments d
WHERE d.supervisedBy.name = “JOHN DOE”
A path expression based index supporting query evaluation concerns only a number of
employees who are department supervisors. Unfortunately, modifying a name of each
employee would be burdened by the index maintenance mechanisms. This
inconvenience is however justified. In the approach to automatic index updating
Page 33 of 181
Chapter 2
Indexing In Databases - State of the Art
presented in [10, 11, 12] all instances of classes associated with a path expression based
index need to be taken into consideration to ensure index validity after data
modifications. Hence, an index structure often preserves some additional information
concerning objects currently not accessed by the given index but located in collections
processed during the path expression evaluation. An overview of the architecture of a
system oriented on indexing based on path expressions is given in section 2.4.5.
The distributed object management system H-PCTE7 developed at the
University of Siegen [47] has proposed a different solution to automatic index
maintenance. It is independent of an index structure kind and its contents. This work
relies on an extended OQL language variant P-OQL [42] designed to reflect a data
model of H-PCTE. The approach is based on so-called index update definitions which
consist of event description for an event causing the need for an index update, a
reference for the affected index structure, a query determining the elements for which
the respective index entries have to be updated and a corresponding update operation.
These index update definitions can be generated during index creation. The solution
handles complex derived attributes, for instance, employing regular path expressions
and exploiting OQL aggregate functions. On the other hand, the authors outline some
limitations of this approach concerning efficiency and consideration of user-methods
giving general suggestions how these disadvantages should be overcame [43].
Another, approach to index maintenance is in detail discussed for function-based
indexing [46] developed in the context of Thor, a distributed, object-oriented database
system developed by the Massachusetts Institute of Technology [71]. It descends from
works on optimisation for methods and functions in databases [9, 57]. Indices are
maintained using a so-called objects registration schema. Registration concerns only
objects which modification can affect an index. An index update is triggered by a
mechanism that checks registration information during objects’ modification. Despite
the theoretical genericity of this approach, it has not been fully implemented, since Thor
provides object persistence for applications, but without the support for queries.
In [123] the authors present an approach generalising the methods based on
indices by stored queries. They propose to store a response, i.e. the result of a query for
7
High-performance Portable Common Tool Environment
Page 34 of 181
Chapter 2
Indexing In Databases - State of the Art
a current database state, according to the query. Universality of this solution enables
taking advantage of indexing exploiting complex predicates, e.g. aggregate functions.
However, in context of traditional approach to the database’s index this work is close to
optimisation by query caching.
To the best of author’s knowledge, only a few indexing techniques proposed in
the scientific literature have been incorporated into commercial OODBMSs products
and major prototypes. Careful inspection of applied indexing facilities is possible
through analysis of major object-oriented DBMSs. The mentioned above prototypes and
commercial products presented in next sections represent only part of existing objectoriented database management systems landscape. Nevertheless, they provide
sufficiently complete overview of the indexing state of the art in OODBMSs.
2.4.1 db4o Database
The db4o database system by db4objects [25] is designed as a tool providing
transparent persistence for object-oriented language objects. Native Queries in db4o
supply an advanced query interface using the semantics of JAVA programming
language [22]. Another query interface is SODA.
Transparent indexing in the db4o OODBMS is provided only for attributes of
classes defining indexed collections [26]. This means that db4o handles index
maintenance and query optimisation automatically. The documentation does not present
details about indices properties, but only about the usage. SODA query optimisation
allows db4o to use indices. Native Queries are converted to SODA where possible.
Otherwise Native Query are be executed by instantiating all objects.
2.4.2 Objectivity/DB
The Objectivity/DB by Objectivity [85, 86, 87] approach to objects persistence
in programming languages is similar to db4o; however, it is considered more as an
alternative to traditional understanding of a query language. Objectivity/DB, besides the
C++, JAVA and SMALLTALK support, provides Objectivity/SQL++ which complies
with ANSI-standard SQL-92 and extends it with some object-oriented extensions.
Storage objects in terms of Objectivity/DB are used to group other objects and
their indices to obtain space utilisation, performance and concurrency requirements.
There are three kinds of storage objects that correspond to three levels of grouping in
Page 35 of 181
Chapter 2
Indexing In Databases - State of the Art
the Objectivity/DB storage hierarchy: federated database, database and the container. A
structure of an index maintains references to persistent objects of a particular class (so
called indexed class) and its derived classes within a particular storage object. The
Objectivity/DB supports indexing on a single class field or concatenated index on
several attributes (key fields). The indexed class is specified when creating an index.
The creation of an index can be performed on any persistence-capable class, i.e. a class
which behaviour enables storing their instances persistently in Objectivity/DB. Indices
can be referred as sorted collections of references to indexed objects. The order of key
values of an index is very relevant regarding the proper activity of predicate scans.
Indexed objects by default are stored by ascending order of values of their key fields;
this can be specified while creating the index.
Let us consider the index usage in Objectivity/DB. The main goal of an index is
to optimise predicate scans. The predicate used in the scan can be one of the following:
•
a single optimised condition (=, ==, >, <, >=, <=, =~ -string match) that tests the
first key field of the index,
•
a conjunction (&&) of conditions in which the first conjunct is an optimised
condition that tests the first key field of the index (no disjunction - OR support).
Objectivity provides the way to determine the uniqueness property of an index
for a combination of values in its key fields of indexed objects. This can be specified
when creating an index. DB however does not automatically ensure the property. The
Objectivity just considers indexing objects with unique key field values combination.
Otherwise, the next object with the same combination of key fields’ values will not be
considered for indexing.
Modifications concerning objects of an indexed class in the relevant storage
object cause appropriate changes in the index automatically. Additionally, to control
updates the session’s index mode can be used. In enables determining the time of an
index update relatively to when indexed objects are modified. The index modes are as
follows:
•
INSENSITIVE – an update is applied when the transaction commits,
•
SENSITIVE – the update will work when the next predicate scan is performed
in the transaction or, if no scans are performed, when the transaction commits,
Page 36 of 181
Chapter 2
Indexing In Databases - State of the Art
•
EXPLICIT_UPDATE – suppress automatic updating of indices; the updateintensive application that works with this index mode can update indices
explicitly after every relevant change.
2.4.3 ObjectStore
Similarly to db4o and Objectivity/DB, one of the goals of the ObjectStore by
Progress [88, 89] is making an access to a database transparent for a programming
language.
In ObjectStore, a collection is an object that groups together other objects.
While adding an index to a collection the order and the uniqueness of an index can be
specified (by default it is unordered and allows for duplicates). In case of ObjectStore
the place of an index storage can be chosen while creation. It can be a pointed database
segment or a specified database. By default, the index is stored in the same database,
segment, and cluster as the collection to which the index was added.
The ObjectStore introduces a so-called multistep index. It can be created using
complex path expressions, which can access multiple public data members and
methods, as a key. Additionally, for the purpose of optimising queries involving types
that have many subtypes, the idea of superindex was implemented. By default, adding
an index on a type results in recursive adding of indices to all its subtypes. Still for
queries with a large and intricate hierarchy of subtypes regular indexing can seriously
deteriorate processing. Adding a superindex to a type with many subtypes differs from a
default index in one essential feature, i.e. a superindex is only one. It eliminates a
recursion; consequently, only one parent query operation occurs in contrast to multiple
queries when using the regular index. The superindex is automatically updated just as it
takes place in case of default one. However, there are some flaws regarding the
superindex:
using a superindex to the small number of subtypes will not bring significant
gain,
starting a query for a subtype gives no gain from supertype’s superindex,
superindex cannot be applied to types with subtypes in different segments of the
same database or in a different database.
Page 37 of 181
Chapter 2
Indexing In Databases - State of the Art
The last superindex’s limitation can be used to prevent from adding new subtypes
located in different databases to a superindexed type.
The ObjectStore ODBMS automatically optimises a query applied to a
collection. If an index is added to a collection, the database first evaluates indexed fields
and establishes a preliminary result set. Next, it applies non-indexed fields and methods
to elements in the preliminary result set. In ObjectStore optimisation can be done
explicitly (by preparing a query) or automatically (otherwise). The latter means that a
query is optimised to use exactly indices which are available on the collection being
queried. The automatic optimisation is convenient and effective. Nevertheless, when a
query is to be run many times against multiple collections, with potentially different
indices, it is recommended to take manual control over the optimisation strategy.
The ObjectStore supports multistep indices, but provides only partial index
maintenance transparency. The ObjectStore automatically updates an index when
elements are removed or added to a indexed collection. However, updating an index
entry after data modification must be explicitly determined by the programmer.
Besides all mentioned above indexing capabilities, ObjectStore can create a
primary index for an unordered collection that does not allow duplicates. It is an index
used for queries and for looking up objects in such collection. Therefore, the primary
index must contain no duplicate keys and must contain all elements in the collection.
Thanks to this solution in some cases the look-up times and insertions/removals from
the collection are faster.
2.4.4 Versant
The Versant Object Database by Versant [126] requires explicit use in
programming language codes statements of a query language. The statements are seen
in the code as strings. They are processed in run time by a special utility in order to find
and manipulate objects in a database. It exploits its native query language VQL8 similar
to SQL with some object-oriented extensions.
In Versant indices are set on a single attribute of a class and affect all instances
of the class. Versant uses two kinds of index structures: B-trees and hash tables. Both
maintain a separate storage area containing attribute values and different organisation.
8
Versant Query Language
Page 38 of 181
Chapter 2
Indexing In Databases - State of the Art
An attribute can be associated with two indices, one of each kind. A B-tree index is
useful in case of value-range comparisons, while a hash index is better for exact match
comparisons of values. No index inheritance is supported by Versant. An index can be
created on an attribute of only one class. No class inheriting from the one with the index
will inherit the index. To index subclass’ attributes you need to specifically set indices
on each subclass. This results in the need for providing index consistency by an
administrator.
Advanced transparent indexing in Versant is achieved on virtual attributes,
which is a similar technique to indexing on a computed column in the Microsoft SQL
Server. This approach enables indexing of derived virtual attributes built using one or
more normal attributes of a class [127].
Indices in Versant do not have names and they are maintained automatically
while adding, removing or updating the value of an attribute.
An extra constraint that a Versant’s index can enforce is the uniqueness. A
unique index ensures that each instance of a class has a value of the index attribute that
is unique with respect to attribute values in all other instances of the class. In other
words, once an attribute receives the unique index, no duplicate value for this attribute
can be committed; the database server process will first check for the uniqueness
constraint. However, such uniqueness must be assured by the index administrator and
can only be changed by removing an index.
2.4.5 GemStone’s Products
The last evaluated by the thesis’ author commercial OODBMSs are GemStone
products [37]: GemFire the Enterprise Data Fabric [35], which supports a subset of
OQL, and Facets [36], which provide transparent persistence for Java programming
language using an SQL-92 based-language with object extensions. These are the only
tested databases which support transparent indexing employing path expressions. Both
databases originate from the GemStone database which approach to indexing has been
discussed in [79].
In the context of GemStone indices address path expressions. The variable name
appearing in a beginning of a path is called path prefix. Then, a path contains a
sequence of links and a path suffix; e.g. Employee.worksIn.manager. For each link (for
instance, a variable of an object) in the path suffix one index is available; thus forming a
Page 39 of 181
Chapter 2
Indexing In Databases - State of the Art
sequence of index components. GemStone supports five basic storage formats for
objects one of which is a non-sequenceable collection (NSC). When objects in this
format grow large, its representation switches from a contiguous one to a B-tree which
maintains the members by OOP9 for NSCs. Every NSC object has a not accessible to
the user instance variable named NSCDict. If there are no indices on a NSC, then the
value of NSCDict is nil; otherwise, the value of NSCDict is the OOP of an index
dictionary. An index dictionary contains the OOPs of one or more dictionary entries.
Dictionary entry contains the information about the kind of an index (either equality or
identity), the length of a path suffix and two arrays: the first one representing the offset
representation of the path suffix and the second one responsible for holding the OOP of
the index component for each instance variable in the path suffix. Those index
components are implemented using B+-trees. An index component stores the
information about the ordering of keys in the component’s B-tree. If the path suffixes of
two or more indices into a NSC have common prefix, then the indices will share the
index components on the common prefix.
In GemStone identity indices directly support exact match lookups; whereas,
equality indices and identity indices on boolean, characters and integers directly support
=, >, >=, <, <= and range lookups.
Objects in GemStone may be tagged with a dependency list. For every index
component in which an object is a value in the component’s B-tree, the object’s
dependency list will contain a pair of values consisting of the OOP of an index
component and an offset. The pair indicates that if a value at the specified offset is
updated then an update must be made to the corresponding index component.
Consequently, an index component is automatically dependent on the value of the
object at the given offset.
2.5 Advanced Solutions in Object-Relational Databases
A very promising feature of relational systems extended with object-oriented
capabilities is indexing using keys defined on expressions consisting of derived
attributes, internal and user methods (described in subchapter 2.3: derived index,
9
GemStone uses unique surrogates called object-oriented pointers (OOPs) to refer to objects,
and an object table to map an OOP to a physical location.
Page 40 of 181
Chapter 2
Indexing In Databases - State of the Art
functional indexes, function-based index and computed-column indices). The plain
relational systems appoint limits for index key definition:
•
an index key can only be calculated using data in a current tuple, since SQL
does not enable defining an index using data from other tables associated with
the primary-foreign key relationship,
•
SQL aggregate functions are forbidden in index definitions, since a simple SQL
expression used in a selection predicate for a table returns a single value.
Without advanced object-oriented extensions there is no support for methods
associated with tuples, polymorphism and path expressions. Such limitations also apply
to majority of indexing techniques in ORDBMSs. The author has not found any objectrelational DBMS supporting indexing using aggregate functions or path expressions.
Relatively complex indices involving method invocations and polymorphism in the
object-relational environment can be created with the use of the Oracle function-based
indices feature [108, 115]. The Oracle documentation does not provide extensive
information concerning the automatic maintenance of such indices. To identify
properties of the Oracle’s approach the author performed tests described in the next
section.
Besides regular indexing facilities, some products introduce robust extensions
for advanced indexing purposes. As an example let us consider two solutions provided
by IBM, i.e. Virtual-Index in Informix [50] and Index Extensions in DB2 [114]. These
tools are dedicated for experienced database programmers which require indexing
mechanisms going beyond standard database capabilities, e.g.:
•
creating secondary access methods (i.e. indexing) that provide SQL access to
non-relational and other data that does not conform to built-in access methods.
(e.g. a user-defined access method retrieving data from an external location),
•
creating a specialised index support to take the semantics of structured types into
account,
•
introducing various index structures.
Nevertheless, to take advantage of such extensions the user needs often to define
specialised routines, in particular, responsible for index maintenance (key generator)
and performing index scans (range producer). Therefore, the solutions presented above
Page 41 of 181
Chapter 2
Indexing In Databases - State of the Art
do not fulfil indexing transparency property.
DB2 additionally introduces transparent indexing technique for indexing semistructured XML10 data [48]. The so-called pureXML feature allows to store wellformed XML documents kept in its native hierarchical form in table columns that have
the XML data type. XQuery (XML Query Language), SQL, or a combination of both
can be used to query and update XML data. An index over XML data indexes a part of a
column, according to a definition which is limited to XPath (XML Path Language)
expression. Hence, an index key can be a value of an atomic type element nested in an
XML structure stored in a column. XML data are stored entirely in table columns, so
modifications done to XML data can be easily reflected in the index.
The author did not encounter any transparent indexing solutions that would
enable indexing using keys more advanced than solutions presented above. The next
subsection focuses on evaluation of the function-based index technique, which is one of
the most advanced and relevant to the author’s work.
2.5.1 Oracle’s Function-based Index Maintenance
In order to verify properties of the function-based index maintenance we
introduce the following example of a database schema (Fig. 2.3):
Fig. 2.3 Example object-relational schemata
The method getTotalIncomes of the EmpType returns the value of salary attribute of a
tuple. It is overloaded in the empStudentType in order to consider the value of
scholarship. The emp table consists of tuples of both types. Creating an index on such a
method associated with a table is straightforward:
CREATE INDEX emp_gettotalincomes_idx ON emp e
(e.getTotalIncomes());
Such an index is automatically used by the query optimiser. Efficiency of the selection
process is improved not only through reducing the number of processed rows but also
10
Extensible Markup Language
Page 42 of 181
Chapter 2
Indexing In Databases - State of the Art
through avoiding method invocation since calculated results can be taken from an index.
The index efficacy has been tested on series of simple queries. Modifications to a salary
or a scholarship attribute, e.g.
UPDATE emp e SET e.salary = 1500 WHERE e.name = 'KUC';
trigger appropriate changes in the index. According to anticipations, the time of such
data alteration after adding an emp_gettotalincomes_idx index deteriorates. Processing
an update is more than three times longer, because the automatic index maintenance
needs to alter corresponding index’s entries. As far as possible, tests have shown that
the created index works correctly.
Unfortunately defining an index on method calls in Oracle shows some
unexpected disadvantages. The index update operations are also triggered during the
modification of any name attribute in the emp collection. Hence, alteration of any emp
tuple’s attribute after creating an gettotalincomes_idx index similarly is slower more
than three times. This has been caused by unnecessary index updating routines. The
Oracle approach to index updating in case of the method-based indices consists in
triggering index update routines during modifications done to any data in a tuple with
associated index entries.
The disadvantage mentioned above grows to a large problem in case when the
method used to define an index key accesses a data outside indexed tuples. For example
the method getYearCost of the DeptType has the following definition:
CREATE OR REPLACE TYPE BODY dept_type IS
MEMBER FUNCTION getyearcost RETURN NUMBER DETERMINISTIC IS
BEGIN DECLARE counter NUMBER;
BEGIN
SELECT sum(salary) INTO counter FROM emp e WHERE e.dept.name
= self.name;
RETURN counter * 12;
END; END; END;
It accesses not only the given DeptType tuple data but also reaches the emp collection.
Oracle also enables indexing dept collection according to getYearCost method:
CREATE INDEX dept_getyearcost_idx ON dept d (d.getYearCost());
Page 43 of 181
Chapter 2
Indexing In Databases - State of the Art
Similarly like in case of emp_gettotalincomes_idx, a command altering dept tuples
triggers updating of the index. However, any modifications done to emp tuples, e.g.
INSERT INTO EMP
SELECT emp_type ('John Smith', 350, REF(d))
FROM DEPT d WHERE d.name = 'HR';
are not taken into consideration and the dept_getyearcost_idx index loses cohesion with
the data. Unfortunately, queries which use the index, e.g.:
SELECT d.name, d.getyearcost() FROM DEPT d
WHERE d.getyearcost() < 24500;
can return incorrect answers, since the selection process and final results depend on the
index contents. Hence, the applied index updating solution is not proper to handle
indices with keys based on “too complex” methods. In practice the function-based
indices feature in Oracle can lead to erroneous work of database queries and
applications.
The reference dept in EmpType associates employee tuples with departments. It
can be used to formulate selection predicates employing path expressions, e.g.:
SELECT e.name FROM emp e where e.dept.name = 'HR';
Nevertheless, using such path expressions to define an index is forbidden:
CREATE INDEX emp_deptname_idx ON emp e (e.dept.name);
ORA-22808: REF dereferencing not allowed
because it would require accessing a tuple from another table, which obviously would
make the index maintenance impossible.
2.6 Global Indexing Strategies in Parallel Systems
Various indexing approaches have been developed in distributed systems over
the last two decades. The most of interesting solutions have been implemented in the
domain of p2p networks [104, 107].
Work [124] introduces detail taxonomy of indexing strategies (described as
index partitioning schemes) for distributed DBMSs. It analyses index maintenance
strategies and storage requirements in the context of data partitioning in relational
system. It assumes that index is partitioned over same nodes as data. Factors considered
Page 44 of 181
Chapter 2
Indexing In Databases - State of the Art
as fundaments of the given taxonomy are:
•
a degree of index replication between system nodes (non-, partial-, full-),
•
index partitioning in the context of data partitioning – i.e. method determining
how index entries are distributed among system nodes.
Generally, local indexing strategy implies that indices are locally built on the local data.
Distributed indexing occurs when partitioning of the index is different from partitioning
of the data. The taxonomy however omits a centralised indexing strategy which is very
important.
Local data indexing is the most common optimisation method used in database
systems. Moreover, it is also applicable to indexing data of a single peer in a distributed
environment. There are several advantages of the local indexing strategy in the
distributed database environment. The knowledge of indices existing in local stores
need not to be available on the level of a global schema. A query addressing global
schema, during a process of optimisation, in many cases can be decomposed into subqueries addressing particular servers. Such a sub-query concerns data stored locally on a
target site. Before evaluation, it can be optimised according to a local optimiser in order
to take advantage of existing local indices. Global query optimisation is divided
between servers and the global optimiser needs not to take into account local
optimisations. Consequently, local indexing is transparent for global applications. Since
data and indices are located on the same machine and in the same repository, an
implementation of all indexing mechanisms, including index management and
maintenance, is standard. In contrast to distributed environment indexing techniques, it
is not so complex.
However, local indexing is not always sufficient regarding a computational
power of a distributed database. The global indices can be kept by a global store. This
approach has the significant potential for optimisation of global queries. An idle time of
a global store can be adopted to indexing and cataloguing data held by local servers.
From the users point of view, the distributed technology should satisfy the
following general requirements: transparency, security, interoperability, efficiency and
pragmatic universality. Distributed or federated databases and data-intensive grid
technologies, which can be perceived as their successors, aim at providing transparency
in many forms: location, concurrency, implementation, scaling, fragmentation,
Page 45 of 181
Chapter 2
Indexing In Databases - State of the Art
replication, failure transparency, etc. [39]. The transparency is the most important
feature for reducing the complexity of a design and for supporting programming and
maintenance of applications addressing distributed data and services. It much reduces
the complexity of a global application. One of the forms of transparency concerns
indexing. As in centralised databases, the programmers should not involve indices
explicitly into the code of applications. Any performance enhancements should be on
the side of database tuning that is the job of database administration. There are several
important aspects connected with transparent indexing in distributed databases:
•
location and access transparency - the geographical location of indices should
not effect the users work,
•
scaling and migration transparency - indices should be maintained in such a way
that servers data may be migrated, added or removed without any impact on the
consistency of applications,
•
failure transparency - indices should be updated or migrated if some of the nodes
are broken,
•
implementation and fragmentation transparency - the user need not to know how
indices are implemented or partitioned,
•
concurrency transparency - the users can
access indexed resources
simultaneously and need not to know that other users exist.
Next sections discuss the basic properties of centralised and distributed
approaches to indexing.
2.6.1 Central Indexing
The most common practice for distributed resources indexing is dedicating one
server for an index repository. This strategy is called central indexing and has certainly
proved its value in many internet applications. It played a particularly important role in
the development of p2p networks. For example, Napster, an application allowing for
sharing music files, used a directory server to locate desired resources [104, 107].
The features of this approach include:
•
small amount of communication necessary,
•
an efficiency for selective queries,
Page 46 of 181
Chapter 2
Indexing In Databases - State of the Art
•
an architecture simplicity.
However, there are also some disadvantages resulting from central indexing.
Indexing server becomes a single point of failure. Moreover, the query evaluation
performance deteriorates if a server is overloaded (i.e. too many clients use an index
simultaneously) or fails. Also, this approach does not take advantage of parallel
computations.
2.6.2 Strategies Involving Decentralised Indexing
In the Gnutella [38] p2p network each participating node is responsible for
answering and forwarding search requests (a so-called flooded request model). It is an
example implementation of the local indexing strategy. However, features of the
Napster solution have proved to be superior and resulted in better performance than
Gnutella.
An efficient possibility of decentralised indexing is the use of global distributed
and parallel indices, e.g. SDDS (see section 2.2.2). These kinds of indices assume that a
searched key-value points to another server where it can be further forwarded or desired
non-key values can be found. A simple example of such technique could be indexing
employees by their profession. One server can store references to all employees whose
profession starts with a letter A, another server starting with a letter B, etc.
The performance comparison of local indexing strategies (described as partialindexes) and distributed indexing (referred as partitioned global indexes) in query
processing of horizontally fragmented data is a topic of [70]. The evaluation was in
favour of the strategy utilising a distributed index. Similar investigation has been
performed in the context of an inverted index for parallel text retrieval systems [19].
The conducted research indicated that the local index strategy should be preferred in
case when queries exploiting indices are infrequent.
The advantages of the distributed indexing strategy in contrast to the centralised
one are the following:
•
it uses the computing potential of a grid (enabling a parallel query evaluation),
•
it is insensitive to overloading,
•
it decentralises necessary communication.
Page 47 of 181
Chapter 2
Indexing In Databases - State of the Art
An organisation and an architecture of such an index is more complex than in
case of a central index. Sites can dynamically join or leave a community forcing the
reorganisation of a part or a whole index. An achievement of scalability in an index
distribution requires the use of advanced algorithms and data-structures which
complexity can have a disadvantageous impact on index performance. It is common for
all global indexing techniques that index positions stored on server X are not associated
with data stored on server X, so maintaining a convergence between data and an index
is more difficult and has to be done on the global level.
Some works, e.g. [7], consider indexing schemas for a distributed page server
OODB recognizing local caching of a centralised index as distributed indexing strategy.
Nevertheless, this technique does not introduce significant performance improvement
for parallel query processing.
2.7 Distributed DBMSs
Despite of the relatively large number of distributed relational and objectoriented DBMSs only small fraction of them has global indexing capabilities. The most
advanced solutions are based on index partitioning, e.g. SQL Server and Oracle.
In databases, partitioning usually refers to tables or indices. The common model
of table partitioning in distributed databases relies on the static division of data into
independent datasets [92]. Data are partitioned horizontally by the “declustering” of
relations based on a function (usually a hash function or a range index). This kind of
partitioning is static since rules assigning datasets to designated partitions do not change
without an administrator’s interference. With a hash function the data can be partitioned
according to one attribute or a combination of several attributes. Such an approach
enables efficient processing of exactly matching queries, often independently within
only one partition.
As a representative example, the Oracle’s approach to index partitioning is
discussed in this subchapter. Its details are described in [82, 133]. Oracle enables
creating a partitioned index for a partitioned and non-partitioned table. On the other
hand, a partitioned table can have partitioned and non-partitioned indices. If a key
partitioning a table is identical to a key partitioning a corresponding index, then the
index is local. In remaining cases, we deal with a global index. Nevertheless,
partitioning of tables uses the same mechanisms, so the number of index partitions in
Page 48 of 181
Chapter 2
Indexing In Databases - State of the Art
case of global or local indices usually does not differ. Moreover, local indexing is
superior to global in efficiency of the index maintenance. Consequently, it is the most
common used indexing strategy. Partitioned indices inherit majority of regular indices
features, e.g. they can be defined using function-based expressions. In the Oracle’s
approach, partitioning is not a mean of data integration and partitions are not managed
autonomously. Therefore, a global or local partitioned index can be created only on an
entire table.
The research [45] concerns the architecture of a multi-key distributed index. It
proposes a distributed index composed of two types of index structures: Global Index
(GI) and Local Index (LI). GI is a part managed on a distributed database’s level and
each LI is created and maintained by local database components. In such an architecture
different indexing aspects are described, e.g. query optimisation, index implementation
and maintenance (referred as dynamic adaptation [44]), together with evaluation of
performance. Generally, the capabilities of the presented approach do not overcome the
presented Oracle’s index partitioning solution.
The SDDS index structure was employed in the SD-SQL11 Server database [106]
in order to distribute data dynamically and transparently between separate database
instances. Table rows are moved between sites according to primary key values by
SDDS algorithms. This approach solves the limitations of static table partitioning
improving data load balancing. SD-SQL Server automatically manages and accordingly
queries database instances. This solution is built on top of SQL Server using database
stored procedures.
11
Scalable Distributed SQL
Page 49 of 181
Chapter 3
The Stack-based Approach
Chapter 3
The Stack-based Approach
The Stack-based Architecture (SBA) [1, 117, 118] is a formal methodology
concerning the construction and semantics of database query languages, especially
object-oriented. SBA is a coherent theory that enables creating a powerful query
language for practically any known data model. The basic assumption behind SBA is
that query languages are variants of programming languages. Consequently, notions,
concepts and methods developed in the domain of programming languages should be
applied also to query languages. In particular, the main semantic and implementation
notion in majority of programming languages is an environment stack. It is an
elementary structure used for defining names space, binding names, calling procedures
[120] (including recursive calls), passing parameters and supporting object-oriented
notions such as encapsulation, inheritance and polymorphism. The Stack-based
Approach to query languages exploits the environment stack mechanism in order to
define and implement operators specific to query languages, such as selection,
projection, navigation, join and quantifiers. Taking advantage of the semantics based on
the environment stack, SBA makes it possible to achieve full orthogonality and
compositionality of the operators. Moreover, SBA enables seamless integration of query
language with imperative constructs and other programming abstractions, including
procedures, types and classes, This chapter contains a brief description of basic SBA
notions and presents the model query language SBQL (Stack-Based Query Language)
developed according to SBA. More details on SBA and SBQL can be found in [117,
118].
3.1 Abstract Data Store Models
SBA deals with several universal models of object stores. Depending on the
complexity, they are referred to as abstract store models AS0, AS1, AS2 and AS3
(previously M0, M1, M2 and M3 were used, correspondingly). Each next model
extends the previous one with some new features. The mentioned models do not exhaust
all possibilities; however, they cover the most currently known ones.
3.1.1 AS0 Model
The AS0 model is built according to relativity and internal objects identification
Page 50 of 181
Chapter 3
The Stack-based Approach
principles. It is a very simple data store model that is capable of representing semistructured data [121]. In AS0 each object comprises an internal identifier (implicit for
the programmer), an external identifier (an object name available for the programmer)
and a value. There are three kinds of objects: atomic, reference and complex. Assuming
that I denotes the set of all acceptable internal identifiers, N the set of acceptable
external names of objects, V the set of simple values like numbers, strings, etc. and O
denotes any set of AS0 objects, we can define objects as the following triples (where i1,
i2 ∈ I, n ∈ N and v ∈ V ):
•
Atomic objects <i1, n, v> - the simplest kind of objects. They are identified by
internal identifier i1, have name n and hold an atomic value v.
•
Reference objects <i1, n, i2> - they model relations between objects. Similarly
like in the previous case, they are identified by a internal identifier i1 and have a
name n. Their value is an identifier i2 referring to some object.
•
Complex objects <i1, n, O> used to model object nesting. Object with an internal
identifier i1 and name n consists of objects which belong to O. Elements of O are
considered subobjects of the object having i1 as the identifier.
3.1.2 Abstract Store Models Supporting Inheritance
The AS1 store model extends AS0 with classes and static inheritance. A class is
a plain complex object containing subobjects which represent invariants of a certain
group of objects. Additionally the inheritance relation between class objects can be
defined. Apart from the inheritance relation, there is a relation defining an object's
membership to a corresponding class.
The AS2 model introduces the notion of object's dynamic role. Each object can
be associated with one or more such roles. If an object is an owner of a role, its situation
is similar to being a class instance. However whereas the inheritance has static character
during run-time object can take new and lose old roles and inheritance between roles is
dynamic [56, 98].
The AS3 model extends AS1 (AS3.1) or AS2 (AS3.2) model with the
encapsulation mechanism. It is assumed that each class can be equipped with an export
list which is a set of class fields names that are explicitly visible outside implemented
class instances. Other fields are not visible and are treated as private.
Page 51 of 181
Chapter 3
The Stack-based Approach
3.1.3 Example Database Schema
The example schema in Fig. 3.1 is introduced as a basis for presenting
conceptual examples in this and in the following chapters. The abstraction level of the
schema relates to the AS1 Store Model described in the previous section. Therefore, the
schema consists of hierarchical objects, pointer links between objects, classes, static
inheritance and multiple inheritance. These are the most relevant elements from the
point of view of object-oriented modelling. The indexing solution for the ODRA
database management system supports many features that result from adapting the AS1
store model.
Student : StudentClass
scholarship : Integer
getFullName() : String
getScholarship() : Integer
setScholarship(wartość : Integer)
Person : PersonClass
name : String
surname : String
age : Integer
married : Boolean
getFullName() : String
1
EmpStudent : EmpStudentClass
getFullName() : String
getTotalIncomes() : Integer
Emp : EmpClass
salary : Integer
getFullName() : String
getTotalIncomes() : Integer
worksIn employs
0..*
1
Dept : DeptType
name : String
1
address
address
1
AddressType
city : String
street : String
zip : Integer[0..1]
1
Fig. 3.1 Example of an object-oriented database schema for a company
The example schema illustrates personnel records of the company. It introduces
several classes PersonClass, StudentClass, EmpClass, EmpStudentClass and two
structure types DeptType and AddressType. Persistent instances of the classes
mentioned above can be accessed using their instance names Person, Student, Emp and
finally EmpStudent. Objects called Dept have DeptType structure with a primary
attribute name and represent departments of the company. Each Person object stands for
person somehow connected with the company. Its attributes provide basic information:
name, age and marital status. Additionally each Dept and Person object includes an
address subobject which specifies a city, a street name and optionally a zip-code
Page 52 of 181
Chapter 3
The Stack-based Approach
according to the AddressType structure. Instances of the EmpClass represent current
employees of the company and extend Person object attributes with the salary attribute.
Emp and Dept objects are associated by references. The worksIn reference of an Emp
object leads to a department. Dept objects contain employs references to department
employees. Another class, which extends the PersonClass, is the StudentClass. Its
objects refer to students who are granted a scholarship by the company. For that reason,
this class introduces the scholarship attribute. The last class presented on the schema is
called EmpStudentClass and like its name suggest it inherits from EmpClass and
StudentClass. It is introduced to represent students who are simultaneously employees
of the company. In SBQL using Person name results in returning all instances of
PersonClass class and its subclasses. Similarly, via Emp name programmer refers both
EmpClass and EmpStudentClass instances.
Beside attributes, classes are composed of methods. Taking advantage of the
polymorphism
some
methods
are
overridden
in
derived
subclasses.
E.g.
getTotalIncomes() method of EmpClass returns the value of a salary attribute, but for
instances of the EmpStudentClass it returns sum of salary and scholarship attributes.
3.1.4 Example Store with Static Inheritance of Objects
Referring to the data schema in Fig. 3.1 we introduce the example store shown
in Fig. 3.2, consistent with the AS1 model (cf. section 3.1.2), presenting classes and
objects, their values, identifiers and the most important relations between them. An
identifier is a property of every database entity. This sample store consists of two
objects of DeptType type and two instances of EmpClass (one of them is also an
EmpStudentClass instance). One Emp object describes Marek Kowalski – a person who
works in a CNC department. The EmpStudent object depicts Piotr Kuc, a student who is
employed by the HR department. Classes PersonClass and StudentClass are omitted but
according to the schema in Fig. 3.1 they are present in the database.
Page 53 of 181
Chapter 3
The Stack-based Approach
(i11) EmpClass
(i13) getFullName() :
< method code >
(i61) Emp
(i31) EmpStudent
(i14) getTotalIncomes() :
< method code >
(i62) name : „Marek”
(i63) surname : „Kowalski”
(i32) name : „Piotr”
...
(i33) surname : „Kuc”
(i64) age : 28
(i34) age : 30
(i65) married : true
(i21) EmpStudentClass
(i66) address
(i36) address
(i23) getFullName() :
< method code >
(i67) city : „Kraków”
(i37) city : „Warszawa”
(i24) getTotalIncomes() :
< method code >
(i68) street : „Bracka”
(i69) zip : 99999
(i35) married : false
(i38) street : „Koszykowa”
...
(i39) scholarship : 500
(i70) worksIn
(i40) worksIn
(i71) salary : 1200
(i41) salary : 1000
(i131) Dept
(i132) name : „CNC”
(i141) Dept
(i142) name : „HR”
(i133) address
(i143) address
(i134) city : „Opole”
(i144) city : „Kraków”
(i135) street : „Wiejska”
(i145) street : „Reymonta”
(i136) zip : 80043
(i146) zip : 08797
(i151) employs
(i152) employs
…
…
Fig. 3.2 Sample store with classes and objects
3.2 Environment and Result Stacks
The semantics of a query language in the Stack-based Approach is explained
using two stacks: the environment stack (which was mentioned earlier) and the result
stack.
The environment stack (ENVS) controls names binding space. This stack
consists of sections and each section holds binders. Binder is a construct which is used
to bind names with an appropriate run-time entity. It is assumed that binders will be
written as n(r), where n ∈ N, r ∈ R. The R set denotes all possible queries results. This
brings us to the second mentioned result stack (QRES). It is used for storing temporary
Page 54 of 181
Chapter 3
The Stack-based Approach
and final query results.
The following r elements belong to the R queries results set:
•
Value result (number, characters string, logical value, date, etc.) – are results of
literal expressions or arise through dereference (process of acquiring a value) of
an atomic database objects.
•
Reference result (identifier of an internal object) – are plain results of
expressions referring through names to database objects. They are usually results
of name binding, however they can also appear through dereference of reference
objects.
•
Binder result (mentioned earlier pair n(r), where n ∈ N, r ∈ R) – are created
when the operators introducing an auxiliary name are used (as, groupas) or as a
result of the nested iX operation (described further), where iX is an identifier of a
reference or complex object.
•
Structure result (struct{r1, r2, …, rn}, where r1, r2, …, rn ∈ SR and SR ⊂ R is a
set of query results which are not collections) – such results are a sequence of
single results (SR set elements). Structures usually are created using comma
expression, join or as a result of dereference of a complex object.
•
Collection of single results (bag{r1, r2, …, rn}, sequence{r1, r2, …, rn} where r1,
r2, …, rn ∈ SR) – they consists of any elements of the R set except other
collections. Results collections can be nested in other collections only if they are
values of binders. Collections are typically created as a result of binding names
or using set operators (e.g. union). There are two main collection types:
preserving order sequences and not preserving order bags. Nevertheless other
collections can be introduced if necessary, e.g. an array.
The following sections describe bind and nested operations which are defined
using the stacks presented above. All these SBA elements are very essential from the
point of view of the author’s work.
3.2.1 Bind Operation
Each name occurring in a query is bound with an appropriate run-time entity
according to name binding space. Names’ binding is performed using so called bind
operation. This operation works on the environment stack in order to find appropriate
Page 55 of 181
Chapter 3
The Stack-based Approach
binders in its sections. In the beginning of a query evaluation the ENVS comprises one
section (i.e. base section) which holds binder to all database root objects. During a
query evaluation new sections, empty or holding several binders, are put onto or pushed
off the environment stack but the base section remains untouched. Generally binding of
the n name consists in searching the ENVS in direction from top to bottom for the first
section which holds at least one binder described with the n name. Since binder names
can repeat inside one section, it is possible that a result of a binding operation will be a
collection of all found binders values. Particularly if no section holds binders described
with n name then empty collection is returned.
3.2.2 Nested Function
The nested function formalises all cases that require pushing new sections on the
ENVS, particularly the concept of pushing the interior of an object. This function takes
any query result as a parameter and returns a set of binders. The following results of
nested operation are defined depending on a parameter kind:
•
Reference to a complex object – the result is a set consisting of binders created
using subobjects of the given complex object. For each subobject created binder
has its name and value is defined by the subobject’s internal identifier.
•
Reference to a pointer object – the result is a set holding a binder with a name of
an object pointed by the pointer and a value equal to internal identifier of the
pointed object.
•
Binder – the result is a set holding the identical binder.
•
Structure – the result is a set that is the union of the results of the nested function
applied for all elements of the structure.
•
In other cases, the result is the empty set.
3.3 SBQL Query Language
Queries in the Stack-based Approach are treated in the same way as traditional
programming languages treat expressions. Therefore in this thesis the terms expression
and query are used interchangeably.
Even though SBA is independent of syntax in order to explain some semantic
constructs the abstract syntax called SBQL (Stack-Based Query Language) is used.
Stack-Based Query Language is a formalised object-oriented query language in the SQL
Page 56 of 181
Chapter 3
The Stack-based Approach
or OQL style, however its syntactic has been significantly reduced, particularly to avoid
large syntactic constructs like select…from…where known from SQL.
3.3.1 Expressions Evaluation
SBQL expressions follow the compositionality principle, which means that the
semantics of a query is a function of the semantics of its components, recursively.
Similarly to programming languages, the simplest queries are names and literals. The
most complex queries are created by free connecting several subqueries by operators
(providing typing constraints are preserved). There are no constrains concerning nesting
queries.
SBA uses operational semantic in order to define operators. The most important
SBQL operators and their semantic are described in the tables below.
Tab. 3-1 Evaluation of traditional arithmetic operators
Unary operators: + Evaluation
steps:
1.
2.
3.
4.
5.
6.
Evaluation
steps:
1.
2.
3.
4.
5.
6.
Execute the subquery.
Take the result from QRES.
Verify is it a single result (if not run-time exception is raised).
For the reference result dereference is performed.
Execute appropriate operation on the value.
Push the final result on QRES.
Binary operators: + - * / = != < <= > >= or and
Execute both subqueries in sequence.
Take both results from QRES.
Verify they are single results (if not run-time exception is raised)
For each reference result dereference is performed.
Execute appropriate operation on the values.
Push the final result on QRES.
Tab. 3-2 Evaluation of operators working on collections
Structure constructor operator: , (comma)
Evaluation
steps:
1.
2.
3.
4.
Initialise an empty bag (eres).
Execute both subexpressions in sequence.
Take both results from QRES (first e2res and next e1res).
For each element (e1) of the e1res result do:
4.1. For each element (e2) of the e2res result do:
4.1.1. Create structure {e1, e2}. If e1 and/or e2 is structure then
its fields are used.
4.1.2. Add structure to eres.
5. Push eres on QRES.
Page 57 of 181
Chapter 3
The Stack-based Approach
Bag and sequence constructors: bag sequence
Evaluation
steps:
1.
2.
3.
4.
5.
Initialise an empty bag (eres).
Execute subquery.
Take result from QRES.
Result is treated as structure and each structure field is added to eres.
Push eres on QRES.
Existence operator: exists
Evaluation
steps:
1. Execute subquery.
2. Take result from QRES.
3. Push false on QRES if result is the empty collection, otherwise true.
Removing duplicates: unique, uniqueref
Evaluation
steps:
1.
2.
3.
4.
Initialise an empty bag (eres).
Execute subquery.
Take a result collection from QRES (colres).
For each element (el) of the colres result do:
4.1. If there is no element in eres equal to el then add el to eres.
5. Push eres on QRES.
In order to evaluate unique operators elements from colres are subjected
to dereference operation if necessary.
Sum of sets: expr1 union expr2
Evaluation
steps:
1. Initialise an empty bag (eres).
2. Execute both subexpressions in sequence.
3. Take results from QRES.
4. Insert all elements from both results into eres.
5. Push eres on QRES.
Traditional set operators: expr1 minus expr2, expr1 intersect expr2
Evaluation
steps:
1.
2.
3.
4.
Initialise an empty bag (eres).
Execute both subexpressions in sequence.
Take both results from QRES (first e2res and next e1res).
For each element (e1) of the e1res result do:
4.1. In case of minus:
if e2res does not contain element equal to e1 push e1 on QRES.
In case of intersect:
if e2res contains element equal to e1 push e1 on QRES.
In order to compare elements from e1 and e2 operator performs necessary
dereference operations.
Inclusion operator: expr1 in expr2
Evaluation
steps:
1. Execute both subexpressions in sequence.
2. Take both results from QRES (first e2res and next e1res).
3. For each element (e1) of the e1res result do:
3.1. If e2res does not contain element equal to e1 then the false logical
literal is pushed on QRES and evaluation of operator is stopped.
4. Push true logical literal on QRES.
Traditional aggregate operators: sum, min, avg, max, count
Evaluation
steps:
1. Execute subquery.
2. Take a result collection from QRES (colres).
Page 58 of 181
Chapter 3
The Stack-based Approach
3. The final result is initialised (0 in case of sum and count operators,
value of the first colres collection element).
4. For each element (el) of the colres result do:
4.1. Suitably to the given operator the final result is updated
considering el element or its value.
5. Push the final result on QRES.
In order to evaluate sum, min, max operators elements from colres are
subjected to dereference operation if necessary.
The evaluation of avg operator consists of sum and count operators
evaluation.
Tab. 3-3 Evaluation of non-algebraic SBQL operators
Projection/navigation: leftquery . (dot) rightquery
Evaluation
steps:
1.
2.
3.
4.
Initialise an empty bag (eres)
Execute the left subquery.
Take a result collection from QRES (colres).
For each element (el) of the colres result do:
4.1. Open new section on ENVS.
4.2. Execute function nested(el).
4.3. Execute the right subquery.
4.4. Take its result from QRES (elres).
4.5. Insert elres result into eres.
5. Push eres on QRES.
Selection: leftquery where rightquery
Evaluation
steps:
Similarly link in case of dot operator except for:
Evaluation
steps:
Similarly link in case of dot operator except for:
4.5. Verify whether elres is a single result (if not run-time exception
is raised).
4.6. If elres is equal to true add el to eres.
Dependent/navigational join: leftquery join rightquery
4.5. Perform Cartesian Product operation on el and elres.
4.6. Insert obtained structure into eres.
Universal quantifier: leftquery forall rightquery
Evaluation
steps:
1. Execute the left subquery.
2. Take a result collection from QRES (colres).
3. For each element (el) of the colres result do:
3.1. Open new section on ENVS.
3.2. Execute function nested(el).
3.3. Execute the right subquery.
3.4. Take its result from QRES (elres).
3.5. Verify whether elres is a single result (if not run-time exception
is raised).
3.6. If elres is equal to false then the false logical literal is pushed on
QRES and evaluation of operator is stopped.
4. Push true literal on QRES.
Existential quantifier: exists leftquery such that rightquery
Page 59 of 181
Chapter 3
The Stack-based Approach
Evaluation
steps:
Evaluation
steps:
Similarly link in case of forall operator except for:
3.6. If elres is equal to true then the true logical literal is pushed on
QRES and evaluation of operator is stopped.
4. Push false literal on QRES.
Sorting: leftquery orderby rightquery
1. Execute join operation.
2. Take the result from QRES
3. Sort obtained structures according to second structure field, then third,
forth, etc.
4. Create new collection using the first structure fields.
5. Push the final collection on QRES
Tab. 3-4 Evaluation of auxiliary names defining operators
Assigning auxiliary names to collections elements: subquery as name
1. Execute the subquery.
2. Take its result from QRES
steps:
3. Each element of the obtained collection replace with a binder using
the given as the operator parameter name and the given element as a
value.
4. Push the final collection on QRES.
Assigning auxiliary name to the whole collection: subquery groupas name
Evaluation
Evaluation
steps:
1. Execute the subquery.
2. Take its result from QRES
3. Create binder using the given as the operator parameter name and the
obtained result as a value.
4. Push the final collection on QRES.
Tab. 3-5 Evaluation of sequences ranking operators
Assigning auxiliary ranking binders to sequence: seqquery rangeas name
1. Execute the seqquery returning sequence (all sequences are indexed
starting from 1).
steps:
2. Take its result from QRES (seqres)
3. Each element of the obtained sequence (seqres) replace with a
structure consisting of
• the given element and
• binder using the given as the operator parameter name and an
index of element in the sequence as a value.
4. Push the final sequence converted to a bag on QRES.
Extracting elements from a sequence: seqquery[subquery]
Evaluation
Evaluation
1. Initialise an empty bag (eres)
2. Execute the seqquery returning sequence (all sequences are indexed
starting from 1)
Page 60 of 181
Chapter 3
The Stack-based Approach
Take its result from QRES (seqres)
Execute the subquery returning collection of integers.
Take a result collection from QRES (intres).
For each element (el) of the intres result do:
6.1. Insert seqres element with an index el into eres.
7. Push eres on QRES.
In this section the most important SBQL language operators are presented. SBA
steps:
3.
4.
5.
6.
enables introducing also more sophisticated operators, e.g. transitive closures and fixed
point equations. Nonetheless, operators essential for the author’s work are presented
above.
3.3.2 Imperative Statements Evaluation
The following operators used to modify the state of data are also a part of the
SBQL language; however, they cannot construct expressions that can be used by other
operators to form complex queries.
Tab. 3-6 Evaluation of imperative operators
Assigning a value to an object: :=
Evaluation
steps:
Evaluation
steps:
1.
2.
3.
4.
5.
6.
Execute the right subexpressions.
Take its results from QRES.
Verify it is a single results (if not run-time exception is raised)
Perform dereference on a result if necessary to obtain a value.
Execute the left subexpressions.
Take its results from QRES (it is assumed that result of the left
subquery should be a reference to a suitable complex object).
7. Verify they it is a single reference (if not run-time exception is raised)
8. Assign value of the right subquery result to the object pointed by the
reference result.
Creating an object inside existing object: :<<
1. Execute the right subexpressions.
2. Take its results from QRES (it is assumed that the right subquery
returns binder).
3. Verify it is a single results (if not run-time exception is raised)
4. Execute the left subexpressions.
5. Take its results from QRES (it is assumed that result of the left
subquery should be a reference to a suitable complex object).
6. Verify it is a single results (if not run-time exception is raised)
7. Create database object according to binder name and its value. If
binders has atomic value inside then new object is atomic. If binder
contains another binder or a structure then complex object should be
created (for nested binders appropriate new subobjects are created).
8. Nest the new object inside object referenced by the left subquery
result.
Page 61 of 181
Chapter 3
The Stack-based Approach
Removing an object: delete
Evaluation
steps:
1. Execute the subquery
2. Take a result collection from QRES (colres). It is assumed that this
collection holds references to existing objects
3. For each element (ref) of the colres result do:
3.1. Remove the object pointed by the ref from the database together
with its subobjects and objects referencing it.
3.4 Static Query Evaluation and Metabase
The SBQL queries during compilation are subjected to the static analysis. This
process is indispensable in order to perform the static type control and the most of
optimisations. Static analysis consists in mechanisms similar to evaluation of a query.
The task of such an evaluation is simulation of the greatest number of possible
situations that may occur in the run-time, however using data appropriate for the
compile-time. Hence, the static analysis does not refer to real data. Instead, it uses a
metabase, i.e. a graph of a databases schema constructed from declaration of program
entities. A database schema graph is a similar structure to a database graph. It is also
modelled using simple, complex and references objects. Significant differences in
contrast to a database graph are the following:
•
a metabase, instead of particular occurrences of objects, stores only the
information about the minimal and maximal numbers of objects, i.e. the
cardinality of a collection.
•
instead of a specific values, the metabase stores the information on a data type
and relationships (e.g. static inheritance) between them.
•
the metabase additionally contains information which can be used during costbased optimisations, e.g. data related statistics.
For example the following source code fragment:
i : integer [0..*];
setvar : record { txt : string; note : string [1..5] };
would result it the following metabase written according to the AS0 model:
<i0, entry,
< i1, i,
< i2, meta_object_kind, META_VARIABLE>
< i3, type_kind, PRIMITIVE>
< i4, type, INTEGER>
Page 62 of 181
Chapter 3
The Stack-based Approach
< i5, minimal_cardinality, 0>
< i6, maximal_cardinality, +∝>
>
< i7, setvar,
< i8, meta_object_kind, META_VARIABLE>
< i9, type_kind, COMPLEX>
< i10, type, i13>
< i11, minimal_cardinality, 1>
< i12, maximal_cardinality, 1>
>
<i13, $x_struct_type,
<i14, meta_object_kind, META_STRUCTURE>
<i15, fields,
<i16, txt,
<i17, meta_object_kind, META_VARIABLE>
< i18, type_kind, PRIMITIVE>
< i19, type, STRING>
< i20, minimal_cardinality, 1>
< i21, maximal_cardinality, 1>
>
< i22, note,
< i23, meta_object_kind, META_VARIABLE>
< i24, type_kind, PRIMITIVE>
< i10, type, STRING>
< i11, minimal_cardinality, 1>
< i12, maximal_cardinality, 5>
>
>
>
>
The equivalent for a query result during compile-time is an operation signature.
The following kinds of signatures can be distinguished:
•
static reference, that is, a reference to a metabase object,
•
static binder that contains a name and a associated signature as a value,
•
variant that contains several possible signatures (when during static analysis the
unambiguous signature cannot be determined),
•
value type representation that contains identifier of a primitive type which it
represents (usually it concerns literals and static references to atomic objects
when dereference is applied),
•
static structure that contains a set of signatures representing fields of that
structure.
Page 63 of 181
Chapter 3
The Stack-based Approach
Each of these signatures contains additional information, e.g. concerning
possible cardinality of the run-time result returned by a query represented by the
signature.
Besides signatures, the context query analyser is equipped with static equivalents
of environment and query result stacks. Distinct from the run-time stacks these
structures work with signatures and the database schema graph rather than with query
results.
3.4.1 Type Checking
The compile-time static query analysis allows for performing static type control
[112]. According to type determining rules, which are specified for every operator, the
compiler can determine the type of a value returned by a complex query through
analysis of its individual parts. The following example is a single rule concerning the
union operator:
bag[a..b](type) union bag[c..d](type) => bag[a+c..b+d](type)
This rule describes a set-theoretic sum of bags, which comprises from at least a
elements to at most b elements, represented by the left union operand signature with
another set, which cardinality is from c to d elements, represented by the right union
operand signature. It additionally assumes that the types of elements in these collections
must be identical. Consequently, this rule indicates that the final collection will preserve
the type of the input collections and that it would comprise from at least a + c elements
to at most b + d elements.
The set of similar rules is an internal part of almost every programming
language. Yet SBQL is also a query language and thus operation signatures are
enhanced with additional information concerning collections like cardinality and order.
The rules created for arithmetic operators are usually more restrictive. For
example, for + operator the following rule can be designed:
value[1..1](integer) + value[1..1](real) => value[1..1](real)
This rule assumes that the addition of integer and real values can be executed if
operands will be single values (i.e. the cardinality of both arguments is [1..1]),
otherwise the typing error would occur. Nevertheless, one can wonder if such
assumption concerning the cardinality is too restrictive and consequently whether some
Page 64 of 181
Chapter 3
The Stack-based Approach
part of the type checking should be moved to the run-time. Alternative form of the rule
above can be rewritten:
value[0..*](integer) + value[0..*](real) => value[1..1](real)
In this case the + operator would allow situation that in the compile-time the actual
number of arguments is unknown. The suitable control ensues in the run-time.
Therefore, if a left or right subquery does not return a single value then the interpreter
would report the run-time error. Such solution leads to the so-called semi-strong type
system.
Let us consider the example query:
(Person where surname = “Kuc”).age + 1
It illustrates the reason why semi-strong type system is more comfortable for a
programmer. In this example it is assumed that there exist only one person exist with
the given surname. This assumption is controlled dynamically, i.e. if it is not fulfilled
than run-time error indicates a typing error. In case of a more restrictive type system,
the compiler would reject the above construction.
3.5 Updateable Object-Oriented Views
A database view is a collection of virtual objects that are arbitrarily mapped from stored
objects. In the context of distributed applications (e.g. web applications) views can be
used to resolve incompabilities between heterogeneous data sources enabling their
integration [60, 61].
The idea of updateable object views relies in augmenting the definition of a view with
the information on users’ intents with respect to updating operations. Only the view
definer is able to express the semantics of view updating. To achieve it, a view
definition is subdivided into two parts. The first part is a functional procedure, which
maps stored objects into virtual objects (similarly to SQL). It returns entities called
seeds that unambiguously identify virtual objects (in particular, seeds are OIDs of stored
objects). The second part contains redefinitions of generic operations on virtual objects.
These procedures express the view definer’s intentions with respect to update, delete,
insert and retrieve operations performed on virtual objects. Seeds are (implicitly) passed
as procedures’ parameters. A view definition usually contains definitions of subviews,
which are defined on the same rule, according to the relativism principle. Because a
Page 65 of 181
Chapter 3
The Stack-based Approach
view definition is a regular complex object, it may also contain other elements, such as
procedures, functions, state objects, etc. The above assumptions and SBA semantics
allow achieving the following properties: (1) full transparency of views – after defining
a view, its user uses the virtual objects in the same way as stored objects, (2) views can
be recursive and (as procedures) may have parameters.
Page 66 of 181
Chapter 4
Organisation of Indexing in OODBMS
Chapter 4
Organisation of Indexing in OODBMS
This chapter concerns primarily the architecture and rules applying to the index
management and maintenance. Actual optimisation of query processing is the topic of
the next chapter. Nonetheless, improving performance depends on diversity of exploited
index structures and flexibility in defining an index. The properties of SBQL, in
particular, orthogonality and compositionality, enable easy formulating complex
selection predicates including usage of complex expressions with polymorphic methods
and aggregate operators. The proposed organisation of indexing provides all necessary
mechanisms so the database administrator is unconstrained in creating local or global
indices with keys based on such expressions.
The implementation exploits the linear hashing index structure (see section
2.2.1). Nevertheless, the solution does not limit the possibility to apply different
indexing techniques, e.g. B-Trees. Details of this aspect of database indexing are
omitted since it is generally orthogonal and independent of index management,
maintenance and query optimisation.
4.1 Implementation of a Linear Hashing Based Index
The primary reason for implementing a linear hashing index is a possibility to
extend this structure to its distributed SDDS version (see section 2.2.2) in order to
utilise optimally distributed database resources. Moreover, the author wants to provide
extensive query optimisation support by enabling:
•
dense indexing for integer, real, string, date and reference key values (dense
index key type),
•
support for optimising range queries on integer, real, string, date key values
(range and enum key types),
•
indexing using multiple keys,
•
enum key type - a special support facilitating indexing of integer, real, string,
date, reference and boolean keys with a countable limited set of distinct values
(low key value cardinality).
The enum key type provides an additional flexibility when applied to multiple
Page 67 of 181
Chapter 4
Organisation of Indexing in OODBMS
key indices since such keys can be skipped in an index invocation, i.e. can be
considered optional.
4.1.1 Index Key Types
The mentioned properties of indexing are introduced through different design of
a hash function.
The dense key type implies that the optimisation of selection queries which use
the given key as a condition will be applied only for selection predicates based on = or
in operators. Therefore, a hash function can distribute objects in index randomly
omitting key values order. Such an index does not support optimising range queries,
however it is faster in processing index invocations with exact match selection criteria.
The range key type additionally supports optimisation concerning selection
predicates based on range operators: >, ≥, < and ≤. This is achieved through a range
partitioning [62, 75] variant implemented by the author. Within index a hash function
groups object references in individual buckets (see bucket definition in section 2.2.1)
according to key value ranges. The ranges are dynamically split as an index grows
increasing its selectivity.
The last key type – enum – is introduced in order to take advantage of keys with
countable limited set of distinct values, i.e. keys with low values cardinality. The
performance of an index can be strongly deteriorated if key values have low cardinality
e.g. person eye colour, marriage status (boolean value) or the year of birth. To prevent
this, an index internally stores all possible key values (or key values range limits in case
of integer values) and uses this information to facilitate index hashing. The enum key
type can deal with optimising selection predicates exactly like in the case of the range
key type, i.e. for: =, in, >, ≥, < and ≤ operators.
Multiple key indexing is introduced through defining overall hash function as a
composition of individual keys hash functions. An enum type key hash function assigns
key values to consecutive hash-codes. As a result, enum type keys are particularly
effective in multiple key indices. First, they can be omitted in index calls generated
during optimisation of queries, what improves flexibility of the index optimiser (details
are described in section 5.5.2). Furthermore, index invocation evaluation proves great
efficiency if all index keys are enum and the number of indexed objects is large enough.
In such conditions each key value combination points to a separate bucket of object
Page 68 of 181
Chapter 4
Organisation of Indexing in OODBMS
references which eliminates necessity to verify search criteria for retained objects.
4.1.2 Example Indices
The given in Fig. 3.1 example schema opens possibility to present a wide variety
of indices supported by the OODBMS indexing engine implemented by the author.
Prefix idx is used to distinguish names of indices from other database entities.
Firstly let us discuss simple single key indices created on object’s attributes:
•
IdxPerAge – returns Person objects according to the value of the age attribute. It
is assumed that this index is capable of processing range queries.
•
IdxEmpAge – identical as above, except only for Emp objects.
•
IdxEmpSalary – returns Emp objects queried by their salary attribute. This index
similarly can use salary range as a selection criterion.
•
IdxPerSurname – a dense index returning Person objects according to the string
type surname attribute.
•
IdxDeptName – a dense index returning Dept objects according to the name
attribute.
•
IdxPerZip – the range index which returns Person objects queried by a zip
attribute of its subobject address. It is important to note that the zip attribute is
optional and therefore this index stores only Person objects containing this
attribute.
•
IdxEmpCity – returns instances of the Emp class according to an address.city
complex attribute. It is assumed that this index is dense.
•
IdxAddrStreet – a dense index which returns address subobjects of Person
objects according to the street attribute. Differently than in case of other indices
non-key objects are defined by a path expression, i.e. Person.address.
The following indices use derived and complex attributes as keys:
•
IdxEmpDeptName – a dense index which uses the derived attribute
worksIn.Dept.name to retrieve Emp objects.
•
IdxEmpWorkCity
–
an
index
using
the
derived
attribute
worksIn.Dept.address.city for Emp objects. Additionally in order to take
Page 69 of 181
Chapter 4
Organisation of Indexing in OODBMS
advantage of the fact that company departments are located in the limited
number of cities (low key values cardinality) the key type is enum.
•
IdxDeptYearCost – the most complex of the indices for Dept objects. The key is
based on the expression sum(employs.Emp.salary) * 12 which returns an
approximate total cost of salaries of a given department for the year period. It is
assumed that this index is range.
•
IdxEmpTotalIncomes – a range index which uses the Emp class method
getTotalIncomes() as a key for selecting Emp objects. This method is overridden
for instances of the EmpStudent class.
Another powerful feature of the proposed indexing solution is multiple key
indexing. Using such an index can strengthen the selectivity property (cf. section 5.4.1),
in particular, when individual keys return only few distinct values.
•
idxEmpAge&workCity – an index for instances of Emp objects. It consists of two
dense keys. The first key is set on the age attribute and the next one on the
derived attribute worksIn.Dept.address.city. It is assumed that it is necessary to
specify both attributes to take advantage of this index.
•
idxPerAge&Surname – the last index for indexing Person objects, which also
uses two keys. The first key is set on the age attribute and supports range
queries. It is assumed low cardinality of values (enum key type). The second
dense key is set on the person surname attribute. This index offers greater
flexibility hence the age key can be omitted in an index call.
In order to take advantage of indexing the administrator must only create proper
indices. The rest of the optimisation issues are completely transparent.
4.2 Index Management
All indices existing in the database are registered and managed by the index
manager. Beside the list of meta-references to objects describing indices, it holds also
auxiliary redundant information needed by the index optimiser and the static evaluator,
i.e. a list of structures (called Nonkey Structures) maintained for each indexed collection
of objects containing information about:
•
query defining the given collection,
Page 70 of 181
Chapter 4
Organisation of Indexing in OODBMS
•
reference to the metabase object representing objects belonging to this
collection,
•
indices set on the given collection along with their meta-references,
•
list of keys used to index the given collection of objects holding precise
information about each key:
o an expression defining the key,
o a list of indices using the given key.
The efficient access to elements of the lists mentioned above is provided by
auxiliary indices. The structure of the index manager is presented in Fig. 4.1.
Fig. 4.1 Index manager structure
Page 71 of 181
Chapter 4
Organisation of Indexing in OODBMS
The index manager assists in the index optimisation process by making wellorganised information about existing indices available (details can be found in section
5.2.1 and subchapter 5.3). For instance if the index optimiser processes a where clause
which selects objects from the whole Emp collection then the Nonkey Structures Index
would return necessary information about indices set on EmpClass instances. Such
information in case of indices introduced in section 4.1.2 is presented in Fig. 4.2.
INDEXED OBJECTS INFORMATION
EmpClass OBJECTS VARIABLE
METAREFERENCE
Emp
query:
LIST OF ASSOCIATED INDICES
idxEmp
Salary
idxEmp
City
idxEmp
WorkCity
idxEmp
TotalIncomes
idxEmp
Age&WorkCity
LIST OF KEYS
1. KEY INFORMATION
2. KEY INFORMATION
Key expression:
Key expression:
Key expression:
salary
city
getTotalIncomes()
LIST OF INDICES
UTILISING THE KEY
LIST OF INDICES
UTILISING THE KEY
LIST OF INDICES
UTILISING THE KEY
idxEmpSalary (1ST key)
idxEmpCity (1ST key)
idxEmpTotalIncomes (1ST key)
4. KEY INFORMATION
3. KEY INFORMATION
5. KEY INFORMATION
Key expression:
Key expression:
age
worksIn.Dept.address.city
LIST OF INDICES
UTILISING THE KEY
LIST OF INDICES
UTILISING THE KEY
idxEmpAge&WorkCity
(1ST key)
idxEmpWorkCity
(1ST key)
idxEmpAge&WorkCity
(2ND key)
INDEX OF KEYS STRUCTURES
ACCORDING TO KEY EXPRESSION FIELD VALUE
Fig. 4.2 Example Nonkey structure for Emp collection
Taking advantage of the Nonkey Structure presented above the index optimiser can
efficiently match selection predicates of the where clause with the associated indices
keys.
Page 72 of 181
Chapter 4
Organisation of Indexing in OODBMS
4.2.1 Index Creating Rules and Assumed Limitations
Each index has to be unique in its namespace. Because of the wide range of
discussed topics in this work the concept of modules and namespaces connected with
them developed in the ODRA prototype [68, 119] are omitted.
The administrator issues the add index command to create a new index in the
database. The syntax of this command is the following:
add index <indexname> ( <typeind_1> [ | <typeind_2> ... ] ) on
<nonkeyexpr> (<keyexpr_1> [ , <keyexpr_2>
... ] )
where:
•
indexname – stands for a unique name of the index,
•
typeind_i – is the a type indicator of the i-th index key specified by the following
values: dense, range and enum (described in section 4.1.1),
•
nonkeyexpr – a path expression defining indexed objects,
•
keyexpr_i – a query defining the i-th key used to retain indexed objects.
The number of type indicators corresponds to the number of keys forming an
index.
Indexed objects are defined by the nonkeyexpr expression which must be bound
in the lowest database section (the databases root) of the environment stack. For
simplification, it is assumed that this definition should be built using a path expression
(name expressions connected using dot non-algebraic operators). Moreover, this path
expression should return collection of distinct objects to be indexed. It is important
because some of the optimisation methods are not currently designed to deal with
collections containing duplicates. Using reference objects in defining nonkeyexpr can
result in the possibility of indexing duplicates, e.g. usually more than several employees
works in a single company department and hence the following path expression:
Emp.worksIn.Dept
would probably return same Dept objects many times because worksIn references
associate different employees with the same department.
Nevertheless, the mentioned limitation concerning indexed objects definition is
normal in the context of typical indexing solutions for databases. Additionally, such an
Page 73 of 181
Chapter 4
Organisation of Indexing in OODBMS
index can be used to enforce a constraint that collection should comprise distinct
objects. Enabling support for more complex definitions of non-key objects is possible;
however, it would result in some limitations concerning applying optimisations and in
increasing implementation complexity of automatic index updating, which is presented
in subchapter 4.3.
Each key value expression keyexpr_i should be defined in the context of the
objects defined by nonkeyexpr expression. Consequently the query:
nonkeyexpr join (keyexpr_1 [ , keyexpr_2
... ] )
returns non-key objects together with corresponding key values. Each keyexpr_i should
depend on the join operator. Index keys should return values of following types:
integer, real, string, date, reference or boolean. Moreover, each key expression has to
be deterministic, i.e. for a given non-key object it must return the exact same result
provided that data used to calculate it has not changed (for example it excludes the
usage of random method).
An important property of a created index is the cardinality of keys. For each key
it indicates the possible number of returned values. Usually keys return a single value so
their cardinality is [1..1] and a key value exists for each non-key object. As the result,
the whole non-key collection is indexed.
When the minimal key cardinality is zero, e.g. address.zip key for the idxPerZip
index, some objects can be omitted in indexing, since their key value may not exist.
This situation does not disable indexing; however, it introduces several requirements for
the database programmer in order not to thwart index optimisation (this problem is
explained in detail in subchapter 5.3 and section 5.5.2).
Currently the author has not provided a support for indexing when the keys
maximum cardinality is above singular because of the ambiguity in generating key
values for an object, i.e. more than one key values combination can be generated for a
single object. Considering such a scenario would require introducing minor changes in
generating the index structure and extending index optimisation methods to properly
deal with selection predicates working with collections.
If all conditions described above are met, the index manager initialises an index
structure and creates an index related meta-object. Next, it proceeds to organise
information required for optimisation. First, a corresponding to the given nonkeyexpr
Page 74 of 181
Chapter 4
Organisation of Indexing in OODBMS
expression Nonkey Structure is located or a new one is created. Then, the structure is
updated with information depicted in Fig. 4.1 concerning the index and all its keys.
Each keyexpr_i expression is marked with the index being created in the proper Key
Structure.
This terminates creation of a new index. However, it is crucial that the index
manager enables index updating mechanism in order to fill the index structure with
objects. During that operation, the index is filled with appropriate objects. This topic
together with related issues is discussed in subchapter 4.3.
4.3 Automatic Index Updating
Indices, like all redundant structures, can lose cohesion if the data stored in the
database are altered. Rebuilding an index is to be transparent for application
programmers and should ensure validity of maintained indices. For that reasons the
automatic index updating has been designed and implemented. Furthermore, the
additional time required for an index update in response to a data modification should
be minimised. This is critical from the point of view of the large databases efficiency.
Any change to data cannot cause long lasting verification of existing indices or
rebuilding the whole index from scratch. To achieve this, a database system should
efficiently find indices which became outdated because of a performed data
modification. Next, the appropriate index entries should be corrected so that all index
invocations would provide valid answers. Such index updating routines should not
influence the performance of retrieving information from the database and the overhead
introduced to writing data should be minimal (particularly when no index has been
affected by changes to the database). However finding a general and optimal solution
for index updating is not possible because of the complexity of DBMSs. Such a task
requires analysis of many different real life situations occurring in the database
environment in order to minimise deterioration of performance.
4.3.1 Index Update Triggers
Each modification performed on objects (creation, update and deletion) is
executed through the ODRA object store CRUD (acronym for Create, Read, Update and
Delete) which generally is responsible for access to persistent data and other database
entities. The proposed approach to automatic index updating concentrates on this
element of the system as it is the easiest and certain way to trace data modifications.
Page 75 of 181
Chapter 4
Organisation of Indexing in OODBMS
Possible modifications that can be performed on an object are the following:
•
updating a value of an integer, double, string, boolean, date or object reference,
•
deleting,
•
adding a child object (in case of a complex object),
•
other database implementation dependent modifications: e.g. adding a child to
an aggregate object (role of this kind of objects in index updating is described in
section 4.3.5).
The author has introduced a group of special auxiliary structures called Index
Update Triggers (IUT) together with Triggers Definitions (TD). These elements are
essential to perform index updating.
Each IUT associates one database object with an appropriate index through a
TD. Existing IUTs automatically initialise the index updating mechanism when a
modification concerning the given object is about to occur. More than one IUT can be
connected with a single object. TDs provide means to find objects which should be
equipped with IUTs. Additionally, TD specifies the type of an IUT.
An object is associated with IUTs when it participates in accessing non-key
objects or calculating key values for indices. Therefore, modification to objects not
linked with any index does not trigger unnecessary index updating. Altering objects
equipped with IUTs is likely to influence topicality of indices and IUTs.
Four basic types of IUTs (each IUT refers to different TD type) are proposed:
1. Root Index Update Trigger (R-IUT) – is by default associated with the root
database entry which is a direct or indirect parent for all indexed database
objects. When a new object is created in the databases root, the trigger can cause
generation of a NonkeyPath- or Nonkey- Index Update Trigger (described
below) for the new child object. This trigger is also used to initialise or terminate
all triggers associated with an index.
2. NonkeyPath Index Update Trigger (NP-IUT) – a type of a trigger associated
with objects which are potential direct or indirect parent objects for new indexed
objects. This type of a trigger is generated when an index non-key object is
defined by a path expression (e.g. idxAddrStreet index), i.e. when non-key
objects are not direct children of the databases root. Similarly to a R-IUT, this
Page 76 of 181
Chapter 4
Organisation of Indexing in OODBMS
trigger can cause generation of a NonkeyPath- or Nonkey- Index Update Trigger
for the new child object.
3. Nonkey Index Update Trigger (NK-IUT) – a trigger that is assigned to indexed
(non-key) objects. It is generated by direct parent object’s update triggers
(R-IUTs or NP-IUTs). The process of creating a NK-IUT consists of the
following steps:
•
first a NK-IUT is assigned to the given indexed object,
•
the key value is calculated,
•
the corresponding index entry is created (if a valid key value is found),
•
Key Index Update Triggers (described below) are generated and
parameterised with the indexed object identifier.
Creating a child object inside a non-key object initialises routines identical to a
Key Index Update Trigger.
4. Key Index Update Trigger (K-IUT) – associated with objects used to evaluate a
key value for a specific non-key object (identifier passed together with TD as an
additional parameter). Each modification to such objects can potentially modify
the process of evaluating a key and hence its value. Therefore, a K-IUT is
responsible for updating a corresponding index entry and maintaining
appropriate K-IUTs corresponding to the given non-key object.
Basing on the sample store depicted in Fig. 3.2 for indices idxPerAge,
idxEmpWorkCity and idxAttrStreet introduced in section 4.1.2 example IUTs shown in
Fig. 4.3, Fig. 4.4 and Fig. 4.5 would be generated. Let us assume that i0 is the identifier
of the databases root. Non-key objects associated with K-IUTs are stated in parentheses.
Fig. 4.3 Example Index Update Triggers generated for idxPerAge index
Page 77 of 181
Chapter 4
Organisation of Indexing in OODBMS
Fig. 4.4 Example Index Update Triggers generated for idxEmpWorkCity index
Fig. 4.5 Example Index Update Triggers generated for idxAddrStreet index
4.3.2 The Architectural View of the Index Update Process
The overview of the index update process that has been proposed and
implemented by the author is presented in Fig. 4.6.
Fig. 4.6 Automatic index updating architecture
Page 78 of 181
Chapter 4
Organisation of Indexing in OODBMS
When the administrator adds an index, TDs are created before IUTs (this step is
shown using the green coloured arrows numbered 1a and 1b):
•
Index manager initialises a new index and issues the triggers manager a
message to build TDs.
•
Next, the triggers manager activates the index updating mechanism which
basing on the knowledge about indices and TDs proceeds to add IUTs:
o This process is initialised by introducing a R-IUT for the databases root
entry.
o The R-IUT trigger propagates remaining triggers to database objects.
o When a NK-IUT is added to an indexed non-key object then a key value
is evaluated and an adequate entry is added to the index.
Removing an index causes the removal of IUTs and TDs. Together with NK-IUTs
corresponding index entries are deleted. The mediator managing the addition and
removal of IUTs is a special extension of the CRUD interface.
The second case the index updating mechanism is activated occurs when the
databases store CRUD interface receives a message to modify the object which is
marked with one or more IUTs (shown in Fig. 4.6. using the blue coloured arrow with
number 2). CRUD notifies the index updating mechanism about forthcoming
modifications and all necessary preparation before database’s alteration are performed.
This step is particularly important in case of changes which can affect a key value for
the given non-key object. It consists of:
•
locating an index entry which corresponds to the non-key object (a key value is
necessary),
•
identifying objects that are accessed in order to calculate the key value (they are
equipped with an identical K-IUT).
After gathering required information CRUD performs requested modifications
and the index updating mechanism proceeds to:
•
update index entries for the given non-key object by:
o moving an entry corresponding to the non-key according to a new key
value,
Page 79 of 181
Chapter 4
Organisation of Indexing in OODBMS
o removing the outdated entry if there is no proper new key value,
o inserting a new entry into the index if a proper key value was calculated
only after database alteration.
•
update existing IUTs by generating new or removing outdated ones.
This finishes servicing the trigger caused by alteration of the database.
4.3.3 SBQL Interpreter and Binding Extension
A significant element used by the index updating mechanism is a query
execution engine, i.e. the SBQL interpreter (also shown in Fig. 4.6.), extended with the
ability to:
1. Log database objects that occur during evaluation of an index key expression.
Logging takes place during binding object names on ENVS (other database
entities like procedures, views, etc. and literals are discarded) – this feature is
used to locate all objects which are or should be equipped with K-IUTs.
2. Limit the first performed binding only to one specified object – this feature
significantly accelerates and facilitates verification whether a new child
subobject added to an object with R-IUT or NP-IUT should be equipped with
NP-IUT or NK-IUT, i.e. to check whether a new child is a non-key object or the
potential direct or indirect parent of a non-key object.
The only module of the SBQL interpreter which required author’s modifications
is the run-time binding manager. The proposed extension is introduced using Java static
inheritance; therefore, applying different binding mechanisms to the SBQL interpreter is
straightforward. The interpreter is used by the index updating mechanism in order to:
•
traverse from the database’s root or objects equipped with NP-IUT to non-key
objects,
•
generate a key value for a given non-key object.
Let us consider the following example of adding IUTs starting with R-IUT
during creation of the idxAddrStreet index. The SBQL interpreter is used first to
evaluate the query Person, which returns identifiers to PersonClass and its subclasses
EmpClass, StudentClass, EmpStudentClass instances, i.e. in this case i31 EmpStudent
and i61 Emp objects. Consequently, NP-IUTs are added to the object i31 and the object
Page 80 of 181
Chapter 4
Organisation of Indexing in OODBMS
i61. In order to propagate triggers to non-key objects, first, the index updating
mechanism performs the operation nested on i31 object to prepare a suitable context for
the SBQL interpreter by pushing necessary binders on the environment stack. Then, the
query address is evaluated and returns i36 object. Similarly, actions are taken for the i61
Emp object and the i66 identifier is returned. Therefore, the NK-IUTs are added to
address objects i36 and i66. Next, in the context of both non-key objects the SBQL
interpreter evaluates the key expression street. Accordingly, street objects i34 and i64
containing key values are returned. This procedure allows inserting two non-key objects
into the idxAdrStreet index and building R-IUT, NP-IUT and NK-IUT triggers.
Nevertheless, it is insufficient to find objects which should be equipped with a K-IUT
because of possible key expression complexity which is not limited only to a path
expression. However, enhancement to the run-time binding manager enables finding
those objects during calculation of a key value. All objects called in evaluation of a key
expression by the SBQL interpreter occur during binding operation. Moreover, this
enhancement allows finding aggregate objects which implicitly facilitate binding. Such
objects can be also useful in improving performance of index updating (cf. section
4.3.5). To conclude the example, as a result IUTs have been generated according to Fig.
4.5.
The next section discusses more complex examples concerning K-IUTs in order
to present versatility of the proposed approach.
4.3.4 Example of Update Scenarios
In order to trace example scenarios of index updating let us refer to the sample
store in Fig. 3.2. In case of examples presented below the most important are object and
method identifiers. Classes PersonClass and StudentClass occur during nesting
operation; however, in presented examples they do not effect binding operation. We
assume that all examples are correct so during evaluation run-time errors do not occur.
In particular, a left operand of assign expressions always returns precisely one object to
be modified.
4.3.4.1
Conceptual Example
The given statement concerns updating the age attribute for the Person object
which surname is equal to “Kuc”:
(Person where surname = “Kuc”).age := 31
Page 81 of 181
Chapter 4
Organisation of Indexing in OODBMS
According to the store state depicted in Fig. 3.2, the left operand of the assignment
returns the age attribute with the identifier i34. The interpreter sends a message to the
ODRA database CRUD mechanism to update a value of the i34 integer attribute to 31.
Before the update operation, the CRUD mechanism checks for IUT triggers connected
with the attribute being modified. Let us assume that according to Fig. 4.3 there is a KIUT described by the following properties: < index: idxPerAge, non-key object: i31 >
associated with the object i34. Consequently, the index updating mechanism is triggered
to calculate a key value used to access the i31 object in the idxPerAge index. It is
important that additionally during this step the objects affecting the key value are
identified. To obtain the key value updating routines initialise a new SBQL interpreter
instance with empty ENVS and perform the following operations:
1. A reference to the i31 object is put onto the QRES and nested operation is
performed.
2. New frames are created on the ENVS. The lowest stack section contains
components of classes according the inheritance hierarchy: first for
PersonClass, followed by EmpClass and StudentClass (however order is not
predictable
because
of
the
multiple
inheritance)
and
above
them
EmpStudentClass. The top ENVS frame is filled with subobjects of the i31
EmpStudent object.
Fig. 4.7 Calculating the idxPerAge index key value for i31 object
Next, the interpreter proceeds to evaluate the idxPerAge index key expression
age. The evaluation steps are shown in Fig. 4.7. The bind operation with the age name
Page 82 of 181
Chapter 4
Organisation of Indexing in OODBMS
parameter is performed. The i34 attribute is put onto QRES. During binding the i34
identifier is stored by the index updating mechanism, as it has influenced the value of
the key. The key value is obtained by dereferencing the i34 attribute.
The index updating mechanism uses the non-key object i31 and the calculated
key value to locate the corresponding to the i31 idxPerAge index entry. This is necessary
for modifying the index after updating the key value for the given EmpStudent object.
Now the CRUD mechanism can alter the i34 attribute and assign it the new value 31.
After age update, the process of calculating the key value is repeated. In this case, it
does not differ from the preceding one presented in Fig. 4.7. The index updating
mechanism uses all gathered information to:
•
Update the idxPerAge index: the entry for the non-key object i31 with the key
value 30 is properly adjusted to the new key value 31.
•
Revise IUTs for idxPerAge index and non-key object i31: before as well as after
modifying the i34 attribute the index updating mechanism identified that i34 is the
only object influencing the key value. Since K-IUT (index: idxPerAge, non-key
object: i31) associated with this object is still valid no changes are made to
existing IUTs.
This finishes the index updating routines for the example presented above.
4.3.4.2
Path Modification
The next more complex example concerns reassigning the Emp object which
surname is equal to “Kowalski” to HR department through updating worksIn reference:
(Emp where surname = “Kowalski”).worksIn :=
ref Dept where name = “HR”
This operation causes assignment of the i141 Dept object reference to the i70 worksIn
attribute. If there is the idxEmpWorkCity index then CRUD finds a K-IUT described by
the following properties < index: idxEmpWorkCity, non-key object: i61 > associated
with object i70. Therefore, before CRUD proceeds to modify the value of the worksIn
attribute routines presented in Fig. 4.8 are performed in order to calculate a
corresponding key value, i.e. value of the i134 object – “Opole”, and to identify objects
which compose the key, i.e. identifiers occurring during binding: i70 worksIn, i131 Dept,
i133 address and i134 city (written with the green colour in Fig. 4.8).
Page 83 of 181
Chapter 4
Organisation of Indexing in OODBMS
Fig. 4.8 Calculating the idxEmpWorkCity index key value before update
The CRUD mechanism performs update on the i70 worksIn attribute value.
Fig. 4.9 Calculating the idxEmpWorkCity index key value after update
Page 84 of 181
Chapter 4
Organisation of Indexing in OODBMS
This modification from the point of view of the index updating mechanism introduces
significant changes in the evaluation of the key expression. As it can be seen in Fig. 4.9
not only the key value changed to “Kraków”, i.e. value of the i144 city attribute, but also
the set of identifiers of objects which affect the key value is different, i.e. i70 worksIn,
i141 Dept, i143 address and i144 city.
The index updating mechanism uses all gathered information to:
•
Update the idxEmpWorkCity index: the entry for the non-key object i61 with the
key value “Opole” is adjusted to the new key value “Kraków”.
•
Revise IUTs for idxEmpWorkCity index and non-key object i61: before as well
as after modifying the i70 attribute the index updating mechanism has identified
that the i70 object influences the key value. However objects i131, i133 and i134 no
longer affect the key value and therefore K-IUTs < index: idxEmpWorkCity,
non-key object: i61 > associated with them are removed. On the other hand
currently objects i141, i143 and i144 assist in computing the key value so K-IUTs <
index: idxEmpWorkCity, non-key object: i61 > are reassigned to them.
It is important to note that modifying the worksIn attribute of i61 Emp should be
followed by suitable changes in Dept objects to ensure consistency between worksIn
and employs references. However, it is not an issue of the automatic index updating but
rather of a particular database application.
4.3.4.3
Keys with Optional Attributes
The given example consists of two statements removing and adding a zip
attribute for an address subobject of an Emp object which surname is equal to
“Kowalski”. It shows the main idea how the automatic index updating deals with
deletion and creation of objects. The following statement causes that the database
CRUD mechanism removes the i69 object:
delete((Emp where surname = “Kowalski”).address.zip)
Let us assume that the idxPerZip index is created and hence before deletion the index
updating mechanism finds a K-IUT described by the following properties < index:
idxPerZip, non-key object: i61 > associated with the object i69. The index updating
mechanism calculates the key value corresponding to the non-key object (states of
SBQL stacks during evaluation are presented in Fig. 4.10), i.e. value of the i69 object –
Page 85 of 181
Chapter 4
Organisation of Indexing in OODBMS
99999. Additionally, objects which influence the key value are identified, i.e. identifiers
occurring during binding: the i66 address and the i69 zip. The latter identifier because of
the removal will not be further considered by the index updating mechanism during
update triggers revising.
Fig. 4.10 Calculating the idxPerZip index key value before removing zip attribute
The CRUD mechanism deletes the i69 zip attribute together with all associated
IUTs. Consequently, the successful evaluation of the key value is not possible (it is
depicted in Fig. 4.11). Despite the lack of a key value the index update mechanism finds
identifiers of objects which are used during the key value calculation, i.e. the i66
address.
Fig. 4.11 Calculating the idxPerZip index key value without zip attribute
The index updating mechanism uses gathered information to:
•
Update the idxPerZip index: the entry for the non-key object i61 is removed.
•
Revise IUTs for the idxPerZip index and the non-key object i61: before as well
as after modifying the i66 attribute influences the key value. The i69 zip attribute
no longer affects the key value, however the K-IUT < index: idxPerZip, nonkey object: i61 > associated with it was already removed during the deletion. As
a result, no other changes are made to existing IUTs.
Page 86 of 181
Chapter 4
Organisation of Indexing in OODBMS
Let us analyse how the automatic index updating deals with inserting a new zip
attribute into the address object:
(Emp where surname = “Kowalski”).address :<< zip(99726)
The index updating mechanism finds a K-IUT described by the following properties <
index: idxPerZip, non-key object: i61 > associated with the object i66. Before the
insertion the key value corresponding to the non-key object is calculated. State of the
key value has not changed; so, identically as in Fig. 4.11 no key value is found. During
the key value computation the i66 address object is used.
The CRUD mechanism creates the new zip attribute with the 99726 value and
the identifier i104 and inserts it into the i66 address object. The index updating
mechanism proceeds to the evaluation of the key expression according to steps
presented in Fig. 4.12. Objects which influence the key value are identified, i.e. the i66
address object and the i104 new zip object.
Fig. 4.12 Calculating the idxPerZip index key value after inserting zip attribute
Consequently the index updating mechanism:
•
Updates the idxPerZip index: the entry for the non-key object i61 with the key
value 99726 is added.
•
Revises IUTs for the idxPerZip index and the non-key object i61: before as well
as after inserting the new zip attribute the index updating mechanism identified
that i66 influences the key value. However, the i104 object is assigned the K-IUT
< index: idxPerZip, non-key object: i61 > because it is important in computing
the key value.
4.3.4.4
Polymorphic Keys
The following examples concerns the idxEmpTotalIncomes index with key based
Page 87 of 181
Chapter 4
Organisation of Indexing in OODBMS
on the getTotalIncomes() method which is polymorphic depending on the class of the
non-key object. For EmpClass instances getTotalIncomes() returns value of the salary
attribute:
return deref(salary)
whereas for EmpStudentClass it also takes into consideration the scholarship attribute:
return deref(salary) + deref(scholarship)
Statement below concerns updating the salary attribute for the Emp object which
surname is equal to “Kowalski”:
(Emp where surname = “Kowalski”).salary := 2000
As a result of evaluation the SBQL interpreter sends a message to the CRUD
mechanism to modify the i71 salary attribute of the i61 Emp object. Before the
modification is executed the index updating mechanism finds the K-IUT < index:
idxEmpTotalIncomes, non-key object: i61 > associated with the i71 salary attribute. The
key value for the non-key object is computed and amounted to 1200. Binding operations
performed during the evaluation presented in Fig. 4.13 indicate that objects which
influence the key value are the i14 getTotalIncomes, i.e. the EmpClass procedure object,
and the i71 salary object. The procedure identifier (written with red colour) can be
discarded by the index updating mechanism during update triggers revision.
Fig. 4.13 Calculating the idxEmpTotalIncomes index key value for i61 object before update
After the salary attribute update the second calculation of the key value is similar to one
presented above in Fig. 4.13. Only the final value changes to 2000. Considering this
information the index updating mechanism:
•
Updates the idxEmpTotalIncomes index: the entry for the non-key object i61 is
adjusted to the new salary key value 2000,
•
Revises IUTs for idxEmpTotalIncomes index and non-key object i61: before as
Page 88 of 181
Chapter 4
Organisation of Indexing in OODBMS
well as after inserting same IUTs were identified by the index updating
mechanism hence no changes are done.
Let us consider how the automatic index updating deals with an
EmpStudentClass instance. The given statement concerns updating scholarship attribute
for EmpStudent objects which age is equal to 30:
(EmpStudent where age = 30).setScholarship(1500)
According to the sample schema in Fig. 3.2 due to invocation of the setter
setScholarship() method the i39 scholarship attribute of the i31 EmpStudent object will
be updated. Again before the modification the key value for the idxEmpTotalIncomes
index is calculated. From the interpreter routines depicted in Fig. 4.14 results that
objects i24 getTotalIncomes (the EmpStudentClass procedure object), i41 salary attribute
and the i39 scholarship attribute influenced the key value. The procedure identifier can
be discarded during the update triggers revision.
Fig. 4.14 Calculating the idxEmpTotalIncomes index key value for i31 object before update
After the scholarship attribute update the second calculation of the key value is similar
to one presented above in Fig. 4.14. Only final steps differ (see Fig. 4.15). In order to
conclude CRUD operations the index updating mechanism:
Page 89 of 181
Chapter 4
Organisation of Indexing in OODBMS
•
Updates the idxEmpTotalIncomes index: the entry for the non-key object i31 is
adjusted to the new key value 2500,
•
Revises IUTs for the idxEmpTotalIncomes index and the non-key object i31:
before as well as after inserting same IUT triggers were identified by the index
updating mechanism hence no changes are done.
Fig. 4.15 Last steps of computing the idxEmpTotalIncomes index key value for i31 after update
The proposed approach presented on examples above due to its generality is
capable of dealing with updating indices with even more complex keys. Extending this
solution to support AS2 and the following abstract store types (which consider dynamic
inheritance and encapsulation as depicted in section 3.1.2) does not require introducing
significant changes.
4.3.5 Optimising Index Updating
The presented solution to index updating is universal and versatile; however,
without optimisations it can cause unnecessary performance deterioration particularly in
simple updating cases.
In the most common scenario, a key value is defined by a path expression (e.g.
indices idxPerAge, idxPerZip, idxEmpDeptName, idxEmpWorkCity). Often alterations
concerning indexed objects key are updating the object which holds a key value. Such
an object could be equipped in a different type of a trigger the Index Key Value Update
Trigger (KV-IUT) instead of the K-IUT. Modifying the value of an object equipped with
this trigger does not require using the SBQL interpreter to recalculate the key value.
Moreover, revising IUTs is also unnecessary. This would significantly simplify the
index updating mechanism. For example in case of database state presented in Fig. 3.2
and the idxEmpWorkCity index the following statement changing city of HR department
to Warszawa:
Page 90 of 181
Chapter 4
Organisation of Indexing in OODBMS
(Dept where name = “HR”).address.city := “Warszawa”
would execute the KV-IUT associated with the i144 city object and the i31 EmpStudent
non-key object. However, in order to calculate key value instead of executing the query
worksIn.Dept.address.city in context of the non-key object the index updating
mechanism directly dereferences the city object. Moreover, revising K-IUTs is skipped.
A next optimisation takes advantage of aggregate objects which are used to
model a collection of objects with the same name and type. Aggregate objects are a
physical optimisation for searching subobjects used when cardinality of a subobject is
not singular. The parent object instead of multiple subobjects with the same name
contains one aggregate subobject. Calling all subobjects by their common name is
achieved through the mediation of an aggregate though aggregate is their direct parent.
If similar IUTs refer to such a collection of objects then their aggregate parent can be
equipped with the identical IUT; consequently, it can be automatically propagated to
new created aggregate object children. For example let us consider the idxPerAge index
and adding new Person object to the database. The new Person object is not added
directly to the database’s root, but to a Person aggregate object. Therefore, correct NKIUT is simply propagated from the aggregate and does not need to be generated by the
R-IUT which is more complex since it requires an additional verification procedure.
In the current implementation the index updating mechanism works within a
range of an atomic databases CRUD operation. Still often even single statement can
cause several changes to the database. In many cases it would be optimal to gather
necessary information during an execution of series of atomic operations and delay
index updating and index update triggers revision to the very end of a complex
operation. This would however require cooperation with the database transaction
mechanism, which is still under development in the implemented prototype. The
following example statement:
(Dept where name = “HR”).address := ("Warszawa" as city,
"Koszykowa" as street)
consists of the four following atomic CRUD operations: first the deletion of i144 city, i145
street objects and next creation of new city, street objects. In case of the
idxEmpWorkCity index it results in running index updating at least three times for the
i31 EmpStudent non-key object (only deletion of the i145 street object is not connected
with K-IUTs). However, it would be efficient (approximately three times faster) to
Page 91 of 181
Chapter 4
Organisation of Indexing in OODBMS
execute the index updating mechanism only before the first atomic deletion to gather
necessary information and after completing the creation of objects. The depicted lazy
index updating strategy would be optimal if the most of index maintenance routines
would occur in a database’s idle time, i.e. after data modifying statement’s execution
but before a next index invocation.
The last approach to optimisation of index updating concerns not only efficiency
but also decreasing databases load by removing unnecessary IUTs.
If identical K-IUT refers to a collection of identical subobjects then only their
aggregate parent can be equipped with the K-IUT instead. This solution would reduce
space occupied by the IUTs in case of more complex indices (e.g. in the
idxDeptYearCost index employees of a department are accessed using references inside
employs aggregate object); nevertheless, the index updating mechanism has to
additionally check for update triggers of the parent aggregate objects.
As a result of introducing the transaction mechanism together with aggregate
objects maintaining some of IUTs would be unnecessary. This method must precisely
consider the architecture of a database’s store and properties of an object-oriented query
language. In the current implementation aggregate objects are automatically created for
a complex object containing subobjects with cardinality different than singular.
Therefore, a statement cannot create a direct new child to existing complex object
provided that the appropriate subobject was not earlier deleted within processing the
given statement. The K-IUT connected with complex objects would not be necessary
because a similar trigger which is responsible for preparation of the index update
mechanism would be earlier started during deletion of subobjects of a complex object.
For instance, in the previous example, concerning the idxEmpWorkCity index and
modifying the address of the HR department, deleting the i144 city subobject should
initialise the index updating mechanism. Therefore, the K-IUT associated with the i143
address object would not be necessary. Similarly, for the given i31 EmpStudent non-key
object the K-IUT for the i141 Dept object can be also omitted.
Majority of the sketched above optimisations proposed by the author are
implemented in the ODRA database prototype. Modifications which concern taking
advantage of the transaction mechanism are planned to be implemented together with
the development of transactions in ODRA. The other source of potential optimisations
concerning the index maintenance for indices based on path expressions is the research
Page 92 of 181
Chapter 4
Organisation of Indexing in OODBMS
literature, e.g. [10, 11, 12].
4.3.6 Properties of the Solution
The proposed index updating mechanism meets guidelines depicted in the
introduction in subchapter 4.3 and proves several supplementary advantages, i.e.:
•
each modification to the indexed data is automatically reflected in the
appropriate indices contents,
•
index updating routines do not influence the performance of retrieving
information from the database,
•
index updates are triggered only in case of modifications concerning objects
used to access the indexed objects or to determine key values,
•
modification to a single key value introduces an additional time overhead, which
is comparable to the time of calculating the given key value two times and
performing modification to index records,
•
automatic index updating performance can be improved by many optimisations
described in subchapter 4.3.5,
•
basic solution is independent from:
o the query language and execution environment (does not require
additional routines during compile-time or run-time),
o index structure,
•
generic support for a variety of index definitions (including usage of complex
expressions with polymorphic methods and aggregate operators).
On the other hand, the proposed solution to index updating issue introduces
many additional database structures. Unfortunately, almost every object used to access
indexed objects or calculate a key value must be equipped with appropriate IUTs (some
exceptions are depicted in subchapter 4.3.5). It is caused by the properties of the SBQL
query language that make in many situations difficult or even impossible to predict
basing only on an index definition which objects should trigger an index update.
Nevertheless, the author does not exclude possibility to develop more optimisation
methods for this aspect of index updating process. In particular, such space-preserving
optimisations can be easily introduced for very simple indices, e.g. on objects attributes.
Page 93 of 181
Chapter 4
Organisation of Indexing in OODBMS
4.3.7 Comparison of Index Maintenance Approaches
The ODRA OODBMS is a proof of concept prototype as well as the
implemented index updating mechanism. There is a great deal of available relational,
object-relational or object-oriented databases based on different paradigms and
exploiting different index structures for diverse applications. Those systems generally
are deprived of detail efficiency comparisons between other existing solutions in aspect
of maintaining index cohesion. The fair comparison of approaches can be conducted
considering general properties to the index maintenance and its influence on capabilities
of a database indexing. Thus, the comparison of efficiency between the proposed
solution and solutions applied in other systems is omitted.
The overview concerning indexing features of many existing products and some
prototype approaches is described earlier in subchapters 2.3, 2.4 and 2.5. Routines
responsible for index maintenance in relational and object-relational databases are
straightforward and therefore simple. An undoubted advantage of the index updating
approach in majority of relational databases is an economic usage of the data store. The
information necessary for the mechanisms maintaining cohesion between data and
indices is associated with table columns as they are identical for each row. The quantity
of such information is therefore independent from the quantity of data stored in tables
(i.e. the number of table rows). Similarly, object-oriented databases associate automatic
index updating mechanisms with a whole collection or a class rather than with an
object. In contrast, in the implemented solution for the ODRA object-oriented database
IUT triggers are in many cases written together with complex objects and atomic objects
containing values. Fortunately, majority of the databases store space is occupied by the
data rather than by the redundant information. This situation is acceptable considering
that nowadays databases administer a very large amount of memory (or disk) space.
The proposed implementation of the fully transparent indexing in the StackBased Approach enables creation and automatic maintenance of indices with keys
defined using arbitrary deterministic expressions, including methods invocation (also
polymorphic and aggregate functions), e.g.:
•
idxEmpDeptName – key based on worksIn.Dept.name path expression,
•
idxDeptYearCost – key based on sum(employs.Emp.salary) * 12 expression,
•
idxEmpTotalIncomes – key based on an Emp class method getTotalIncomes().
Page 94 of 181
Chapter 4
Organisation of Indexing in OODBMS
This method is overridden for instances of the EmpStudent class.
As it was said in subchapter 2.5, properties of a query language (e.g. SQL), a lack
of appropriate object-oriented extensions or a primitive approach to the index
maintenance limits defining advanced indices. For the following reasons, the most of
advanced object-relational transparent indexing approaches including SQL Server
(computed columns) and Informix (functional indexes) does not provide sufficient
support to introduce indices with complexity similar to ones presented above. Similarly,
the IBM DB2 Universal Database in spite of proposing the Index Extensions, which are
very powerful indexing tools, also had not provided sufficient transparent solutions.
From among OODBMSs only GemStone products enable indexing for indices based on
a path expression like idxEmpDeptName index.
The Oracle function-based index feature, despite lack of path expression based
index
support,
provides
facilities
for
creating
an
index
similar
to
the
idxEmpTotalIncomes index. The test conducted in section 2.5.1 has used the schema in
Fig. 2.3, partially corresponding to the object store in Fig. 3.1. The created
emp_gettotalincomes_idx Oracle index is based on an analogous polymorphic method.
The disadvantage of the Oracle’s equivalent of the discussed index concerns its
influence on the database performance. The index updates occur in case of any
modifications to indexed table, not only those concerning columns used to determine
key values. The attempt to introduce in Oracle an index dept_getyearcost_idx
corresponding to the idxDeptYearCost index was unsuccessful. The modifications to a
table with data about employees, which were used to calculate index key, caused
dept_getyearcost_idx to lose cohesion with the data. There are no similar errors
concerning maintaining idxDeptYearCost index in the ODRA implementation.
Advanced approaches to indices based on path expressions are described in
many research documents, e.g. [10, 43]. The index maintenance issue is usually solved
by preserving additional information inside the index structure, which enables efficient
and correct index updating. To the best of author’s knowledge, implemented solutions
concerning the transparent index maintenance presented in the research literature or
incorporated in commercial products apply to a specified family of index definitions
and cannot be considered generic.
Not implemented, but a generic solution for the maintenance of function-based
indexes is defined in [46]. Similarly to the ODRA implementation, index updating
Page 95 of 181
Chapter 4
Organisation of Indexing in OODBMS
information is connected with objects associated with indices.
In contrast to all presented above solutions to the automatic index updating
issue, the author’s approach based on Index Update Triggers implemented in ODRA
provides transparent, complete and generic support for a variety of index definitions.
Moreover, the additional data modification costs associated with the index maintenance
concerns exclusively objects used to access the indexed objects or to determine a key
value. One can argue about increased storage cost caused by IUTs. Nevertheless, as it is
shown in [10, 11, 12] the maintenance of indices defined using complex expressions
require introducing a lot of additional information in the index structure (not only
entries of indexed objects according to values pointed by path expressions). Other
advantage of author’s IUTs set on objects used to determine a key value is that they
include direct reference to an indexed object, whereas other solutions [10, 43] are often
forced to identify it indirectly (e.g. by reverse navigation methods or accessing key
value first and looking out indexed object in the index).
4.4 Indexing Architecture for Distributed Environment
Different aspects of indexing presented in this chapter form the complete
architecture of local index management and maintenance. In subchapter 2.6 the local
indexing strategy is explained. It completely relies on the local indexing architecture
and general optimisation methods for distributed query processing (i.e. global query
decomposition). Therefore, analysis of this strategy is considered straightforward and is
omitted in this subchapter.
The discussed global indexing architecture concerns homogeneous, horizontally
fragmented data on the integration schema level. It is a currently developed approach
(in the ODRA prototype) to integration of distributed resources. The integration schema
describes how data and services residing on local servers are to be integrated. It consists
of individual schemas. The idea of a schema is a combination of an interface known
from the object-oriented programming languages and a typical database schema. A
schema is an abstract description specifying objects with attributes and methods that
must be provided by a group of servers contributing to the given schema. Nonetheless,
local servers implementing the schema still have wide autonomy. Contributed objects
can be either materialised or virtual using SBA views. They can contain additional
attributes and methods not included in the schema. Moreover, objects contributing
Page 96 of 181
Chapter 4
Organisation of Indexing in OODBMS
servers provide an own implementation of object methods and can transparently take
advantage of inheritance and polymorphism. Generally, schemas enable type-safe
querying integrated horizontally fragmented data. A query addressing an integration
schema is decomposed on parts referring to individual schemas and appropriate subqueries are sent to servers to be evaluated locally in parallel. The local evaluation
differs, depending on local schema implementation.
According to the taxonomy presented in [124] the global indexing strategy
proposed in this subchapter corresponds to a Non-Replicated Index with Index
Partitioning Attribute indexing schema. It is the result of the following factors:
•
a selected distributed index structure – i.e. SDDS basic variant – does not
replicate parts of an index on different servers,
•
in the global indexing strategy data partitioning and index partitioning are
orthogonal.
The data integration approach does not imply any concrete data partitioning method;
hence, a distributed index can spread on the greater number of servers than data. In that
context, the described indexing schema is not entirely compatible with the presented
taxonomy. Similarly, the centralised indexing strategy is not taken into consideration in
the taxonomy.
More advanced and complex data integration, e.g. involving mixed
fragmentation, data heterogeneity and replication, can be implemented on top of
presented integration schemas using updateable views (see subchapter 3.5). Such
solutions are a topic described in works, e.g. [68], and many research papers, e.g. [2, 39,
60, 61] including contributed by the author [63, 64, 131].
The next section discusses a proposed approach to indexing management and
index maintenance in distributed object-oriented database. To conclude this subchapter
an example of indexing in a global schema is presented.
4.4.1 Global Indexing Management and Maintenance
Let us consider creating a global index defined on a schema, addressing a
horizontally fragmented collection stored on servers (contributing sites). First, an
appropriate index structure is created. Stored non-key values consist of an indexed
object reference together with the information on its origin, i.e. contributing site
Page 97 of 181
Chapter 4
Organisation of Indexing in OODBMS
identifier. A global index can be centralised, i.e. located on one server, or distributed
between several indexing sites over the database. Regardless of an indexing strategy,
such an index must be made available to many servers. Locally it can be represented by
a proxy forwarding index calls. A centralised index communicates with proxies on
servers directly. In case of a distributed indexing strategy an individual proxy can
forward index calls to an arbitrary indexing site hosting an index part. Optimally, a
proxy may transparently become a part of a distributed index. Further processing of an
index call and communication between indexing sites depends on a particular index
implementation. For example, a linear hashing implementation discussed in subchapter
4.1 can be used for centralised indexing. It can be extended to an SDDS distributed
index in order to preserve indexing properties and enable parallel processing.
The next step of creating the global index is its registration. The subchapter 4.2
proposes an organisation of the index management which can be completely applied to
indices on the level of integration schema. An auxiliary information provided by the
index manager, which is needed by the index optimiser and the static evaluator, are
used in the same way as in case of the local indexing. The main difference lies in the
fact that information about global indices must be replicated together with the
integration schema on all servers which can utilise it. Obviously, the indices referenced
by the index manager can be a local proxies enabling communication with an
appropriate centralised or distributed index.
In the next turn, the index manager initialises populating the index. According to
the author’s approach presented in subchapter 4.3 it is connected with an activation of
the automatic index updating. Again, this mechanism relies mainly on the local index
maintenance architecture. It is essential that the currently considered data distribution
model disables storing references to remote objects; therefore, in presented solution it is
assumed that each key value can be calculated within an indexed object site. As a result,
the index manager delegates activation of the index maintenance to contributing sites
where appropriate Trigger Definitions are created according to an index definition.
Next, locally and independently Index Update Triggers are generated. During this
operation objects are inserted into the global index. If local index maintenance routines
evaluating non-key or key expressions encounter elements, such as e.g. views
invocations or links to remote databases, making the automatic index updating
impossible then an appropriate error message is sent to the global index manager and
Page 98 of 181
Chapter 4
Organisation of Indexing in OODBMS
the creation of the global index is cancelled.
Concluding, populating the global index and the further transparent index
maintenance is provided mainly locally by the architecture presented in Fig. 4.6 where
only the index manager and database indices are global (in contrast to a case discussed
in subchapter 4.3).
The final element of the indexing architecture, i.e. the approach to index
transparency from the point of view of query processing, is the topic of Chapter 5. The
presented solution is general as it applies equally to indexing on local and global levels.
4.4.2 Example on Distributed Homogeneous Data Schema
Let us consider a schema describing horizontally fragmented data presented in
the Figure below:
Fig. 4.16 Example database schema for data integration
It comprises three interfaces defining what attributes and methods must contributed
Person, Emp and Dept collections of objects contain. Contributing sites have to share
data fitting the given integration schema. Actual schemas of contributing sites can be
distinct. An example database schema that matches one presented above was introduced
in Fig. 3.1. Differences of a local schema like e.g. other collections, inheritance
relations between collections, extra attributes or methods, does not matter as long as a
local schema contains elements required by the integration schema.
Let us consider the creation and work of an idxEmpDeptName global index
using a derived attribute worksIn.Dept.name to retrieve Emp objects. First, an
appropriate empty, centralised or distributed index structure is initialised and is made
available among the distributed database servers. Next, it is registered by the index
manager and information is generated which is necessary by index optimiser and static
Page 99 of 181
Chapter 4
Organisation of Indexing in OODBMS
evaluator modules working on queries addressing the integration schema. In the final
step of the global index creation, the index manager initialises automatic index updating
mechanisms on contributing sites. On each site this operation causes the following steps
(described in detail in section 4.3.2):
•
according to the index definition Trigger Definitions are created,
•
Root Index Update Trigger is added to the database’s root,
•
Nonkey Index Update Triggers associated with objects belonging to an Emp
collection are generated,
•
for each non-key object the key value is calculated and objects used to determine
it are equipped with Key Index Update Triggers and a corresponding index entry
is added to the global index.
It is significant that evaluation of the key expression worksIn.Dept.name in the context
of an Emp object can be performed completely on a contributing site since the
integration model restricts a worksIn reference to point to a local Dept object. This
makes indexing architecture simple and effective. Consequently, changes affecting
indexed data are detected locally within a contributing site and independently an
appropriate global index update command is issued by local index maintenance
mechanisms. Finally, the index optimiser does not distinguish between local and global
queries applying indices available in a schema a query addresses.
Similarly, it is possible to create and utilise in the integration schema depicted in
Fig. 4.16 almost all indices, which apply to Fig. 3.1, introduced in section 4.1.2. The
only global index that cannot be created by the administrator is idxEmpSalary, because
in the integration schema Emp objects are devoid of a salary attribute.
There exist several other aspects that implementation of indexing in distributed
environment should consider. The main problem concerns dynamic joining and
disconnecting of contributing or indexing sites and distributed transactions
management. However, there exists a variety of solutions addressing those issues that
can be applied, e.g. [29, 76].
Page 100 of 181
Chapter 5
Query Optimisation and Index Optimiser
Chapter 5
Query Optimisation and Index Optimiser
The research on optimisation of SBQL queries resulted in the work [93] deeply
investigating this issue and in many papers e.g. [94, 95, 96, 97, 98, 99, 100, 122]. The
goal of the developed optimisation methods is similar like in the case of optimisation in
RDBMSs [20, 29, 54, 55]. The original query is processed in order to improve its
efficiency by modifying its default evaluation plan and at the same time to preserve its
semantics.
In the implemented approach query optimisation is achieved through query
transformations, mostly efficient, reliable and easy to implement query rewriting
methods. In contrast to relational optimisers, no other intermediate query
representations are applied, e.g. object-oriented algebra. The transformation processes
are facilitated by static query analysis (sketched in subchapter 3.4). Query optimisation
exploits information about the size of an environment stack during an evaluation of
query parts in order to:
•
equip each non-algebraic operator occurring in a query with number referring to
a ENVS section which it opens,
•
assign the current size of ENVS to each name when it is bound, together with
the section number where the binding is performed.
Static query analysis also facilitates locating query parts which raise a threat of run-time
errors.
One of the most important methods exploiting information from the static
analysis is factoring out independent subqueries [95, 97]. Frequently a database query
contains a subquery for which all names are bound in sections different than opened by
a currently evaluated non-algebraic operator. Such a subquery can be evaluated before
this operator puts its section onto ENVS. Consequently, the calculation of this subquery
is planned earlier than it would result from the original query syntax tree. This operation
is vital in optimisation of non-algebraic operators evaluation because it prevents from
processing the subquery multiple times in case when its result is always the same. Let
us consider the query which retrieves surnames of the employees who earn as much as
employee with surname “Kuc”:
Page 101 of 181
Chapter 5
Query Optimisation and Index Optimiser
(Emp where salary = (Emp where surname = “Kuc”).salary).surname
The SBQL optimiser rewrites it to the following form:
((Emp where surname = “Kuc”).salary groupas salaux).
(Emp where salary = salaux).surname
The independent subquery, which determines the salary of the given employee, is
factored out and therefore is calculated only ones at the very beginning. Its result is
stored inside the salaux binder and is repeatedly accessed by the where clause in order
to compare salaries of all employees.
Other SBQL optimisations, which are also implemented in the ODRA prototype,
take advantage of the distributivity property of some SBQL operators (e.g. pushing
selection before join), use redundant database structures (e.g. indices, caching) or
perform other query transformations (e.g. removing auxiliary names, removing dead
subquries), etc. Some of these methods are discussed in the context of indexing in
further sections.
5.1 Query Optimisation in the ODRA Prototype
ODRA (Object Database for Rapid Application development) [2, 119] is a
research platform providing database application development tools. The essential
features of the prototype are functional: run-time environment integrated with an
OODBMS, SBQL query language, optimisation framework, etc.
This subchapter depicts the view on the internal architecture of the ODRA
optimisation framework. Its schema is presented in Fig. 5.1; it contains data structures
(dashed lines figures) and program modules (grey boxes). The architecture reflects only
the most important components from the point of view of the query optimisation and
processing. Each ODRA instance can work as a client and as a server; therefore, this
subdivision is introduced to increase comprehensibility. A server can service many
clients and a client can communicate with many servers.
Fig. 5.1 illustrates also general SBQL query processing flow. First, a query is
parsed from its textual form to an equivalent query syntax tree. The processing flow
order concerning suitable transformations of the syntax tree proceeds according to the
numbers on the schema:
1. Static evaluation adds necessary operators (e.g. casts and dereferences), and
Page 102 of 181
Chapter 5
Query Optimisation and Index Optimiser
equips the query syntax tree with signatures which facilitate optimisers.
2. Query syntax tree is processed through the chain of optimisers in an appropriate
order. Each optimiser rewrites the query and returns its syntax tree with current
set of signatures. The index optimiser is concerned as one of such optimisers;
however, it additionally employs the index manager module.
3. The syntax tree of the optimised and type-checked query is sent for further
compilation and evaluation to a suitable ODRA module.
Fig. 5.1 ODRA optimisation architecture [2]
5.2 Index Optimiser Overview
The index optimiser is the main mechanism responsible for reorganising queries
in order to take advantage of available indices. It is one of optimisers, which can be
used in the query optimisation process. The index optimiser is essential to ensure one of
Page 103 of 181
Chapter 5
Query Optimisation and Index Optimiser
the most important indexing properties – index transparency. During compilation of adhoc SBQL queries or ODRA modules, which often contain not optimised queries in
procedures, updateable views generic procedures and class methods, queries are
processed by the index optimiser in order to improve their efficiency.
Fig. 5.2 illustrates the index optimisation process and all vital cooperating
ODRA elements.
Fig. 5.2 Schema of the index optimiser
The index optimiser input is a query which is already passed through static
evaluation. Therefore, its syntax tree nodes are equipped with signatures containing
typing information. The index optimiser adds index calls to the query and performs
necessary modifications. The most important issue concerning all optimisation methods
is to preserve query semantics while rewriting so the optimisation does not affect a
query evaluation result. The transformed query must also preserve typing constraints.
The index optimiser communicates with the following ODRA modules:
•
Index manager – provides information about indices set on database’s
objects. This information is internally ordered and enables the index
optimiser to find indices according to their non-keys as well as keys.
•
Metabase – provides a detailed description of a database’s schema. The
index optimiser uses information about indices from the metabase to
determine if an index call can substitute a fragment of the query.
•
Cost model – holds statistical information about properties of databases
objects attributes. The index optimiser choosing between alternative index
Page 104 of 181
Chapter 5
Query Optimisation and Index Optimiser
combinations uses the cost model to pick the best solution.
•
Static evaluator – calculates signatures in a query syntax tree. Each time the
index optimiser applies an index, the modified part of the syntax tree is filled
with the description of types.
The example scenario of query syntax tree transformation applied by the index
optimiser is shown on the Fig. 5.3.
Fig. 5.3 Example optimisation applied by the index optimiser
The given query concerns retrieving persons with a surname “KOWALSKI” who are 28
years old:
Person where ((surname = “KOWALSKI”) and (age = 28))
The index Optimiser applies the idxPerAge index which retrieves Person objects
according to their age attribute and rewrites a query to the following form:
$index_idxPerAge(28 groupas $equal) where surname = “KOWALSKI”
Fig. 5.3 shows that first the predicate age = 28 is selected and removed. The index
optimiser replaces the where left operand (Person) with an index invocation exactly
matching the removed predicate. This transformation preserves semantic equivalence.
Page 105 of 181
Chapter 5
Query Optimisation and Index Optimiser
5.2.1 General Algorithm
The proposed and implemented solution works in the context of where operators
(which in the SBA approach are responsible for selection) when the left operand is
indexed by key values of the right operand selection predicates. However, it is possible
to take advantage of indexing also when dealing with another non-algebraic SBQL
operator, i.e. forany quantifier. This case is explained in section 5.5.4.
The general index optimiser algorithm (shown in Fig. 5.4) attempts optimisation
for each single or group of nested where operators found in the query. Let us consider
optimising the given where branch
qOBJ where qP1 where qP2 where … where qPn
All subqueries, the left operand qOBJ generating objects for selection and all operands
qP1, qP2, …, qPn defining selection predicates, may contain internal where clauses. In
case of queries qP1, qP2, …, qPn some of where clauses can be pushed outside the
analysed branch using the factoring out independent subqueries method (described in
section 5.6.1), however not all. Some selection predicates can contain potentially index
applicable where clauses which partially depend on the main where operator, e.g.
Emp where ((age as empage).
(salary > avg((Emp where age = empage).salary)))
Such selection predicates could be a field of a standalone indexing optimisation.
Therefore operands qP1, qP2, …, qPn should be processed by the index optimiser
separately.
Where clauses can also be found inside the left operand qOBJ query tree. In this
case, no regular index is applicable for the main selection clause (only simple path
expressions can define regularly indexed objects, cf. section 4.2.1). However, before
optimising where clauses inside qOBJ the index optimiser first should try to optimise the
given branch using other index related optimisation techniques e.g. volatile indexing
(described in subchapter 7.1). This order should be preserved so that the qOBJ query
remains unchanged during main branch analysis process.
Page 106 of 181
Chapter 5
Query Optimisation and Index Optimiser
Fig. 5.4 Index optimiser algorithm
For each where branch the first object of the analysis is the left operand qOBJ.
The query qOBJ has to be completely independent, so the optimiser checks node
signatures of the query if it is bound in the lowest ENVS section of the database
(numbered 1). Necessary information is provided by the static evaluator. If there is
more than one base sections it is necessary to check if binding will be performed in a
database section. Next qOBJ is used as a key for the Nonkey Structures Index, which is
maintained by the index manager (detailed description is in subchapter 4.2). As a result,
the index optimiser has the access to necessary information concerning indices set on
the left operand. If suitable indices are found, the algorithm proceeds to match selection
predicates. If not, it skips to the next where branch.
5.3 Selection Predicates Analysis
The most important and complex index optimiser routines concern analysis of
selection predicates. The analysis directly precedes selecting the best index and query
Page 107 of 181
Chapter 5
Query Optimisation and Index Optimiser
rewriting. The central part of the algorithm focuses on clauses which consist of one or
several nested where operators:
qOBJ where(1st) qP1 where(2nd) qP2 where(3rd) … where(n-th) qPn
A right operand of the first where operator qP1 defines selection predicates which
address objects returned by the query qOBJ. Consecutively qP2, …, qPn concern the
following where expressions. The most frequently used form of a where clause is
defined by a single where expression:
qOBJ where qP1
An object for which all queries qP1, qP2, …, qPn return true passes the selection.
First all objects are confronted with query qP1. Those which match qP1 predicates are
passed to the next where expression and query qP2 is evaluated. This process is repeated
for all where expressions. From the point of view of a single object the where operator
behaves like the conjunction operator && that performs a short-circuit logical
expression (known from many programming languages). Therefore when the qPi
predicates return false than next predicates qPj (where j > i) are skipped. This property is
often used to prevent run-time errors. E.g. in this way the following query can be
executed without a run-time error:
Person where exists(address.zip) where address.zip = 99726
whereas the query:
Person where exists(address.zip) and address.zip = 99726
will cause a run-time error in case when at least one Person object does not contain
subattribute zip derived from the address attribute. SBQL semantic of the and operator
assumes that both left and right operands are always evaluated.
The goals of the index optimiser rewriting are the following:
•
Preserve semantic equivalence between the query rewritten by the index
optimiser and the original input query, so that their evaluation is identical from
the point of view of a database user and the database’s and program’s state.
•
Optimise selection, by reducing the amount of data to be processed. The index
optimiser takes advantage of indices modifying qOBJ and therefore reducing the
number of objects evaluated by where operators. Next, it adequately eliminates
some selection predicates from queries qP1, qP2, …, qPn.
Page 108 of 181
Chapter 5
Query Optimisation and Index Optimiser
Each query qP1, qP2, …, qPn can be represented by a conjunction sequence of n
predicates i.e. joined with and operators, e.g. qP1 stands for m conjunct subpredicates:
p1,1 and p1,2 and ... and p1,m
In the simplest case, there may be only a single subpredicate. Each pi,j (where i ∈ 1, 2,
…, n; j ∈ 1, 2, …, mi and mi is the number of conjunct subpredicates in query qPi) is an
expression that should return single boolean literal true or false. Particularly it can be:
•
a binary expression based on operators comparing pair of values e.g. =, <, >, ≥,
≤,
•
a binary expression based on operators working with sets e.g. in, contains,
•
a binary expression based on non-algebraic quantifiers,
•
other binary expressions e.g. instanceof,
•
some of unary expressions e.g. exists, not,
•
a disjunction binary expression i.e. or operator (which may contain set of many
selection predicates),
•
other not listed above expressions.
A method describing how the index optimiser is dealing with selection
predicates based on the or operator is an important issue which is described in section
5.5.3.
5.3.1 Incommutable Predicates
The first step to identify indices that can be used is to find which predicates are
able to take part in the optimisation process. This is the most important stage from the
point of view of query semantics.
When applying a given index where clause must contain subpredicates which
specify the indexing function key-values criteria. The query qOBJ is substituted by the
index call and mentioned subpredicates are removed. Therefore, evaluation of these
subpredicates is moved to the very beginning – the index invocation. The amount of
objects evaluated by where operators decreases. Such an operation is possible due to
commutativity of a conjunction resulting from the logic theory:
(p1 and p2) = (p2 and p1)
Page 109 of 181
Chapter 5
Query Optimisation and Index Optimiser
However in case of SBQL where clauses not all sub-predicates can be freely
moved to be used before evaluation of the first where operator. Unfortunately, in some
conditions it may lead to discrepancy between original and optimised queries semantics.
As a result of moving sub-predicate pi,j (which belongs to i-th where expression) some
objects normally evaluated by i-th and preceding where expressions may be skipped
(this is the goal of optimisation). Usually it is desired to decrease the amount of data
processed by predicates and where operators. Still there are some cases where it should
be forbidden, i.e. when evaluation of a predicate:
•
is not run-time safe,
•
produces side effects i.e. changes of the database or program state.
The first case occurs when an undesired or unpredicted (by a database
programmer) state of the database causes a run-time error during the predicate
evaluation. This situation is shown on the following example:
Person where address.zip = 99726
The zip attribute has cardinality [0..1] therefore if one of evaluated Person objects does
not contain an address.zip attribute a run-time error will occur. Using an index call,
which reduces the number of objects evaluated by similar predicates, lessens the threat
of run-time error in the optimised query. Unfortunately, this is semantically incorrect.
The second case concerns predicates which contain producing side-effects calls
to user defined SBQL procedures, views or class methods. E.g. predicate calling
getScholarship method of Student objects in the query:
Student where getScholarship() = 1000
should be evaluated for all Student objects according to query semantics. Nevertheless if
getScholarship() only returns scholarship attribute and does not introduce any sideeffects (e.g. incrementing an internal Student object access counter to the scholarship
attribute) then the number of Student objects evaluated by this predicate can be
decreased. Otherwise such an optimisation may lead to unexpected query behaviour.
Both situations described above influence other database optimisers and
therefore should be identified by common compile-time processes.
In conclusion, the index optimiser verifies queries qP1, qP2, …, qPn for run-time
unsafe or causing side-effects predicates. If such a predicate is found in query qPk (k-th
Page 110 of 181
Chapter 5
Query Optimisation and Index Optimiser
where clause) than none sub-predicates located in queries qPk, …, qPn can be used by the
index optimiser. Otherwise, the k-th where operator wound not always process all
objects like in case of the original query. After this verification, the index optimiser
focuses on the following part of the main where clause:
qOBJ where(1st) qP1 where(2nd) qP2 where(3rd) … where((k-1)-th) qPk-1
and ignores all sub-predicates in queries qPk, …, qPn.
5.3.2 Matching Index Key Values Criteria
After verifying predicates the index optimiser proceeds to match indices with
selection criteria defined in queries qP1, …, qPk-1. Let us assume that there exist m
indices ix1, ix2, …, ixm established on defined by the qOBJ query objects and consists of
one or several keys iki,j (the j-th key on i-th index).
Generally, an optimiser processes binary expression predicates based on =, <, >,
≥, ≤ and in operators. Such a single predicate pi,j (where i ∈ 1, 2, …, k-1) consists of
two operands and comparison operator:
left_operand operator right_operand
The index optimiser checks whether left_operand or right_operand defines any of the
keys iki,j used for constructing available indices. If one of the operands matches an
index key (key_operand) then another is treated as the criterion value (value_operand).
The construction of an index key is described in section 4.2.1. Value_operand is
any query processed within a where operator but in contrast to key_operand it must be
independent from this operator. For instance a query:
Person where exists(salary) where salary > (age * 100)
disables applying index idxEmpSalary set on employees salary, because age changes
during evaluation for each employee as it is dependent on the nearest where operator.
If a processed predicate meets the conditions described above then the index
optimiser updates the information about suitable index key values criteria iki,j. All keys
support criteria based on = and in operators. Range criteria (operators <, >, ≥, ≤)
requires keys defined as range or enum (cf. section 4.1.1).
In order to process unary expressions which return a boolean value, such as e.g.
the exists operator or boolean attributes, they are treated as the following simple binary
Page 111 of 181
Chapter 5
Query Optimisation and Index Optimiser
expression:
unary_expression = true
which is semantically correct. Boolean keys characterise a weak selectivity thus they are
not suitable for constructing single-key indices. However, they are useful in multi-key
indexing and therefore unary expression predicates are supported by the index
optimiser.
5.3.3 Processing Inclusion Operator
Generally both comparison operands have singular cardinality which is ensured
by verification of predicates described in subchapter 5.3. The only exception concerns
the in operator because its semantic does not constrain cardinality. Let us consider two
important variants concerning in operator operands depending on the location of the
key_operand.
The first variant is when the left_operand operand defines an index key:
key_operand in value_operand
If the left_operand has the cardinality [0..1] than using index replacing this predicate
would not be possible because the in operator returns true if the left predicate returns an
empty bag. For example the query:
Person where address.zip in 99726
returns Person objects who have zip code equal to 99726 or have no address.zip
attribute; however, the index call:
idxPerZip(99726)
returns only objects with address.zip attribute equal to 99726. Such a transformation
would cause semantic inconsistency.
Since the key_operand must have maximal cardinality 1 (cf. section 4.2.1) the
index optimiser should in the discussed case accept only the singular cardinality of the
left_operand. The cardinality of the right_operand is not relevant because an index
invocation can deal with a collection of alternative key values (section 5.5.1).
The second variant is when the left_operand operand defines a key search value:
value_operand in key_operand
According to the inclusion operator semantics, it returns true when left operand returns
Page 112 of 181
Chapter 5
Query Optimisation and Index Optimiser
an empty bag. On the other hand returning a collection containing different values by
the left operand would result in returning false because the key_operand can include
only one value. In both these cases the result of an inclusion does not depend on key_operand hence when cardinality of value_operand is not singular then the index
optimiser skips processing this predicate. In order to use an index the value_operand
must return a single value.
It is worth noticing that in the second variant when the cardinality of the
key_operand is [0..1] there is no threat of a run-time error, indices can be applied and
the inclusion operator can be used instead the equality operator. For example in the
query:
Person where 99726 = address.zip
the evaluation of the predicate may cause a run-time error; whereas, the following form
of predicate ensures safe evaluation:
Person where 99726 in address.zip
To conclude this section the index optimiser can start matching key values
criteria defined by the inclusion operator only if the left_operand has the singular
cardinality.
5.4 Role of a Cost Model
Once all predicates are processed, the index optimiser possesses information
concerning all available indices about key values criteria found in the analysed where
clause. If the criteria are sufficient for applying an index or few indices the next step of
the optimisation is executed, i.e. choosing the best index (or a combination of indices).
The cost model is used to check if applying an index improves efficiency and to select
the most selective index.
In some cases using two or more available indices in a single where clause is
difficult or impossible. For example in the query:
Person where surname = ”Nowak” and age = 30
let us assume that both indices idxPerSurname or idxPerAge exist and can be used.
After applying one of them e.g.:
idxPerSurname(“Nowak”) where age = 30
Page 113 of 181
Chapter 5
Query Optimisation and Index Optimiser
the second one has to be omitted because Person name was removed and the where left
operand lacks suitable objects for idxPerAge. In a similar situation using idxPerAge:
idxPerAge(30) where Surname = ”Nowak”
makes applying the idxPerSurname index impossible. In such cases there exist the
possibility to use set intersection through transforming the given query to the following
form:
idxPerSurname(“Nowak”) ∩ idxPerAge(30)
However, it is not certain that this operation will reduce the cost of evaluation. A profit
of using both indices can be seriously decreased by necessity of determining partial
results intersection. In this simple example there are three possible ways to transform
and evaluate this query and it is difficult to decide which of them is optimal in the sense
of the evaluation time cost. Often the use of one index as in previous examples can be
optimal therefore the index optimiser considers only this possibility.
A selection can be assisted with a proper model of a query evaluation cost called
cost model. Data collections differ in size, selectivity, cardinality, distribution and other
logical and physical features; thus, building a complete theoretical model of costs is
impossible. The cost model is therefore a heuristic-empirical model which can be
approximated through many experiments in a real environment. It can take advantage
from all index properties available for measuring and database meta-model. The
following elements of this model can be taken into account:
•
size of indexed object sets,
•
index selectivity, i.e. the average number of non-key values returned in a result
of a random index use,
•
data read time from disk memory (or another permanent memory)
•
execution time of operators used in the transformed query, e.g. set intersection
operator,
•
selectivity of a condition in a where clause, e.g. expressed in percents of
selected objects,
•
etc.
It is possible to take into account many other factors. The publications
Page 114 of 181
Chapter 5
Query Optimisation and Index Optimiser
concerning optimising relative queries [90, 101, 102] give many patterns for building
such model which can be used creatively adapting them to a new optimising approach
and a new database environment. The better cost model guarantees better optimisation
and in case of the index optimisation process better selection of indices to apply.
The idea of calculating the index selectivity is used in the implemented solution
and therefore is described in the next section.
5.4.1 Estimation of Selectivity
The selectivity is determined using the known concept of reduction factors
[102]. For SBA indexing the theoretical basis is presented in [93]. The reduction factor
for a selection predicate is the estimated percentage of objects selected by a predicate to
all objects to which that predicate was applied in the following query:
qOBJ where pSELECTION_PREDICATE
In case of the most popular atomic predicates:
key_operand operator value_operand
example reduction factors can be defined as follows [102] (we assume that values
generated by the key_operand are uniformly distributed):
•
for key_operand = value_operand as
1
valuesCardinalityOf (key _ operand )
•
for key_operand in value_operand as
countValuesOf (value _ operand )
valuesCardinalityOf (key _ operand )
•
for key_operand > value_operand and key_operand ≥ value_operand (where
both
operands
are
real
numbers)
as
HighestValueOf (key _ operand ) − value _ operand
HighestValueOf (key _ operand ) − LowestValueOf (key _ operand )
•
for key_operand > value_operand (where both operands are integer numbers) as
HighestValueOf (key _ operand ) − value _ operand
HighestValueOf (key _ operand ) − LowestValueOf (key _ operand ) + 1
•
for key_operand ≥ value_operand (where both operands are integer numbers) as
HighestValueOf (key _ operand ) − value _ operand
HighestValueOf (key _ operand ) − LowestValueOf (key _ operand ) + 1
Page 115 of 181
Chapter 5
Query Optimisation and Index Optimiser
where valuesCardinalityOf(key_operand) returns the number of different values which
are returned by the query qOBJ.key_operand and countValuesOf(value_operand) returns
the number of values returned by the value_operand query.
In practice estimating a reduction factor is very difficult. Particularly in case of
range operators <, >, ≥, ≤, because the value_operand value can be unknown during the
compile-time if it is not a literal. Therefore, the average reduction factor has been
assumed 0.5 for atomic range operators. Similarly calculating reduction factor for
predicates based on in operator requires the knowledge about the number of elements
returned by the value_operand. Because of the simplified cost model this number has
been assumed as a constant value 5 in all cases.
The cost model must also deal with complex selection predicates i.e. made up of
two or more atomic conjunct predicates. The implemented solution generally assumes
that sub-predicates assembling them are statistically independent. Consequently, a
reduction factor of a complex predicate is calculated as a product of atomic subpredicates it is formed of. For example the query:
Person where surname = ”NOWAK” and age > 30
retrieves persons named Nowak who are older than thirty. The reduction factor s1 for
the atomic sub-predicate surname = “NOWAK” depends on the number of different
names in the database:
s1 =
1
1
=
= 0.001
valuesCardinalityOf ( surname) 1000
and selectivity for another sub-predicate age > 30 is assumed as
s2 = 0.5
which gives the reduction factor of the whole complex predicate:
sel = s1 * s2 = 0.001 * 0.5 = 0.0005
Let us assume that two indices could be applied, i.e.
idxPerSurname(“NOWAK”) where age > 30
and
idxPerAge( ]30, ∞] ) where surname = “NOWAK”
where in a definition of a values range:
Page 116 of 181
Chapter 5
Query Optimisation and Index Optimiser
•
]minvalue – stands for the exclusive left limit of the defined values range,
•
[minvalue – stands for the inclusive left limit of the defined values range,
•
maxvalue[ – stands for the exclusive right limit of the defined values range,
•
maxvalue] – stands for the inclusive right limit of the defined values range.
The index optimiser should select the first one, because it has a smaller reduction factor;
thus, it is more selective. Nevertheless, the most optimal solution would be to apply the
multi-key index constructed on both keys surname and age (as a range key):
idxPerAge&Surname( ]30, ∞]; “NOWAK”)
In the author’s implementation the rule that sub-predicates assembling them are
statistically independent has one exception, namely, when there are two opposing range
predicates on the same condition keys. Such predicates improve selectivity and
therefore the cost model additionally multiplies the obtained reduction factor by the
constant value i.e. 0.25. The selection of this value should be heuristic and empirical as
it depends on the size of a range occurring in a processed query. For example, the
reduction factor for predicates in the query:
Person where age >= 23 and age < 28
is calculated in the following way:
sel = s1 * s2 * 0.25 = 0.5 * 0.5 * 0.25 = 0.0625
and the following index can be applied:
idxPerAge( [23, 28[ )
Additionally, in case of where clauses which consist of sub-predicates defined
with the or operator (cf. section 5.5.3) the cost model must enable estimating the
selectivity of predicates joined with the union of two or more where expressions. To
calculate the reduction factor of such a union, reduction factors of predicates in
individual where clauses are summed together. Let us analyse the following example.
The query:
uniqueref((Person where surname = ”NOWAK”) union
(Person where age > 30))
retrieves persons either with the surname Nowak or ones who are older than 30. The
reduction factors of individual atomic predicates are calculated in previous examples:
Page 117 of 181
Chapter 5
Query Optimisation and Index Optimiser
s1 = 0.001 and s2 = 0.5
The following formula constitutes the selectivity of the union of where clauses with
those predicates:
sel = s1 + s2 = 0.001 + 0.5 = 0.501
In this case, the cost model omits the cost of the union and uniqueref expressions
evaluation to simplify the index selection process. The example query can be
transformed to the following form:
uniqueref(idxPerSurname(“NOWAK”) union idxPerAge( ]30, ∞] ))
5.5 Query Transformation – Applying Indices
After successfully selecting an index for the query evaluation optimisation the
index optimiser must rewrite the given where clause. The simple example of applying a
dense, single-key index is shown earlier in Fig. 5.3. The general idea behind rewriting
the query by the index optimiser is not complex; however, a few cases and proposed
solutions should be discussed. Some elements presented in this subchapter, like index
invocation semantics, are only implementation issues.
Generally the algorithm in the first step generates a proper index call, which next
is used to substitute qOBJ, i.e. the left operand of the main where clause. Finally, the
unnecessary predicates are removed. Eventually the index optimiser can generate a
completely new where clause containing index invocation and replace the original
where clause in the given query.
Generating an index call is the most complex task; therefore, the knowledge of
the proposed syntax is necessary.
5.5.1 Index Invocation Syntax
From the SBQL syntax point of view an index invocation is simply a procedure
invocation:
$index_<indexname> ( <key_param_1> [; <key_param_2> ...] )
The number of parameters is equal to the number of index keys. Each key parameter
defines a desirable value of a key. An index function call returns references to objects
matching specified criteria.
Page 118 of 181
Chapter 5
Query Optimisation and Index Optimiser
Names used to invoke indices contain the prefix $index_ for two reasons:
•
to prevent database users from calling indices explicitly ($ is not accepted by the
SBQL parser) ,
•
to make optimisation developer and testers easier identifying index calls in
optimised queries.
A key parameter expression can define a single value as a criterion. In that case
its evaluation should return integer, double, string, reference or boolean value result or
reference to such a value. In the author’s implementation to pass a dense key value to
index call a binder named $equal is created using the groupas operator. E.g. in the
following call the parameter is a binder containing integer value 28:
$index_idxPerAge(28 groupas $equal)
Binders are used to increase readability and to make introducing new types of
parameters for index calls easier.
To specify a values range criterion as a key value the parameter expression
should return a structure consisting of four parameters:
(<lower_limit>, <upper_limit>, <lower_closed>, <upper_closed>)
where:
<lower_limit> and <upper_limit> are key values specifying the range,
<lower_closed> is a boolean value indicating whether <lower_limit> belongs to
the criterion range,
<upper_closed> is a boolean value indicating whether <upper_limit> belongs to
the criterion range.
An example index idxPerAge&Surname invocation returns references to persons
who’s age is in the range [23, 28[ and surname is “KOWALSKI”:
$index_idxPerAge&Surname( (23, 28, true, false) groupas $range;
“KOWALSKI” groupas $equal)
Similarly like in case of single value key parameters, parameters specifying the range
are passed using the value of a binder named $range.
The key parameter can specify also a collection of key values as a criterion. This
is done when the key parameter returns a bag of key values.
Page 119 of 181
Chapter 5
Query Optimisation and Index Optimiser
$index_idxPerAge((25 union 30 union 35) groupas $in)
The binder named $in is used to pass a collection of key values. If the criterion
parameter returns an empty bag, then the index call returns an empty bag too.
5.5.2 Rewriting Routines
Majority of the rewriting routines are straightforward and consist of generating
an index call and removing unnecessary predicates as it has been described earlier. This
section describes rewriting routines for complex combinations of predicates; however, it
only focuses on a general idea rather than on implementation details.
Firstly, let us discuss application of an index with a key specifying a range. If
optimised query selection predicates specify only one limit of the range (lower or upper)
then the second limit is generated automatically, i.e. a possible smallest or biggest value
for the given key. For example the query:
((sum(Person.age) / count(Person)) groupas auxavg).
Person where age > auxavg and surname = “KOWALSKI”
can be transformed by the index optimiser in order to use the idxPerAge&Surname
index:
((sum(Person.age) / count(Person)) groupas auxavg).
$index_idxPerAge&Surname(
((auxavg, 2147483647, false, true) groupas $range;
“KOWALSKI” groupas $equal)
If there are more than one predicate or two opposite predicates describing the
range on the given key then min, max, union and comparison operators are used to
obtain a correct key range parameter. E.g. the query:
((sum(Person.age) / count(Person)) groupas auxavg).
Person where age > auxavg and 23 <= age and age < 28
can be rewritten using the index invocation with a complex key value parameter
expression:
((sum(Person.age) / count(Person)) groupas auxavg).
$index_idxPerAge(
(max(auxavg union 23), 28, 23 > auxavg, false)
groupas $range)
Page 120 of 181
Chapter 5
Query Optimisation and Index Optimiser
The value of the lower limit is the maximal value chosen between values auxavg and
23. When 23 is greater than auxavg the lower limit should belong to the criterion range,
otherwise does not. This is ensured by the <lower_closed> parameter: 23 > auxavg.
The index optimiser in some cases uses if then expression to predict whether a
given query returns no result and invoking the index is unnecessary i.e. if selection
predicates are in contradiction. This has to be checked e.g. when for a given key there
exist more than one selection predicate and at least one is based on = or in operator. If
any selection predicate contradicts with a predicate based on = or in operator then such
a query will return an empty bag. E.g the query:
((sum(Person.age) / count(Person)) groupas auxavg).
(Person where age >= auxavg and 30 = age)
can be transformed into the following form:
((sum(Person.age) / count(Person)) groupas auxavg).
if (30 >= auxavg) then $index_idxPerAge(30 groupas $equal)
which guarantees that the index idxPerAge will not be invoked and the empty bag will
be returned if the condition 30 >= auxavg is false.
In some cases multiple key indices allow to omit a key in an index call (cf. enum
type key described in section 4.1.1); hence, the index optimiser supports scenario when
there are no selection predicates refering to such a non-obligatory key. In this case both
lower and upper bounds are set to the smallest and the biggest key value. E.g.:
Person where surname in “NOWAK”
can be rewritten to use the multi-key index idxPerAge&Surname with an omitted age
key:
$index_idxPerAge&Surname(
((((-2147483648,
2147483647), true), true)) groupas $range;
“Nowak” groupas $in)
To omit boolean key in an index call the set key parameter criteria is used:
(false union true) groupas $in
Predicates based on operators <, >, ≥, ≤, = need operands with a singular
cardinality. Because of the threat of a run-time error such selection predicates consisting
of key operands with optional cardinality cannot be used to apply the suitable index. As
Page 121 of 181
Chapter 5
Query Optimisation and Index Optimiser
it was shown in subchapter 5.3 in order to prevent run-time errors the exists operator
can be used on a given key. E.g.:
Person where exists(address.zip)
where address.zip > 99720 and salary <= 99727 and age <= 28
can be rewritten to take advantage of the idxPerZip index. Additionally after applying
indexing the index optimiser removes the unnecessary exists expression:
$index_idxPerZip((99720, 99727, false, true)) where age <= 28
This solution was considered in the author’s implementation, because it enables range
indices on keys with the optional cardinality.
The presented rewriting rules concern the most common and important
situations that the index optimiser has to deal with. Nevertheless, these solutions do not
cover all possible scenarios when applying indexing is possible (e.g. rank queries and
count operator used instead of exists). This issue needs further research focused on all
SBQL operators and analysis of queries occurring in the database. Finally, the index
optimiser rewriting routines concerning disjunction predicates (based on or operator)
are also very significant and therefore they are described separately in the next section.
There are also other methods, auxiliary to the index optimiser, that increase
indexing potential in queries. They are the topic of subchapter 5.6.
5.5.3 Processing Disjunction of Predicates
The index optimiser is prepared to deal with queries with selection predicates
joined by the or operator. It is possible due to the law of conjunction distributivity,
which results from the logic theory:
[(p1 or p2) and p3] = [(p1 and p3) or (p2 and p3)]
As or weakens a selection it also makes optimisation more complex. Therefore,
if applying an index is possible without considering predicates joined by or, then the
index optimiser may skip deeper analysis and use the index.
In other cases in order to check all possibilities for indexing the index optimiser
removes the or operator and splits a non-algebraic where operator expression into two
partial selection expressions. Objects returned by both these expressions can be
duplicates; so, it is necessary to leave only distinct object references. It is achieved
through the uniqueref expression. Indexing may reduce amount of data processed by
Page 122 of 181
Chapter 5
Query Optimisation and Index Optimiser
such a query only if it can be applied to both partial expressions. This procedure is
recursive if there are more than one or operator. Let us consider the following example
optimisation of the query:
Emp where age = 28 and (address.city = “Szczecin” or
”Szczecin” in worksIn.Dept.address.city)
When there is no single-key index set on the Emp objects age attribute the query is split
by the index optimiser into the following form:
uniqueref((Emp where age = 28 and address.city = “Szczecin”)
union (Emp where
age = 28 and “Szczecin” in worksIn.Dept.address.city))
Depending on the current cost model both indices can be applied:
uniqueref(
($index_idxEmpCity(
”Szczecin” groupas $equal) where age = 28) union
($index_idxEmpAge&workCity(28 groupas $equal; “Szczecin”))
The implementation of the or operator support does not actually perform
splitting of the where clause into several where clauses. Instead, the index optimiser
works with several different combinations of predicates (containing predicates from or
operator child branches). After finding indices matching all combinations, the cost
model selects the best indices for each combination. Next, the cost model is used to find
the most selective set of combinations. A set which contains x combinations of
predicates necessary to build equivalent to the original query union of x where clauses
is used to generate the optimised query.
5.5.4 Optimising Existential Quantifier
In SBQL the existential quantifier is a non-algebraic operator assuming the
following syntax:
exists q1 such that q2
The query q2 must return true or false for each object defined by q1. If for at least one
object returned by q1 the query q2 is true, then the expression returns true, otherwise
false.
There is a method to reuse the index optimiser routines presented in the previous
Page 123 of 181
Chapter 5
Query Optimisation and Index Optimiser
sections of this chapter in order to apply indices which relate to predicates in the query
q2. The existence quantifier has to be earlier rewritten to the form containing the
selection operator:
exists(q1 where q2)
where exists is the SBQL algebraic unary operator. Both these queries are semantically
equivalent, but in the second form the index optimiser can be used to process the where
clause in order to reduce the amount of evaluated data. If the optimisation succeeds the
query can be further transformed. When all predicates from the query q2 have been used
by the index the final optimised query form is the following.
exists(indexCallExpression)
On the other hand when some q2 predicates are not associated with the selected index
(q2’ stands for expression describing these predicates):
exists(indexCallExpression where q2’)
it is better finally to transform the unary exists operator to an expression with the
existential quantifier:
exists indexCallExpression such that q2’
The latter form is more efficient because contrary to the where operator the evaluation
breaks when a processed element returned by the indexCallExpression matches the
predicates defined in query q2’ (see Tab. 3-3).
5.5.5 Reuse of Indices through Inheritance
In the AS0 store model any two different path expressions defining an index
non-key value always return different objects. This is not true in case of static or
dynamic inheritance (AS1, AS2, etc.). The name of subclass instances collection usually
indicates a subset of a bigger collection. For example according to the schema shown in
Fig. 3.1 the query EmpStudent returns the common subset of the results returned by
queries Emp and Student. Similarly, all objects belonging to collection Emp can be
found among the superclass PersonClass instances.
As a result, all indices addressing objects of the superclass contain subclasses
instances and therefore such indices also should be used in optimisation of selection
queries which concern the subclass instances subset of the indexed objects set. An
invocation of the index set on a collection of superclass instances often returns more
Page 124 of 181
Chapter 5
Query Optimisation and Index Optimiser
objects than it is required. For example the following query:
Emp where age = 28 and surname = “KUC”
cannot be directly optimised using any index mentioned in section 4.1.2. The selection
predicates concern attributes of EmpClass’s superclass, i.e. PersonClass, and for that
reason the administrator probably equipped the whole Person collection with suitable
indices, e.g. idxPerAge, idxPerSurname, idxPerAge&Surname. Such indices return not
only EmpClass instances and therefore the index optimiser applying one of them has to
introduce facility that would remove non-EmpClass instances from the index invocation
result. This can be done using an SBQL coerce operator. In the AS1 and AS2 models it
can be used to convert an object into an object of a more specific or a more general
class. Additionally, this conversion rejects objects that are not instances of a specified
class. The syntax of the coerce operator was taken from the typical syntactic convention
that is known from languages such as C, C++, Java, etc. as cast. Consequently the
example query above can be rewritten to one of the following forms:
(Emp) idxPerAge(28 groupas $equal) where surname = “KUC”
(Emp) idxPerSurname(”KUC” groupas $equal) where age = 28
(Emp) idxPerAge&Surname(28 groupas $equal; ”KUC” groupas $equal)
The method presented in this section requires extending the cost model because the
selectivity should take into consideration unwanted objects returned by an index and an
additional cost of a coerce operation.
Finally, the optimisation is not possible using an index set on a subclass
instances collection since determining objects that were not taken into account would
require inspection of the whole collection of objects. It would not decrease the amount
of data processed within the query what is the main idea of indexing. For example, the
idxEmpCity index cannot be applied by the index optimiser in the following query:
Person where address.city = “Warszawa”
Concluding, generally when considering the attribute as an index key, the best
rule for the administrator would be to create indices considering all instances of the
class which introduces the given attribute. Such indices are more versatile as they can
be used for optimising selection queries addressing subclasses collections.
Page 125 of 181
Chapter 5
Query Optimisation and Index Optimiser
5.6 Secondary Methods
Efficacy of indexing depends on several factors, e.g. good selection of indices
generated by the database administrator, perspicuous and correct construction of
selection queries, etc. This subchapter focuses on presenting how various query
rewriting methods can facilitate the index optimiser.
Query
AUXILIARY
AUXILIARY
AUXILIARY
METHOD 1
METHOD 2
METHOD N
INDEX
OPTIMISER
Query Equipped
with Indices
Query Prepared
for Applying Indices
UNDESIRED
UNDESIRED
UNDESIRED
METHOD 1
METHOD 2
METHOD M
Optimised
Query
Fig. 5.5 Query optimisation with the index optimiser pre-processing
Fig. 5.5 presents how auxiliary methods assist in indexing. These methods can
be divided into several types:
•
optimisation methods, e.g. [95, 97]:
o factoring out independent subqueries,
o pushing selection,
•
methods assisting optimisation of queries invoking views, e.g. [122]:
o query modification,
o removing unnecessary auxiliary names,
•
and other methods, e.g. [93]:
o query syntax tree normalisation.
The following sections shows by short examples how secondary methods enable
indexing. Details concerning these methods, algorithms or their role in query processing
are not described. Given examples do not cover all possible situations of facilitating the
index optimiser. It is also not excluded that other not listed methods exist and could be
Page 126 of 181
Chapter 5
Query Optimisation and Index Optimiser
useful. This work does not focus on presenting the proper order of applying this
methods because this would require deeper analysis and testing of the query
optimisation environment.
The process shown in Fig. 5.5 also includes undesired methods which could be
put in an optimisation sequence after the index optimiser. They may affect negatively
application of indices. An example of such harmful routines regarding indices is shown
in section 5.6.5.
5.6.1 Factoring Out Independent Subqueries
Factoring out is one of the most important optimisation methods. It has roots in
optimisation of nested queries in relational DBMSs. In SBA it is used in the context of a
non-algebraic operator which contains a subquery independent from this operator. As it
was depicted in the beginning of Chapter 5, the general idea of factoring out consists in
moving the subquery before a non-algebraic operator. Thus, it is evaluated only once
before the non-algebraic operator loop.
Let us consider the following query selecting persons who earn salary equal to
the lowest salary in CNC department:
Emp where
salary = min((Dept where name = “CNC”).employs.Emp.salary)
During the compile-time the number of employees working in CNC department is
unknown, so in case when the subquery
(Dept where name = “CNC”).employs.Emp.salary
returns an empty bag the evaluation of min operator would not be possible and the runtime error will occur. According to the conditions described in section 5.3.1, a selection
predicate containing such an operand would disallow applying indexing. However, the
situation improves after applying factoring out independent subqueries. The subquery
calculating the minimal salary in CNC department is arbitrarily independent and hence
can be calculated before selection:
min((Dept where name = “CNC”).employs.Emp.salary)
groupas $aux0. Emp where salary = $aux0
The advantage of such a transformation is that selection predicates are free of the runtime error threat and the idxEmpSalary index can be safely applied:
Page 127 of 181
Chapter 5
Query Optimisation and Index Optimiser
min((Dept where name = “CNC”).employs.Emp.salary) groupas $aux0.
$index_idxEmpSalary(($aux0) groupas $equal)
Factoring out independent subqueries is a very important secondary method to
applying indices, because hazardous predicates in where clauses completely disallow
indexing. Similar situations were identified and verified in [78].
5.6.2 Pushing Selection
Pushing selection uses the property of distributivity of some non-algebraic
operators. It is a generalised equivalent of pushing a selection before a join known from
relational DBMSs.
The example query retrieving the age of persons whose surname is Nowak:
Person.(age where surname = “NOWAK”)
is formed in an unfortunate manner that disables using the idxPerSurname index. The
selection predicate is independent of the where operator because it does not relate to
age attribute and therefore the predicate can be pushed before the where clause and
selection can be applied to Person objects:
(Person where surname = “NOWAK”).age
Although in this case that transformation itself would not cause a high efficiency gain it
enables using the mentioned above index:
$index_idxPerSurname(“NOWAK” groupas $equal).age
which could improve the query performance even by orders of magnitude. This would
not be possible without the pushing selection method.
The second example involves pushing a selection before a join operator. The
following query:
(Dept join (sum(employs.Emp.salary) * 12))) where name = “HR”
returns “HR” department with the overall year-long cost of salaries of its employees.
The sum of employees’ salaries is calculated unnecessarily for all departments different
from HR. Examining the binding levels of the selection predicate name = “HR” proves
that name depends on where, because it is bound in the scope opened by that operator.
However, this predicate relates only to Dept objects and consequently it can be applied
directly to the left operand of the join:
Page 128 of 181
Chapter 5
Query Optimisation and Index Optimiser
(Dept where name = “HR”) join (sum(employs.Emp.salary) * 12)
After rewriting the subquery:
sum(employs.Emp.salary) * 12
will be evaluated for the significantly smaller number of Dept objects. Moreover, this
form makes it possible to apply the idxDeptName index:
$index_idxDeptName(“HR” groupas $equal) join
(sum(employs.Emp.salary) * 12)
Both examples show positive influence of pushing selection on work of the
index optimiser.
5.6.3 Methods Assisting Invoking Views
Views introduce a higher level of abstraction in designing applications.
Unfortunately, it may result in serious performance deterioration because of a limited
access of optimisation methods to the body of invoked views.
Let us consider the following example of the query operating on the
UnderpaidEmp view, which refers to Emp objects with the attribute salary lower than
1000:
(UnderpaidEmp where City = “Szczecin”).FullName
The subview City returns employee’s city of residence and the subview FullName
returns concatenated name and surname of an employee. Invocation of the
UnderpaidEmp view can only be optimised by utilising the index idxEmpSalary. In case
of such a wide range applying an index would probably bring small gain in the query
performance because of a weak query selectivity.
Using the query modification technique (known also from relational DBMSs),
which idea lies in combining a query with definitions of the views being invoked, would
properly replace the UnderpaidEmp view and its sub-views City and FullName by their
definitions:
(((Emp and salary < 1000) as up) where
((up. address.city) as upac).upac = “Szczecin”).
((up.name + “ “ + up.surname) as upn).upn
Directly after query modification applying all possible indices is not available to the
Page 129 of 181
Chapter 5
Query Optimisation and Index Optimiser
index optimiser because of auxiliary names that were introduced by the views
definitions. Hence the removing unnecessary auxiliary names method should by applied
to the obtained query:
((Emp where salary < 1000) where address.city = “Szczecin”).
(name + “ “ + surname)
This form of the transformed query is still semantically equivalent to the initial query,
but it enables using the last predicate concerning the derived attribute address.city of the
Person object to apply an index:
($index_idxEmpCity(“Szczecin” groupas $equal) where
salary < 1000).(name + “ “ + surname)
The idxEmpCity index has relatively good selectivity hence using it is more profitable
than applying the idxEmpSalary index.
5.6.4 Syntax Tree Normalisation
Syntax tree normalisation can be used in order to convert two semantically
equivalent subqueries to identical expressions. This operation can be issued to any
commutable binary algebraic operator:
left_operand operator right_operand
This method associates every SBQL expression and literal with a metric which makes
possible subqueries comparison. If the right_operand has a smaller metric than the left_operand then the binary operator operands are swapped:
right_operand operator left_operand
In the context of indexing the syntax tree normalisation addresses the keys built
on derived attributes involving complex expressions exploiting commutable operators.
As the simple example of such a key let us consider the overall year-long cost of
salaries of employees calculated in the context of a department:
sum(employs.Emp.salary) * 12
Generation of the idxDeptYearCost index on this key for Dept objects could greatly
increase the evaluation performance of the following example query:
Dept where sum(employs.Emp.salary) * 12 > 1000000
because the index key would exactly match the complex predicate used for departments
Page 130 of 181
Chapter 5
Query Optimisation and Index Optimiser
selection. Although the query above is not very selective rewriting the query to the
following form:
$index_idxDeptYearCost
((1000000, 2147483647, false, true) groupas $range)
would improve efficiency, since calculating the overall year-long cost or salaries in
departments will be omitted. The problem arises when the user would form the
predicate in the following manner:
Dept where 12 * sum(employs.Emp.salary) < 1000000
Even though semantically the left operand of the selection predicate:
12 * sum(employs.Emp.salary)
is equal to one used in the previous example query applying the idxDeptYearCost index
is not possible as factors of the product are swapped. Assuming that the integer literal
has a larger metric than the sum operator, the syntax tree normalisation would convert
the query accordingly:
Dept where sum(employs.Emp.salary) * 12 < 1000000
enabling using the idxDeptYearCost index.
Syntax tree normalisation itself does not improve the performance of query
evaluation; on the contrary, it introduces a small delay during the compile-time process.
Its value can be appreciated only in the context of other optimisation methods like
indexing or caching.
5.6.5 Harmful Methods
As an example of a method that may make applying indexing impossible, let us
consider factoring out common path-subexpressions [93] and the following query:
Emp where worksIn.Dept.address.city = “Warsaw” and
worksIn.Dept.address.street = “Sienkiewicza”
that returns employees who work in a department in Warsaw at Sienkiewicza Street. In
this query the derived attribute worksIn.Dept.address is accessed two times for each
Emp object. In this case, worksIn.Dept.address is the common path-subexpression for
left operands of both selection predicates. Using mentioned factoring out method the
obtained optimised query:
Page 131 of 181
Chapter 5
Query Optimisation and Index Optimiser
Person where worksIn.Dept.address.
(city = “Warsaw” and street = “Sienkiewicza”)
computes worksIn.Dept.address expression only once. Nonetheless in case of the
discussed query it would be better to apply the idxEmpWorkCity index:
$index_idxEmpWorkCity(“Warsaw” groupas $equal)
where worksIn.Dept.address.street = “Sienkiewicza”
which is not possible because of disadvantageous predicates transformation done by
factoring out common path-subexpressions.
The major optimisation methods significantly facilitate indexing. Still some
methods, like one presented in the example, should be considered to be put
subsequently to the index optimiser. Ordering the optimisation methods is a process
which involves heuristic analysis of various queries and common sense of the optimiser
designer. All presented examples outline the index optimiser in the context of a whole
optimisation process and should be considered useful.
5.7 Optimisations involving Distributed Index
As it was described in section 2.2.2, main advantages of a distributed index is
parallel access and increased capacity. Therefore, it greatly improves the efficiency of
an index concurrently exploited by multiple queries.
Nevertheless, performance of a distributed index in case of a single call is
usually similar to a centralised index. Let us consider the following query issued by a
client:
<nonkeyexpr> where min <= <keyexpr> and <keyexpr> <= max
It returns a part of a collection defined by <nonkeyexpr> so that the derived attribute
<keyexpr> is within the range determined by min and max. Additionally, let us assume
the following:
•
data is equally distributed on qDATA servers,
•
n is the number of elements in the processed collection,
•
s is the number of elements selected by the query.
The where clause is evaluated in parallel on qDATA servers, each storing approximately
n/qDATA elements; therefore, local evaluation on a server has the computational
Page 132 of 181
Chapter 5
Query Optimisation and Index Optimiser
complexity O(n/qDATA) expressed in the big O notation. Overall performance could be
improved by local indexing on servers only if all servers would provide an appropriate
index; however, this cannot be provided by the global schema administrator.
Significant efficiency improvement can be obtained by creating a distributed
range SDDS index appropriate to optimise query evaluation. Such an index is spread on
qIDX machines among many available over a distributed database. Assuming that data
are also equally distributed inside the index, each server contains n/qIDX elements.
Therefore, performing selection of all elements from a fragment of the index on an
individual server has the complexity O(n/qIDX) and does not depend on actual data
distribution. It is important to remind that qIDX dynamically grows with the number of
indexed elements so it is usually much greater than qDATA.
Regardless of existing indices, the time complexity of merging partial results on
the client is O(s) since it depends on query selectivity. The computational complexity of
centralised index based on linear hashing is comparable (see section 2.2.1).
Nevertheless, a distributed index can significantly improve performance in
relation to a centralised index in case of a count query issued by a client:
count(<nonkeyexpr> where min <= <keyexpr> and <keyexpr> <= max)
The query running time expressed in the big O notation is O(n). The evaluation is done
in parallel on qDATA sites containing data. The client only calculates the sum of results
obtained from qDATA servers.
In case of employing a centralised index the performance complexity remains
the same as in case of a where clause. Nevertheless, the execution can be faster.
Regardless of an indexing strategy, such a count query is efficiently computed by the
index itself. Actual data are not used or their participation in evaluation of the query is
very little.
The properties of a distributed index enable computing a count query in parallel
on index sites. The time of calculating a sum on a client is omitable especially as partial
results are usually obtained from one or several of qIDX servers. Therefore, the
distributed index in case of the discussed count query should prove its efficacy.
In the next section the author proposes optimisation concerning other type of
SBQL queries, which can take advantage of efficiency of the distributed index in
processing count queries.
Page 133 of 181
Chapter 5
Query Optimisation and Index Optimiser
5.7.1 Rank Queries Optimisation
Queries can return a sequence. Consequently, there should be a possibility to
select kTH sequence element, last sequence element, etc. This is achieved through rank
queries, which nowadays are very widespread. Particularly search engines over internet
provide answers according to internal or explicit rankings. The rank queries subject is
researched in context of regular query optimisation, construction of database systems
devoted to ranking and development of its theoretical fundaments (e.g. relational
ranking algebras), e.g. [52, 69].
The mostly exploited ranking query is looking up top-k elements. Such a query
can be easily facilitated with the use of a suitable ordered index. However, the author
would like to discuss optimisation of ranking queries in a more general case. SBQL
allows expressing rank queries using square brackets operator and operator rangeas
(see Tab. 3-5). A query concerning objects defined by a nonkeyexpr expression and
ordered using a <keyexpr> expression, which rank is between integers defined by
<min> and <max> can be formulated in at least three semantically equivalent forms:
Query 5.1 Ranking Queries in SBQL
1. Using square brackets and bag of integers:
(<nonkeyexpr> orderby <keyexpr>)
[bag(<min >, <min>+1, <min>+2 …, <max>]
2. Using square brackets and a range of integers bag constructor (unsupported yet in
the ODRA database):
(<nonkeyexpr> orderby <keyexpr>)[<min >..<max>]
3. Using the rangeas operator and appropriate selection predicates:
((<nonkeyexpr> orderby <keyexpr>) as <name> rangeas <rank>
where <rank> >= <min> and <rank> <= <max>).<name>
where <name> and <rank> are auxiliary names.
The third solution was introduced in the Loqis system [116]. It is the most universal as
it can be freely used with other options of SBQL. The query which returns a
sequence{res1, res2, ..., resn)} can be further processed by the rangeas operator. It
equips individual results with binders, e.g. rank, which store ordered natural numbers:
bag{struct{res1, rank(1)}, struct{res2, rank(2)} , …, struct{resn, rank(n)}}. This
Page 134 of 181
Chapter 5
Query Optimisation and Index Optimiser
solution enables query language to form freely conditions on the rank binders.
For example, the following rank query:
((Emp orderby salary) rangeas n) where n <= 10
returns a bag of 10 worse earning employees with an additional binder called n:
bag {struct{Emp1, n(1)}, struct{Emp2, n(2)},…, struct{Emp10, n(10)}}
In case of rank queries sorting is not the goal of a query although the orderby
operator is used. Creating a sorted sequence of data significantly reduces optimising
potential. Sorting of the data usually deteriorates the average performance of query
evaluation to O(n·log2(n)) running time (where n – is the number of elements to
sort)[23]. In case when data is distributed between several servers, queries forcing data
ordering cannot be completely decomposed on particular sites in order to execute in
parallel. If order is forced, then each element is processed separately. A client has to
request all the data required by a query from servers and processes it to obtain the result,
i.e. so-called total data shipping strategy. Therefore, the performance of this strategy is
at least linear. Most of the methods based on rewriting and indexing assume that data
are not ordered. Therefore, queries involving sorting require dedicated optimisation
methods.
In order to avoid sorting of a whole collection defined by <nonkeyexpr> the
author considered approach based on looking up <min>TH and <max>TH elements
according to the given ranking key. Assuming that the method returning kTH element of
the given set is defined, e.g. $findKthElement(k:integer), then ranking query forms in
Query 5.1 can be transformed accordingly to a semantically equivalent form:
Query 5.2: Evaluation of Rank Query Without Sorting
($findKthElement(<min>).<keyexpr> as val_min join
I
$findKthElement(<max>).<keyexpr> as val_max join
((<min> - count(
II
III
<nonkeyexpr> where <keyexpr> < val_min)) as delta))
.((<nonkeyexpr> where val_min <= <keyexpr>
and <keyexpr> <= val_max) orderby <keyexpr>)
IV
[delta..<max>-<min>+delta])
The transformed query:
I. finds a ranking key value of <min>TH and <max>TH elements and stores them as a
Page 135 of 181
Chapter 5
Query Optimisation and Index Optimiser
value of auxiliary binders named val_min and val_max,
II. next, since there can be more elements with the same value as the <min>TH
element the query calculates which in a row element with a value val_min is the
<min>TH element and stores this number using an auxiliary binder named delta,
III. elements with a ranking key value between val_min and val_max inclusively are
extracted,
IV.
finally, elements are sorted and those before the <min>TH element and after the
<max>TH element are removed using a ranking operator.
Assuming that mostly the number of elements extracted by ranking query is practically
small (s coefficient), the performance of strategy presented above depends greatly on
$findKthElement method and on both where clauses. The evaluation complexity of II
and III part of Query 5.2 is O(n). Without indexing computations are done in parallel
on qDATA sites. Facilitated by a global distributed index the evaluation splits between
qIDX servers. Next subsections discuss variants of Hoare’s algorithm resolving finding
the kTH element problem and their influence on overall performance of ranking queries.
5.7.1.1
Hoare’s Algorithm in Distributed Environment
The Hoare’s algorithm is based on the well-known quicksort sorting algorithm
based on bisection [23]. During each iteration algorithm splits examined data into two
parts (smaller and equal or greater than randomly selected so-called pivot element). In
contrast to quicksort, after dividing a set the Hoare’s algorithm executes itself
recursively only on the part of a set that contains wanted kTH element while omitting the
other part. Such approach results in linear evaluation complexity so is faster than
obvious algorithm basing on sorting the given set. Nevertheless, similarly like sorting it
consumes additional resources to store a copy of data to swap freely elements according
to the algorithm.
Applying classic Hoare’s algorithm for centralised processing of a rank query
would result in:
•
total shipping of distributed data to be processed on a main server (usually on a
client),
•
obtaining (in big O notation) linear evaluation complexity O(n),
•
large consumption of memory and the large number of write operations on the
Page 136 of 181
Chapter 5
Query Optimisation and Index Optimiser
main server.
The straightforward parallelisation of the Hoare’s algorithm on qDATA servers
storing processed data requires introducing:
•
the controlling algorithm on the main server deciding of iteration parameters.
•
peer algorithms on qDATA servers processing actual data.
An iteration of the parallelised Hoare’s algorithm consists of the following steps:
1. The controlling algorithm randomly selects a pivot value within the given range.
The selection can be performed variously, e.g. using:
•
a query for a random element within the given range from a random data server,
•
a median calculated for a collection of random elements within the range from
all data servers,
•
basing on the knowledge about minimal and maximal limits of the range an
average value can be used.
2. The controlling algorithm sends a message to peer algorithms to divide the given
range according to the selected pivot.
3. Peer algorithms divide the range and inform the main server about the cardinality of
obtained parts (additionally random elements from both parts can be sent).
4. The controlling algorithm determines a part of the given range containing the kTH
element and this part becomes a new range for a next iteration.
The controlling algorithm stops after O(log2(n)) iterations when the number of
elements in the range is reasonably small. Those elements are sent to the main server
and the kTH element is selected. Actually, the absolute kTH element would be the k mincount element of the final range, where mincount is the number of elements smaller
than the final range.
Concluding the parallel evaluation of Hoare’s algorithm has the following
properties:
•
avoiding total data shipping,
•
O(log2(n)) rounds of communication,
•
large consumption of memory and the large number of write operations on
Page 137 of 181
Chapter 5
Query Optimisation and Index Optimiser
servers storing data,
•
existing local or global indices cannot be used.
Using this algorithm to evaluate Query 5.2 guarantees running time complexity O(n)
and distribution of calculations on qDATA servers. Further improvements require taking
advantage of indexing.
5.7.1.2
Modification of Hoare’s Algorithm
In the proposed modification to the Hoare’s algorithm splitting examined data into two
parts is not done physically. Instead, during an iteration the number of elements lesser
and greater than the pivot value is determined. This allows selecting a side of the range
divided by the pivot holding the kTH element. Such a simple bisection algorithm can be
entirely defined on a client side as an SBQL program:
set_min := min(<nonkeyexpr>.<keyexpr>);
set_max := max(<nonkeyexpr>.<keyexpr>);
set_position := 0;
do {
pivot := (set_min + set_max)/2;
less_count := count(<nonkeyexpr> where set_min <= <keyexpr>
and <keyexpr> < pivot);
if (set_position + less_count >= k) set_max := pivot;
else {
set_min := pivot;
set_position += less_count;
}
} while (less_count > stop_const);
return ((<nonkeyexpr> where set_min <= <keyexpr>
and <keyexpr> <= set_max) orderby <keyexpr>)[k - set_position];
where:
•
set_min and set_max are boundary values of an examined set,
•
pivot is an arithmetic centre of set boundary values,
•
less_count holds the number of set elements with values smaller than pivot,
•
set_position indicates the number of elements with a value smaller than set_min,
•
stop_const is a constant used for terminating the main loop as the examined set
size reduces together with less_count value.
Page 138 of 181
Chapter 5
Query Optimisation and Index Optimiser
The performance of the proposed algorithm is O(n*log2(n)). Since loop is executed
O(log2(n)) times, the overall evaluation, similarly as the Hoare’s algorithm, requires
O(log2(n)) rounds of communication. The regular evaluation of statements containing
min, max and count operators can be decomposed on qDATA servers. If there exists an
appropriate SDDS index the evaluation is split between qIDX servers. The method for
determining the pivot and a loop stop condition does not influence the running-time
complexity. Still, it can be implemented differently to tune the algorithm performance.
There are many advantages of the proposed algorithm:
•
implementation simplicity,
•
successfully avoiding total data shipping,
•
small memory usage and the number of write operations (data is mainly read),
•
very small amount of data is sent through a network,
•
data processing is transparently facilitated by existing local or global indices.
Features of this algorithm along with the performance apply also to the rank queries
evaluation strategy shown in Query 5.2. The table below compares different approaches
to execution of rank queries.
Tab. 5-1 Features of Rank Queries Evaluation Strategies
Feature List
Query 5.2 based on
Unoptimised
Query 5.2 based on
Distributed
Query 5.1
author’s approach
Hoare’s
Reduced network traffic
NO
YES
YES
Small memory usage
NO
NO
YES
Can utilise local indices
NO
NO
YES
Algorithm simplicity
YES
NO
YES
Computational complexity
O(n*log2(n))
O(n)
O(n*log2(n))
Parallel evaluation on qIDX
sites with SDDS support
NO
NO
YES
Despite slightly better performance of the rank queries evaluation strategy
employing the Hoare’s algorithm, the author’s approach possesses more advantages.
Unfortunately, efficiency verification of the proposed rank queries optimisation method
has not been done yet because of immature stage of the ODRA platform
implementation. Appropriate tests are planned to be performed in the future.
Page 139 of 181
Chapter 5
Query Optimisation and Index Optimiser
5.8 Increasing Query Flexibility with Respect to Indices
Management
In the current ODRA prototype each change in a database data definition
(usually triggered through DDL commands) forces recompilation of the applications
which can depend on the modified entity. Such situations obviously occur also during
index management operations. After adding an index, compilation and optimisation are
required to introduce index calls in queries inside existing applications (example shown
in Fig. 5.3). Whereas, after removing an index the compiled form of the query syntax
tree must be free from all calls to non-existing indices.
On the other hand, recompiling would additionally require terminating some
running applications; therefore, in many cases it is not possible or troublesome. In order
to solve this problem in the future, the author proposes the solution to obtain a more
flexible compiled form of the optimised query. The necessary changes can be
introduced by the index optimiser in the stage of the query syntax tree rewriting. Let us
assume that like in many solutions in RDMSs an index can be disabled or enabled by
the administrator. First, it is crucial to provide a mechanism to ensure validity of the
query even if index remove or index disable command would be issued before or during
the evaluation of the query. Therefore, the author proposes a new build-in SBQL
method:
$request(index_name) : boolean
that can be introduced in the query syntax tree in compile-time to facilitate query
processing. Its argument is the name of a database entity, i.e. an index name in this case,
intended to be called and return type is the boolean value. If the given index exists and
is valid for usage the $request method returns true and additionally it prevents a
database from disabling or removing this index before its successive call finishes. The
$request method returns false if the specified index is not accessible or valid.
Consequently, assuming the idxPerAge index exists, the following example query:
Person where surname = ”NOWAK” and age = 30
can be rewritten to the form preventing any problems with evaluation in case of removal
or disabling of idxPerAge index:
Page 140 of 181
Chapter 5
Query Optimisation and Index Optimiser
if ($request(idxPerAge))
$index_idxPerAge(30 groupas $equal) where surname = ”NOWAK”
else Person where surname = ”NOWAK” and age = 30
Such approach allows even more flexible and independent exploiting of indices by user
applications. The administrator, apart from adding currently necessary indices, can
register information about indices that are anticipated to be used in the future. It is
predictable already in the stage of designing the data schema. The administrator can
usually easily estimate which attributes can be used to construct selection predicates.
The suitable information can be introduced to the index manager by issuing a command
adding an index in the disabled state. The index optimiser can consider using registered
indices during the query evaluation by applying an appropriate transformation to the
query syntax tree. For example assuming the administrator added information about
idxPerAge, idxPerSurname and idxPerAge&Surname indices the query above can be
rewritten to increase its flexibility:
if ($request(idxPerAge&Surname)) then
$index_idxPerAge&Surname(30 groupas $equal;
“NOWAK” groupas $equal)
else if ($request(idxPerSurname)) then
$index_idxPerSurname(“NOWAK” groupas $equal) where age = 30
else if ($request(idxPerAge)) then
$index_idxPerAge(30 groupas $equal) where surname = ”NOWAK”
else Person where surname = ”NOWAK” and age = 30
Determining the precedence of indices should be facilitated by the cost model, e.g.
according to selectivity property; therefore, the best available index is always used in
the run-time. This solution permits the administrator to freely disable and enable
available indices without a necessity to compile the query again.
Consequently, partial recompilation of user applications would be essential only
in case of adding a different, not yet registered index in order to improve performance
of queries that can exploit it. The most important benefit of the proposed solution is that
applications can work continuously, independently from index management actions and
flexibly exploit available indices.
Page 141 of 181
Chapter 6
Indexing Optimisation Results
Chapter 6
Indexing Optimisation Results
Tests results are average values from 20 subsequent measurements performed on
the example schema presented in Fig. 3.1 populated with random data.
Tests were performed on the following single machine:
Tab. 6-1 Optimisation testbench configuration
Property
Processor
RAM
HDD
OS
JVM
Value
Intel Mobile Core 2 Duo T2300, 1.66 GHz
2,00 GB
120 GB, 5400 rpm
MS Windows Server 2003 R2 Service Pack 2, 32 bit
Sun JRE SE 1.6.0_03
The data store of the ODRA OODBMS prototype is entirely mapped in the
RAM memory using the memory-mapped file access. Current implementation allows
performing tests on 300000 objects representing people related to the company.
6.1 Test Data Distribution
location
The data distribution is presented in the following figures.
Warszaw a
Łódź
Kraków
Wrocław
Poznań
Gdańsk
Szczecin
probability
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
department
Fig. 6.1 Department’s location distribution
production
retail
wholesale
research
warehousing
CNC
customer serv ice
logistics
security
pay ments
HR
employ ment
BHP
probabiliy
0
0,05
0,1
0,15
0,2
0,25
Fig. 6.2 Employee’s department distribution
Page 142 of 181
0,3
0,35
0,05
0
ANNA
MARIA
KATARZYNA
MAŁGORZAT
AGNIESZKA
KRYSTYNA
BARBARA
EWA
ELśBIETA
ZOFIA
JANINA
TERESA
JOANNA
MAGDALENA
MONIKA
JADWIGA
DANUTA
IRENA
HALINA
HELENA
BEATA
ALEKSANDR
MARTA
DOROTA
MARIANNA
GRAśYNA
JOLANTA
STANISŁAW
IWONA
KAROLINA
BOśENA
URSZULA
JUSTYNA
RENATA
ALICJA
PAULINA
SYLWIA
NATALIA
WANDA
AGATA
ANETA
IZABELA
EWELINA
MARZENA
WIESŁAWA
GENOWEFA
PATRYCJA
KAZIMIERA
EDYTA
STEFANIA
0
0,07
0
NOWAK
KOWALSKA
WIŚNIEWSKA
WÓJCIK
KOWALCZYK
KAMIŃSKA
LEWANDOWSKA
ZIELIŃSKA
SZYMAŃSKA
WOŹNIAK
DĄBROWSKA
KOZŁOWSKA
JANKOWSKA
MAZUR
WOJCIECHOWSKA
KWIATKOWSKA
KRAWCZYK
PIOTROWSKA
KACZMAREK
GRABOWSKA
PAWŁOWSKA
MICHALSKA
ZAJĄC
KRÓL
JABŁOŃSKA
WIECZOREK
NOWAKOWSKA
WRÓBEL
MAJEWSKA
OLSZEWSKA
STĘPIEŃ
JAWORSKA
MALINOWSKA
ADAMCZYK
NOWICKA
GÓRSKA
DUDEK
PAWLAK
WITKOWSKA
WALCZAK
RUTKOWSKA
SIKORA
BARAN
MICHALAK
SZEWCZYK
OSTROWSKA
TOMASZEWSKA
PIETRZAK
JASIŃSKA
WRÓBLEWSKA
0,08
JAN
ANDRZEJ
PIOTR
KRZYSZTOF
STANISŁAW
TOMASZ
PAWEŁ
JÓZEF
MARCIN
MAREK
MICHAŁ
GRZEGORZ
JERZY
TADEUSZ
ADAM
ŁUKASZ
ZBIGNIEW
RYSZARD
DARIUSZ
HENRYK
MARIUSZ
KAZIMIERZ
WOJCIECH
ROBERT
MATEUSZ
MARIAN
RAFAŁ
JACEK
JANUSZ
MIROSŁAW
MACIEJ
SŁAWOMIR
JAROSŁAW
KAMIL
WIESŁAW
ROMAN
WŁADYSŁAW
JAKUB
ARTUR
ZDZISŁAW
EDWARD
MIECZYSŁAW
DAMIAN
DAWID
PRZEMYSŁAW
SEBASTIAN
CZESŁAW
LESZEK
DANIEL
WALDEMAR
salary
Indexing Optimisation Results
Chapter 6
2400 - 6000
1800 - 2400
1400 - 1800
1000 - 1400
600 - 1000
300 - 600
probabiliy
0
0,05
0,1
Page 143 of 181
0,15
Fig. 6.6 Male person’s first name distribution
0,2
0,25
Fig. 6.3 Employee's salary range distribution
probability
0,07
0,06
0,05
0,04
0,03
0,02
0,01
Fig. 6.4 Female person’s first name distribution
probability
0,06
0,05
0,04
0,03
0,02
0,01
Fig. 6.5 Female person’s surname distribution
probability
0,04
0,03
0,02
0,01
Chapter 6
Indexing Optimisation Results
0,07
probability
0,06
0,05
0,04
0,03
0,02
0
NOWAK
KOWALSKI
WIŚNIEWSKI
WÓJCIK
KOWALCZYK
KAMIŃSKI
LEWANDOWSKI
ZIELIŃSKI
WOŹNIAK
SZYMAŃSKI
DĄBROWSKI
KOZŁOWSKI
JANKOWSKI
MAZUR
WOJCIECHOWSKI
KWIATKOWSKI
KRAWCZYK
KACZMAREK
PIOTROWSKI
GRABOWSKI
ZAJĄC
PAWŁOWSKI
KRÓL
MICHALSKI
WRÓBEL
WIECZOREK
JABŁOŃSKI
NOWAKOWSKI
MAJEWSKI
STĘPIEŃ
OLSZEWSKI
JAWORSKI
MALINOWSKI
DUDEK
ADAMCZYK
PAWLAK
GÓRSKI
NOWICKI
SIKORA
WALCZAK
WITKOWSKI
BARAN
RUTKOWSKI
MICHALAK
SZEWCZYK
OSTROWSKI
TOMASZEWSKI
PIETRZAK
ZALEWSKI
WRÓBLEWSKI
0,01
Fig. 6.7 Male person’s surname distribution
Instances of classes PersonClass, StudentClass, EmpClass and EmpStudentClass
are distributed equally. Regular and employed students’ age is distributed randomly
between 19 and 30 inclusive. Employees’ age distribution is between 18 and 65
inclusive. Remaining persons are between 1 and 100 years old inclusive. The value of
student’s scholarship is randomly 0, 200 or 500. Sex values are equally distributed.
The distribution of data is closer to the assumed one as the number of employees
increases.
6.2 Sample Index Optimisation Test
Main tests compare times between query executions with the enabled and
disabled index optimiser. Additional elements of execution taken into account are staticevaluation (i.e. type-checking) and optimisation (cf. subchapter 5.1). For each test the
set of existing indices is specified. Each query is given a plot of reference (ref. avg.
time) and additionally optimised by indexing (opt. avg. time) execution times for 10,
100, 1000, 3000, 10000, 30000, 100000 and 300000 person objects together with the
optimisation gain (the evaluation times’ ratio). A time measurement can be disrupted by
unexpected actions of OS, hardware or applications running in background. In order to
eliminate influence of such interferences the test results are estimated with an average
of 20 subsequent measurements. Multiple measurements particularly increase precision
of tests with short query evaluation times. Therefore, in case of tests lasting longer the
smaller number of measurements where performed, i.e. 5 measurements for tests longer
than 10 minutes and 1 measurement for tests longer than 30 minutes.
To improve the readability results on plots are presented using the logarithmic
Page 144 of 181
Chapter 6
Indexing Optimisation Results
scale on x-axis. All queries below are devoid of decoration introduced by the static
evaluator (e.g. implicit dereferences and coercions) and transformations done by other
than indexing standard ODRA optimisation methods.
Query 6.1a: Retrieves persons 28 and less years old named KOWALSKI
reference
Person where surname = "KOWALSKI" and age <= 28
index
optimised
idxPerAge&Surname(
(-2147483648, 28, true, true) groupas $range);
"KOWALSKI" groupas $equal)
ref. avg. time
opt. avg. time
gain
30
100
90
25
80
[s]
60
15
50
40
10
gain [ratio]
70
20
30
20
5
10
0
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.8 Evaluation times and optimisation gain for Query 6.1
The optimisation gain for this simple query above proves effectiveness of ODRA
indexing in case of the idxPerAge&Surname index. The data distribution indicates that
surname KOWALSKI occurs in two or three cases out of 100 people. Optimisation gain
plot shows that for the number of persons greater than 30000 the amount of data
processed by the query reduced more than 50 times, i.e. out of 100 people one or two
person are processed. It is the result of combining both selection predicates in an index
call. Creating similar indices for collections smaller than 100 objects is not beneficial.
The second type of performed tests is designed to verify index properties. It is
achieved by comparing an optimisation gain obtained with a use of different indices.
Query 6.1b
idxPerAge&Surname
optimisation
idxPerAge&Surname(
(-2147483648, 28, true, true) groupas $range);
"KOWALSKI" groupas $equal)
Page 145 of 181
Chapter 6
Indexing Optimisation Results
idxPerAge
optimisation
idxPerAge((-2147483648, 28, true, true) groupas
$range) where surname = "KOWALSKI"
idxPerSurname
optimisation
idxPerSurname("KOWALSKI" groupas $equal) where
age <= 28
idxPerAgeSurname gain
idxPerAge gain
idxPerSurname gain
80
70
60
gain [ratio]
50
40
30
20
10
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.9 Indices optimisation gain for Query 6.1
The plot confirms that index calls reach a desired optimisation gain with a growth of the
number of non-key objects. The performance improvement using the idxPerAge index
is small (gain ratio little smaller than 2) because the age predicate concerns the large
value range. The idxPerAge&Surname multiple key index efficiency significantly
overtakes single key indices for larger databases (the number of persons much greater
than 10000).
6.3 Omitting Key in an Index Call Test – enum Key Types
In case of multiple-key indices, e.g. idxPerAge&Surname, the index optimiser
can sometimes omit specifying a value of a key in an index call (enum type key
described in section 4.1.1). This feature makes index more flexible and consequently the
set of existing indices can be reduced.
Query 6.2a: Counts persons named KOWALSKI, KOWALSKA, NOWAK
reference
count(Person where surname in ("KOWALSKI" union
"KOWALSKA" union "NOWAK"))
Index
optimised
count(idxPerAge&Surname(
(-2147483648, 2147483647, true, true) groupas
$range); ("KOWALSKI" union "KOWALSKA" union "NOWAK")
groupas $in))
Page 146 of 181
Chapter 6
Indexing Optimisation Results
ref. avg. time
opt. avg. time
gain
35
100
90
30
80
25
60
[s]
20
50
15
40
gain [ratio]
70
30
10
20
5
10
0
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.10 Evaluation times and optimisation gain for Query 6.2
The optimisation gain for the large number of objects is comparable to reducing an
amount of data processed according to the selection predicate, i.e. approximately one
out of ten people has surname KOWALSKI, KOWALSKA or NOWAK. This result
satisfies expectations; however, creating suitable single key index, i.e. idxPerSurname,
can improve efficiency even better.
Query 6.2b
idxPerAge&Surname
optimisation
count(idxPerAge&Surname(
(-2147483648, 2147483647, true, true) groupas
$range); ("KOWALSKI" union "KOWALSKA" union
"NOWAK") groupas $in))
idxPerSurname
optimisation
count(idxPerSurname(
("KOWALSKI" union "KOWALSKA" union "NOWAK")
groupas $in))
Despite the fact that both index calls (see Query 6.2b) return the same collection of
objects the plot in Fig. 6.11 indicates that the optimisation gain for the idxPerSurname
index is even 30 times greater than for the idxPerAge&Surname index. Additional
reason of such a high performance is that index optimised queries do not process
selected objects by a where clause (like in the original query).
Omitting an index key is a useful feature, but depending on an index structure it
has an impact on the index efficiency.
Page 147 of 181
Chapter 6
Indexing Optimisation Results
idxPerAgeSurname gain
idxPerSurname gain
350
300
gain [ratio]
250
200
150
100
50
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.11 Indices optimisation gain for Query 6.2
6.4 Multiple Index Invocation Test
If an index call is located on the right side of a non-algebraic operator, e.g. dot,
then it is likely to be evaluated more than once during the query execution. This is
shown on the example Query 6.3 with an idxEmpTotalIncomes index. In Fig. 6.12 and
Fig. 6.13 the logarithmic scale has been used also on y-axis.
In Fig. 6.12 the dependency between the optimisation gain and the number of
persons is close to linear and grows to 457 for 300000 person objects.
Query 6.3a: For 61 years old, married employees living in Łódź, working in Łódź or Wrocław retrieves a
name concatenated with a surname and the number of employees with the equal amount of total incomes.
Reference
((Emp where address.city = "Łódź" and
worksIn.Dept.address.city in ("Łódź" union
"Wrocław") and married = true and age = 61) as e).
(e.name + " " + e.surname, count(Emp where
getTotalIncomes() = e.getTotalIncomes()))
index
optimised
((Emp where address.city = "Łódź" and
worksIn.Dept.address.city in ("Łódź" union
"Wrocław") and married = true and age = 61) as e).
(e.name + " " + e.surname,
count(idxEmpTotalIncomes(e.getTotalIncomes()))
Page 148 of 181
Chapter 6
Indexing Optimisation Results
opt. avg. time
gain
100000
10000
10000
1000
1000
100
100
10
10
1
1
[s]
100000
0,1
gain [ratio]
ref. avg. time
0,1
0,01
0,01
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.12 Evaluation times and optimisation gain for Query 6.3
Additionally introducing another index – idxEmpAge&WorkCity – in order to
optimise evaluation of the first part of the query can significantly influence the
performance:
Query 6.3b
idxEmpTotalIncomes
optimisation
((Emp where address.city = "Łódź" and
worksIn.Dept.address.city in ("Łódź" union
"Wrocław") and married = true and age = 61) as e).
(e.name + " " + e.surname,
count(idxEmpTotalIncomes(e.getTotalIncomes()))
idxEmpAge&
WorkCity
optimisation
((idxEmpAge&WorkCity(61 groupas $equal; ("Łódź"
union "Wrocław") groupas $in) where address.city =
"Łódź" and married = true) as e). (e.name + " " +
e.surname, (e.name + " " + e.surname, count(Emp
where getTotalIncomes() = e.getTotalIncomes())))
both indices
optimisation
((idxEmpAge&WorkCity(61 groupas $equal; ("Łódź"
union "Wrocław") groupas $in) where address.city =
"Łódź" and married = true) as e). (e.name + " " +
e.surname,
count(idxEmpTotalIncomes(e.getTotalIncomes()))
For a database consisting of 300000 persons two indices compound gives the
optimisation gain approximately 40 times greater (see Fig. 6.13). Despite such
difference, the most important is an index repeatedly invoked, i.e. idxEmpTotalIncomes.
Without this index the query performance does not improve noticeably.
Page 149 of 181
Chapter 6
Indexing Optimisation Results
idxEmpTotalIncomes gain
both indices gain
idxEmpAgeWorkCity gain
100000
10000
gain [ratio]
1000
100
10
1
0,1
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.13 Indices optimisation gain for Query 6.3
6.5 Complex Expression Based Index Test
The test concern processing complex selection predicates. As an example,
optimisation involving the idxDeptYearCost index is shown.
Query 6.4: Gets names of departments which employees earn more in a year than 10000.
reference (Dept where sum(employs.Emp.salary) * 12 > 10000).name
idxDeptYearCost(
index
(10000, 2147483647, false, true) groupas $range).name
optimised
opt. avg. time
gain
30
3000
25
2500
20
2000
15
1500
10
1000
5
500
0
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.14 Evaluation times and optimisation gain for Query 6.4
Page 150 of 181
gain [ratio]
[s]
ref. avg. time
Chapter 6
Indexing Optimisation Results
The query concerns selecting some from 13 departments. The not-optimised execution
time grows linearly with the increasing number of Person objects. In case of the
idxDeptYearCost index a precise key value is pre-calculated and written inside an index
structure. Therefore, the index call execution is extremely fast and independent from the
number of persons.
6.6 Disjunction of Predicates Test
The query optimisation presented below is the result of the process described in
section 5.5.3. It is assumed that administrator has created idxEmpCity and
idxEmpWorkCity indices. A lack of one of them would make optimisation impossible
because the execution of Query 6.5 would still require processing all data.
Query 6.5a: Retrieves employees with age greater than or equal to 57 and less than 61 who live or work
in Szczecin.
reference
count(Emp where age >= 57 and age < 61 and
(address.city = "Szczecin" or
worksIn.Dept.address.city in "Szczecin"))
index
optimised
count(uniqueref(
(idxEmpCity("Szczecin" groupas $equal) where
deref(age) >= 57 and deref(age) < 61) union
(idxEmpWorkCity("Szczecin" groupas $equal) where
deref(age) >= 57 and deref(age) < 61)))
ref. avg. time
opt. avg. time
gain
18
100
16
90
80
14
60
[s]
10
50
8
40
6
gain [ratio]
70
12
30
4
20
2
10
0
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 6.15 Evaluation times and optimisation gain for Query 6.5
The plot confirms that the implemented solution improves the query execution time
Page 151 of 181
Chapter 6
Indexing Optimisation Results
even for 1000 person objects database. Gain decrease for 100000 persons is a result of
random differences in distribution of specified by indices parameters objects (the
number of objects matching the given criteria grows linearly only in general).
In some situations, it is possible to take advantage of an index which would not require
rewriting of predicates in disjunction. In this case the idxEmpAge index can be
introduced by the administrator and splitting the given query into two where clauses can
be avoided. The plot in Fig. 6.16 however indicates that in this particular case the latter
index optimisation in major produced the smaller optimisation gain. The proper
approach for selecting the most efficient solution should rely on a fast and accurate (as
close as it is possible) cost model (see subchapter 5.4).
Query 6.5b
idxEmpCity and
idxEmpWorkCity
optimisation
count(uniqueref(
(idxEmpCity("Szczecin" groupas $equal) where
deref(age) >= 57 and deref(age) < 61) union
(idxEmpWorkCity("Szczecin" groupas $equal)
where deref(age) >= 57 and deref(age) < 61)))
idxEmpAge
optimisation
count(idxEmpAge((57, 61, true, false) groupas
$range) and (address.city = "Łódź" or
worksIn.Dept.address.city in "Szczecin"))
idxEmpCity and idxEmpWorkCity gain
idxEmpAge gain
18
16
14
gain [ratio]
12
10
8
6
4
2
0
10
100
1000
10000
100000
no. of persons
Fig. 6.16 Indices optimisation gain for Query 6.5
Page 152 of 181
1000000
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
Chapter 7
Indexing for Optimising Processing of
Heterogeneous Resources
The updateable views described in subchapter 3.5 allow seamless integration of
heterogeneous data sources. As a result, the users can transparently process and modify
data shared by contributing resources. Such an environment because of complex multilayer architecture is not efficiency-oriented; therefore, it requires dedicated optimisation
methods. In this context, the approach to the index maintenance presented in subchapter
4.3 is inappropriate. External resource objects or table rows used to determine non-key
or key values are not easy to identify. Additionally, a distributed database is unaware of
local data modifications. Generic solution to this problem is out of the scope of the
dissertation since it is a wide research topic itself.
The author’s approach exploiting indexing architecture presented in the thesis is
based on observation that in many index-optimised queries an index is invoked multiple
times. In such scenario, a query performance would benefit even if the index would be
created throughout the evaluation of the query. Consequently, the index maintenance
becomes unnecessary.
7.1 Volatile Indexing
The idea of a volatile index is similar to a temporary index used in RDBMSs
(see subchapter 2.3). A regular index, as a redundant structure, requires an automatic
updating mechanism to be in cohesion with data. In case of volatile indices database
permanently stores an index definition and materialises an index only during the query
evaluation. Therefore, automatic updating for volatile indices becomes superfluous. The
main and obvious disadvantage of this approach is the necessity to perform index
materialisation during query evaluation.
The time of the index generation is at least a time of single evaluation of a where
clause on which the optimisation occurs. Therefore, the query evaluation performance
will not improve if such an index is invoked only once. The index optimiser should
predict such situations to avoid unnecessary generating a volatile index.
Page 153 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
7.1.1 Conditions for Volatile Indexing Optimisation
When a volatile index is called multiple times during the query evaluation the
optimisation gain can be comparable to a regular index. Such situation can occur when
the optimised where clause is situated on the right side of a non-algebraic operator. It is
presented on the following figure.
Non-algebraic
expression
Right subexpression
rightexpr
Left subexpression
leftexpr
WhereExpression
where
Left where subexpression
refsexpr
Right where subexpression
predicates
Fig. 7.1 Query suitable for applying a volatile index
When the leftexpr expression returns a collection the rightexpr containing a where
clause is multiply evaluated against each collection result (see Tab. 3-3). It is assumed
that there exists an index defined on objects returned by the refsexpr expression.
Consequently, refsexpr has to be independent from the non-algebraic operator.
Moreover, predicates expression should contain selection predicates defining key
values for the given index which are dependent on the given non-algebraic operator; so,
the index key would be context dependent (key values should not be constant for all
iterations of the non-algebraic operator). Otherwise the where clause would be
independent and should be evaluated only once before the non-algebraic expression (see
factoring out independent subqueries method in section 5.6.1).
7.1.2 Index Materialisation
In ODRA the volatile index materialisation occurs directly before the first index
invocation. It consists of the following steps:
1. The non-key and key values are calculated through execution of the query:
<nonkeyexpr> join (<keyexpr_1> [ , <keyexpr_2>
... ])
which is generated on the basis of the index definition. nonkeyexpr expression is
equal to refsexpr expression from the optimised where clause (see Fig. 7.1) and
Page 154 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
keyexpr_i expressions form the predicates. The query returns a collection of
structures consisting of a non-key value and corresponding key values.
2. An index structure is initialised.
3. The cached query run-time result is made available for the index structure.
4. Non-key values are indexed according to key values.
In that way, cached run-time results are directly returned by index calls during a
volatile index optimised query evaluation. A normal index constructs run-time results
from the values stored in the database; therefore, an individual volatile index invocation
might be faster. After the query execution volatile index contents is removed and only
the index definition remains.
It is vital that a query determining index non-key and key values must be
executed most efficiently as possible; hence, participation of available optimisers is
often indispensable. Particularly it is important, when the query addresses a distributed
and heterogeneous collection of objects.
7.1.3 Solution Properties
Many important properties of a regular index concern also a volatile index:
•
from the user point of view it is used like a regular index; except, it is created
using different command, e.g.:
add vltlindex
•
the index transparency is achieved using standard index optimiser routines,
•
a volatile index call in an SBQL syntax tree and in a compiled ODRA
intermediate byte-code is the same as in case of a regular index call.
For that reason, the architecture of the volatile indexing technique relies on the
developed architecture of indexing described in Chapter 4 and Chapter 5.
Next sections on examples show the effectiveness of the volatile indexing
technique and its application for indexing heterogeneous and distributed resources.
7.1.4 Prove of Concept Test
Let us consider the example query introduced in subchapter 6.4:
Page 155 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
((Emp where address.city = "Łódź" and worksIn.Dept.address.city
in ("Łódź" union "Wrocław") and married = true and age = 61) as
e).(e.name + " " + e.surname,
count(Emp where getTotalIncomes() = e.getTotalIncomes()))
Its syntax tree is depicted below.
Fig. 7.2 Syntax tree of the example query
The dot expression (in a root the syntax tree) represents the root non-algebraic operator
from the Fig. 7.1. Its left subquery returns collection of binders named e containing
references to 61 years old, married employees living in Łódź, working in Łódź or
Wrocław. The right dot subquery is evaluated for each binder with an employee object.
A count expression, which is marked with a dashed line, calculates the number of
employees with an equal amount of total incomes to the processed employee. In
example from subchapter 6.4 a where clause in this subexpression was substituted with
an index call:
count(idxEmpTotalIncomes(e.getTotalIncomes()))
An index key is a total income of currently processed employee; thus, it depends on the
dot non-algebraic operator. The query meets all conditions, presented in section 7.1.1,
that are necessary to take advantage of a volatile index vltlIdxEmpTotalIncomes
similarly defined as the idxEmpTotalIncomes index. Consequently, the count
subexpression can be transformed accordingly:
count(vltlIdxEmpTotalIncomes(e.getTotalIncomes()))
The plot in Fig. 7.3 shows the optimisation gain for the given query optimised
with a use of the indices mentioned above. The gain for the query optimised using the
volatile indexing technique is in general smaller than in case of a regular index.
Nevertheless, for a database consisting of more than 30000 persons the query
performance improvement is significant. After applying a volatile index the query
execution is more than 39 times faster.
Page 156 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
vltlIdxEmpTotalIncomes gain
idxEmpTotalIncomes gain
1000
gain [ratio]
100
10
1
0,1
10
100
1000
10000
100000
1000000
no. of persons
Fig. 7.3 Optimisation gain for volatile and regular indices
7.2 Optimising Queries Addressing Heterogeneous
Resources
The most important feature distinguishing a volatile index from a normal index
concerns non-key values limitations. In the current solution, regular indices can index
database objects defined using simple path expressions which return object references.
Such limitation is caused by the index updating mechanism; therefore, it does not
concern the volatile indexing technique where non-keys definition can be an arbitrary
expression returning:
•
remote object references,
•
virtual object references (updateable views seeds – see subchapter 3.5),
•
binders and literals.
The basic assumption concerning a non-key object and key value definition for a
volatile index is determinism, i.e. it must return the exact same result provided that data
used to calculate it has not changed.
The significant advantage of the volatile indexing technique is its practicability
from the point of view of integration of distributed and heterogeneous resources. The
next section gives the overall description of wrapper enabling the transparent integration
of RDBMS resources into ODRA distributed database repository. The following section
Page 157 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
gives an example involving processing heterogeneous schema exploiting the wrapper.
The test proves that the volatile indexing technique can significantly facilitate
evaluation of queries in such an environment.
7.2.1 Overview of a Wrapper to RDBMS
Wrapping relational resources into the ODRA prototype has been developed
under the eGov-Bus project12 [28]. The ODRA database server is used as a virtual
repository the only component accessible for the top-level users and applications. A
virtual repository presents a global schema. Virtually integrated data from the
underlying heterogeneous resources are made available using SBQL views’ definitions
(described in subchapter 3.5). In the virtual repository concept neither data nor services
are to be be copied, replicated and maintained in the global schema, as they are
supplied, stored, processed and maintained on their autonomous sites.
The research devoted to object-oriented wrappers to relational databases
supporting query optimisation has been described in work [128] and in many papers
[63, 64, 129, 130, 131, 132]. The author has contributed to the virtual repository and
wrapper development.
An ODRA resource (an ODRA engine) denotes any data resource providing an
interface capable of executing SBQL queries and returning SBQL result objects. A
nature of such a resource is irrelevant, as only the mentioned capability is important. In
the simplest case, where a resource is an ODRA database, its interface has a direct
access to an ODRA database engine (DBMS). However, as virtual repository aims to
integrate existing business resources, whose models are mainly relational ones, an
interface becomes much more complicated, as there is no directly available data store –
SBQL result objects must be created dynamically basing on results returned from SQL
relational queries evaluated directly in a local RDBMS.
Such cases (the most common in real-life application) force introducing
additional middleware, an object-relational wrapper as a client-server solution. A
standard ODRA database can be extended with as many wrappers as needed (e.g. for
12
Advanced eGovernment Information Service Bus supported by the European Community
under “Information Society Technologies” priority of the Sixth Framework Programme - contract
number: FP6-IST-4-026727-STP
Page 158 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
relational or semi-structured data stores) and plugged into any resource model without
any lost of its primary performance. Furthermore, a wrapper server can be developed
independently, providing a communication protocol to its client. Of course, an ODRA
database with a wrapper’s client can work on a separate machine.
A query evaluation process exploiting the wrapper is depicted in Fig. 7.4. One
of the global applications sends a query (arrow 1). This query is expressed with SBQL,
as it refers to the business object oriented model available to global (top-level) users.
According to the global schema and its information on data fragmentation, replication
and physical location (obtained from integration schemata), the query is sent to
appropriate resources. In Fig. 7.4 this stage is realised with arrows 2, 2a and 2b.
global
application
10
global
SBQL
query 1
global SBQL
query result
global virtual store
global schema
2
virtual
repository
SBQL query
result composition
2a 2b
9
9a
9b
partial SBQL queries' results
partial
SBQL
query
ODRA interface
partial query
syntax tree
transformations
3
SBQL
query evaluation
partial
SBQL
query
8
SQL queries 4
SBQL result objects
ODRA interface
wrapper client
SQL optimization
information
3
SQL queries 4
encapsulated
7 SQL queries' results
wrapper server
JDBC
connection
resource
model
partial
SBQL
query
ODRA resource
ODRA interface
SQL queries 5
6
SQL queries' results
RDBMS
ODRA resource
ODRA resource
Fig. 7.4 Query evaluation through the wrapper [64]
The partial query aiming at a given relational resource is further processed with
a resource’s ODRA interface. First, the interface performs query optimisation. Apart
from efficient SBQL optimisation rules applied at any resource’s interface, queries can
be transformed so that powerful native SQL optimisers can work and amount of data
Page 159 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
retrieved from the RDBMS is acceptably small. Relational optimisation information
(indices, cardinalities, primary-foreign key relationships, etc.) is provided by the
wrapper server’s resource model (arrow 3) and appropriate SBQL query syntax tree
transformations are performed. Appropriate tree branches (responsible for such SQL
queries) are substituted with calls to execute immediately procedures with optimisable
SQL queries.
Once syntax tree transformations are finished, the interface starts regular SBQL
query evaluation. Whenever it finds an execute immediately procedure, SQL query is
sent to the server via the client (arrows 4, the client passes SQL queries without any
modification). The server executes SQL queries as a resource client (JDBC connection),
arrow 5, and their results, arrow 6, are encapsulated and sent to the client (arrow 7).
Subsequently, the client creates SBQL result objects from results returned from the
server (it cannot be accomplished at the resource site, which is another crucial reason
for a client-server architecture) and puts them on regular SBQL stacks for further
evaluation (arrow 8). In the preferable case (which is not always possible), results
returned from the server are supplied with TIDs (tuple identifiers), which enables
parametrising SQL queries within the SBQL syntax tree with intermediate results of
SBQL subqueries. Having finished its evaluation, the interface sends a “partial result”
upwards (arrow 9), where it is combined with results returned from other resources
(arrows 9a and 9b) and the global query result is composed (depending on
fragmentation types, redundancies and replication). This result is returned to the global
application (arrow 10).
A test presented in the next section takes advantage from above presented
features of the ODRA and the wrapper to RDBMS.
7.2.2 Volatile Indexing Technique Test
For the following test, a local ODRA OODBMS data schema and external data
schema are combined. The local data represent one company, which will be referred to
as company O. Its schema and distribution is the same as in the tests from this and the
previous chapter.
The external relational schema (Fig. 7.5) concerns another company, which will
be referred as company R. Its records are automatically wrapped to a simple internal
object-oriented schema.
Page 160 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
Fig. 7.5 Example relational schema of company R
Virtually each table row corresponds to a complex object, which consists of primitive
(atomic) subobjects according to table columns. Finally, the schema is transformed by
designed updateable views (Fig. 7.6) and extended with a virtual pointer worksIn
associating employees with departments.
Fig. 7.6 Views object-oriented schema of company R
RDBMSEmp and RDBMSDept views allow ODRA users transparently access, process
and even modify data shared by the RDBMS. For the test purposes, the relational
schema created on PostgreSQL 8.2 RDBMS has been populated with data about 100
employees and departments according to the distribution depicted in subchapter 6.1.
The example usage of the volatile indexing technique is tested on the following
query:
Query 7.1a: For each company O employee return a name concatenated with a surname and the number
of employees with a higher salary working in the R company in the same department as the given
employee.
original
query
Emp as empaux.(empaux.name + " " + empaux.surname,
(empaux.worksIn.Dept.name as deptnameaux).
((empaux.salary as empsalaryaux).count(RDBMSEmp
where worksIn.RDBMSDept.name = deptnameaux
and salary > empsalaryaux)))
Its syntax tree is consistent with the pattern shown in Fig. 7.1. The left dot subquery
addresses all employees of the O company. The right dot subquery is evaluated for each
binder containing an employee object. A count expression calculates the number of
employees of the R company selected according to a where clause. Values of selection
predicates are specified by a currently processed company O employee.
In order to optimise query evaluation optimiser uses methods mentioned in
section 5.6.3, i.e. query modification technique and removing unnecessary auxiliary
names. In such form the transformed query can be processed by the wrapper optimiser
Page 161 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
and the whole count expression can be substituted with an appropriate SQL query (see
Query 7.1b – reference). This query also depends on a currently processed company O
employee. Therefore, it is multiply sent to relational data resource (connected through
the wrapper) where it can be evaluated with an assistance of native optimisations
provided by the RDBMS (e.g. indices, projections, joins).
Query 7.1b
reference
Emp as empaux.(empaux.name + " " + empaux.surname,
(empaux.worksIn.Dept.name as deptnameaux).
((empaux.salary as empsalaryaux).
execsql("select COUNT(*) from employees, departments
where departments.name = '" + deptnameaux + "' AND
departments.id = employees.department_id AND
employees.salary > '" + empsalaryaux + "'"),
"<0 $employees | | e | none | binder 0>",
"admin.rdbms")
index
optimised
Emp as empaux.(empaux.name + " " + empaux.surname,
(empaux.worksIn.Dept.name as deptnameaux).
((empaux.salary as empsalaryaux).
count($vltlIdxRDBMSEmp(deptnameaux groupas $equal;
(empsalaryaux, 1.7976931348623157E308, false, true)
groupas $range))))
ref. avg. time
opt. avg. time
gain
7000
100
90
6000
80
5000
60
[s]
4000
50
3000
40
gain [ratio]
70
30
2000
20
1000
10
0
0
10
100
1000
10000
100000
1000000
no. of persons
Fig. 7.7 Evaluation times and optimisation gain for Query 7.1
The alternative way to improve performance of the Query 7.1 evaluation is to
take advantage of the volatile indexing technique. The administrator can create a
volatile index on RDBMSEmp defined using multiple keys. The first key is a name of a
department where an employee works (definition worksIn.RDBMSDept.name) and
Page 162 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
another one is an employee salary (definition salary). The second key should enable
optimisation of range queries. Let us assume that such index exists and its name is
vltlIdxRDBMSEmp. The index optimiser would transform the given query in order to
exploit the volatile indexing technique (see Query 7.1b - index optimised).
The test evaluation plot in Fig. 7.7 depends on the number of company O
employees. The gain indicates that the second approach to query optimisation results in
more than 40 times better performance.
The significant advantage of the index optimised query evaluation is a decrease
of communication between ODRA and RDBMS resource to minimum. The volatile
index materialisation has a significant influence on the Query 7.1b evaluation.
According to description in section 7.1.2 the execution of the following query is
necessary:
RDBMSEmp join (worksIn.RDBMSDept.name, salary)
Naïve evaluation would first return whole contents of the employees table
corresponding to the RDBMSEmp expression. Next, for each table row an appropriate
SQL query would be issued to RDBMS in order to determine a name of an employee’s
department. In order to improve the evaluation of the query determining non-key and
key values query modification technique, removing unnecessary auxiliary names and
the wrapper optimiser optimisation methods have been used. Consequently, the query
has been transformed to the following optimised form:
execsql("select employees.info, employees.department_id,
employees.surname, employees.salary, employees.id,
employees.sex, employees.name, employees.birth_date,
departments.name from employees, departments where
((departments.id = employees.department_id) AND (departments.id
= departments.id))", "<0 | | | none | struct <1 $employees |
| e | none | binder 1> <1 $departments | $name | | string |
value 1> <1 $employees | $salary | | real | value 1> 0>",
"admin.rdbms")
As a result, sending only one SQL query to the wrapper is necessary to materialise the
vltlIdxRDBMSEmp volatile index, i.e. to cache results containing seeds of virtual objects
corresponding to company R employee records together with required key values. The
evaluation of this query on the given RDBMS is generally longer and retrieves larger
amount of data than a single SQL query invocation from the reference query.
Nevertheless, profit of using a volatile index is considerable. The gain indicates that
single invocation of a volatile index is more than 40 times faster than execsql evaluation
Page 163 of 181
Chapter 7
Indexing for Optimising Processing of Heterogeneous Resources
in the reference query.
The test has been performed on the single machine so the communication
between the wrapper and the RDBMS, storing company R data, was realised through a
local loopback interface. Nevertheless, the wrapper can without obstruction access
resources in a distributed environment. Additional factors, like e.g. network throughput
and traffic delays, negatively influence the overall query evaluation performance.
Particularly in the context of the current example, they would significantly deteriorate
execution of the reference query, because it requires processing multiple SQL queries.
The only contraindication for introducing the volatile indexing technique would
occur if index materialisation would consume too many local resources and deteriorate
the performance. This however is not an issue of the given example and should be
considered by the database administrator.
The volatile indexing solution is generic. It can be applied for any schema
consisting of deterministic views, e.g.:
•
views transforming one schema (schema of actual data or view schema) to
another,
•
views providing access to external, legacy resources (e.g. returning objects
provided by the wrapper to RDBMS) regardless of their location,
•
views integrating data from several distributed and heterogeneous resources into
a common objects schema (see integration schema description in subchapter
4.4).
In order to test the volatile indexing technique against the last type of views
extending ODRA OODBMS with mechanisms supporting such schemas is required.
Page 164 of 181
Chapter 8
Conclusions
Chapter 8
Conclusions
The theses stated in the Ph.D. dissertation have been proved valid:
1. Processing of selection predicates based on arbitrary key expressions accessing
data in a distributed object-oriented database can be optimised by centralised or
distributed transparent indexing.
The designed indexing architecture provides the assumed level of transparency
in the established distributed and homogeneous environments. The architecture
comprises the optimisation module that is able to employ transparently indices in a
query, automatic update of indices in response to modification of corresponding data
and the administration module for organising and managing indices.
In order to enable creation and maintenance of indices supporting keys defined
using arbitrary, deterministic and side effects free expressions the author has introduced
a special kind of database triggers. Each individual database’s object is associated with
an Index Update Trigger (IUT) if it: belongs to an indexed collection, contains nested
indexed objects or is used to determine a key value. Any modification submitted to such
objects triggers a procedure necessary to update a corresponding index. Triggers
associated with objects used in key value evaluation are called Key Index Update
Triggers (KIUTs). Besides the information about an associated index, KIUTs hold
references to a corresponding indexed object. Determining objects participating in
calculation of the key value is made possible by an extension of the query execution
engine enabling logging objects that occur during binding. As the result, together with a
re-calculation of a key value for an indexed object it is possible to validate and correct
existing KIUTs. This approach is generic regardless of an expression defining an index
key.
In general, query optimisation consists in analysis of a query syntax tree and
replacing its parts with index calls using selection predicates employed as call
parameters. The index optimiser is capable of processing both conjunction and
disjunction of predicates. The solution considers an established object model and
properties of SBQL query language providing rules ensuring that optimisations preserve
query semantics. There are no restrictions concerning supported selection predicates and
Page 165 of 181
Chapter 8
Conclusions
a level of the transformed query, i.e. whether it addresses local or global schema.
Finally, no restrictions concern the selection of an index structure. An employed
indexing technique, i.e. linear hashing, is exemplary. However, it is essential that linear
hashing can be substituted with its scalable distributed equivalent (LH* SDDS) without
any changes in work of other elements of the presented indexing architecture.
Furthermore, this would enable the parallelisation of computation and increase
concurrency of the index. The author has proposed the rank queries optimisation
method efficient in a distributed environment particularly when taking advantage of a
distributed index is possible.
2. Evaluation of complex queries involving distributed heterogeneous resources
can be facilitated by techniques taking advantage of a transparent index
optimisation.
Heterogeneity introduces a higher complexity level of database’s architecture
which makes employing global indexing difficult. External schemas can be imported
into an object-oriented database and their data can be processed transparently, i.e.
indistinguishably from purely object-oriented data. This approach has been applied in
the wrapper to a relational database developed for OODBMS based on SBA and SBQL.
Furthermore, updateable object-oriented views, which are defined in SBQL, can
envelope and arbitrarily transform an external schema to build a top-level objectoriented schema.
The author’s volatile indexing technique addresses the environment depicted
above. It relies on a significant part of the developed indexing architecture, i.e. index
optimiser and management facilities, omitting mechanisms responsible for maintaining
cohesion between indices and indexed data. In contrast to regular indices, volatile
indices are materialised during the time of query evaluation. The indexed results are
cached making individual index calls very efficient. An improvement of query
performance depends on the number of times a volatile index is invoked during the
query evaluation. Therefore, this approach proves efficacy in optimisation of complex
and laborious global queries.
The theses have been confirmed by the prototype implementation of indices for
the ODRA prototype and tests presenting the example optimisation gain for the majority
of proposed solutions.
Page 166 of 181
Chapter 8
Conclusions
8.1 Future Work
The thesis is a significant contribution to the topic of transparent indexing in
distributed object-oriented databases. Nevertheless, there still exist many unexplored
directions of research in this domain. The presented solutions can be considered as the
strong fundament for further works.
The current considerations lead to the work on creating robust global indices for
horizontally fragmented homogeneous data. Vertical and mixed fragmentations involve
SBQL views constituting the global schema. In order to extend capabilities of indexing
to such a model, semantics of views should be taken into consideration. The important
research issues concern determining:
•
a method of persisting inside an index data made virtually available by views,
•
rules which constrain creating such global indices.
A workaround to those problems applying to the specified family of queries is
the author’s volatile indexing technique. Currently the control over this technique is
given to the administrator. Automation of the creation of volatile indices by the
database’s engine would enable better adaptation of index definition to a particular
query. This would require constructing algorithms for finding parts of queries that
would gain from indexing and analysis of selection predicates to determine the best
combination of index keys. The author believes that the research on this subject in the
context of SBA and SBQL would result in original query optimisation methods.
Another challenging subject is designing transparent global non-volatile indices
facilitating processing of distributed and heterogeneous resources. The first problem
concerns identifying external data, e.g. relational tuples, within the index.
Consequently, a mechanism enabling fast materialisation of individual objects wrapping
external data should be provided. Finally, full transparency involves development of an
architecture ensuring automatic updating of indices in response to external data
alteration. Solving those problems may require introducing special facilities for
registering external resources, e.g. global objects register, and extending wrapping
mechanisms with additional functionality, e.g. update triggers support.
Efficacy of all future solutions should be additionally proved by tests. Therefore,
extending indexing implementation for the ODRA prototype is the essential issue. The
closest works will involve support for distributed transactions.
Page 167 of 181
Index of Figures
Index of Figures
Fig. 2.1 A typical stages of high-level language query optimisation [29]...................... 20
Fig. 2.2 Example of a bucket split operation [72] .......................................................... 28
Fig. 2.3 Example object-relational schemata.................................................................. 42
Fig. 3.1 Example of an object-oriented database schema for a company....................... 52
Fig. 3.2 Sample store with classes and objects ............................................................... 54
Fig. 4.1 Index manager structure .................................................................................... 71
Fig. 4.2 Example Nonkey structure for Emp collection ................................................. 72
Fig. 4.3 Example Index Update Triggers generated for idxPerAge index ..................... 77
Fig. 4.4 Example Index Update Triggers generated for idxEmpWorkCity index........... 78
Fig. 4.5 Example Index Update Triggers generated for idxAddrStreet index ................ 78
Fig. 4.6 Automatic index updating architecture.............................................................. 78
Fig. 4.7 Calculating the idxPerAge index key value for i31 object ................................. 82
Fig. 4.8 Calculating the idxEmpWorkCity index key value before update ..................... 84
Fig. 4.9 Calculating the idxEmpWorkCity index key value after update ........................ 84
Fig. 4.10 Calculating the idxPerZip index key value before removing zip attribute...... 86
Fig. 4.11 Calculating the idxPerZip index key value without zip attribute .................... 86
Fig. 4.12 Calculating the idxPerZip index key value after inserting zip attribute .......... 87
Fig. 4.13 Calculating the idxEmpTotalIncomes index key value for i61 object before
update.............................................................................................................................. 88
Fig. 4.14 Calculating the idxEmpTotalIncomes index key value for i31 object before
update.............................................................................................................................. 89
Fig. 4.15 Last steps of computing the idxEmpTotalIncomes index key value for i31 after
update.............................................................................................................................. 90
Fig. 4.16 Example database schema for data integration ............................................... 99
Fig. 5.1 ODRA optimisation architecture [2] ............................................................... 103
Fig. 5.2 Schema of the index optimiser ........................................................................ 104
Page 168 of 181
Index of Figures
Fig. 5.3 Example optimisation applied by the index optimiser .................................... 105
Fig. 5.4 Index optimiser algorithm ............................................................................... 107
Fig. 5.5 Query optimisation with the index optimiser pre-processing.......................... 126
Fig. 6.1 Department’s location distribution .................................................................. 142
Fig. 6.2 Employee’s department distribution................................................................ 142
Fig. 6.3 Employee's salary range distribution............................................................... 143
Fig. 6.4 Female person’s first name distribution .......................................................... 143
Fig. 6.5 Female person’s surname distribution............................................................. 143
Fig. 6.6 Male person’s first name distribution.............................................................. 143
Fig. 6.7 Male person’s surname distribution ................................................................ 144
Fig. 6.8 Evaluation times and optimisation gain for Query 6.1.................................... 145
Fig. 6.9 Indices optimisation gain for Query 6.1 .......................................................... 146
Fig. 6.10 Evaluation times and optimisation gain for Query 6.2.................................. 147
Fig. 6.11 Indices optimisation gain for Query 6.2 ........................................................ 148
Fig. 6.12 Evaluation times and optimisation gain for Query 6.3.................................. 149
Fig. 6.13 Indices optimisation gain for Query 6.3 ........................................................ 150
Fig. 6.14 Evaluation times and optimisation gain for Query 6.4.................................. 150
Fig. 6.15 Evaluation times and optimisation gain for Query 6.5.................................. 151
Fig. 6.16 Indices optimisation gain for Query 6.5 ........................................................ 152
Fig. 7.1 Query suitable for applying a volatile index.................................................... 154
Fig. 7.2 Syntax tree of the example query .................................................................... 156
Fig. 7.3 Optimisation gain for volatile and regular indices .......................................... 157
Fig. 7.4 Query evaluation through the wrapper [64] .................................................... 159
Fig. 7.5 Example relational schema of company R ...................................................... 161
Fig. 7.6 Views object-oriented schema of company R ................................................. 161
Fig. 7.7 Evaluation times and optimisation gain for Query 7.1.................................... 162
Page 169 of 181
Index of Tables
Index of Tables
Tab. 3-1 Evaluation of traditional arithmetic operators.................................................. 57
Tab. 3-2 Evaluation of operators working on collections............................................... 57
Tab. 3-3 Evaluation of non-algebraic SBQL operators .................................................. 59
Tab. 3-4 Evaluation of auxiliary names defining operators............................................ 60
Tab. 3-5 Evaluation of sequences ranking operators...................................................... 60
Tab. 3-6 Evaluation of imperative operators .................................................................. 61
Tab. 5-1 Features of Rank Queries Evaluation Strategies ............................................ 139
Tab. 6-1 Optimisation testbench configuration ............................................................ 142
Page 170 of 181
Bibliography
Bibliography
1. Adamus R., Habela P., Kaczmarski K., Lentner M., Stencel K, Subieta K. StackBased Architecture and Stack-Based Query Language. ICOODB 2008, Berlin:
http://www.odbms.org/download/030.02%20Subieta%20StackBased%20Architecture%20and%20StackBased%20Query%20Language%20March%202008.PDF
2. Adamus R., Kowalski T.M., Subieta K., et al: Overview of the Project ODRA.
Proceedings of the First International Conference on Object Databases,
ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp. 179-197
3. Aguilera M. K., Golab W., Shah M. A.: A practical scalable distributed B-tree.
Proceedings of the VLDB Endowment, 1(1), pp. 598-609, 2008
4. Ali M. H., Saad A. A., Ismail M. A.: The PN-Tree: A Parallel and Distributed
Multidimensional Index. Distributed and Parallel Databases 17(2), pp. 111-133,
2005
5. Andrzejewski W., Królikowski Z., Masewicz M., Wrembel R.: Hidden Markov
Models as prediction mechanism for object oriented database systems with
hierarchical materialisation. II Krajowa Konferencja Naukowa “Technologie
Przetwarzania Danych” Poznań, September 2007 (in Polish)
6. Astrahan M. M.: System R: A relational approach to data management. ACM
Transactions on Database Systems, 1(2), pp. 97-137, June 1976
7. Basu J., Keller A. M., Pöss M.: Centralized versus Distributed Index Schemes in
OODBMS - A Performance Analysis. Proc. of ADBIS 1997, pp. 162-169
8. Bayer R., McCreight E.: Organization and maintenance of large ordered
indexes. Acta Inf. 1, 1972, 173-189
9. Bertino, E.: Method precomputation in object-oriented databases. SIGOS
Bulletin, 12 (2, 3), 1991, pp. 199-212
10. Bertino E. et al.: Indexing Techniques for Advanced Database Systems. Kluwer
Academic Publishers, Boston Dordrecht London,1997
11. Bertino E., Catania B., Chiesa L.: Definition and Analysis of Index
Organizations for Object-Oriented Database Systems. Information Systems,
Page 171 of 181
Bibliography
v.23 n.2, p.65-108, April 1, 1998
12. Bertino E., Foscoli P.: Index Organizations for Object-Oriented Database
Systems. IEEE Transactions on Knowledge and Data Engineering archive
Volume 7, Issue 2 (April 1995), pp. 193-209
13. Bębel, B., Wrembel, R.: Method Materialization Using the Hierarchical
Technique: Experimental Evaluation, Proc. of Joint Conference on KnowledgeBased Software Engineering (JCKBSE), Slovenia, 2002
14. Black P.E.: Dictionary of Algorithms and Data Structures [online], Paul E.
Black, ed., U.S. National Institute of Standards and Technology. 17 November
2008.: http://www.nist.gov/dads
15. Blasgen M. W., Casey R. G., Eswaran K. P.: An Encoding Method for Multifield
Sorting and Indexing. Communications of the ACM, Nov. 1977, p. 874.
16. Nam B., Sussman A.: DiST: Fully Decentralized Indexing for Querying
Distributed Multidimensional Datasets. Proceedings of the 20th IPDPS 2006,
IEEE 2006
17. Burleson D.: Turbocharge SQL with advanced Oracle9i indexing. March 26,
2002: http://www.dba-oracle.com/art_9i_indexing.htm
18. Cattell R.G.G., Barry D.K.(Eds.): The Object Data Standard: ODMG 3.0.
Morgan Kaufmann 2000
19. Cambazoglu B.B., Catal A., Aykanat C.: Effect of Inverted Index Partitioning
Schemes on Performance of Query Processing in Parallel Text Retrieval
Systems. ISCIS 2006, Istambul, Turkey, pp. 717-725
20. Chaudhuri S.: An Overview of Query Optimization in Relational Systems.
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium
on Principles of database systems, Seattle, Washington, United States , pp. 3443, 1998
21. Chen Y., Chen Y.: Signature file hierarchies and signature graphs: a new index
method for object-oriented databases. Proc. of the SAC, pp. 724-728, 2004
22. Cook W.R., Rosenberger C.: Native Queries for Persistent Objects: A Design
White Paper. 2006:
http://www.db4o.com/about/productinformation/whitepapers/Native%20Queries
Page 172 of 181
Bibliography
%20Whitepaper.pdf
23. Cormen T. H. et al.: Introduction to Algorithms, Second Edition. MIT Press and
McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 9: Medians and Order
Statistics, pp.183–196.
24. DB2: http://ibm.com/software/data/db2,
25. db4o: http://www.db4o.com/
26. db4o Tutorial for Java. Production Release V6.3:
http://www.db4o.com/about/productinformation/resources/db4o-6.3-tutorialjava.pdf
27. Eder J., Frank H., Liebhart W.: Optimization of Object-Oriented Queries by
Inverse Methods. Proceedings of the 2nd International East/West Database
Workshop, September 1994, Klagenfurt, Austria, pp. 108-120
28. eGov-Bus, http://www.egov-bus.org/web/guest/home
29. Elmasri R., Navathe S. B.: Fundamentals of Database Systems. 4th Edition,
Pearson Education, Inc, publishing as Addison-Wesley, 2004, ISBN 0-32112226-7
30. Fenk R., Markl V., Bayer R.: Interval Processing with the UB-Tree. . Proc.
IDEAS Conf., IEEE Computer Society, 2002, pp. 12-22
31. Firebird: http://www.firebirdsql.org/
32. Gaede V., Günther O.: Multidimensional Access Methods, ACM Computing
Surveys, 30(2), pp. 170-231, June 1998
33. Garcia-Molina H., Ullman J.D., Widom J.: Database Systems: The Complete
Book. 1st edition, Pearson Education, Inc., publishing as Prentice Hall, 2002
34. Garcés-Erice L., et al.: Data Indexing in Peer-to-Peer DHT Networks.
Proceedings of the ICDCS 2004, pp. 200-208
35. GemFire Enterprise Developer’s Guide. Version 5.7, GemStone, September
2008:
http://www.gemstone.com/docs/5.7.0/product/docs/html/Manuals/wwhelp/wwhi
mpl/js/html/wwhelp.htm
36. GemStone FacetsTM Programming Guide, Version 4.0, GemStone, June 2006:
Page 173 of 181
Bibliography
http://www.facetsodb.com/downloads/facets/Programming.pdf
37. GemStone Systems, Inc.: http://www.gemstone.com/
38. Gnutella Protocol Development: http://rfc-gnutella.sourceforge.net/
39. Habela P., Kaczmarski K., Kozankiewicz H., Lentner M., Stencel K., Subieta
K.: Data-Intensive Grid Computing Based on Updateable Views. ICS PAS
Report 974, June 2004
40. Hadjieleftheriou M., Hoel E. G., Tsotras V. J.: SaIL: A Spatial Index Library for
Efficient Application Integration. GeoInformatica 9(4), pp. 367-389, 2005
41. Helmer S., Moerkotte G.: A performance study of four index structures for setvalued attributes of low cardinality. VLDB Journal, 12(3): pp. 244-261, October
2003
42. Henrich A.: P-OQL: an OQL-oriented query language for PCTE. In Proc. 7th,
Conf. on Software Engineering Environments (SEE ’95), pages 48–60,
Noordwijkerhout, Niederlande, 1995. IEEE
43. Henrich A.: The Update of Index Structures in Object-Oriented DBMS.
Proceedings of the Sixth International Conference on Information and
Knowledge Management (CIKM'97), Las Vegas, Nevada, November 10-14,
1997. ACM 1997, ISBN 0-89791-970-X: pp. 136-143
44. Hosain M. S., Newton M. A. H., Rahman M. M.: Dynamic Adaptation of Multikey Index for Distributed Database System. Proceedings of the 9th WSEAS
International Conference on Computers, Athens, Greece, July 2005.
45. Hosain M. S., Newton M. A. H.: Multi-Key Index for Distributed Database
System, International Journal of Software Engineering and Knowledge
Engineering, Vol. 15, No. 2, May 2005, pp. 433–438
46. Hwang D. J.: Function-based indexing for object-oriented databases. PhD
thesis, Massachusetts Institute of Technology, February 1994
47. H-PCTE: http://pi.informatik.uni-siegen.de/pi/hpcte/hpcte.html
48. IBM® DB2 Information Center. version 9.5, 6 August 2008:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/
49. IBM® Informix® Dynamic Server Information Center. version 11.50, 20 August
Page 174 of 181
Bibliography
2008:
http://publib.boulder.ibm.com/infocenter/idshelp/v115/
50. IBM® Informix® Virtual-Index Interface, Programmer’s Manual. Version
11.50, SC23-9439-00, May 2008
http://publibfp.boulder.ibm.com/epubs/pdf/c2394390.pdf
51. IBM® System iTM and i5/OS® Information Center. version 6, 1st edition, 2008:
http://publib.boulder.ibm.com/infocenter/systems/scope/i5os/index.jsp
52. Ilyas I.F., Aref W.G. et al.: Adaptive rank-aware query optimization in
relational databases. ACM Transactions on Database Systems (TODS), v.31
n.4, p.1257-1304, December 2006
53. Informix: http://ibm.com/informix
54. Ioannidis Y. E.: Query Optimization, ACM Computing Surveys, symposium
issue on the 50th Anniversary of ACM, Vol. 28, No. 1, 1996, pp. 121-123
55. Jarke M., Koch J.: Query Optimization in Database Systems. ACM Computing
Surveys 16(2), 1984, pp. 111-152
56. Jodłowski A.: Dynamic Object Roles in Conceptual Modelling and Databases.
Ph.D. Thesis, The Institute of Computer Science, The Polish Academy of
Sciences, 2002
57. Kemper A., Kilger C., Moerkotte G.: Function Materialization in Object Bases:
Design, Realization and Evaluation. IEEE Transaction on Knowledge and Data
Engineering, Vol. 6, No. 4, August 1994, pp 587-608
58. Kowalski T.M., Kuliberda K, Adamus R., Wislicki J., Murleski J.: Local and
Global Indexing Strategies and Data-Structures in Distributed Object-Oriented
Databases. SiS 2006 Proceedings, Łódź, Poland, 2006, pp. 153 - 156
59. Kowalski T.M., Wiślicki J., Kuliberda K., Adamus R., Subieta K.: Optimization
by Indices in ODRA. Proceedings of the First International Conference on
Object Databases, ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp.97-117
60. Kozankiewicz H., Leszczyłowski J., Subieta K.: Implementing Mediators
through Virtual Updateable Views. Engineering Federated Information Systems,
Proceedings of the 5th Workshop EFIS 2003, July 17-18 2003, UK, pp.52-62
Page 175 of 181
Bibliography
61. Kozankiewicz H., Stencel K., Subieta K.: Integration of Heterogeneous
Resources through Updatable Views. Workshop on Emerging Technologies for
Next Generation GRID (ETNGRID-2004), June 2004, Proc. published by IEEE
62. Kroll B., Widmayer P.: Distributing a search tree among a growing number of
processors. In Proc. of ACM-SIGMOD, May 1994
63. Kuliberda K, Adamus R., Wiślicki J., Kaczmarski K.,Kowalski T. M.,Subieta
K.: Autonomous Layer for Data Integration in a Virtual Repository. 3th
International Conference on Grid computing, high-performAnce and Distributed
Applications (GADA'06), France, Springer 2006 LNCS 4276, pp. 1290-1304
64. Kuliberda K., Meina M., Wiślicki J., Kowalski T.M., Adamus R., Kaczmarski
K., Subieta K.: On Distributed Data Processing in Data Grid Architecture for a
Virtual Repository. SiS 2008 proceedings, Łódź, Poland, 2008 (to appear)
65. Kwan S. C., Strong H. R.: Index Path Length Evaluation for the Research
Storage System of System R. IBM Research Report RJ2736, San Jose, CA.,
January 1980
66. Lane P. et al.: Oracle® Database Data Warehousing Guide. 11g Release 1
(11.1), Part Number B28313-02, September 2007:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28313/toc.htm
67. Lee W.-C., Lee D. L. Path dictionary: a new approach to query processing in
object-oriented databases. IEEE Transactions on Knowledge and Data
Engineering, Volume 10, Issue 3 (May/June 1998), pp. 371-388
68. Lentner M.: Integration of data and applications using virtual repositories. PhD
Thesis, PJIIT, Warszawa 2008
69. Li C., Chang K. C.-C., et al.: RankSQL: query algebra and optimization for
relational top-k queries. Proceedings of the 2005 ACM SIGMOD international
conference on Management of data, June 14-16, 2005, Baltimore, Maryland
70. Liebeherr J., Omiecinski E., Akyildiz I. F.: The Effect of Index Partitioning
Schemes on the Performance of Distributed Query Processing. IEEE
Transactions on Knowledge and Data Engineering archive, Volume 5, Issue 3,
1993, pp. 510-522
71. Liskov B. et al: Safe and Efficient Sharing of Persistent Objects in Thor. In Proc.
Page 176 of 181
Bibliography
of ACM SIGMOD International Conference on Management of Data, pages
318–329, Montreal, Canada, June 1996
72. Litwin W.: Linear Hashing: a new tool for file and tables addressing. Reprinted
from VLDB-80 in READINGS IN DATABASES. 2-nd ed. Morgan Kaufmann
Publishers, Inc., 1994. Stonebraker , M.(Ed.)
73. Litwin W., Neimat M.-A., Schneider D. A.: LH*: linear hashing for distributed
files. In Proc. of ACM-SIGMOD, May 1993
74. Litwin W., Neimat M. A., Schneider D. A.: LH*: Scalable, Distributed
Database System. 1996, ACM Trans. Database Syst., 21(4):480-525
75. Litwin W., Neimat M.-A., Schneider D. A.: RP*: A family of order-preserving
scalable distributed data structures. In Proc. of VLDB, September 1994
76. Litwin W., Schwarz T. J. E.: LH*RS: A High-Availability Scalable Distributed
Data Structure using Reed Solomon Codes. SIGMOD Conference 2000: 237248
77. Luk F. H.-W., Fu A.: W. Triple-node hierarchies for object-oriented database
indexing. In Proceedings of the 7th international conference on Information and
knowledge management, ACM Press (1998), pp 386-397
78. Łaski M.: Query optimisation in object-oriented databases on example of ODRA
system implementation. MSc thesis, Computer Engineering Department,
Technical University of Łódź, 2007 (in Polish)
79. Maier D., Stein J.: Indexing in an object-oriented DBMS. In Proceedings on the
1986 international Workshop on Object-Oriented Database Systems,
International Workshop on Object-Oriented Database Systems. IEEE Computer
Society Press, pp. 171-182
80. Masewicz, M., Wrembel, R., Jezierski, J.: Optimising Performance of ObjectOriented and Object-Relational Systems by Dynamic Method Materialisation.
Proc. of ADBIS 2005, Tallin, Estonia
81. Milo T., Suciu D.: Index structures for path expressions. In Proc. of the 7th Int.
Conf. on Database Theory (ICDT’99), pp 277-295, 1999
82. Morales T. et al.: Oracle® Database VLDB and Partitioning Guide. 11g Release
1 (11.1), Part Number B32024-01, July 2007:
Page 177 of 181
Bibliography
http://download.oracle.com/docs/cd/B28359_01/server.111/b32024/toc.htm
83. MySQL: http://www.mysql.com/
84. O’Neil P.E., Quasi D.: Improved Query Performance with Variant Indexes.
Proceedings of SIGMOD, pp. 38-49, 1997
85. Objectivity: http://www.objectivity.com/
86. Objectivity for Java Programmer’s Guide. Part Number: 93-JAVAGD-0,
Release 9.3, October 13, 2006
87. Objectivity/SQL++. Part Number: 93-SQLPP-0, Release 9.3, October 9, 2006
88. ObjectStore: http://www.progress.com/objectstore/
89. ObjectStore Java API User Guide, ObjectStore. Release 7.1 for all platforms,
Progress Software Corporation, August 2008:
http://www.psdn.com/library/servlet/KbServlet/download/5894-10229715/osjiug.pdf
90. Olken F., Rotem D.: Simple Random Sampling for Relational Databases.
Proceedings of VLDB, pp. 160-169, 1986
91. Oracle: http://www.oracle.com/
92. Özsu M. T., Valduriez P.: Distributed and Parallel Database Systems. ACM
Computing Surveys, Volume 28(1), March 1996, pp. 125-128
93. Płodzień J.: Optimization Methods In Object Query Languages. PhD Thesis.
IPIPAN, Warszawa 2000
94. Płodzień J., Kraken A.: Object Query Optimization in the Stack-Based
Approach. Proc. ADBIS Conf., Springer LNCS 1691, 1999, pp. 303-316
95. Płodzień J., Kraken A.: Object Query Optimization through Detecting
Independent Subqueries. Information Systems, Pergamon Press, 2000
96. Płodzień J., Subieta K.: Applying Low-Level Query Optimization Techniques by
Rewriting. Proc. DEXA Conf., Springer LNCS 2113, 2001, pp. 867-876
97. Płodzień J., Subieta K.: Optimization of Object-Oriented Queries by Factoring
Out Independent Subqueries. Institute of Computer Science Polish Academy of
Sciences, Report 889, 1999
Page 178 of 181
Bibliography
98. Płodzień J., Subieta K.: Query Processing in an Object Data Model with
Dynamic Roles. Proc. WSEAS Intl. Conf. on Automation and Information
(ICAI), Puerto de la Cruz, Spain, CD-ROM, ISBN: 960-8052-89-0, 2002
99. Płodzień J., Subieta K.: Query Optimization through Removing Dead
Subqueries. Proc. ADBIS Conf., Springer LNCS 2151, 2001, pp. 27-40
100. Płodzień J., Subieta K.: Static Analysis of Queries as a Tool for Static
Optimization. Proc. IDEAS Conf., IEEE Computer Society, 2001, pp. 117-122
101. Poosala V., Ioannidis Y.E.: Selectivity Estimation without the Attribute Value
Independence Assumption. Proceedings of VLDB, pp. 486-495, 1997
102. Ramakrishnan. R.: Database Management Systems. WCB/McGraw-Hill,1998
103. PostreSQL: http://www.postgresql.org/
104. Ranjan R., Harwood A., Buyya R.: Peer-to-Peer Based Resource Discovery in
Global Grids: A Tutorial. IEEE Communications Surveys and Tutorials,
Volume 10, Number 2, pp: 6-33, ISSN: 1553-877X, USA, 2008.
105. Rao P., Moon B.: psiX: Hierarchical Distributed Index for Efficiently Locating
XML Data in Peer-to-Peer Networks. Technical Report 05-10, University of
Arizona, 2005
106. Sahri S., Litwin W., Schwartz T.: SD-SQL Server: a Scalable Distributed
Database System. CERIA Research Report, December 2005
107. Schoder D., Fischbach K.: Core Concepts in Peer-to-Peer (P2P) Networking.
In: Subramanian, R.; Goodman, B. (eds.): P2P Computing: The Evolution of a
Disruptive Technology, Idea Group Inc, Hershey. 2005
108. Shiela R. et al.: Advanced Application Developer's Guide. 11g Release 1
(11.1), Part Number B28424-03, August 2008:
http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28424/toc.htm
109. SQL Server: http://www.microsoft.com/sqlserver/
110. SQL Server 2008 Books Online. 2008: http://msdn.microsoft.com/enus/library/ms130214.aspx
111. Sreenath B., Seshadri S.: The hcC-tree: An Efficient Index Structure for Object
Oriented Databases. Proc. 21-st VLDB Conf., Santiago de Chile, pp. 203-213,
Page 179 of 181
Bibliography
1994
112. Stencel K.: Semi-strong Type Checking in Database Programming Languages.
(in Polish), PJIIT - Publishing House, Warszawa 2006, 207 pages
113. Stoica I., et al.: Chord: a scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Transactions on Networking, Volume 11, Number 1,
pp. 17-32, 2003
114. Stolze K., Steinbach T.: DB2 Index Extensions by example and in detail. 2003:
http://www3.software.ibm.com/ibmdl/pub/software/dw/dm/db2/dm0312stolze/0312stolze.pdf
115. Strohm R. et al.: Oracle® Database Concepts. 11g Release 1 (11.1), Part
Number B28318-05, October 2008:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/toc.htm
116. Subieta K. LOQIS: The Object-Oriented Database Programming System.
Proc.1st Intl. East/West Database Workshop on Next Generation Information
System Technology, Kiew, USSR 1990, Springer Lecture Notes in Computer
Science, Vol.504, pp.403-421
117. Subieta K.: Stack-Based Approach (SBA) and Stack-Based Query Language
(SBQL). http://www.sbql.pl , 2008
118. Subieta K.: Theory and Construction of Object-Oriented Query Languages.
PJIIT - Publishing House, ISBN 83-89244-28-4, 2004, 522 pages (in Polish)
119. Subieta K. et al.: ODRA Manual. August 2008:
http://www.sbql.pl/various/ODRA/ODRA_manual.html
120. Subieta K., Kambayashi Y., Leszczyłowski J.: Procedures in Object-Oriented
Query Languages. Proc. 21-st VLDB Conf., Zurich, pp.182–193, 1995
121. Subieta K., Leszczyłowski J., Ulidowski I.: Processing Semi-Structured Data
in Object Bases. ICS PAS Report 852, February 1998
122. Subieta K., Płodzień J.: Object Views and Query Modification. (in) Databases
and Information Systems (eds. J. Barzdins, A. Caplinskas), Kluwer Academic
Publishers, ss. 3-14, 2001
123. Subieta K., Rzeczkowski W.: Query Optimization by Stored Queries.
Page 180 of 181
Bibliography
Proceeding of VLDB, pp. 369-380, 1987
124. Taniar D., Rahayu J. W.: A Taxonomy of Indexing Schemes for Parallel
Database Systems. Distributed and Parallel Databases, Volume 12, Number 1,
Kluwer Academic Publishers, pp. 73-106, 2002
125. Tao Y., Papadias D., Sun J.: The TPR*-Tree: An Optimized Spatio-Temporal
Access Method for Predictive Queries. Proceedings of VLDB, Berlin, Germany,
pp. 790-801, 2003
126. VERSANT: http://www.versant.com/
127. VERSANT Database Fundamentals Manual. Release 7.0.1.0, July 2005:
http://www.versant.com/developer/resources/objectdatabase/documentation/data
base_fund_man.pdf
128. Wiślicki J.: An object-oriented wrapper to relational databases with query
optimisation. PhD Thesis, Technical University of Łódź, Łódź 2008
129. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Integration of
Relational Resources in an Object-Oriented Data Grid. SiS 2006 Proceedings,
Łódź, Poland, 2006, pp. 277-280
130. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Implementation of a
Relational-to-Object Data Wrapper Back-end for a Data Grid, SiS 2006
Proceedings, Łódź, Poland, 2006, pp. 285-288
131. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R.: Integration of relational
resources in an object-oriented data grid with an example. Journal of Applied
Computer Science (2006), Vol. 14 No. 2, Łódź, Poland, 2006, pp. 91-108
132. Wiślicki J., Kuliberda K., Kowalski T.M., Adamus R., Subieta K.:
Implementation and Testing of SBQL Object-Relational Wrapper Supporting
Query Optimisation. Proceedings of the First International Conference on Object
Databases, ICOODB 2008, Berlin, ISBN 078-7399-412-9, pp.39-56
133. Wrembel R., Bębel B.: Oracle: Designing of Distributed Databases.
Wydawnictwo Helion, 2003 (in Polish)
134. Zobel J., Moffat A., Ramamohanarao K.: Inverted Files versus Signature Files
for Text Indexing. ACM Transactions on Database Systems, 23(4): pp. 453-490,
1998
Page 181 of 181

Podobne dokumenty