Tytył habilitacji (czcionka 14pt) - Studia Informatica

Transkrypt

STUDIA
INFORMATICA
Formerly: Zeszyty Naukowe Politechniki Śląskiej, seria INFORMATYKA
Quarterly
Volume 32, Number 3B (99)
Silesian University of Technology Press
Gliwice 2011
STUDIA INFORMATICA
Volume 32, Number 3B (99)
Formerly: Zeszyty Naukowe Politechniki Śląskiej, seria INFORMATYKA
Nr kol. 1850
Editor in Chief
Dr. Marcin SKOWRONEK
Silesian University of Technology
Gliwice, Poland
Editorial Board
Dr. Mauro CISLAGHI
Project Automation
Monza, Italy
Prof. Olgierd A. PALUSINSKI
University of Arizona
Tucson, USA
Prof. Bernard COURTOIS
Lab. TIMA
Grenoble, France
Prof. Svetlana V. PROKOPCHINA
Scientific Research Institute BITIS
Sankt-Petersburg, Russia
Prof. Tadeusz CZACHÓRSKI
Gliwice, Poland
Prof. Karl REISS
Universität Karlsruhe
Karlsruhe, Germany
Prof. Jean-Michel FOURNEAU
Université de Versailles - St. Quentin
Versailles, France
Prof. Jean-Marc TOULOTTE
Université des Sciences et Technologies de Lille
Villeneuve d'Ascq, France
Prof. Jurij KOROSTIL
IPME NAN Ukraina
Kiev, Ukraine
Prof. Sarma B. K. VRUDHULA
University of Arizona
Tucson, USA
Dr. George P. KOWALCZYK
Networks Integrators Associates, President
Parkland, USA
Prof. Hamid VAKILZADIAN
University of Nebraska-Lincoln
Lincoln, USA
Prof. Stanisław KOZIELSKI
Gliwice, Poland
Prof. Stefan WĘGRZYN
Gliwice, Poland
Prof. Peter NEUMANN
Otto-von-Guericke Universität
Barleben, Germany
Prof. Adam WOLISZ
Technical University of Berlin
Berlin, Germany
STUDIA INFORMATICA is indexed in INSPEC/IEE (London, United Kingdom)
© Copyright by Silesian University of Technology Press, Gliwice 2011
PL ISSN 0208-7286, QUARTERLY
Printed in Poland
The paper version is the original version
ZESZYTY NAUKOWE POLITECHNIKI ŚLĄSKIEJ
KOLEGIUM REDAKCYJNE
REDAKTOR NACZELNY – Prof. dr hab. inŜ. Andrzej BUCHACZ
REDAKTOR DZIAŁU
– Dr inŜ. Marcin SKOWRONEK
SEKRETARZ REDAKCJI – Mgr ElŜbieta LEŚKO
SPIS TREŚCI
1.
Str.
Mirosław Błocho, Zbigniew J. Czech: Ulepszony algorytm minimalizacji
liczby tras dla problemu trasowania pojazdów z oknami czasowymi ........................... 5
2.
Jan Sadolewski, Zbigniew Świder: Specyfikacja i weryfikacja prostych programów sterowania logicznego z wykorzystaniem Frama C ...................................... 21
3.
Artur Siążnik, Bożena Małysiak-Mrozek, Dariusz Mrozek: Eksploracja danych genetycznych bazy GenBank z wykorzystaniem usług sieciowych ................... 35
4.
Miłosz Góralczyk, Jarosław Koszela: Architektura obiektowej bazy danych
MUTDOD .................................................................................................................... 53
5.
Marcin Karpiński, Jarosław Koszela: Obiektowo orientowane rozproszenie
w MUTDOD ................................................................................................................ 65
6.
Jarosław Koszela, Miłosz Góralczyk, Michał Jasiorowski, Marcin Karpiński,
Emil Wróbel, Kamil Adamowski, Joanna Bryzek, Mariusz Budzyn, Michał
Małek: Środowisko wykonawcze rozproszonej obiektowej bazy danych ................... 77
7.
Aleksandra Bieńkowska: Badanie symulacyjne protokołu zatwierdzania transakcji mobilnych .......................................................................................................... 89
8.
Jerzy Martyna: Uczenie maszynowe dla identyfikacji zmian DNA do diagnozowania choroby ........................................................................................................ 103
CONTENTS
1.
Page
Mirosław Błocho, Zbigniew J. Czech: An improved route minimization
algorithm for the vehicle routing problem with time windows ..................................... 5
2.
Jan Sadolewski, Zbigniew Świder: Specification and verification of simple
logic control programs using Frama C......................................................................... 21
3.
Artur Siążnik, Bożena Małysiak-Mrozek, Dariusz Mrozek: Exploration of
genetic data from GenBank using web services .......................................................... 35
4.
Miłosz Góralczyk, Jarosław Koszela: Architecture of object database
MUTDOD .................................................................................................................... 53
5.
Marcin Karpiński, Jarosław Koszela: Object oriented distribution in
MUTDOD .................................................................................................................... 65
6.
Jarosław Koszela, Miłosz Góralczyk, Michał Jasiorowski, Marcin Karpiński,
Emil Wróbel, Kamil Adamowski, Joanna Bryzek, Mariusz Budzyn, Michał
Małek: Executive environment of distributed object database MUTDOD ................. 77
7.
Aleksandra Bieńkowska: Simulation researches of protocol for commiting
mobile transactions ...................................................................................................... 89
8.
Jerzy Martyna: Machine learning for the identification of the DNA variations
for diseases diagnosis ................................................................................................. 103
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Mirosław BŁOCHO, Zbigniew J. CZECH
Silesian University of Technology, Institute of Informatics
AN IMPROVED ROUTE MINIMIZATION ALGORITHM FOR THE
VEHICLE ROUTING PROBLEM WITH TIME WINDOWS
Summary. A route minimization algorithm for the vehicle routing problem with
time windows is presented. It was elaborated as an improvement of the algorithm
proposed by Nagata and Bräysy (A powerful route minimization heuristic for the
vehicle routing problem with time windows, Operations Research Letters 27, 2009,
333-338). By making use of the improved algorithm the two new-best solutions for
Gehring and Homberger’s (GH) benchmarks were found. The experiments showed
that the algorithm constructs the world-best solutions with the minimum route
numbers for the GH tests in a short time.
Keywords: vehicle routing problem with time windows, guided local search,
heuristics, approximation algorithms
ULEPSZONY ALGORYTM MINIMALIZACJI LICZBY TRAS DLA
PROBLEMU TRASOWANIA POJAZDÓW Z OKNAMI CZASOWYMI
Streszczenie. W pracy zaprezentowano algorytm minimalizacji liczby tras dla
problemu trasowania pojazdów z oknami czasowymi. Został on opracowany przez
ulepszenie algorytmu zaproponowanego przez Nagatę and Bräysy’ego (A powerful
route minimization heuristic for the vehicle routing problem with time windows,
Operations Research Letters 27, 2009, 333-338). Przy użyciu ulepszonego algorytmu
znaleziono dwa nowe najlepsze rozwiązania dla testów wzorcowych Gehringa
i Hombergera (GH). W eksperymentach wykazano, że za pomocą ulepszonego
algorytmu są konstruowane w krótkim czasie najlepsze światowe rozwiązania testów
GH o minimalnej liczbie tras.
Słowa kluczowe: trasowanie pojazdów z przedziałami czasowymi, lokalne
poszukiwanie z przewodnikiem, heurystyki, algorytmy aproksymacyjne
6
M. Błocho, Z. J. Czech
1. Introduction
Reducing transportation costs has always been important in all transit companies.
Nowadays, not only overestimating the number of required vehicles causes a lot of problems,
but also underestimating them. Revaluing routes number brings about ineffective allocation
of funds, which is especially essential due to expensive maintenance costs. In turn
underestimating them might entail inadequate covering all scheduled services and the
prospective loss of the reputation and the customers. Therefore, the proper prediction of the
minimum number of vehicles needed to carry out the transportation tasks becomes
increasingly indispensable. This problem occurs not only in distribution products from depot
to the customers but also in the school bus routing, newspapers and mail delivery, armoured
car routing, rail distribution, airline fleet routing, repairmen scheduling and many others.
The vehicle routing problem with time windows (VRPTW) is an extension to the
capacitated vehicle routing problem (CVRP) which is formulated in the following
manner [1]. There is a central depot of goods and n customers geographically scattered
around the depot. The locations of the customers (i = 1, 2, …, n) and the depot (i = 0), as well
as the shortest distances dij and the corresponding travel times tij between any two customers
i and j (including the depot) are known. Each customer asks for a quantity qi of goods which
has to be delivered by a vehicle of capacity Q. Each vehicle after serving a subset of
customers must return to the depot for reloading. Each route starts and terminates at the
depot. A solution to the CVRP is a set of routes of minimum travel distance (or travel time)
which visits each customer i exactly once. The total demand for each route cannot exceed Q.
The CVRP is extended into the VRPTW by introducing for each customer and the depot
a service time window [ei, fi] and a service time si (s0 = 0). The values ei and fi determine the
earliest and the latest start time of service, respectively. Each customer i has to be served
within the time window [ei, fi] and the service of all customers must be accomplished within
the time window of the depot [e0, f0]. The vehicle can arrive to the customer i before the
earliest start time of service ei, but then it has to wait until time, when the service can begin.
If the vehicle arrives to the customer i after the latest start time of service fi, then the solution
is not feasible. The routes are travelled simultaneously by a fleet of K homogenous vehicles
(i.e. of equal capacity) with each vehicle assigned to a single route. A feasible solution to the
VRPTW is the set of routes which guarantees the delivery of goods to all customers and
satisfies the time window and the vehicle capacity constraints. The primary objective is to
minimize the number of needed vehicles and the secondary objective is to minimize the total
travel distance.
Formally, there are three types of decision variables in this two-objective optimization
problem. The first decision variable xi, j, k, i, j  {0,1,…,n}, k  {1,2,…,K}, i ≠ j, is 1 if
An improved route minimization algorithm for the VRPTW
7
vehicle k travels from customer i to j, and 0 otherwise. The second decision variable, ti,
indicates the time when a vehicle arrives at the customer i, and the third decision variable, bi,
denotes the waiting time at this customer. The objective is to:
minimize K, and then
minimize
(1)
  
n
n
K
i 0
j 0
k 1
(2)
d i , j xi , j , k
subject to the following constraints:
K
n
 x
k 1 j 1
n
x
j 1
K

 K , for i = 0,
i , j ,k
n
  x j ,i ,k  1 , for i = 0 and k  1,2,..., K ,
i , j ,k
n
K
 xi , j , k  
n
n
K
 qi
k 1
j 0 , j  i
i , j ,k
n
x
i 0, i  j
n
x
k 1 i 0,i  j
q  x
i
(4)
j 1
k 1 j 0 , j  i
i 1
(3)
i , j ,k
Q,
i , j ,k
 1 , for i, j  1,2,..., n,
for k  1,2,..., K ,
(t i  bi  hi  t i , j )  t j ,
ei  (t i  bi )  f i , for i  1,2,..., n.
for j  {1,2,..., n} and t 0  b0  h0  0,
(5)
(6)
(7)
(8)
Formulas (1) and (2) identify the minimized functions. Eq. (3) specifies that there are
K routes starting at the depot. Eq. (4) denotes that every route starts and terminates at the
depot. Eq. (5) assures that every customer is visited only once by a single vehicle. Eq. (6)
denotes the capacity constraints. Eqs. (7)-(8) identify the time windows constraints. Eqs. (3)(8) define the feasible solutions to the VRPTW.
In this paper the improvements of the route minimization algorithm for the VRPTW by
Nagata and Bräysy [7] are presented. Similarly to this heuristic the improved algorithm is
based on the idea of the ejection pool, originated from the heuristic by Lim and Zhang [5],
combined with the guided local searches and the diversification strategy [12]. The powerful
insertion procedure which temporarily accepts an infeasible solution supplemented with
further attempts to restore the feasibility have been also included in the improved algorithm.
Moreover, the additional algorithm modifications which allowed for speeding up the local
search strategies have been introduced. The experimental tests provided helpful hints on the
algorithm steps needed strong modifications in order to achieve better results. The remainder
of this paper is arranged as follows. In Section 2 the algorithm modifications are described.
Section 3 contains the discussion of the experimental test results. Section 4 concludes the
paper.
8
2. Route minimization algorithm
2.1. Algorithm description
The first version of the algorithm was implemented based upon the route minimization
heuristic for the vehicle routing problem with time windows [7]. The main framework of the
heuristic consists of the consecutive route elimination steps performed until the total
computation time reaches a specified time. These route elimination steps are included in the
RemoveRoute function [7], which is repeatedly called during the algorithm execution.
function RemoveRoute(φ)
begin
1:
choose a random route and remove it from the solution φ
2:
initialize Ejection Pool (EP) with the customers from the removed
route
3:
initiate the penalty counters for all customers p[j]:= 1
(j = 1,…,n)
4:
while EP ≠ Ø and currTime < maxTime do
5:
select and eject the customer vins from EP using LIFO stack
6:
if Ninsert(vins, φ) ≠ Ø then
7:
φ:= the new solution φ’ selected randomly from Ninsert(vins, φ)
8:
else
9:
φ:= Squeeze(vins, φ)
10:
end if
11:
if vins is not inserted into φ then
12:
p[vins] := p[vins] + 1 (increase a penalty counter for the
customer vins)
13:
select φ’ from Nej(vins, φ) with minimized
(1)
14:
15:
16:
17:
18:
19:
20:
21:
22:
End
( 2)
(k )
Psum = p[ v out ]+p[ v out ]+…+p[ v out ]
update φ:= φ’
(1)
( 2)
add the ejected customers: v out , v out , … ,
φ:= Perturb(φ)
end if
end while
if EP ≠ Ø then
restore φ to the beginning solution
end if
return φ
(k )
vout
to EP
The Ejection Pool (EP) is used to hold the unserved customers coming from the removed
route (line 2) or from the ejected customers list (line 15) during finding solution with the
minimum penalty sum. According to Nagata proposition [7] the ejection pool should be
constructed using the LIFO (Last In First Out) queue in order to prevent the customers which
are hard to reinsert from those remaining in the ejection pool. The penalty counters for all
customers are initialized each time the function RemoveRoute is called. These counters
indicate how many times the attempts to insert given customers failed. The bigger value of
the penalty counter for a specified customer, the more difficult is to reinsert it into the
9
solution. After the random route is removed from the current solution, continuous attempts to
include the ejected customers in the rest of the routes are performed. These attempts are
carried out until all customers from the Ejection Pool are inserted or the execution time of the
algorithm reaches a specified time limit maxTime.
Ninsert(vins, φ) is constructed as a set of the feasible partial solutions, that are obtained by
inserting the customer vins into all insertion positions in the solution φ. In this set only the
insertions between two consecutive nodes are considered. If the constructed Ninsert(vins, φ) set
is empty (line 6) then the function Squeeze is called in order to help the insertion of the
selected customer into the solution φ.
function Squeeze(vins ,φ)
begin
1:
φ:= the selected solution φ’ from Ninsert(vins , φ) with min Fp(φ’)
2:
while Fp(φ’) ≠ 0 do
3:
randomly choose an infeasible route r
4:
select a solution φ’ from Nr(φ), such that Fp(φ’) is minimum
5:
if Fp(φ’) < Fp(φ) then
6:
φ:= φ’
7:
else
8:
break
9:
end if
10:
end while
11:
if Fp(φ) ≠ 0 then
12:
restore φ to the beginning solution
13:
end if
14:
return φ
End
The idea of this method is to choose the temporally infeasible insertion with the minimal
penalty function value Fp(φ) defined as follows:
Fp ( )  Pc ( )    Ptw ( ) ,
(9)
where
Fp(φ) – the solution penalty function,
Pc(φ) – the capacity penalty (the sum of total excess of the demands in all routes [7]),
α
– the penalty coefficient, which value is adapted iteratively according to the Pc(φ)
and Ptw(φ) comparisons,
Ptw(φ) – the sum of the total time window penalties Ptw(i,φ), (i = 0,1,..,n) of all customers
and the depot in the solution φ [6]; Ptw(i,φ) is defined by:
where
a vi
– the earliest possible start time of service at customer vi and is defined recursively
by the following equations:
av0 = e0
(11)
avi = max { avi 1 + s vi 1 + cvi 1vi , evi }, i = 1, 2,…, n+1.
(12)
10
After that the local search moves are performed in order to restore the feasibility of the
solution. In the Squeeze function no customer ejections are allowed.
If the Squeeze function fails then it informs that the selected customer vins was not inserted
into the solution φ. The value of the penalty counter for a chosen customer must be then
increased and after that the ejections of the customers are tested. In these ejections the limit
km for the number of removed customers was introduced [7]. Before searching there is need
to construct the Nej(vins, φ) set which contains the possible solutions with the combinations of
inserted customers at different route positions and various ejected customers. Only the
solutions with the minimum penalty sum value (line 13) are taken into account during the
local searches in the Perturb function. Choosing the solution with the minimized penalty
counters sum value gives the largest probability [7] of finding the feasible solution after the
local search moves.
2.2. Algorithm improvements
In the RemoveRoute function the Ejection Pool is initialized with the customers from the
randomly removed route (line 2). It was observed that if this function fails to insert the
customers from the removed route and fails to insert other customers from the removed
routes in the next RemoveRoute calls, then after some steps it generates again the same
solution which was previously considered. In such a case the customers are inserted into the
ejection pool exactly in the same order which may lead to a similar execution of the
algorithm as in the previous phase. For that reason it was proposed to insert the customers
from the removed route into the ejection pool always in a random order to obtain the
diversification of the search. This situation may happen very frequently while checking the
ejections for the solutions with the route number close to the minimum number of routes for
a given test case.
In the original algorithm [7] the attempts to insert the ejected customers into the solution
are carried out until all customers are inserted or the execution time of the algorithm reaches
a specified time limit maxTime. While analyzing the algorithm we observed that in some
cases always the same customers were ejected and the algorithm got stuck inside the while
loop (lines 4 to 18 in RemoveRoute) no matter how large were the values of the penalty
counters for those customers. Obviously in the original algorithm it is forbidden to eject just
inserted customer vins , what is especially important in the case of ejecting more than two
customers. Based on these observations we suggest the following improvements in the
algorithm:

the ejecting customers which have been inserted into the solutions during the last lc loop
executions should be forbidden (recommended, experimentally established value of
lc = 4, 5)
11

the attempts of inserting the customers from the ejection pool should be given up
(together with the original break conditions), if the loop count exceeds the specified limit
pmaxTrials (recommended pmaxTrials = 1000).
These algorithm improvements are vital while removing the routes from the solutions with
the routes count very close to the number of routes of the best known solutions.
During the process of continuous insertions and ejections of the customers in the
RemoveRoute function, the ejection pool size is usually alternatingly increased and decreased.
During experimental tests we observed that in many cases, if the ejection pool size exceeds
the certain limit, then it is quite difficult to insert the customers from this pool again into the
solution. Therefore we propose the additional limit for the maximum ejection pool size
EPsmax. This size should be calculated each time the random route is removed from the
solution. The recommended formula for the EPsmax is defined as follows:
EPsmax = rrs + rrx
(13)
where
rrs – the removed route size,
rrx – the coefficient which denotes the number of customers which may reside in the ejection
pool together with the customers from the removed route.
If the Squeeze function fails to insert the selected customer into the solution, then the
customers’ ejections in order to find the feasible solution are tested. It was observed that
ejecting more than one customer is useful only if the routes count of the current solution is
very close to the minimum number of routes for a given solution. If k (RemoveRoute
function, line 13) is set to 3, 4 or more, then testing the ejections of routes takes too much
time. In many cases setting k = 1 is sufficient and allows for finding the feasible solutions in
much shorter time. Therefore we suggest an improvement to test first the ejections with k = 1,
if they fail then test the ejections with k = 2 and so on, up to a specified maximum limit kmax.
The experiments indicated that constructing Ninsert(vins, φ) in the RemoveRoute function
takes much more time than expected. The set Ninsert(vins, φ) has to contain only the feasible
partial solutions. Therefore there is no need to construct a new solution object for each test
case while checking whether the partial solution is feasible or not. We suggest that it can be
checked in advance whether the potential solution (after inserting a new customer) will
remain feasible or not. For all tested customer insertions, the positions which give the feasible
solutions are recorded on a list. Then a random item from this list is chosen and a new
solution object is created together with all customers’ updates of the earliest arrival times, the
latest start times of service, etc. In order to check whether the solution will be feasible after
the customer insertion the forward and backward time window penalty slacks described in
[8] was implemented. Introducing this feature allowed for calculation the change in a time
12
window penalty (in case of the feasible solutions this change is equal to zero) in constant time
what decreases the time for constructing the Ninsert(vins, φ) set.
In a similar manner the implementation of finding the solution with the minimum penalty
value Fp(φ’) in the Squeeze function (line 1) was modified. Only one solution is required and
since the change in the time window penalty may be calculated in a constant time, the best
solution may be found very fast. This feature was implemented using the 2Opt*, OutRelocate
and Exchange local search moves [8] for the Squeeze as well as in the Perturb function for
the same operators but with accepting only the feasible solutions. If more than one solution
with a minimum penalty is found, it is suggested to record all solutions with the current
minimum penalty (during the consecutive customer insertions while constructing
Ninsert(vins , φ) set) and then to choose randomly one of them.
The experimental tests of finding the solution from Nej(vins, φ) with the minimized penalty
sum value of ejected customers indicated that there is need to implement the changes similar
to the changes in construction of Ninsert(vins, φ) also in this case. The solutions constructed
after inserting the customer vins and ejecting the particular customers must be partially
feasible. It is proposed to test first the backward and forward time window penalty slacks as
well as the changes in the route demands. This helps to check very quickly whether the
changed solution is feasible or not.
3. Analysis of the experimental result
The improved algorithm was implemented in C++ and tested with the following settings:
maxTime = 240 – the maximum time (in seconds) for a single test,
α=1
– the initial penalty coefficient,
lc = 5
– the number of customers inserted during the last klc loop executions in
which the customers are not allowed to be ejected,
pmaxTrials = 1000 – the loops count limit in the RemoveRoute function,
x=7
– the coefficient for the maximum ejection pool size,
kmax = 4
– the maximum ejected customers count,
Irand = 400
– the Perturb function execution count,
npSq = 60
– the percent of close customers for the Squeeze function.
The algorithm was executed on an Intel Core 2 Duo 2.4 GHz (2 GB RAM) processor.
The Figures 1-5 show the Gehring and Homberger’s test results for 200, 400, 600, 800 and
1000 customers for all problem instances. The achieved test results are compared with the
world results based on the cumulative number of the vehicles (CVN).
13
200 Customers - Test Results
200
160
R2
RC1
43
R1
43
C2
180
C1
180
0
40
40
60
60
40
182
181
188
80
188
CVN
120
RC2
World Result
Achieved Result
Group
Fig. 1. Test Results – 200 Customers
Rys. 1. Wyniki – 200 Klientów
400
360
360
85
85
80
80
120
117
100
364
362
377
200
376
CVN
300
0
C1
C2
R1
R2
Group
RC1
RC2
World Result
Achieved Result
In Table 1 the results obtained with the improved algorithm (MBL) are compared with
the results known from the literature:

GH
(Gehring and Homberger [2])

IBA
(Ibaraki et al. [3])

LZ
(Lim and Zhang [5])

GD
(Gagnon and Desaulniers [10])

PR
(Pisinger and Ropke [9])
The results are compared based on the number of routes only. The route minimization
step is carried out independently in all those algorithms, so the comparison is well-founded.
14
The CVNs, computer specifications and average CPU time together with the number of runs
are listed.
800
400
200
C2
R1
R2
116
116
110
110
175
175
0
C1
550
550
545
545
576
574
CVN
600
RC1
RC2
Group
World Result
Achieved Result
800
720
720
728
727
755
400
753
CVN
600
200
R1
R2
RC1
157
C2
157
C1
150
150
235
234
0
RC2
Group
World Result
Achieved Result
15
1000
600
R2
RC1
184
Group
183
R1
900
C2
190
190
0
C1
900
919
295
294
200
919
944
400
942
CVN
800
RC2
World Result
Achieved Result
The obtained results prove the efficiency and powerfulness of the improved algorithm.
The maximum execution time of the algorithm was set to 4 minutes and within this limit the
algorithm found the world results of the minimum route number in 96% test cases (Table 3).
The results in Table 1 compared to results of other well-known algorithms show the similar,
competitive cumulative number of the vehicles with a much smaller amount of time needed
to obtain those results, especially for 400, 600 and 1000 customers. Moreover the real
average times needed to obtain those results are gathered in Table 2 and show that in most
cases the 4 minutes limit was not necessary. The real average time for all 300 Gehring and
Homberger’s tests was only 25 seconds.
Furthermore, during the experiments the two new world best results were found – for test
RC2_10_1 (1000 customers) the solution with 20 routes and for test C1_8_2 (800 customers)
the solution with 73 routes. These results have been published on the Sintef website:
http://www.sintef.no/Projectweb/TOP/Problems/VRPTW/Homberger-benchmark/.
16
Table 1
The results for all problem sizes – compared to other algorithms GH, IBA, LZ, GD, PR
200 customers
C1
C2
R1
R2
RC1
RC2
Total CVN
CPU
(min.) x runs
400 customers
C1
C2
R1
R2
RC1
RC2
Total CVN
CPU
(min.) x runs
600 customers
C1
C2
R1
R2
RC1
RC2
Total CVN
CPU
(min.) x runs
800 customers
C1
C2
R1
R2
RC1
RC2
Total CVN
CPU
(min.) x runs
1000 customers
C1
C2
R1
R2
RC1
RC2
Total CVN
CPU
(min.) x runs
GH
189
60
182
40
181
44
696
P 400M
8.4 x 3
GH
380
120
364
80
361
88
1392
P 400M
28.4 x 3
GH
577
178
545
110
550
119
2079
P 400M
51.6 x 3
GH
761
237
728
150
723
161
2760
P 400M
92.8 x 3
GH
954
297
919
190
901
185
3446
P 400M
120.4 x 3
IBA
189
60
182
40
180
43
694
P 2.8G
N/A
IBA
377
120
364
80
360
86
1387
P 2.8G
N/A
IBA
575
174
545
110
550
116
2070
P 2.8G
N/A
IBA
757
234
728
150
724
157
2750
P 2.8G
N/A
IBA
945
294
919
190
900
183
3431
P 2.8G
N/A
LZ
189
60
182
40
180
43
694
P 2.8G
10 x 2
LZ
376
117
364
80
360
85
1382
P 2.8G
20 x4
LZ
574
174
545
110
550
115
2068
P 2.8G
30 x 6
LZ
754
234
728
150
720
156
2742
P 2.8G
40 x 8
LZ
944
293
919
190
900
183
3429
P 2.8G
50 x 10
GD
189
60
182
40
180
43
694
O 2.3G
53 x 5
GD
376
119
364
80
360
86
1385
O 2.3G
89 x 5
GD
574
175
545
110
550
117
2071
O 2.3G
105 x 5
GD
754
235
728
150
720
158
2745
O 2.3G
129 x 5
GD
943
295
919
190
900
185
3432
O 2.3G
162 x 5
PR
189
60
182
40
180
43
694
P 3.0G
7.7 x 10
PR
376
120
364
80
360
85
1385
P 3.0G
15.8 x 5
PR
575
175
545
110
550
116
2071
P 3.0G
18.3 x 5
PR
756
237
728
150
730
157
2758
P 3.0G
22.7 x 5
PR
946
297
922
190
900
183
3438
P 3.0G
26.2 x 5
MBL
188
60
182
40
180
43
693
P 2.4G
4x1
MBL
377
120
364
80
360
85
1386
P 2.4G
4x1
MBL
576
175
545
110
550
116
2072
P 2.4G
4x1
MBL
755
235
728
150
720
157
2745
P 2.4G
4x1
MBL
944
295
919
190
900
184
3432
P 2.4G
4x1
17
Table 2
C1
C2
R1
R2
RC1
RC2
Avg
Comparison of average times needed to find test solutions
200
400
600
800
1000
<1s
11 s
16 s
56 s
95 s
<1s
2s
37 s
115 s
130 s
<1s
<1s
2s
<1s
1s
1s
1s
1s
2s
2s
<1s
1s
1s
29 s
1s
1s
38 s
66 s
73 s
73 s
1s
9s
21 s
46 s
50 s
Avg
36 s
57 s
1s
1s
6s
50 s
25 s
Table 3
Compared percentage of found solutions with minimal world-best number of routes
200
400
600
800
1000
Avg
C1
100 %
90 %
80 %
80 %
80 %
86 % (43/50)
(9/10)
(8/10)
(8/10)
(8/10)
C2
100 %
70 %
100 %
90 %
90 %
90 % (45/50)
(7/10)
(9/10)
(9/10)
R1
90 %
90 %
100 %
90 %
100 %
94 % (47/50)
(9/10)
(9/10)
(9/10)
R2
100 %
100 %
100 %
100 %
100 %
100 %
RC1
100 %
100 %
100 %
100 %
100 %
100 %
RC2
100 %
100 %
100 %
100 %
90 %
98 % (49/50)
(9/10)
Avg
98 %
92 %
97 %
93 %
93 %
96 %
(59/60)
(55/60)
(58/60)
(56/60)
(56/60)
(288/300)
4. Conclusions
The improved algorithm proved very competitive with respect to other well-known
heuristics solving the VRPTW. The main advantage of the proposed algorithm is a short time
of obtaining the solutions which contain the number of routes which are equal to or are
slightly worse than the best known solutions. Therefore the improved algorithm can be used
as the first stage in other heuristic algorithms for minimization of the routes count.
It would be interesting to examine if the parallelization of the improved route
minimization algorithm could give the better results while solving the VRPTW. Our further
research concentrates on this topic and the results are very promising as the parallel heuristic
has been able to discover new best solutions to Gehring and Homberger’s benchmarks.
18
BIBLIOGRAPHY
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Czech Z. J.: A Parallel Simulated Annealing Algorithm as a Tool for Fitness landscapes
Exploration. Parallel and Distributed Computing, InTech, 2010, chapter 13, p. 247÷271.
Gehring H., Homberger J.: Parallelization of a two-phase metaheuristic for routing
problems with time windows. Asia-Pacific Journal of Operational Research 18, 2001,
p. 35÷47.
Ibaraki T., Imahori S., Nonobe K., Sobue K., Uno T., Yagiura M.: An iterated local search
algorithm for the vehicle routing problem with convex time penalty functions. Discrete
Applied Mathematics 156, 2008, p. 2050÷2069.
Lenstra, J., and Rinnooy Kan, A.: Complexity of vehicle routing and scheduling problems.
Networks 11, 1981, p. 221÷227.
Lim A., Zhang X.: A two-stage heuristic with ejection pools and generalized ejection
chains for the vehicle routing problem with time windows. Informs Journal on Computing
19, 2007, p. 443÷457.
Nagata Y.: Efficient evolutionary algorithm for the vehicle routing problem with time
windows: Edge assembly crossover for the VRPTW. Proc. of the 2007 Congress on
Evolutionary Computation, 2007, p. 1175÷1182.
Nagata Y, Bräysy O.: A Powerful Route Minimization Heuristic for the Vehicle Routing
Problem with Time Windows. Operations Research Letters 37, 2009, p. 333÷338.
Nagata Y, Bräysy O., Dullaert W. (2010) A Penalty-based edge assembly memetic
algorithm for the vehicle routing problem with time windows. Computers & Operations
Research 37, 2010, p. 724÷737.
Pisinger D., Ropke S.: A general heuristic for vehicle routing problems. Computers &
Operations Research 34, 2007, p. 2403÷2435.
Prescott-Gagnon E., Desaulniers G., Rousseau L.-M.: A branch-and-price-based large
neighborhood search algorithm for the vehicle routing problem with time windows.
Working Paper, University of Montreal, Canada, 2007.
Toth, P., and Vigo, D., (Eds.): The vehicle routing problem. SIAM Monographs on
Discrete Mathematics and Applications, Philadelphia, PA, 2002.
Voudouris C., Tsang E.: Guided local search. Handbook of Metaheuristics, Kluwer, 2003,
p. 185÷218.
Recenzent: Dr hab. Urszula Boryczka
Wpłynęło do Redakcji 16 grudnia 2010 r.
19
Omówienie
W pracy został przedstawiony ulepszony algorytm minimalizacji liczby tras dla problemu
trasowania pojazdów z oknami czasowymi. Algorytm ten jest oparty na zmodyfikowanej
heurystyce zaproponowanej przez Nagatę i Bräysy’ego [7]. Idea algorytmu polega w pierwszej fazie na usuwaniu klientów z losowo wybranej trasy w celu zminimalizowania liczby
tras w rozwiązaniu oraz na wieloetapowych próbach wstawiania usuniętych klientów do tworzonego rozwiązania w drugiej fazie. Próby wstawiania usuniętych klientów są przeprowadzane z zastosowaniem puli usuniętych klientów (ang. ejection pool) wraz z kierowanymi
lokalnymi poszukiwaniami (ang. guided local searches) oraz dywersyfikacją uzyskanych
rozwiązań (ang. solution diversification). W algorytmie jest istotna tymczasowa akceptacja
rozwiązań z naruszonym ograniczeniem maksymalnej ładowności pojazdu lub z naruszonym
ograniczeniem okien czasowych poszczególnych klientów.
Ulepszony algorytm powstał na podstawie testów eksperymentalnych zaimplementowanej heurystyki podanej w pracy [7]. Do najważniejszych modyfikacji algorytmu należą:

wstawianie klientów w kolejności losowej do puli klientów usuniętych,

zabronione usuwanie klientów wstawionych do aktualnego rozwiązania podczas
określonej liczby ostatnich wykonań głównej iteracji algorytmu,

dodatkowy limit na maksymalną pojemność puli usuniętych klientów,

ulepszona funkcja sprawdzania z wyprzedzeniem, czy potencjalne zmiany w rozwiązaniu
pozwolą na uzyskanie żądanego rezultatu.
Przeprowadzone testy ulepszonego algorytmu dla testów wzorcowych Gehringa i Hombergera dowiodły jego wysokiej konkurencyjności w stosunku do innych dobrze znanych
algorytmów rozwiązywania problemu VRPTW [2, 3, 5, 9, 10]. Zdecydowaną przewagą
ulepszonego algorytmu jest krótki czas znajdowania rozwiązań z liczbą tras równą lub
niewiele większą od światowych wyników dla testów wzorcowych. W 96% przypadków
testowych udało się uzyskać liczbę tras równą znanym światowym wynikom przy rzeczywistym średnim czasie działania 25 sekund na jeden test. Dla dwóch przypadków testowych
(RC2_10_1 oraz C1_8_2) udało się uzyskać rozwiązania z liczbą tras mniejszą od światowych wyników.
Adresses
Mirosław BŁOCHO: Politechnika Śląska, Instytut Informatyki, ul. Akademicka 16,
44-100 Gliwice, Polska, [email protected]
Zbigniew J. CZECH: Politechnika Śląska, Instytut Informatyki, ul. Akademicka 16,
44-100 Gliwice, Polska, [email protected]
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Jan SADOLEWSKI, Zbigniew ŚWIDER
Rzeszów University of Technology, Department of Computer and Control Engineering
SPECIFICATION AND VERIFICATION OF SIMPLE LOGIC
CONTROL PROGRAMS USING FRAMA C1
Summary. The paper presents an approach to verification process for programs of
simple logic controls written in ANSI C. The software is verified with open source
tools like Frama C, Jessie and Coq. Process of specification determination and
verification whether implementation conforms with specification is demonstrated by
several examples, involving combinatorial logic, sequential logic and sequential logic
with time constraints.
Keywords: software verification, Frama C, control programs
SPECYFIKACJA I WERYFIKACJA PROSTYCH PROGRAMÓW
STEROWANIA LOGICZNEGO Z WYKORZYSTANIEM FRAMA C
Streszczenie. W artykule zaproponowano proces weryfikacji prostych programów
sterowania logicznego napisanych w ANSI C. Programy są weryfikowane przez
ogólnodostępne narzędzia, jak Frama C, Jessie i Coq. Proces określania specyfikacji
i weryfikacji zgodności specyfikacji z implementacją przedstawiono na kilku przykładach układów kombinacyjnych, sekwencyjnych oraz sekwencyjnych z uzależnieniami czasowymi.
Słowa kluczowe: weryfikacja programów, Frama C, systemy sterowania
1. Introduction
Control systems must be reliable and equipped with correct software. Small embedded
systems are usually programmed in ANSI C. Despite that the programs are usually written by
1
This research has been partially supported by MNiSzW under the grant N N516 415638 (2010-2011), and
partially by the European Union under the European Social Fund.
22
J. Sadolewski, Z. Świder
experienced programmers, final code may contain mistakes and side effects. Software tests
can reveal most of mistakes, but never give assurance that the code is completely error free.
Formal verification of compliance between specification and implementation can help to find
mistakes and side effects.
The paper presents verification process for simple control programs and function blocks
written in ANSI C. It employs open source tools for code analysis like Frama C [5] with
Jessie plug-in [9] and Coq [3]. Frama C supports C code analysis with specification
annotations written in ACSL language [2]. It contains an internal module for static value
analysis of the whole program (or its main part), but it is too weak to prove correctness even
simple programs. That weakness motivates the use of Jessie plug-in, which generates
condition lemmas based on Dijkstra preconditions [4]. The lemmas consists of ensures proof
obligations and safety proof obligations. Ensures proof obligations guarantee program
correctness, while safety proof obligations assure that variable overflow does not occur in
runtime. The lemmas can be proved half-automatically with Coq.
The research is an extension of previous work [10], by replacing somewhat obsolete
Caduceus with Frama C. It is also related to [1, 7] but uses Coq as a backend prover instead
of modeling tool. The paper is divided into three parts each of which focuses on examples,
i.e. combinatorial logic – binary multiplexer and temperature control, sequential logic – RS
flip flop and detection of truck movement, and sequential logic with time constraints –
blinking LED and cargo lift. The examples demonstrate, that even for simple programs
serious effort is required to prove correctness formally.
2. Combinatorial logic
Specification of combinatorial logic can be provided by output formulas, a text, and value
tables. Examples presented here use the first and the second form. The third one can be
converted to minimal form using Karnaugh maps [6].
2.1. Binary multiplexer
The first example involves a binary multiplexer of Fig. 1 with three inputs and one output.
The s input toggles the output
y
between x0 and x1 inputs. Specification formula is shown in
(1), followed by C program listing 1 (function). The specification written as code annotation
for Frama C is in listing 2.
y = ( s  x0 )  ( !s  x1 )
(1)
Specification and verification of simple logic control programs...
23
// ----- Listing 1 ----char mux(char s, char x0, char x1)
{
char y;
if(s == 1)
y = x0;
else
y = x1;
return y;
}
// ----- Listing 2 ----requires (s==0 || s==1) && (x0==0 || x0==1) && (x1==0 || x1==1);
ensures (\result <==> (s && x0) || (!s && x1));
Fig. 1. Binary multiplexer
Rys. 1. Multiplekser binarny
The specification involves two clauses
requires
and
ensures.
The first one expresses
constraints on input values at function call (preconditions). The other expresses constraint
which must be satisfied upon the function return (postconditions). Variable
\result
represents value returned by the function, i.e. multiplexer output. The formula (1) is inserted
into the ensures clause for implementation.
By using Jessie plug-in available in Frama C, one can check whether program satisfies
postconditions when preconditions are met. Jessie module generates a file with obligation
lemmas, which can be proved using Coq. For the multiplexer, four ensures proof obligations
and two safety proof obligations are obtained.
a) s: int8
H4: integer_of_int8
s = 0 -> False
False
b) x0: int8
c)
y: int8
HW_6: y = x0
integer_of_int8 x0 = 0
y: int8
H2: integer_of_int8 y
= 0
integer_of_int8 y = 0
Fig. 2. Examples of goals and contexts
Rys. 2. Przykłady celów oraz kontekstów
All Coq proofs implemented here begin with
intuition
tactic, which splits given lemma
into simpler form. The splitting generates additional hypotheses collected into a space called
context. In the next step, the user has to choose which hypotheses to apply for further proof
using either
apply _, rewrite _
or
assumption
tactic. The
_
character (underscore) must be
replaced by the user with one of the hypothesis from the context.
conclusion of the hypothesis is equal to the current goal (Fig. 2a),
and the goal are both equalities (Fig. 2b), and
assumption
Apply
rewrite
is used when
when hypothesis
when the whole goal is equal to
hypothesis (Fig. 2c). Assumption is a part of intuition, so the use of assumption immediately
after
intuition
is incorrect. The presented tactics prove all lemmas belonging to the ensures
24
proof obligations group. Safety proof obligations are proved instantly after the initial
intuition
tactic.
2.2. Heater control
Temperature is monitored by a thermometer with contacts a, b, c, d (Fig. 3a). Heaters
H2
H1,
are turned on by the relays r1, r2, r3 (Fig. 3b) according to the following rules:
1. t < ta
– H1, H2 in parallel,
2. ta  t < tb – H1 only,
3. tb  t < tc – H2 only,
4. tc  t < td – H1 and H2 in series,
5. t  td
– both switched off.
b)
a)
Fig. 3. Heater control system
Rys. 3. System sterowania grzejnikami
Fig. 4. Input and output table
Rys. 4. Tablica wejść i wyjść
r2 = !a  ( b  !d )
r1 = !b
r3 = c
(2)
Control program can be written using either truth table (Fig. 4) or explicit formulae (2)
derived by Karnaugh maps. Here the truth table is used for implementation (listing 3) and (2)
for specification (listing 4). The
(none of the contacts faulty).
(necessary for Jessie),
ensures
requires
Assigns
clause specifies admissible sets of signal values
indicates global variables modified by function body
contains formulas (2) with the last one written as implication.
It is necessary because Karnaugh map leaves some undetermined values.
// ----- Listing
void termometr()
{
if(!a) {
else {
if(!b) {
else {
3 ----r1 = 1;
r2 = 1;
r3 = 0;
}
r1 = 1;
r2 = 0;
r3 = 0;
}
if(!c) {
r1 = 0;
else {
if(!d) { r1 = 0;
else {
r1 = 0;
}
}
r2 = 1;
r3 = 0;
}
r2 = 1;
r2 = 0;
r3 = 1;
r3 = 0;
}
}
25
}
}
// ----- Listing 4 ----requires (a==0 && b==0 && c==0 && d==0) || (a==1 && b==0 && c==0 && d==0) ||
(a==1 && b==1 && c==0 && d==0) || (a==1 && b==1 && c==1 && d==0) ||
(a==1 && b==1 && c==1 && d==1);
assigns r1, r2, r3;
ensures (!b <==> r1) && ((!a || (b && !d)) <==> r2) && (r3 ==> c);
Here, the file generated by Jessie module contains twenty two (22) ensures proof
obligation lemmas. No safety proof obligations are generated because all assignments in the
code are constant. Lemmas can be proved by one of the following tactics:
1.
intuition,
2. Sequence of intros, rewrite
_ in _, contradiction,
3. Sequence of intros, rewrite _, omega.
In some cases 2nd and 3rd tactic would require additional
[and] in _.
left, right
The first two are used when goal is disjunction and
and
decompose
decompose
when context
hypothesis is conjunction.
Tactics presented here can be generalized for other combinatorial examples.
3. Sequential logic
In sequential logic the output depends on current input and on internal state from previous
calculation cycle. Implementations as C functions require additional parameters, usually
stored as pointers. Dereferencing those pointers at the beginning means retrieving the
previous state. Specification can be in text form, formula form, or created from signal-time
plots and automaton graphs.
3.1. RS flip flop
Suppose the RS flip flop (Fig. 5) is specified by the following text: 1 (logic) at input
R
resets output Q into 0, 1 at input S sets Q into 1 but only when R is 0, if R and S are both 0 then
value of Q is preserved (state), if R and S are 1 then R is predominant and Q is reset to 0. Such
specification is written as code annotation in listing 5.
// ----- Listing 5 ----requires \valid(state) && (R==0||R==1) && (S==0||S==1) && (*state==0||*state==1);
assigns *state;
ensures ((\old(*state)==0) && (\old(S)==0) ==> (\result==0) && (*state==0)) &&
26
((\old(*state)==0) && (\old(S)==1) && (\old(R)==0) ==>
(\result==1) && (*state==1)) &&
((\old(*state)==1) && (\old(R)==0) ==> (\result==1) && (*state==1)) &&
((\old(*state)==1) && (\old(R)==1) ==> (\result==0) && (*state==0));
Fig. 5. RS flip flop symbol
Rys. 5. Symbol przerzutnika RS
The requires clause contains \valid modifier which indicates that the state variable will
not be
NULL
at the function call. The
assigns
clause must contain
state
variable since it is
modified by the function (even when it is pointer dereferencing). The modifier
ensures
\old
in the
clause indicates value from previous execution (previous state).
// ----- Listing 6 ----char rs_ff_aut(char R, char S, char* state)
{
switch(*state)
{
case 0:
if(S == 1 && R == 0) { *state = 1; return 1; }
return 0;
break;
case 1:
if(R == 1) { *state = 0; return 0; }
return 1;
break;
}
}
Listing 6 shows implementation of the RS flip flop function as an automaton. Jessie
analysis generates 46 proof obligation lemmas and 9 safety proof obligations. Most of those
46 lemmas can be proved similarly as before. The other lemmas require operations on pointer
arithmetic, which involve
subst _, rewrite select_store_eq, rewrite pset_singleton in _
tactics, and some combinations of the tactics mentioned before. Moreover, some lemmas
involve contradictory hypotheses, so additional tactics like
contradiction
with
absurd
_,
omega
and
are needed. The safety proof obligations are simpler, and 7 of them are proved
intuition.
Remaining 2 lemmas can be proved with the sequence of
intros, rewrite _
and omega.
3.2. Detecting truck movement
Direction of movement of a truck, to the left or to the right, is detected by two photo
sensors S1, S2 and signaled by the lamps L1, L2 (Fig. 6). If the truck moves to the left, L1 is on.
Distance between the sensors is smaller than length of the truck. One assumes that the truck
can stop only outside the sensors.
27
Fig. 6. Detecting direction of truck movement
Rys. 6. Wykrywanie kierunku ruchu wózka
Specification can be represented by Moore automaton shown in Fig. 7, and coded
accordingly as in listing 7. As before, to simplify proof obligations, it is assumed that all
variables are declared globally as char type. Listing 8 involves implementation of the control
program.
Fig. 7. Automaton detecting direction of truck movement
Rys. 7. Automat wykrywający kierunek ruchu wózka
Jessie analysis produces 97 proof correctness obligation lemmas, and 2 safety proof
obligations. Due to the use of global declaration, 89 lemmas are proved immediately by
intuition
tactic. Additional
rewrite _
and
assumption
tactics prove remaining 8 lemmas.
Two safety lemmas are proved by intuition.
// ----- Listing 7 ----requires (state >= 0) && (state <= 2) && (cl==0 || cl==1) && (cp==0 || cp==1);
assigns state, l1, l2;
ensures ((\old(state)==0 && (\old(cl)==0) && (\old(cp)==0)) ==> (state==0)) &&
((\old(state)==0 && (\old(cl)==1) && (\old(cp)==0)) ==> (state==1)) &&
((\old(state)==0 && (\old(cl)==0) && (\old(cp)==1)) ==> (state==2)) &&
((\old(state)==1 && (\old(cl)==1 || \old(cp)==1)) ==> (state==1)) &&
((\old(state)==1 && (\old(cl)==0 && \old(cp)==0)) ==> (state==0)) &&
((\old(state)==2 && (\old(cl)==1 || \old(cp)==1)) ==> (state==2)) &&
((\old(state)==2 && (\old(cl)==0 && \old(cp)==0)) ==> (state==0));
// ----- Listing 8 ----void truck_movement()
{
switch(state)
{
case 0:
l1 = 0;
l2 = 0;
if(cl && !cp)
state = 1;
if(!cl && cp)
state = 2;
if(!cl && !cp)
state = 0;
break;
case 1:
28
l1 = 1;
l2 = 0;
if(cl || cp)
state = 1;
else
state = 0;
break;
case 2:
l1 = 0;
l2 = 1;
if(cl || cp)
state = 2;
else
state = 0;
break;
}
}
Verification of sequential logic with Frama C and Jessie does not differ much from
combinatorial one. The only difference is the use of pointers in function declarations. It leads
to somewhat complicated verification lemmas, which require knowledge on proving Jessie
pointer arithmetics with Coq.
4. Sequential logic with time constraints
Software implementations of sequential logic with time constraints involve time span
called cycle time. Measured time is a multiplication of the cycle time. Specifications
expressed by automata do not differ much from previous examples, except that formulae
describing states and transitions are extended by time.
4.1. Flashing LED
Led L flashes when key K is pressed (Fig. 8). Th, Tl denote time periods when the LED is
on or off, respectively. LED is turned off immediately when the key is released. The
automaton in Fig. 9 is derived directly from the time plot. Specification based on it is shown
in listing 9. Expressions describing states and transitions involve condition written above the
separating line and time operation (if needed) below the line. As seen from Fig. 9, the
condition
K==0
describes each of the two incoming edges into
0
state. This allows to write
only one part ensures clause for (K==0). Other parts of the clause involve time operations.
Fig. 8. Time-signal plots of flashing LED
Rys. 8. Przebiegi czasowe sygnałów dla błyskającego LEDa
29
Fig. 9. State automaton for flashing LED
Rys. 9. Automat stanów dla błyskającego LEDa
// ----- Listing 9 ----requires ((K==0) || (K==1)) && ((state>=0) && (state<=2)) && (t1>=0) && (t2>=0)
&& (t1<=T1) && (t2<=T2);
assigns L, state, t1, t2;
ensures ((K==0) ==> (state==0)) &&
((K==1) && (\old(state)==0) ==> (state==1) && (t1==0)) &&
((K==1) && (\old(state)==1) && (\old(t1)<Th) ==> (state==1) && (t1==\old(t1)+1)) &&
((K==1) && (\old(state)==1) && (\old(t1)==Th) ==> (state==2) && (t2==0)) &&
((K==1) && (\old(state)==2) && (\old(t2)<Tl) ==> (state==2) && (t2==\old(t2)+1)) &&
((K==1) && (\old(state)==2) && (\old(t2)==Tl) ==> (state==1) && (t1==0));
Software implementation is presented on listing 10 (variables are char globals). T1 and T2
are assumed to have
const
modifier initialized with some values (e.g.
6
and 3, not shown in
the listing). Those values represent numbers of program execution cycles for elapsing of
required times.
// ----- Listing 10 ----void flashingled()
{
switch(state)
{
case 0:
L = 0;
if(K == 1) {
state = 1;
t1 = 0; }
break;
case 1:
L = 1;
if(K == 0) {
state = 0; } else
if(K==1 && t1<Th) {
state = 1; t1++; }
if(K==1 && t1==Th) {
state = 2; t2 = 0;
break;
case 2:
L = 0;
if(K == 0) {
state = 0; } else
if(K==1 && t2<Tl) {
state = 2; t2++; }
if(K==1 && t2==Tl) {
state = 1; t1 = 0;
break;
}
}
else
}
else
}
Jessie analysis produces 209 lemmas to prove program correctness, and 28 lemmas for
variable overflow safety. Most of the correctness lemmas are verified by
intuition
tactic.
30
Remaining ones can be proved similarly as in the multiplexer and truck movement examples.
Safety lemmas are however not so trivial. Four of them cannot be proved because of
weakness of
requires
clause in the current form. Main problem is that
Th
and
Tl
are treated
by Jessie not as constants but as variables. Besides, there are no assumptions that
and t2
<= Tl
nor t1
<= 126
and t2
that program. Strengthening
<= 126,
requires
t1 <= Th
which guaranties overflow safety on char type for
clause with these assumptions allows to prove that
overflow does not occur.
4.2. Cargo lift
The lift can move up or down as shown in Fig. 10. If it is turned back while moving, it
stops first for a moment, e.g. 2 seconds, and then returns. Signals are as follows:
Inputs:
pup, pdn
– pushbuttons (up or down)
lup, ldn
– limit switches
Outputs:
mup, mdn
– motor
Times:
rup, rdn
– reverse stop
Fig. 10. Cargo lift
Rys. 10. Winda towarowa
Automaton representing the algorithm is shown in Fig. 11. Listing 11 involves
corresponding specification in code annotation form (variables are
rdn
and
rup
are declared with
const
char
globals). Variables
modifier and initialized with values
10.
Listing 12
presents the lift program.
// ----- Listing 11 ----requires (state>=0 && state<=5) && (mup==0 || mup==1) && (mdn==0 || mup==1) &&
(lup==0 || lup==1) && (ldn==0 || ldn==1) && (pup==0 || pup==1) && (pdn==0 ||
pdn==1);
assigns state,mup,mdn,tim;
ensures ((\old(state)==0 && pup==1) ==> state==1) &&
((\old(state)==1 && pdn==1 && lup==0) ==> ((state==4) && (tim==0))) &&
((\old(state)==1 && pdn==0 && lup==1) ==> state==2) &&
((\old(state)==2 && pdn==1) ==> state==3) &&
((\old(state)==3 && pdn==1 && ldn==0) ==> ((state==5) && (tim==0))) &&
((\old(state)==3 && ldn==1 && pup==0) ==> state==0) &&
((\old(state)==4 && (\old(tim)<rdn)) ==> ((state==4) && (tim==\old(tim)+1))) &&
31
((\old(state)==4 && (\old(tim)==rdn)) ==> state==3) &&
((\old(state)==5 && (\old(tim)<rup)) ==> ((state==5) && (tim==\old(tim)+1))) &&
((\old(state)==5 && (\old(tim)==rup)) ==> (state==1));
Fig. 11. Automaton of lift control with time stop
Rys. 11. Automat windy towarowej z postojem
// ----- Listing 12 ----void liftcontrol()
{
switch(state)
{
case 0:
mup = 0;
mdn = 0;
if(pup == 1) state = 1;
break;
case 1:
mup = 1;
mdn = 0;
if(pdn == 1) { state = 4; tim = 0; } else
if(lup == 1) { state = 2; }
break;
case 2:
mup = 0;
mdm = 0;
if(pdn == 1) { state = 3; }
break;
case 3:
mup = 0;
mdn = 1;
if(pup == 1) { state = 5; tim = 0; } else
if(ldn == 1) { state = 0; }
break;
case 4:
mup = 0;
mdn = 0;
if(tim < rdn) { tim++; } else
state = 3;
break;
case 5:
mup = 0;
mdn = 0;
if(tim < rup) { tim++; } else
{ state = 1; }
break;
}
}
32
Jessie analysis generates 196 proof correctness obligation lemmas and 26 safety proof
obligations. As before most of correctness lemmas can be proved by
remaining lemmas require sequence of
rewrite _ in _, assumption.
with
or
The
intros, rewrite _,
Only two safety lemmas cannot be proved because of requires
clause weakness, because the variable
requires
intros, rewrite _, assumption
intuition.
tim>=0 && tim <= 126
tim
can exceed its maximum value. By strengthening
expression, those lemmas can be proved with
omega
tactic.
Verification of sequential logic with time constraints does not differ from earlier cases.
Declaration of global variable simplifies proofs of most lemmas to intuition tactic and frees
them from pointer arithmetic. However, if no constraints are given for time variables, lack of
overflow cannot be proved.
5. Summary
Verification approach for simple control programs written in ANSI C has been presented.
The programs involve combinatorial, sequential and sequential-plus-time constraints
examples. Specification can be given in the form of formula, verbally, as graph automaton or
time plot. Implementation is checked with respect to verification using Frama C, Jessie and
Coq open-source tools. Besides correctness, variable overflow safety lemmas are also
evaluated.
The verification provides formal justification of program correctness but is really quite
cumbersome. Therefore to make it more practical, future work will focus on so-called
dynamic verification, similar somewhat to debugging, where intermediate results are checked
symbolically “on-line”.
BIBLIOGRAPHY
1.
Affeldt R., Kobayashi N.: Formalization and Verification of a Mail Server in Coq. Lecture
Notes in Computer Science, v. 2609, Springer, 2003.
2.
Baudin P., Cuoq P., Filliâtre J.Ch., Marché C., Monate B., Moy Y., Prevosto V.: ACSL:
ANSI/ISO C Specification Language. INRIA, 2009.
3.
Coq homepage [online] http://coq.inria.fr.
4.
Dijkstra E.W.: A Discipline of Programming. Prentice-Hall Inc., Englewood Cliffs, NJ
1976.
5.
Frama C homepage [online] http://frama-c.ceq.fr.
33
6.
Kalisz J.: Podstawy elektroniki cyfrowej. WKŁ, Warszawa 2007.
7.
Kerbœuf M., Novak D., Talpin J. P.: Specification and Verification of a Steam-Boiler
with Signal-Coq, University of Oxford, 2000.
8.
Świder Z. (red.): Sterowniki mikroprocesorowe. Oficyna Wydawnicza Politechniki Rzeszowskiej, 2002.
9.
Moy Y., Marché C.: Jessie Plugin Tutorial. [online] http://why.lri.fr.
10.
Sadolewski J.: Introduction to verification simple programs in ST language with Coq,
Why and Caduceus. Metody Informatyki Stosowanej. No. 2 (19) 2009 (in Polish).
Recenzent: Prof. dr hab. inż. Bolesław Pochopień
Wpłynęło do Redakcji 7 lutego 2011 r.
Omówienie
Praca przedstawia proces weryfikacji prostych programów sterowania zapisanych
w języku ANSI C z wykorzystaniem ogólnodostępnych narzędzi Frama C z modułem Jessie
oraz Coq. Weryfikacja polega na wykazaniu zgodności specyfikacji z implementacją, a także
na wykazaniu braku sytuacji, w których nastąpiłoby przepełnienie wartości zmiennych. Specyfikacja może być podana wzorami, np. (1) i (2), słownie, w postaci automatu (rys. 4, 6, 7)
lub jako wykres czasowy (rys. 5). Każda z tych form jest zapisywana w postaci adnotacji do
kodu, a następnie poddawana analizie programem Frama C z modułem Jessie, które badają
zgodność specyfikacji z utworzonym kodem. W wyniku analizy otrzymuje się lematy dowodzące poprawności (ensures proof obligations) oraz lematy bezpieczeństwa zakresów zmiennych (safety proof obligations). Dowody tych lematów przeprowadza się półautomatycznie
programem Coq.
Proces tworzenia dowodów pokazano na przykładach układów kombinacyjnych, sekwencyjnych oraz sekwencyjnych z uzależnieniami czasowymi. Zwrócono uwagę na przypadki,
gdy kod ma umożliwiać deklarację kolejnych instancji. Przykłady niewymagające wieloinstancyjności przedstawiono z wykorzystaniem zmiennych globalnych, które eliminując arytmetykę wskaźników upraszczają dowody. Niemożliwość udowodnienia jednego z lematów
blokuje wykazanie poprawności lub bezpieczeństwa weryfikowanej jednostki. Potencjalnymi
przyczynami mogą być błąd w implementacji, błąd w specyfikacji lub zbyt słabe warunki
wstępne. W dwóch przypadkach pokazano, jak należy je wzmocnić.
34
Addresses
Jan SADOLEWSKI: Rzeszow University of Technology, Department of Computer and
Control Engineering, ul. W. Pola 2, 35-959 Rzeszów, Poland, [email protected].
Zbigniew ŚWIDER: Rzeszow University of Technology, Department of Computer and
Control Engineering, ul. W. Pola 2, 35-959 Rzeszów, Poland, [email protected].
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Artur SIĄŻNIK, Bożena MAŁYSIAK-MROZEK, Dariusz MROZEK
Politechnika Śląska, Instytut Informatyki
EKSPLORACJA DANYCH GENETYCZNYCH BAZY GENBANK
Z WYKORZYSTANIEM USŁUG SIECIOWYCH
Streszczenie. National Center for Biotechnology Information (NCBI) gromadzi
ogromne liczby danych opisujących różne organizmy biologiczne na wiele różnych
sposobów. Dane te są przechowywane we właściwych bazach danych, zarządzanych
przez NCBI. Baza danych GenBank jest jedną z najbardziej znanych na świecie baz
NCBI przechowujących dziesiątki milionów sekwencji nukleotydowych DNA i RNA.
W niniejszym artykule przedstawiono autorski system eksploracji danych genetycznych bazy GenBank. System search GenBank pozwala nie tylko wyszukiwać i przeglądać dane biologiczne bazy GenBank, ale także łączyć znalezione wpisy bazy GenBank z danymi w innych bazach danych NCBI, dając w ten sposób możliwość międzybazowej eksploracji danych.
Słowa kluczowe: bioinformatyka, DNA, RNA, bazy danych, usługi sieciowe
EXPLORATION OF GENETIC DATA FROM GENBANK USING WEB
SERVICES
Summary. National Center for Biotechnology Information (NCBI) collects huge
amounts of data describing various biological organisms in several ways. These data
are stored in appropriate databases, managed by the NCBI. GenBank is one of the
world's most famous NCBI database storing tens of millions of nucleotide sequences
of DNA and RNA. In this article, we present a new system designed to explore
genetic data in the GenBank database. The search GenBank system not only allows to
search and browse biological data in the GenBank, but also combine the GenBank
database entries with items in other NCBI databases. Therefore, the search GenBank
provides the cross-database exploration possibilities.
Keywords: bioinformatics, DNA, RNA, databases, web services
36
A. Siążnik, B. Małysiak-Mrozek, D. Mrozek
1. Wprowadzenie
W związku z gwałtownym rozwojem informatyki, który mogliśmy obserwować w przeciągu minionych lat, wiele innych dziedzin nauk może dziś korzystać z rozwiązań i technologii, których ów rozwój jest niezaprzeczalną przyczyną.
Połączenie nauk medycznych z informatyką śniło się niegdyś tylko pisarzom fantastyki
naukowej, a jednak dziś wspólna praca naukowców z całkowicie różnych dziedzin nauki nie
jest niczym nadzwyczajnym. Co więcej, proces współpracy naukowców o innych kręgach
zainteresowań zapoczątkował tworzenie się pośrednich dziedzin nauki, takich które potrafią
w harmonii połączyć wiedzę i doświadczenie z, wydawać by się mogło, odległych światów.
Zdaje się, że bez współpracy ponad podziałami, wynikającymi z różnic pomiędzy różnymi
dziedzinami nauki, dzisiejszy obraz pracy naukowej wyglądałby zupełnie inaczej. Na szczęście, nie jest nam dane się przekonać, co by było, gdyby historia potoczyła się zupełnie inną
ścieżką.
Medycyna i nauki przyrodnicze kilka lat temu zaczęły być gałęziami nauki, których badania generują ogromną liczbę danych. Ta niewyobrażalnie duża liczba danych musiała zostać
w jakiś sposób przetworzona i zachowana. Zadaniem tym zajęli się informatycy i bioinformatycy, którzy, posiadając odpowiednie technologie baz danych, a także dobre podstawy teoretyczne związane z naukami przyrodniczymi, potrafili rozwiązać problem z przechowywaniem
i przetwarzaniem dużej ilości informacji natury biologicznej.
Ze względu na nieustannie rosnący zbiór informacji, rozwiązania i techniki skupiające się
na zadaniu przechowywania danych medycznych w bazach danych są coraz to bardziej dopracowywane. Jedną z głównych przyczyn powstania bioinformatycznych baz danych [1, 2]
jest bardzo duża ilość informacji genetycznych, w tym sekwencji nukleotydowych, dla których nie ma lepszej metody przechowywania niż systemy baz danych. Praktycznie od 1981
roku, kiedy została wynaleziona metoda sekwencjonowania Sangera, problem przechowywania informacji genetycznych jest cały czas aktualny. Jedną z najbardziej znanych na świecie
baz danych przechowujących informacje genetyczne jest baza GenBank [3] utrzymywana
przez National Center for Biotechnology Information (NCBI)1.
W niniejszym artykule przedstawiono narzędzie do eksploracji danych genetycznych bazy
GenBank. Specjalnie zaprojektowany portal internetowy, wykorzystując usługi sieciowe, pozwala na wyszukiwanie informacji w tejże bazie danych, a także w bazach powiązanych
z bazą GenBank.
1
National Center for Biotechnology Information (NCBI), http://www.ncbi.nlm.nih.gov
Eksploracja danych genetycznych bazy GenBank z wykorzystaniem usług sieciowych
37
2. Pobieranie informacji z baz danych NCBI
Do wyszukiwania i pobierania informacji z baz danych, utrzymywanych przez NCBI, wykorzystuje się system Entrez [4], który jest zarówno narzędziem do indeksowania rekordów
baz danych. Pierwsza wersja systemu była rozprowadzana na płycie CD-ROM (1991 rok).
W tym czasie Entrez (rysunek 1) obsługiwał bazę danych GenBank z zamieszczonymi tam
sekwencjami nukleotydowymi, a także bazę sekwencji aminokwasowych Proteins [5], która
to przechowywała sekwencje białek odpowiadających sekwencjom nukleotydowym w bazie
GenBank, a także bazę abstraktów prac naukowych PubMed [6].
Działanie systemu Entrez opiera się na połączeniach pomiędzy węzłami, które odpowiadają konkretnym bazom danych. Na rysunku 2 została przedstawiona struktura i połączenia
węzłów systemu Entrez.
Rys. 1. Przykładowy wynik globalnego zapytania wysłanego do systemu Entrez
Fig. 1. Result of sample, global query submitted to Entrez system
System cały czas jest rozwijany i udoskonalany. Liczba węzłów obsługiwanych przez
Entrez cały czas rośnie. Oryginalny trójwęzłowy system, tj. GenBank (Nucleotide), PDB
(Protein) i PubMed, wyewoluował w przeciągu ostatnich lat, dodając węzły, takie jak [7]:

Taxonomy, zorganizowany wokół nazw i powiązań filogenetycznych pomiędzy organizmami;

Structure, zorganizowany wokół struktur trójwymiarowych białek i kwasów nukleinowych;

Genomes, gdzie każdy rekord reprezentuje chromosom danego organizmu;

PopSet, który jest kolekcją sekwencji z pojedynczego studium populacyjnego;

OMIM, który jest bazą wszystkich znanych chorób o podłożu genetycznym;

SNP, zorganizowany wokół zjawiska poliformizmu pojedynczego nukleotydu;

inne.
38
Rys. 2. Struktura powiązań pomiędzy węzłami systemu Entrez
Fig. 2. Relationships between database nodes in Entrez
3. Budowa i składnia zapytań
Rys. 3. Graficzna reprezentacja wyszukiwania informacji przez system Entrez
Fig. 3. Graphical representation of information searching in Entrez
Przeszukując bazy danych NCBI, użytkownicy wprowadzają zwykle do wyszukiwarki
NCBI Entrez słowa kluczowe lub frazy, które mają zostać wyszukane. Ciągi znaków wprowadzanych do systemu Entrez są konwertowane na zapytania o następującym formacie:
39
Term1[field1]OpTerm2[field2]OpTerm3[field3]Op...
gdzie: Term, określa frazę zapytania, field jest polem wyszukiwania (zapisywany zawsze
w kwadratowych nawiasach), natomiast Op jest jednym z dostępnych operatorów logicznych:
AND, OR lub NOT (operatory muszą być zapisane dużymi literami).
Przykład: human[organism]
AND topoisomerase[protein name]
System Entrez dzieli wpisane zapytanie na serie elementów, które w oryginalnym zapytaniu były
rozdzielone spacją. Jeśli w zapytaniu znajdują się operatory logiczne, to system podzieli zapytanie na
serie, najpierw względem operatorów logicznych, a następnie względem spacji. Każdy element zapytania jest przetwarzany osobno, a wyniki wyszukiwania są następnie łączone, zgodnie z operatorami
użytymi w zapytaniu (rysunek 3). Domyślnym operatorem logicznym jest operator AND.
W przypadku gdy zapytanie składa się tylko z listy UID2 lub numerów dostępu, system
Entrez zwróci tylko te rekordy, do których podane identyfikatory się odwołują. Nie zachodzi
wtedy żadne dodatkowe przetwarzanie zapytania.
4. Automatyczne mapowanie fraz w zapytaniu do systemu Entrez
System Entrez podczas przetwarzania zapytania automatycznie przeszukuje bazę danych
dla każdej z frazy zapytania względem kryteriów:
1. Węzeł taksonomiczny – każda z fraz jest limitowana do pola
Fields].
[organism]
Na przykład dla frazy mouse, system automatycznie mapuje frazę na:
lub
[All
„mus mu-
sculus”[organism] OR mouse[All Fields].
2. Nazwa czasopism – baza danych jest przeszukiwana względem nazw czasopism, np.:
science
science[Journal].
3. Nazwa autora – rezultat wykonania zapytania jest zawężany do pola [Author]. Nie każda
fraza może zostać mapowana na pole nazwy autora, ponieważ za prawidłową nazwę autora uważa się tylko słowo, po którym występuje jedna lub dwie litery. Na przykład:
znik A
Sia-
„Siaznik A”[Author]
Jeśli system nie zwrócił wyników po procesie automatycznego mapowania, wtedy to
usuwana jest fraza, która jest wysunięta najbardziej na prawo w zapytaniu i proces mapowania zostaje powtórzony dopóki, dopóty system zwróci wyniki. Jeśli mimo to system dalej nie
zwraca wyników, to wszystkie frazy zapytania są limitowane do pola
All Fields,
oraz są
połączone operatorem logicznym AND.
W tabeli 1 przedstawiono przykłady zapytań przed i po procesie automatycznego mapowania [7].
2
UID – (ang. Unique Identifier) nazwa unikalnego identyfikatora rekordu w bazach danych NCBI.
40
Tabela 1
Przykłady automatycznego mapowania fraz w zapytaniu do systemu Entrez
Zapytanie oryginalne
cancer cell receptor
cell receptor cancer
mouse p53
wheat nuclear protein
wheat w nuclear protein
Zapytanie po procesie automatycznego mapowania
("Cancer Cell"[Journal] OR ("cancer"[All Fields]
AND "cell"[All Fields]) OR "cancer cell"[All
Fields]) AND receptor[All Fields]
cell[All Fields] AND receptor[All Fields] AND
("Cancer"[Organism] OR cancer[All Fields])
("Mus musculus"[Organism] OR mouse[All Fields])
AND p53[All Fields]
("Triticum aestivum"[Organism] OR wheat[All
Fields]) AND nuclear protein[All Fields]
wheat w[Author] AND ("nuclear proteins"[MeSH
Terms] OR ("nuclear"[All Fields] AND "proteins"[All Fields]) OR "nuclear proteins"[All
Fields] OR ("nuclear"[All Fields] AND "protein"[All Fields]) OR "nuclear protein"[All
Fields])
5. Narzędzia Entrez Programming Utilities
Tabela 2
Dostępne narzędzia eUtils
Nazwa narzędzia
Opis
Uzyskiwanie informacji na temat dostępnych baz danych lub konkretnej bazy danych, takie jak: liczba rekordów zaindeksowanych dla każFinfo
dego pola wyszukiwania, data ostatniej aktualizacji, dostępne powiązania z innymi bazami danych.
Odpowiada na zapytanie, zwracając liczbę rekordów pasujących podaEGQuery
nym frazom wyszukiwania w każdej bazie obsługiwanej przez system
Entrez.
Odpowiada na zapytanie, zwracając listę unikalnych identyfikatorów
ESearch
(UID) rekordów, które pasują do zadanego zapytania.
Zwraca dla podanej listy UID podsumowania rekordów z konkretnej
ESummary
bazy danych.
Akceptuje listę UID i wysyła ją na serwer historii, zwracając odpoEPost
wiedni adres w postaci parametrów: WebEnv i query_key.
Odpowiada na listę UID, zwracając kompletne rekordy z danej bazy
EFetch
danych.
Zwraca listę UID rekordów z podanej bazy danych powiązanych
ELink
z oryginalną listą UID rekordów z bazy wejściowej.
Zwraca sugestie poprawnej pisowni wprowadzonego przez użytkowESpell
nika zapytania.
Pod nazwą Entrez Programming Utilities [8, 9] kryje się zestaw ośmiu programów, działających po stronie serwera, które świadczą stabilny interfejs systemu Entrez w NCBI.
41
Z eUtils, bo tak skrótowo określa się ów zestaw narzędzi, można skorzystać w dwojaki
sposób. Pierwszym sposobem jest przesłanie przez aplikację odpowiednio złożonego adresu
URL do serwera, na którym znajdują się narzędzia, a następnie odebranie odpowiedzi danego
narzędzia w formacie XML. Drugim, bardziej dystyngowanym, sposobem jest wykorzystanie
w tym celu protokołu SOAP [10]. NCBI udostępnia na swoich stronach internetowych odnośniki do plików WSDL z opisem usług sieciowych, które stanowią narzędzia eUtils.
W tabeli 2 został przedstawiony spis wszystkich narzędzi eUtils wraz z krótkim opisem
ich funkcji [9].
6. Łączenie narzędzi Entrez Programming Utilities
W celu zbudowania aplikacji zdolnej do przeszukiwania i pobierania informacji z baz danych Entrez, ważne jest umiejętnie połączenie dostępnych narzędzi. Podstawowym połączeniem programów, zawartych w pakiecie eUtils, jest:
ESearch
EFetch/ESummary
Połączenie to pozwala na przeszukanie bazy danych i pobranie odpowiednich rekordów,
które spełniają warunki zawarte w podanym zapytaniu.
Program ESearch jest odpowiedzialny za wygenerowanie listy identyfikatorów rekordów
w podanej bazie danych, zgodnych z wpisanym zapytaniem, natomiast program ESummary
lub EFetch pobiera z bazy danych rekordy o identyfikatorach zgodnych z podaną listą UID
(rysunek 4).
Rys. 4. Schemat blokowy, obrazujący działanie połączenia narzędzia ESearch z EFetch/ESummary
Fig. 4. Diagram showing colaborative functioning of ESearch tool and EFetch/ESummary tool
Jeśli zadaniem aplikacji jest wyszukanie rekordów w pewnej bazie danych, a następnie
znalezienie rekordów powiązanych z nimi w innej bazie danych, to narzędziami, jakie muszą
zostać użyte do wykonania tego zadania, są:
ESearch
ELink
EFetch/ESummary
W powyższym połączeniu ELink jest odpowiedzialny za wygenerowanie listy UID, która
odpowiada rekordom, konkretnej bazy danych, powiązanym z rekordami bazy danych, dla
której oryginalnie przeprowadzono przeszukiwania.
42
Jeżeli interesuje nas szersze spektrum powiązań, dla przykładu naszym zadaniem jest zna-
lezienie sekwencji aminokwasowych powiązanych z genami, które są w pewien sposób połączone z sekwencjami nukleotydowymi ze zbioru populacyjnego sekwencji, należących do
myszy, wtedy to liczba wywołań programu ELink wynosi 3:
ESearch
ELink
ELink
ELink
EFetch
Dla takiego przypadku ESearch zwraca nam liste UID wszystkich zbiorów populacyjnych
sekwencji myszy z bazy PopSet, pierwsze wywołanie ELink wyszukuje nam listę UID sekwencji nukleotydowych w bazie Nucleotide, które zawierały się w znalezionych zbiorach,
drugie wywołanie ELink generuje listę genów z bazy Gene, powiązanych z sekwencjami nukleotydowymi, a wynikiem ostatniego wywołania programu ELink jest lista UID białek
z bazy danych Protein, które są powiązane z wcześniej uzyskaną listą genów.
7. Aplikacja search GenBank
search GenBank jest programem napisanym w formie portalu internetowego, który pozwala użytkownikom na przeszukiwanie i pobieranie informacji z bazy danych GenBank.
Aplikacja, oprócz obsługi bazy sekwencji nukleotydowych, pozwala także użytkownikom na
przeszukiwanie innych baz danych.
Program został przetestowany dla następujących baz danych (wytłuszczone nazwy baz
danych wskazują na bazy danych, dla których przygotowano formę reprezentacyjną rekordów):

Nucleotide – główna baza sekwencji nukleotydowych (GenBank),

dbEST – baza sekwencji EST3,

dbGSS – baza sekwencji GSS4,

Genome – baza genomów różnych organizmów,

PopSet – baza sekwencji z pojedynczego studium populacyjnego,

Taxonomy – baza taksonomii,

Gene – baza znanych genów,

OMIM – baza wszystkich znanych chorób o podłożu genetycznym,

SNP – baza poliformizmów pojedynczego nukleotydu,

PubMed – baza abstraktów publikacji naukowych,

PMC – baza dostępnych publikacji naukowych,
3
Expressed Sequence Tag – krótki odcinek sekwnecji z sekwencji cDNA (mRNA). Może być użyty jako
identyfikator transkryptów genów.
4
Genome survey sequence – sekwencje podobne do sekwencji EST, z wyjątkiem że większość z nich jest
sekwencjami genomowymi, a nie uzyskiwanymi z mRNA/cDNA.

Journals – baza czasopism naukowych,

Protein – baza sekwencji aminokwasowych.
43
Rys. 5. Strona główna aplikacji search GenBank
Fig. 5. Main page of search GenBank
Aplikacja pozwala na przeszukiwanie zasobów, wymienionych baz danych w taki sam
sposób, jak jest to rozwiązane na stronach NCBI5. Oprócz standardowej metody wyszukiwania, w której użytkownik wpisuje zapytanie ręcznie w pole wyszukiwania, udostępniono także moduł wyszukiwania zaawansowanego, który pozwala na zdefiniowanie odpowiednich
limitów ograniczających wyniki zapytania.
5
http://www.ncbi.nlm.nih.gov/guide/
44
Portal internetowy search GenBank został także wyposażony w moduł budowania makr,
które służą do zautomatyzowania wyszukiwania rekordów w innych bazach danych, powiązanych z rekordami, które są wynikiem zapytania oryginalnego. Ów moduł jest nowością
i nie znaleziono żadnych odpowiedników, które spełniałyby takie same funkcje.
Interfejs aplikacji został zaprojektowany z wykorzystaniem najnowszych trendów
w świecie szablonów stron internetowych; został on skomponowany, mając na uwadze wymagania nowoczesnych użytkowników serwisów internetowych i jest zgodny z przyjętymi
zasadami przejrzystości prezentowania informacji na stronach WWW.
Program pozwala zalogowanym użytkownikom na skorzystanie z prostego systemu zapisywania wprowadzonych zapytań i zbudowanych makr. Udogodnienie to wprowadza do serwisu możliwość ponownego wykorzystania zapisanych elementów, bez przypominania sobie
konfiguracji zbudowanego makra czy pisania od nowa skomplikowanego zapytania.
Aplikacja jest dostępna pod adresem: http://sgb.bioaut.pl
7.1. Proste wyszukiwanie danych
Proste wyszukiwanie jest udostępnione na każdej ze stron portalu eksploracji search
GenBank. W górnej części każdej strony znajduje się pole, do którego można wprowadzić
pojedyncze słowo lub frazę, która ma zostać wyszukana. Dodatkowo, z listy rozwijanej po
prawej stronie należy wybrać bazę danych, którą należy przeszukać. Wyniki wyszukiwania
danych genetycznych w bazie danych Nucleotide dla przykładowej frazy mouse zostały
przedstawione na rysunku 6.
Warto zwrócić uwagę, do jakiej formy została przekształcona zadana fraza mouse. Po lewej stronie strony wyników mogą pojawić się bloki:

Wyniki dla zapytania – pokazuje zapytanie wraz z liczbą znalezionych rekordów, daje
również możliwość zapisania zapytania (zapisz zapytanie);

Inne warianty – pokazuje listę sugerowanych, alternatywnych zapytań wraz z oczekiwaną
liczbą wyników;

Czy chodziło Ci o? – wskazuje na ewentualne błędy w pisowni we wprowadzonym zapytaniu.
Rys. 6. Wyniki wyszukiwania w systemie search GenBank
Fig. 6. Results of sample query in search GenBank
Rys. 7. Zawartość schowka i powiązania z innymi bazami danych dostępne przez system search
GenBank
Fig. 7. Content of clipboard and links to other data sources available through search GenBank
45
46
Rekordy, które zostały zwrócone w wyniku wykonania zapytania, można przeglądać oraz
dodawać do podręcznego schowka. Podczas przeglądania pojedynczego rekordu lub przeglądania zawartości schowka jest również możliwe powiązanie danego rekordu z danymi innej
bazy. Na przykład, szukając danych genetycznych w bazie Nucleotide, można je powiązać
z sekwencjami białkowymi bazy Proteins lub danymi bibliograficznymi bazy PubMed. Odbywa się to poprzez szereg powiązań, które są udostępniane użytkownikowi w oknie aplikacji
po lewej stronie (rysunek 7, sekcja Powiązania).
7.2. Zaawansowane wyszukiwanie danych
Wyszukiwanie zaawansowane w portalu search GenBank pozwala na dokładne złożenie
zapytania, wykorzystując do tego odpowiednie dla wybranej bazy danych pola wyszukiwania
oraz dodatkowe czynniki ograniczające. Zapytania buduje się zgodnie z regułami przedstawionymi w rozdziale 3, zwykle łącząc operatorami logicznymi wiele prostych warunków wyszukiwania. Na przykład:
"oxygen"[TITL] AND "hemoglobin"[GENE] AND "Arabidopsis thaliana"[ORGN]
W portalu search GenBank udostępniono odpowiedni kreator zapytań dla budowania zapytań złożonych przedstawiony na rysunku 8.
Rys. 8. Strona wyszukiwania zaawansowanego w systemie search GenBank
Fig. 8. Advanced search web page in search GenBank system
47
Skrótowo proces wyszukiwania zaawansowanego można opisać w punktach:
1. Wybierz bazę danych z głównego formularza zapytania.
2. Zbuduj swoje zapytanie:
a) wybierz operator logiczny łączący frazy zapytania,
b) wybierz pole wyszukiwania,
c) wpisz frazę wyszukiwania,
d) naciśnij przycisk Dodaj do pola zapytania,
e) powtórz punkty od a) do d), jeżeli jest taka potrzeba.
3. Naciśnij przycisk Szukaj, który znajduje się obok pola wyboru bazy danych na górze strony WWW.
Moduł wyszukiwania zaawansowanego pozwala także na wprowadzenie dodatkowych
czynników ograniczających zakres wyników wyszukiwania. Wspomnianymi czynnikami jest
zakres daty publikacji lub daty modyfikacji rekordów w określonej bazie danych. Aby skorzystać z tej funkcji, należy wypełnić odpowiednie pola na stronie z formularzem wyszukiwania.
Prezentacja wyników zbudowanego zapytania jest taka sama jak prezentacja wyników
w przypadku wyszukiwania podstawowego. Opcje schowka i powiązań są dostępne i działają
tak samo dla wyników zapytań zbudowanych przez formularz wyszukiwania zaawansowanego.
7.3. Tworzenie makr
Rys. 9. Budowanie makr na stronie search GenBank
Fig. 9. Construction of macros on search GenBank
48
Makra pozwalają na automatyzację procesu wyszukiwania rekordów, w jakiś sposób po-
wiązanych z rekordami w innej lub w tej samej bazie danych. Do zbudowania makra jest niezbędne określenie bazy danych, w której wyszuka się rekordów pasujących do wpisanego
zapytania. Następnie użytkownik wybiera z listy dodatkowe powiązania z innymi bazami
danych. Teoretycznie liczba powiązań wprowadzonych do makra jest nieskończona, jednakże
należy mieć na uwadze fakt, iż nie wszystkie rekordy w bazach danych NCBI posiadają adnotacje, wskazujące na powiązane elementy w innych bazach. Z biegiem czasu liczba powiązań
między bazami danych NCBI wzrasta, dlatego też można być dobrej myśli, iż makra okażą
się w przyszłości świetną alternatywą dla żmudnego procesu eksploracji danych pomiędzy
bazami. Budowanie makr odbywa się poprzez odpowiedni formularz dostępny w serwisie
search GenBank, przedstawiony na rysunku 9.
Po skonstruowaniu makra można je wykonać lub zapisać do słownika makr, o ile jest się
zalogowanym użytkownikiem. W tabeli 3 przedstawiono kilka przykładów makr.
Tabela 3
Przykłady makr
Problem: Znajdź wszystkie dostępne w bazie danych rekordy genów dla rekordów bazy sekwencji aminokwasowych, odpowiadające białku o nazwie: topoisomerase
Zapytanie:
Powiązanie:
topoisomerase[protein
Gene Links
name]
Baza:
Znaleziono: 19
Protein
Problem: Znajdź sekwencje nukleotydowe dla myszy, a następnie wszystkie dostępne dla nich
artykuły z bazy PubMed
Zapytanie:
Powiązanie:
mouse
Pubmed Links
Baza:
Nucleotide
Znaleziono: 264
Problem: Znajdź wszystkie możliwe rekordy z bazy PopSet, odpowiadające zapytaniu o raka
piersi, następnie wyszukaj powiązane z nimi sekwencje nukleotydowe. Powiąż znalezione sekwencje nukleotydowe z sekwencjami białek
Zapytanie:
Powiązanie:
Powiązanie:
Breast cancer
Nucleotide Links
Protein Links
Baza:
PopSet
Znaleziono: 881
8. Podsumowanie
Portal internetowy search GenBank daje duże możliwości prostego i zaawansowanego
przeszukiwania bazy GenBank oraz innych baz danych utrzymywanych w Stanach Zjedno-
49
czonych przez National Center for Biotechnology Information. Ponadto, możliwość tworzenia makr pozwala na międzybazową eksplorację powiązanych ze sobą danych. Jest to cecha
unikalna systemu search GenBank. Siły tego rozwiązania nie można aktualnie wykorzystać
w pełni ze względu na fakt, iż powiązania pomiędzy rekordami różnych baz nie są obecnie
tak bardzo rozbudowane. Jednakże potencjał idei, jaki drzemie właśnie w rozwiązaniach automatyzujących proces wyszukiwania informacji w bioinformatycznych bazach danych, może
być w niedalekiej przyszłości całkowicie wykorzystany.
Portal internetowy search GenBank został zaprojektowany dla osób zajmujących się analizą danych biologicznych, m.in. biochemików, biologów molekularnych, lekarzy medycyny,
pracowników laboratoriów genetycznych, patologów molekularnych. Zarejestrowani i zalogowani użytkownicy systemu mogą zapisywać raz już skonstruowane zapytania i makra
w specjalnych słownikach po to, by w przyszłości, prowadząc podobne badania, móc do nich
powrócić.
Portal internetowy search GenBank koncentruje się wprawdzie w dużej mierze na danych
genetycznych, wychodząc z założenia, że dane genetyczne są obecnie najczęściej wykorzystywanymi danymi w naukach o życiu (ang. life sciences), jednakże umożliwia również
przeszukiwania i przeglądania innych baz danych NCBI.
BIBLIOGRAFIA
1.
Mrozek D.: Bioinformatyczne bazy danych – rola, miejsce i klasyfikacja. [w] Bazy danych: Struktury, Algorytmy, Metody. Wydawnictwa Komunikacji i Łączności, Warszawa
2006, s. 117÷128.
2.
Mrozek D., Małysiak B.: Bioinformatyczne bazy danych – poziomy opisu funkcjonowania organizmów. [w] Bazy danych: Struktury, Algorytmy, Metody. WKŁ, Warszawa
2006, s. 107÷116.
3.
Benson D. A., Karsch-Mizrachi I., Lipman D. J., Ostell J., Wheeler D. L.: GenBank: update. Nucleic Acids Res., Vol. 32, 2004, s. 23÷26.
4.
Hogue C., Ohkawa H., Bryant S.: A dynamic look at structures: WWW-Entrez and the
Molecular Modelling Database. Trends Biochem. Sci. 21, 1996, s. 226÷229.
5.
Wheeler D. L., Chappey C., Lash A. E., Leipe D. D., et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 28(1), 2000, s. 10÷14.
6.
McEntyre J., Lipman D.: PubMed: bridging the information gap. CMAJ. 164(9), 2001,
s. 1317÷1319.
7.
Ostell J.: The Entrez Search and Retrievlal System. http://www.ncbi.nlm.nih.gov/
/bookshelf/ br.fcgi?book=handbook&part=ch15 [stan na 02.02.2011].
50
8.
Sayer E., Wheeler D.: Building Customized Data Pipelines Using The Entrez Programming Utilities (eUtils). http://www.ncbi.nlm.nih.gov/bookshelf/ /br.fcgi?book=coursework&part= eutils [dostęp 02.02.2011].
9.
Sayers E.: The E-Utilities In-Depth: Parameters, Syntax and More. http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpeutils&part=chapter4 [dostęp 02.02.2011].
10.
SOAP Version 1.2 Part 1: Messaging Framework (Second Edition), http://www.w3.org/TR/soap12-part1/ [dostęp 02.02.2011].
Recenzenci: Prof. dr hab. inż. Mieczysław Muraszkiewicz
Dr Ewa Romuk
Wpłynęło do Redakcji 31 stycznia 2011 r.
Abstract
Since the rapid development of computer science, which we have seen over the past years,
many other fields of science can now use the solutions and technologies, which this
development is an undeniable cause.
Medicine and natural sciences began to be branches of science, whose research generates
a huge amount of data. This unimaginably large amount of data had to be processed and
stored in some way. The task involved specialists from the IT and bioinformatics who, having
the appropriate database technologies and a good theoretical basis associated with the natural
sciences, could solve the problem of storing and processing large amounts of biological
information.
Due to the constantly growing collection of information, solutions and techniques that
focus on storing specific medical data in databases have to be more fine-tuned to the purpose.
One of the main reasons for establishing biological databases is a very large amount of
genetic information, including nucleotide sequences, for which there is no better storage
method than database systems. Practically since 1981, when the Sanger method of sequencing
was invented, the problem of storing and processing genetic information is still up-to-date.
GenBank is one of the world's most famous database storing tens of millions of
nucleotide sequences of DNA and RNA. In this article, we present a new system designed to
explore genetic data in the GenBank database. The search GenBank system not only allows to
search and browse biological data in the GenBank, but also combine the GenBank database
51
entries with items in other NCBI databases. Therefore, the search GenBank provides the
cross-database exploration possibilities.
Adresy
Artur SIĄŻNIK: Politechnika Śląska, Instytut Informatyki, ul. Akademicka 16,
44-100 Gliwice, Polska, [email protected].
Bożena MAŁYSIAK-MROZEK: Politechnika Śląska, Instytut Informatyki,
ul. Akademicka 16, 44-100 Gliwice, Polska, [email protected].
Dariusz MROZEK: Politechnika Śląska, Instytut Informatyki, ul. Akademicka 16,
44-100 Gliwice, Polska, [email protected].
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Miłosz GÓRALCZYK, Jarosław KOSZELA
Military University of Technology, Informatics System Institute
ARCHITECTURE OF OBJECT DATABASE MUTDOD
Summary. This article contains overall description of architecture of distributed
object database MUTDOD, which is created at Military University of Technology.
This paper describes a metamodel schema and unified data model – which is shared
between application and database environment. Also a division of query execution
process is explained.
Keywords: architecture, metamodel, distributed object database, MUTDOD
ARCHITEKTURA OBIEKTOWEJ BAZY DANYCH MUTDOD
Streszczenie. Artykuł zawiera ogólny opis architektury rozproszonej obiektowej
bazy danych, która jest tworzona w Wojskowej Akademii Technicznej. Materiał opisuje metamodel oraz ujednolicony model danych, który jest współdzielony pomiędzy
środowisko aplikacji i bazy danych. Ponadto, został wyjaśniony podział etapów wykonania zapytania.
Słowa kluczowe: architektura, metamodel, rozproszona obiektowa baza danych,
MUTDOD
1. Introduction
Based on an IDC‟s report [4]: in 2008 computer systems and their users produced about
427EB (Eksa ~ 1018) of electronic data, but in 2012 they will produce over 2 ZB (Zetta ~
1021) – it is over 2 billion terabytes of data. It means that a volume of produced data will increase for about 100% every year. Probably a growth will be even higher in the following
years. To handle this issue software engineers have to improve data storage and processing
mechanisms to make them more effective. A simple way to solve this problem may be
a large-capacity distributed database systems, which store and process a large data volume.
54
M. Góralczyk, J. Koszela
Current computer systems are going to be increasingly more complex. Time and a budget
needed to create a computer system, may be reduced by using object-oriented languages,
which allow for dividing system into modules. OOP (Object Oriented Programming) is also
known of being developer-friendly. However using an object-oriented language may cause
problems in data access, especially when system is cooperating with a relationship database
- the most popular database kind nowadays. There is an impedance mismatch of data structures [5, 10] between object model in application and relational data model in database. That
is why ORMs (Object-Relational Mapping) are used to transform data between these two
different structures. Unfortunately, an additional mapping reduces system‟s performance and
requires to prepare (or generate) a lot of additional code and configurations.
2. What is the MUTDOD?
In the future, object database systems [3, 10] may substitute relationship databases in
most cases. At this moment there is no object database system, which serves full and natural
(intuitional) objectivity to software engineers – there is more about this defects in following
chapters. A try of creating such solution is MUTDOD System. MUTDOD stands for a Military University of Technology Distributed Object Database System, which is creating at title
University in Warsaw, Poland. MUTDOD is mentioned to be a platform for creating (complex) object application with an object-data storage. This paper describes an object side of
MUTDOD. More information about a distribution in MUTDOD is in [5, 14].
MUTDOD is a project which includes its own object database engine and ODBMS
[6, 7] (Object Database Management System) which can be installed on distrusted (Windows) hosts. There are two kinds of MUTDOD‟s servers – a central server and a data server.
The first one manages data hosts and a query execution. At this point MUTDOD works only
with one instance of a central server, but in the future it will be divided and distributed to
eliminate this performance bottleneck. The second one physically stores data and executes
queries. There can be many data servers.
MUTDOD also contains an executive environment for client applications. At this point it
is built as an ODBC-like provider for .NET applications, but in the future it might be required
to create a separate runtime environment (like CLR1 or JRE2).
All MUTDOD‟s elements are designed and implemented from beginning at Military University of Technology, including an object database engine. MUTDOD is build using .NET
1
CLR – stands for a Common Languages Runtime; it is a runtime environment for Microsoft .NET Framework platform; more information about CLR you can find on [9].
2
JRE – a Java Runtime Environment; it is a platform, which allows to run application created in Java.
Architecture of object database MUTDOD
55
Framework 4.0 in C# and C++. At this point it works only with .NET-based client applications. Support for other platforms and languages will be added in the future.
3. Unified data model
Von Neumann‟s architecture specifies [11] that application‟s executive code and data are
kept in the same memory. It has been so up to this very day – all of mentioned elements are
in RAM memory, but heavy data is stored on a hard disk or in the database – usually on remote host. In this way only objects in RAM are easy accessible for programmer (all operation
can be performed on-the-fly). Stored data is loaded in disagreeable way – first of all there is
a need to create a mapping from database‟s relationship structure to application‟s business
objects. Even if some generator would be used, there is still a lot of effort to customize or
optimize metadata. Secondly, while the programmer is attempting to use (remote) data from
a database, he has to create a database connection, send query, check if execution do not generate some errors, etc. It is an additional effort, needles any way. The programmer should
access the remote data in the same way as the local one.
Over the years software engineers got used to accessing database data in that tedious way.
But there is no need to do it like that. For the programmer there is no difference between remote and local data (except their access latency). MUTDOD system is mentioned to serve
unified access to both types of data. It means that for a programmer data access and
processing are transparent – the developer does not have to care about data location. For
example: there is a computer system to register all cars and their owners in Poland. In a database the data for the whole country is kept, probably a lot of GB. There is also an object application which serves functionality of CRUD3 operation on all objects. To implement functionality of a new car registration a programmer has to get data about an owner and his other
cars. He creates a new instance of a Car class, fills its attributes and… that is all, he does not
have to do anything else. Programmer does not have to care about data synchronization with
database – this is the role of MUTDOD. Please, notice that with MUTDOD programmer
creates a database object as if it would be a local object. It works the same with update or
delete operation – programmer just works on objects and does not trouble how to get an
access to them.
To provide this functionality there is one key requirement – the data model in the application and the database has to be unified or at least a shared part of it – there will be more
about this in a next paragraph. It means that objects in both schemas have the same attributes
3
CRUD – stands for create, read, update, delete operations.
56
and methods, because objects have to be serialized and deserialized on both sides4. Of course
using proper CASE tools will reduce time of system designing, because only one model of
data will be designed, shared one object model for both – a database and an application. This
is in opposition to using two different models – a relational one for the database and an object-oriented one for the application, plus additional ORM mappings.
Fig. 1. Integral parts of objective computer system
Rys. 1. Integralne części obiektowego systemu komputerowego
What is more, because of a unified data model the programmer has only to create a configuration file to access objects in database – there is no need to create any database connector objects in code5. So MUTDOD allows software engineers to start thinking about systems,
in which data from an object database is integral part of whole system (see Fig. 1) and which
is easily accessible. They can think about a one unified data model.
It has to be mentioned that a unified object model for database and application does not
imply that there is no difference between these two models. Some application‟s objects
(attributes or methods) may be available for local use only, while some objects in a database
may be secured for client applications. During creation of a data model programmer may
operate with extended encapsulation visibility modifiers, beside a common: public, private,
protected. A new one is: database (for elements only for database) and application (for elements only used in application). It will be only two additional encapsulation states. Both will
be provided with MUTDOD.
4
5
Notice that there will be no mapping, because both data models are objective.
Like SQLProviders, ODBC, JDBC, ect for current solutions.
57
4. Easy data selecting
For the majority of computer systems the most common operation is data selection – select queries. However it is a rare case to select all stored data without any filtering or grouping. In a typical select operation there are also some join operations for many tables to get
data across many relationships. For example to get VIN numbers of cars, whose owners‟
name is “Simpson” there is a need to join two tables – Car (which stores cars‟ information)
and Person (which stores information about cars‟ owners).
Building such query in OQL (Object Query Language) requires: creating a database connector object, sending a query and results mapping6 (database objects) to application‟s objects. In addition, OQL query is (usually) hard coded as a string in application‟s code, so
there is no syntax check or semantic check of query during a code compilation. Even OQL
syntax seems to be orthogonal to “objective world”, because there are no (or just few) benefits with using OQL queries (and an object database) instead of SQL (and a relationship database). As long as developers will use OQL they will still have to think in a relational way
during a queries creation.
Last but not least, a disadvantage of OQL is that there is no way to use application‟s language build-in functions (methods), e.g. .NET‟s or Java‟s string operation like: “To Lower
Case”, “Split”, “Remove”, “Replace”, etc. High-level programmers got used to use them in
an application code, so why they cannot use this functions in database query? Please consider
a previously-mentioned situation, about cars‟ owners named “Simpson”. Please consider
a situation where there is no validation in application or data is migrated from many different
systems. In that case in person‟s name attribute there can be values like: “Simpson”, “simpson”, “SIMPSON”, etc. To get full results a developer has to write every combination of upper and lower cases of a name string in query‟s where section – please look on sample queries in the table 1.
Another way of selecting data from a database is using of a Linq7,8. It provides ORM,
a keyword-based syntax and on-code-compilation errors checking. For programmers it is
useful because they can use default syntax autocompleting tools and does not have to check if
query has any syntax error in a runtime – a code compilation will stop if any error is found in
a Linq query. Especially a good thing about the Linq is that programmers can use .NET objects‟ build-in functions. So in case of different spelling of car owner‟s name programmers
6
Notice that even if there is no data structure impedance mismatch between models of an object application
and a common object database, a mapping results to application‟s objects is required if a database works with
OQL.
7
LINQ – Language-INtegrated Query, an extension for .NET, which make access to data in database easier
8
Of course there are many other equivalents of Linq for .NET and other languages.
58
can reduce query complexity with using a code like this: Owner.Name.ToLower(“simpson”).
For whole query, please look onto a table 1.
Table 1
Sample queries in object query languages
OQL
select CAR.VIN
from CAR
join PERSON on CAR.OWNERID
= PERSON.ID
where
PERSON. NAME = „Simpson‟
Or PERSON.NAME = „simpson‟
Or PERSON.NAME =‟SIMPSON‟
Linq
rom car in DB.Cars
join person in DB.Persons
on car.OWNERID
equals person.ID
where person.Name.
ToLower().
Equals(“simpson”)
select car.VIN
SBQL
Car(WHERE Owner.Name
.ToLower()
.Equals(“simpson”)
).VIN
As you can see in the Linq still has an OQL-like syntax. In other versions Linq9 all references are in metamodel, so programmers do not have to use join operation. Instead, a Person
object contains a list of references to owned cars and the Car object has a reference to its
owner – an instance of a Person. It is a step in the right direction, because it reduces complexity of queries, so it also reduces queries developing time and chance of making mistakes.
A next step in object query language10 is moving to a full objective expression (syntax).
This kind of syntax is familiar for programmers, who use any of objective languages. Using
a query language should be natural and intuitional for developers of objective applications.
There should be no influence of SQL, because it is needless. As it is in Linq, an object query
language should contain build-in keywords of a programming language, in which application
is coded. A sample of this kind of language is SBQL (Stack-Based Query Language) [2, 8].
At this point MUTDOD provides most of the basic functionality of SBQL. In a near future the query language will provide some additional functionality to control distribution of
query processing. MUTDOD‟s query language will also handle unified data models, so some
stages of query execution will be moved to client‟s environment – more about this you can
read in part 6.
5. Objective metamodel
MUTDOD‟s data metamodel is based on the ODMG [1] metamodel. It has to keep
many kinds of information about stored data. The most important is the information about
9
E.g.: Linq to Objects, Linq to Entities,ect.
Linq is also available for finding objects in .NET collections, but still with the same, eclectic SQL-like
syntax.
10
59
database schema, classes used and relations between objects. There are four kind of relation
between objects:

generalization – which allows to operate on object‟s dynamic roles, so in different contexts programmers can operate on different attributes of an object (e.g. cast Person object
to Student object); for all objects base class is class Object;

aggregation – allows software engineers to design relations between main object (container) and its parts; main object can exist without any part;

composition – as above, but main object cannot exist without its parts. In MUTDOD
a composition is also used when a class is divided horizontally11 – some of object‟s attributes are stored in different location than the rest;

association – reference to other object.
A relation in MUTDOD is an organized triple. It contains OID of object A, OID of object
B and type of relation. OIDs Object IDentifier [1]. For any object in MUTDOD, OID is
unique. Except primitive types (like int, double, OID, ect.) all attributes of a complex object
are also objects.
Other thing which is stored in metamodel is information about elements encapsulation. In
MUTDOD there are three common encapsulation states: public, protected and private. These
are used to hide attributes and methods for inheritance. But as it was written above, due to
unified data model for database and application there are two additional states: database and
application. First one makes an element “visible” to database side, while second one works
on application side. A use of new encapsulation visibility modifiers can be combined with
the “old” modifiers, because you can create database protected attribute or application public
methods. New modifiers allow creating objects which are used only on one side – like temporary object on client‟s side.
Fig. 2. MUTDOD‟s simplified metamodel schema without distribution elements
Rys. 2. Uproszczony schemat metadanych MUTDOD, bez elementów rozproszenia
11
When a class is divided horizontally metamodel creates 2 or more subclasses and links them with composition to base class. An object division is transparent to programmers and has no influence for query‟s syntax if
it is not declared by the programmer.
60
The metamodel keeps also security information, because some part of information could
be secret or protected by law – like personal identity. MUTDOD data metamodel is designed
to store all above information12 – its simplified schema is shown on Fig. 2 as a modification
of the ODMG metamodel schema.
MUTDOD‟s distribution requires that information about a data location has to be stored
in the metamodel too. In MUTDOD data can be divided vertically [12] – some instances of
a class can be stored on one host, while other objects of the same class are stored on other
host. It may improve performance, e.g. host A is storing information about cars‟ owners,
while host B contains objects of motorcycle owners.
Data can be also divided horizontally [13], what means that some attributes can be
stored on a different (e.g. secured, faster) host. Please imagine a situation, when object Person contains personal information about cars‟ owner, list of owned car, login and image, as
an avatar for web access. This object can be divided horizontally into two parts: a first part
for a personal data and a car list for government use only, which are stored on secure, encrypted MUTDOD host; and a second part for an image and a login, on a second (less secure)
MUTDOD host available for web application.
6. Divided execution
A main problem of object applications, which use ORM to operate on multiple rows
from database, is that almost all operations (especially updates) on this data have to be
processed in iterations – e.g. with foreach loops. It produces additional, yet unneeded operation cost and network traffic – an application has to select data, retrieve data (which is transformed into objects), for each element of collection updates object‟s attributes and send it
back to the server. It is caused by an imperative nature of programming languages. But an
object database environment should provide a query language, which combines features from
both – imperative and declarative languages – so MUTDOD‟s one will do.
MUTDOD will provide a runtime environment for a client site, which will allow using
a unified data model and make programmers able to use declarative operation between imperative operations in code of an application. So it will be possible to update an attribute for
multiple objects with a one expression, and with no data transfer from the server – whole
expression will be executed on server side. It could be something like this: Employee(Where
Name==“Simpson”).Salary *= 0.1 – to increase salary for 10% for each employee, whose
12
Not all features are implemented at the moment.
61
name is Simpson. Above query has to be differed from other application operations, which
are coded in a programming language.
To handle with this MUTDOD‟s client environment is designed to be an extension of
server environment. Unlikely a common database connection, MUTDOD‟s client has to take
two additional states in query execution13. As you can see on fig. 3 this stages are: a query
optimization and a query interpretation.
A query optimization is a stage which is responsible for choosing fastest way of query
execution, division query to subqueries, ect. In MUTDOD it gets an additional role to identify, which operations require to return only references (OIDs) of selected objects and which
ones have to return all the data. For example: if application will use selected collection only
to modify value of each element, database may return only OIDs, because no value from objects will be used – there is no need to dereference anything. Other example is when all data
will be displayed to end user in a grid – in this case whole objects (with their values) have to
be returned. That is why MUTDOD‟s client environment has to take responsibility of query
optimization (at least a part of it).
A query interpretation fills up the previous stage. It allows client applications to send
a semi-processed (a compiled) query, so its execution time should be lower than for a query
fully processed on the server side.
Fig. 3. Division of query execution process in application working with MUTDOD
Rys. 3. Podział procesu wykonywania w systemie wykorzystującym MUTDOD
13
Of course these stages are automated and give no additional effort to a programmer.
62
Both stages require metadata from a database. The idea of a unified data model assumes
that a database schema is shared with a client environment. There are also some other elements of metadata which have to be shared too, because they are required for an optimization
and an interpretation process. Some of these elements are: indexes, elements‟ actual state and
elements visibility (e.g. secured object should not be available at client side).
7. Summary
MUTDOD as a project is at the beginning of its research. The main idea – creating
a platform for object systems with an object database – is constant, but minor ideas and its
features are evolving. MUTDOD is a project, which is meant to find the answers to some of
following questions – how object database platform should look like; how to make coding of
complex application easier; how to join a code of operation with collection of data; what kind
of features should object database server have to make it as popular and useful as relational
databases; and many more.
Future steps of MUTDOD project research will be:

deeper unification of client and database environment;

making data processing more transparent for programmer;

extending query language and trying to make it programming language independent (to
run on .NET, JAVA, Ruby, etc.);

working on local processing distribution and system effectiveness;

discovering the best way of FAR strategy (Fragmentation Allocation and Replication) for
data.
BIBLIOGRAPHY
1.
2.
3.
4.
Catell R. G. G.: The Object Database Standard: ODMG 3.0. Morgan Kaufmann, 2000.
Subieta K.: Teoria i konstrukcja obiektowych języków zapytań. Wydawnictwo PJWSTK,
Warszawa 2004.
Lausen G., Vossen G.: Models and Languages of Object-Oriented Databases. Addison
Wesley Longman Limited, 1997.
IDC Group: As the Economy Contracts, the Digital Universe Expands. http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm, valid on at 10th November 2009.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
63
Brzozowska P., Góralczyk M., Jesionek Ł., Karpiński M., Kędzierski G., Kędzierski P.,
Koszela J., Wróbel E.: System obiektowy = obiektowa baza danych + obiektowa aplikacja. Studia Informatica, Vol. 31, No. 2B (90), Gliwice 2010.
Góralczyk M.: Projekt oprogramowania zarządzającego obiektową bazą danych. WAT,
Warszawa 2010.
Karpiński M.: Projekt mechanizmu replikacji i synchronizacji elementów obiektowej bazy
danych. WAT, Warszawa 2010.
Stack-Based Architecture (SBA) and Stack-Based Query Language (SBQL). www.sbql.pl, valid on 1st April 2011.
Microsoft Corporation: Common Language Runtime (CLR).
http://msdn.microsoft.com/pl-pl/library/8bs2ecf4.aspx, valid on 1st April 2011.
Subieta K.: Słownik terminów z zakresu obiektowości. Akademicka Oficyna Wydawnicza
PLJ, Warszawa 1999.
Rojas R., Hashagen U.: The First Computers: History and Architectures. MIT Press,
Cambridge MA 2000.
Bellatreche L., Simonet A., Simonet M.: Vertical fragmentation in distributed object database systems with complex attributes and methods. 7th International Workshop on Database and Expert Systems Applications (DEXA‟96), 1996.
Bellatreche L., Simonet A.: Horizontal fragmentation in distributed object database systems. Lecture Notes in Computer Science, Volume 1127/1996, Springer, Berlin Heidelberg 1996.
Karpiński M., Koszela J.: Object-oriented distribution in MUTDOD. Studia Informatica,
Vol. 32, No. 3B (99), Gliwice 2011.
Recenzenci: Dr inż. Ewa Płuciennik-Psota
Dr inż. Aleksandra Werner
Omówienie
MUTDOD (ang. Military University of Technology Distributed Object Database) to rozproszona, obiektowa baza danych tworzona w Wojskowej Akademii Technicznej. MUTDOD
jest projektowany jako platforma dla systemów tworzonych w obiektowych językach programowania.
64
Projekt ma na celu uproszczenie procesu tworzenia oprogramowania, m. in. poprzez
ujednolicenie modelu danych w aplikacji i bazie danych. Celem jest traktowanie obiektów
w bazie w sposób identyczny z obiektami lokalnymi trzymanymi w bazie danych. Baza danych powinna stanowić integralną część systemu, a nie być jej wyizolowanym fragmentem
(rys. 1).
Ważnym elementem systemu jest język zapytań. Obecnie wykorzystywane języki wymagają warstwy ORM, aby zmapować tabele bazy danych na obiekty po stronie aplikacji.
Z drugiej strony, OQL wydaje się trzymać zbyt silne więzy z SQL-em, przez co wydaje się
nie pasować do obiektowych realiów. Celem MUTDOD jest integracja języka zapytań
z językiem programowania w taki sposób, aby dla programisty problem składowania obiektów był transparenty. Język ma umożliwić korzystanie z lokalnych obiektów w ten sam sposób co obiektów w bazie danych, bez konieczności mapowania modelu klas bazy danych po
stronie klienta. Różnice między językami zostały przedstawione w tabeli 1.
Kolejnym ważnym elementem jest metamodel danych. W MUTDOD bazuje na metamodelu zaproponowanym przez ODMG. Jednak jest on rozszerzany, aby zawierał informacje
o bezpieczeństwie, rozproszeniu, a także, aby współpracował ze współdzielonym modelem
danych. Uproszony model metamodelu został przedstawiony na rys. 2.
Unifikacja modelu danych wymaga, aby na stronę klienta przenieść dwa etapy przetwarzania zapytania: optymalizację i interpretację zapytań, co zostało zaprezentowane na rys. 3.
Środowisko klienckie MUTDOD automatyzuje ten proces.
Addresses
Miłosz GÓRALCZYK: Wojskowa Akademia Techniczna, Wydział Cybernetyki,
ul. gen. Sylwestra Kaliskiego 2, 00-908 Warszawa, Poland, [email protected].
Jarosław KOSZELA: Wojskowa Akademia Techniczna, Wydział Cybernetyki,
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Marcin KARPIŃSKI, Jarosław KOSZELA
OBJECT ORIENTED DISTRIBUTION IN MUTDOD
Summary. The paper contains overall description of distribution in objectoriented database created at Military University of Technology. This paper describes
overall information about possible distribution’s architectures and the ones provided
in MUTDOD. The article contains information about distributed data storing, data
processing and architecture of synchronization and replication units. Details about solutions provided in MUTDOD are also presented.
Keywords: distribution, query planner, replication, synchronization, MUTDOD
OBIEKTOWO ORIENTOWANE ROZPROSZENIE W MUTDOD
Streszczenie. Artykuł zawiera ogólny opis architektury rozproszenia w obiektowej bazie danych MUTDOD, która jest tworzona w Wojskowej Akademii Technicznej. Artykuł opisuje możliwe rozwiązania architektoniczne wraz z rozwiązaniami
przewidzianymi w MUTDOD. Artykuł zawiera informacje o modułach synchronizacji i replikacji oraz rozwiązaniach zastosowanych przy rozproszonym przechowywaniu i przetwarzaniu danych.
Słowa kluczowe: rozproszenie, planowanie zapytań, replikacja, synchronizacja,
MUTDOD
1. Introduction
Military University of Technology Distributed Object Database (MUTDOD) is a new
look for storing and manipulating data. Nowadays systems usually need to process large
amount of data. Simple machines with standard database systems are slowly reaching their
limits. For the purpose of processing large amount of data people have created computing
clouds and clusters. Unfortunately both of them are not perfect solutions for processing billions of records. Clouds are usually beyond the control of organization. Their units are con-
66
M. Karpiński, J. Koszela
trolled by a third party company, what eliminates problems with scalability, but reduces
awareness of how crucial data are stored and processed. Usage of self controlled cluster is
also possible but it needs a lot of management. Even if easily manageable, cluster of relational databases still does not solve one problem – the system is probably written using object
oriented paradigm whereas database is relational. It provides necessity for efficient ORM1
[1, 3]. Database which can store data in a way that object oriented system understands (so
store data in object form) and is capable of processing data on multiple units when it is possible, would eliminate necessity for ORM solution and provide a tool for faster data processing.
MUTDOD is an example of such a database. Not only is it able to perform as a single unit,
but it can also operate like a cluster. It stores not only attributes or properties of objects but
also its methods, and in addition, MUTDOD is able to execute stored methods in a distributed
environment.
2. Central Server vs. P2P
Fig. 1. Central Server Architecture
Rys. 1. Architektura z serwerem centralnym
MUTDOD architecture is based on two different, but similar in some aspects, ideas.
MUTDOD was design as a single database unit which also has to be able to operate as an
element of a distributed environment [9, 10]. Natural way of meeting that requirement would
be central server architecture. This type of architecture has got some serious disadvantages
which rather eliminates it from using with crucial data. First and the most important disadvantage is the central unit. Its failure causes a fail of the whole database, and crucial data becomes unavailable. Moreover central unit is a gateway for all communication, so it has to
process all incoming requests, what causes slowdown of whole system or even fail of system
1
ORM – object relational mapping is necessary because of impedance mismatch
Object oriented distribution in MUTDOD
67
when machine is no longer able to process such a big number of connections. P2P architecture is more efficient in such systems, but it causes other significant problems. First of all,
there is no central point for all types of blockades, and controlling. Performing transaction in
such environment is very complicated because before anything will happen, blockades on all
machines have to be set up. There are also serious timing problems. There is no simple solution for situation where two equal in priority (because in P2P architecture we treat all machines equally) requests of data manipulation are received. Which one should be executed at
first place? P2P architecture has high risk of deadlocks and starvation. Finally all information
has to be populated among all devices to keep database in consistent state.
Fig. 2. P2P Architecture
Rys. 2. Architektura P2P
Fig. 3. Hybrid Architecture
Rys. 3. Architektura hybrydowa
68
The third possible solution is a hybrid architecture. It is a unique mix of characteristics of
both already mentioned architectures. Hybrid architecture [5] is simply a central machine in
a standard peer-to-peer environment. Such add-on to P2P environment solves the problem of
timing and synchronization but unfortunately brings back another one – weak point of central
server. Besides all this advantages and disadvantages of each described architecture there is
one more important characteristic which we should consider – transparency [8]. Transparency
is a feature which we can apply on all levels of database structure. First of all, user should not
be aware of database structure. The fact that user is working with cluster rather than with
a single node should be invisible for him. When users are working with database they cannot
be aware of database structure changes. They cannot be forced to reconfigure the client software, because some additional nodes joined the cluster, or because additional data replicas
are now available. Information about distributed data processing [10] (so that user request is
processed by more than one node) should not be visible to users2. User should not be aware
of data localization, fragmentation, data processing details so number of nodes, their localization, used for query execution. MUTDOD system was designed to provide transparency on
all available levels [3, 7, 11].
3. Federation vs. Election
MUTDOD system is capable of working in two architectures: hybrid and P2P. The main
idea of MUTDOD architecture is distinguishing node types. Data nodes and management
nodes cooperate in environment, but they are not necessary on the same machines. Of course,
if a node does not have both data server and management server it is not capable of performing as a single unit or single database. Both hybrid and P2P has one single point of start –
metadata replication. Mode switching depends on configuration of a database. If all nodes
store replicas (up-to-date replicas) of metadata, mode can be switched in a real-time. If some
nodes do not have up-to-date metadata, replication has to be performed before mode switching. MUTDOD is even capable of working in semiP2P architecture, so where a number of
management servers is running but not all data servers have dedicated ones.
While working in P2P architecture MUTDOD database acts like a federation of single
units. They can exchange data on-the-fly or perform periodical updates. There is also a mode
in which when operations causing synchronization problem occur all of them are rolled back
and before they are executed once more the system decides the order of requests. All of pro-
2
Only when it is not important for user – user is able to manipulate query execution by using new keywords
available in DDQL [6]
69
vided mechanism cause some additional network transfer and some delays caused by the synchronization operations.
On the other hand we get simple hybrid architecture. MUTDOD hybrid architecture has
got significant advantage over standard version – central server is not appointed, system decides which one should be at that moment the central one. During creating a cluster the administrator decides which one is the chosen one, and also sets up which node is its deputy.
When one of these nodes is going down another performs reelection and finds its substitute.
In this case whole database is protected from single node failure.
Which mode is better? It is hard to say. Both modes have got advantages and disadvantages, the cost of peer2peer mode is amount of data transfer required to set up blockade on
objects, transfer required to loop back from starvation or deadlocks. In election these problems do not occur but there is a weak point that all requests have to be passed through the
central unit. Database administrator or designer should choose which one is more suitable for
his needs.
4. Data replication
The core idea of distributed systems is processing data by more than one machine. This is
a natural way of improving system performance. However before distributed processing will
be possibly we have to take care of one more thing. Before query can be processed by more
than one node all nodes have to have the necessary data. The process of coping data from one
unit to other is called replication. There are three types of replication – snapshot replication,
merge replication and transactional replication. [14]
In the situation described above we have to use the first type of replication – snapshot
replication. This type of replication is, as name suggest, similar to taking a snapshot of data
available on one unit and transferring it to different machine. This is not only starting point
for parallel data processing but also a way of protecting data. If the first node fails we still
have access to data on the second one. But this is unfortunately also starting point for some
troubles. In relational databases possessing more than one version of data is called redundancy and is usually something we should avoid, because it is potentially a perfect way to reach
database inconsistency. Having the same information saved in more than on record 3 causes
the situation in which you have to modify all records when data needs to be changed. If only
one copy will not be updated on time the database will be in inconsistent state, and resolving
which records is up-to-date will be impossible. The process of keeping database in consistent
state by updating all copies of the same data is called transactional replication or data syn3
In relational databases, and in more than one object in object-oriented
70
chronization. Unfortunately there is one more problem – what if second unit already has
some data, and what if some of his data is a copy of data from first node?
To solve this problem, the database (or any distributed environment) has to be able to perform merge replication. Merge replication is a process of synchronizing data between two or
more nodes. Units exchange data to achieve state in which nodes are copied of each other so
any node has the data from all other nodes. This type of replication has to solve problem with
two versions of the same data. If first node posses the same data that the second node but in
different version, we have to find out which version is up-to-date.
Merge replication in MUTDOD system uses simple mechanism for solving such problem.
Administrator or database designer has to choose which version should be used. In
MUTDOD system we have built in some possible solution for this problem, so algorithms in
which the version problem is solved by using nodes priority. For example one algorithm always uses the version from node that is joining the system, whereas second algorithm always
uses the version that is already in the system. Moreover, MUTDOD system offers database
designer the possibility to design and implement own algorithm and use it to solve this problem. At this point administrator can implement his own comparing algorithm using C# language and built in IComparer interface.
Fig. 4. IComparer Interface
Rys. 4. Interfejs IComparer
As was mentioned above, having more than one copy of the same data causes situation in
which nodes have to be synchronized when data is being updated. Transactional replication
algorithm has to use all advantages of system architecture because this replication runs constantly and can produce large transfer between nodes.
Transactional replication algorithm available in MUTDOD system is simple and effective
considering both possibly architectures. Designed algorithm is based partially on previous
works on solving that problem [13]. In many present databases the transactional replication is
performed by choosing one node which will be a gateway for all modifications. Such node will
have to follow all modification and localization of all objects and perform all necessary operation to keep database in consistent state. This approach however has one security risk – node
failure4. MUTDOD system algorithm is based on this approach but it dismisses the risk. When
an update operation (so any operation which changes data) is being performed central server (or
4
The same situation as with central-server architecture
71
management node to which client is connected) is choosing the node which will be the gateway
only for this particular operation. The node choosing algorithm may work in many different
ways, not only designer can implement his own one, but also load balanced one can be used.
This node is responsible for updating data on all nodes that have a copy of modified data. To
speed the process data distribution is divided. The whole process is shown in figure 5.
An important thing about the mechanisms of replication in MUTDOD is a fact that they
have to consider fragmentation [4] of data. Data can be fragmented both in vertical and horizontal way [13]. In object-oriented environment this two types of fragmentations [12] are
very similar. To distinguish these types in MUTDOD vertical mode stores two objects on
different nodes. Horizontal division is based on distinguishing parts of object and setting individual OID number for these parts.
Fig. 5. Data population algorithm
Rys. 5. Algorytm populacji danych
Client sends a query to the management node.
1. Management server starts the query execution.
2. All necessary objects are being blocked.
3. Management server chooses the node which will perform the operation – in case of add
operation the server which should carry replicas are also chosen.
4. Server passes necessary data to the chosen node.
5. The node updates its own data.
6. Node starts the operation on other nodes.
7. Node sends back the information about successful update.
8. Other nodes update their data and send information to the management server and chosen
node.
72
9. Information about successful replication of all nodes reaches the management server and
can be replicated to other management servers (if more than one exists).
5. Distributed processing
When all necessary operations of data replication are performed we can now start dividing the work of executing the query. To divide the work for all nodes we have to use powerful distributed query planner module [2]. This module has to answer few questions. First of
all we have to find out if the query is worth dividing. Let us consider a simple query which
sums two values. We can easily say that this should be executed immediately on the node to
which client is connected, because the time and CPU power needed to perform planning operation are much bigger than simple executing this operation.
The second question is – is it possible to divide query? We can think about situations
when one server has all data that query needs. In such situations there is no point dividing the
work because only one unit is able to execute the query. After answering this two main questions division can be perform.
START
Nonalgebraic operator
where
First subquery
Second subquery
Object name Person
Algebraic
operator '='
First subquery
Second subquery
Dereferance
operator
Literal
'Nowak'
Name
Fig. 6. Token tree
Rys. 6. Drzewo Tokenów
73
Process of dividing work for all nodes is based on a token tree [15]. This process has two
steps. First step is trying to divide work into number of nodes, based on load of nodes and
data they posses. The result of division is a plan of query in a form of number of queries and
information in what order the queries should be executed and how their results should be
mixed. Sometimes perfect division cannot be done and some queries need to be executed in
certain order. After this is completed the second step is performed. The token trees of subqueries are analyzed, and in case the machine has got more than one CPU unit 5 the division for
CPUs is performed.
6. Other consequences of distribution
Distribution has far more consequences then these described in this article. As such consequences we should take into account such things as query language modifications [6] necessary to manipulate distribution like data partitioning or localization or some performance
switches. Also, the metamodel [11] needs to be adapted to distribution requirements. The data
about localization and partition of each object has to be saved and kept somewhere – so metamodel has to consider storing such things. The last thing is data fragmentation. Metamodel,
language like the distribution module itself has to be able to work with partitioned data both
vertically and horizontally.
7. Summary
As you can see MUTDOD system is something more than a simple cluster. MUTDOD
system is able to work both as single unit database and a cluster with both P2P and central
management architecture. MUTDOD system covers all problems connected to distribution –
user is not even aware that he is working with more than one unit. MUTDOD system is trying
to be as transparent as it is available on all levels – both architecture, structure, processing
etc.. MUTDOD conveys a new look at database systems, which perfectly fits present trends
of cloud computing and which can show directions of database evolution. Military University
of Technology will continue working on MUTDOD and on finding best FAR strategy for
data.
5
Or more than one core
74
BIBLIOGRAPHY
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Jesionek Ł.: Projekt generatora planów wykonywania zapytań dla obiektowej bazy danych. Praca dyplomowa WAT, Warszawa 2010.
Góralczyk M.: Projekt oprogramowania zarządzającego obiektową bazą danych. Praca
dyplomowa WAT, Warszawa 2010.
danych. Praca dyplomowa WAT, Warszawa 2010.
Coulouris G., Dollimore J., Kindberg T.: Distributed Systems: Concepts and Design. Addison Wesley Longman, 2005.
Brzozowska P.: Projekt analizatora syntaktycznego i semantycznego obiektowej języka
zapytań. Praca dyplomowa WAT, Warszawa 2010.
Wróbel E.: Projekt interfejsu programistycznego dostępu do obiektowej bazy danych.
Praca dyplomowa WAT, Warszawa 2010.
Tanenbaum A. S., Steen M.: Systemy rozproszone Zasady i paradygmaty. WNT, Warszawa 2006.
Date C. J.: Twelve rules for a distributed database. Computer World 21(23), 1987.
Orfali R., Harkey D., Edwards J.: The Essential Distributed Objects Survival Guide. John
Wiley & Sons, 1996.
Koszela J., Góralczyk M.: Architecture of object database, in progress.
Malinowski E.: Fragmentation Techniques for Distributed Object-Oriented Databases.
University of Florida, 1996.
Sikorska M.: Praktyczne porównanie mechanizmów 2PC i 3PC. Wydział Inżynierii Mechanicznej i Informatyki Politechniki Częstochowskiej, 2006.
Garcia-Molina H., Ullman J. D., Widom J.: Systemy baz danych. WNT, Warszawa 2006.
Date C. J.: Wprowadzenie do systemów baz danych. WNT, Warszawa 2000.
Recenzenci: Dr inż. Ewa Płuciennik-Psota
Dr inż. Aleksandra Werner
75
Omówienie
Projekt MUTDOD, czyli Military University of Technology Distributed Object Database,
to rozproszona, obiektowa baza danych, działająca w sfederowanym modelu rozproszenia.
Projekt ten, tworzony w Wojskowej Akademii Technicznej, ma przede wszystkim rozwiązać
problem impedancji, przez co ma stanowić idealną platformę dla systemów opartych na paradygmacie obiektowości.
We współczesnych systemach spotykamy się z sytuacją, kiedy obiekty systemu muszą
być transformowane na tabele relacyjnej bazy danych. Aby rozwiązać ten problem, należałoby przechowywać obiekty bez zmiany ich formy, np. bez zbędnej konwersji do postaci tekstowej.
We współczesnym świecie istnieje także tendencja do przechodzenia na technologie
Cloud Computing, czyli rozproszonego przetwarzania danych. MUTDOD ma oferować podobne możliwości, tj. umożliwiać zarówno prace z jednym węzłem, jak i z wieloma. O ile
baza będzie dysponować więcej niż jednym węzłem, system ma umożliwiać zarówno rozpraszanie danych, jak i obliczeń na wszystkie węzły. Jednym z założeń sytemu MUTDOD było
ograniczenie pracy administratora i projektanta do minimum przy rozpraszaniu przetwarzania. Sam proces przetwarzania ma być jak najbardziej bezobsługowy, tak aby maksymalnie
odciążyć programistę.
Ważnym elementem systemu jest język zapytań. Obecnie wykorzystywane języki zarówno deklaratywne, jak i imperatywne nie do końca spełniają wymagania systemu MUTDOD.
Aby programista czy administrator miał możliwość manipulowania procesem rozpraszania,
niezbędne jest wprowadzenie do języka dedykowanych temu zadaniu form składniowych,
które umożliwiłyby użytkownikowi nie tylko oznaczenie instrukcji, które powinny być zrównoleglone, ale również kontroli nad automatyczną wersją tego procesu. Cały czas jednak należy pamiętać, że rozmieszczenie obiektów znacząco wpływa na możliwość rozpraszania.
Oba te fakty powodują, że system MUTDOD wymaga języka posiadającego szczególne cechy, co przekłada się na konieczność zaimplementowania własnego języka.
MUTDOD jest systemem bazodanowym nowej generacji. Może on pracować zarówno
w trybie prostej, jednomaszynowej bazy danych, jak również jako sfederowany system multiwęzłowy zdolny do przetwarzania i przetrzymywania rozproszonych danych. Tryby pracy
z serwerem centralnym bądź też w wersji pełnego P2P, a także możliwość definiowania własnych algorytmów rozmieszczania obiektów, bądź ich porównywania daje administratorom i
programistom całkowitą kontrolę nad działaniem systemu.
76
Addresses
Marcin KARPIŃSKI: Wojskowa Akademia Techniczna, Wydział Cybernetyki,
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Jarosław KOSZELA, Miłosz GÓRALCZYK, Michał JASIOROWSKI,
Marcin KARPIŃSKI, Emil WRÓBEL, Kamil ADAMOWSKI,
Joanna BRYZEK, Mariusz BUDZYN, Michał MAŁEK
EXECUTIVE ENVIRONMENT OF DISTRIBUTED OBJECT
DATABASE MUTDOD
Summary. This paper contains an overall view of main elements of an executive
environment of distributed object database MUTDOD (Military University of Technology Distributed Object Database), which is designing at Military University of
Technology in Warsaw.
Keywords: environment, distributed object database, MUTDOD
ŚRODOWISKO WYKONAWCZE ROZPROSZONEJ OBIEKTOWEJ BAZY DANYCH
Streszczenie. Materiał zawiera ogólny zarys głównych mechanizmów środowiska
wykonawczego rozproszonej, obiektowej bazy danych MUTDOD (ang. Military
University of Technology Distributed Object Database), która jest tworzona w Wojskowej Akademii Technicznej w Warszawie.
Słowa kluczowe: środowisko, rozproszona obiektowa baza danych, MUTDOD
1. Introduction to MUTDOD’s architecture
MUTDOD stands for Military University of Technology Distributed Object Database,
which is being created at Military University of Technology. MUTDOD is meant to be a next
step in a database engine evolution - it pretends to be a platform for object systems containing
applications coded in object languages (e.g. C#, Java, ect.), which are working with data
stored in object database.
There are some advantages of MUTDOD. First of all it is a pure objective environment
for all elements in the object system. Unified data model for applications and database causes
78
J. Koszela, M. Góralczyk, M. Jasiorowski, M. Karpiński, E. Wróbel and others
that there is no need to use any ORM (Object-Relational Mapping). What is more, programmers do not have to use special connectors to get data from a database, but they can just
query for data with an easy and intuitional syntax – similar to one used to access application‟s
local objects stored in memory.
A second advantage of MUTDOD is its distribution environment. Nowadays it is nothing
special to work in multi-core and multi-host environment, but unfortunately most applications
are using only one core. “Common distribution” required an additional effort from a programmer, who has to divide manually algorithms into parts. MUTDOD is meant to help the
programmer handle this problem by providing a distributed database environment which is as
easy to use as a single node one.
At this point many cases of a MUTDOD‟s inner processing are considered to be developed, some of them are already designed and a few of them are already implemented. This
paper provides an overall description of main mechanisms of MUTDOD execution environment. Authors of this article put an effort to describe most interesting cases of their work.
2. Object-oriented language
The query language is a very important aspect of MUTDOD project. MUTDOD`s semantic, syntax and pragmatics include new ideas, but it still stays user-friendly and enables
easy search using various of criteria. Query language, designed for MUTDOD purposes,
contains elements from both declarative and imperative languages. Declarative statements are
independent and isolated, so they could be executed independently. Scattering all calculations
is much easier because of that.
Table 1
Simple query in SQL and DODQL
SQL
DODQL1
select e.Name, e.Surname
Employees.Director(Name, Surname);
from Employees
where e. profession = „director‟;
Requests should be generated in a way, which allows token`s tree optimization. This provides ability to calculate result of queries not only on one machine, but also on multiple computers. To perform multicomputer calculations, the base form of token tree has to be divided
into separate smaller unrelated queries, which will then be executed in parallel. The query
language stays consistent even when some objectivity paradigms are added to it. It will allow
1
DODQL – a MUTDOD query language, based on SBQL (Stack Based Query Language); see more at
www.sbql.pl
Executive environment of distributed object database MUTDOD
79
user to intuitively use the new language, without the need of learning syntax from the beginning. For example, if we would like to get a name and surname of company`s director, we
could write in the way presented in table 1.
This is much shorter than the same query written in another language too. Thanks to that,
programmers don‟t have to focus on syntax, because it is clear and intuitive, but on the other
hand which syntax of a sentence, from the following example, will be more intuitive for programmers? The one, where the object from which data will be taken is on the beginning, on
the end or maybe on some place in the middle of the query:
Univesity.Department.Dean(Degree, Name, Surname);
(Degree, Name, Surname)Dean.Departmant.University;
(Degree, Name, Surname) Univesity.Department.Dean;
Nowadays some query languages pretend to be objected-oriented e.g. OQL, LINQ, but
still they base on relational databases and transform works on object to queries. They generate queries from objects and return objects as result, but still all operations are executed on
tables in database. In the following example, we can compare how the same query in OQL,
LINQ, could look at MUTDOD language:
Table 2
Simple query in OQL, LINQ and DODQL
OQL
LINQ
DODQL
select struct
from e in db.Employees where (name,deptName)
( E: e.name, :e.dept.name ) e.Id = 10
Employees(where Id = 10)
from e in Employees as e
select
where e.id=‟10‟
new{ e.Name, e.DeptName}
MUTDOD is going to work without query languages as SQL, OQL or LINQ. Query language and database model will be object-oriented. Thanks to that we can select and union
different objects. When user`s query asks for an object, which does not exist in the database,
do not have its own class, then the generated temporary object is generated, which exist as
long as the session exist, e.g. average salary.
A lot of different issues have to be solved in MUTDOD, because syntax, semantics and
pragmatics of object-oriented language is the most issue. Even the best solutions won‟t be
adopted unless they are clear, easy and user-friendly.
3. Storing of objects
Core part of every database management system is data storage. There are many problems, concerning storing objective data, which MUTDOD design need to consider. Some of
them are related to physical layer (close to hardware) and I/O operations optimization, others
80
are correlated with logical data structures which need to be organized in such a way that
smallest possible number of mappings is needed to retrieve objects based on their OID (Object Identifier). The storage also needs to keep track of objects and metadata stored in one
instance of database engine which ensures consistency of data. We also have to resolve problems concerning efficient serialization and deserialization of objects so that they can be
stored in form of binary data in persistent data store. Another important requirement of data
storage is the ability to access every single object‟s property without activating (creating an
instance of) the object. This is needed for performing fast queries on object‟s properties
which are not indexed.
Because read/write operations are often performed on the database in a short time in
a context to the same data piece (the same object in this case) so that the cache mechanism is
a very important part of storage which speeds up this kind of operations. This reduces costly
persistent storage I/O operations.
Fig. 1. Data access hierarchy diagram
Rys. 1. Diagram hierarchii danych
After facing these problems during design and implementation of MUTDOD we ended
up with the solution to persist the objective data in such a way that we can easily and efficiently access the complex data structures represented by the objects stored in the objective
data store.
4. Indexes
Object oriented environment requires a few different types of indexes for base types and
user objects. Usually an object consists of other objects and that must be reflected in local
indexes of object. Double way inclusion is easy to delete from indexes when one object is
being disposed. Problem comes with one way inclusion. After destruction of object contained
in the other object, all indexes for this object should be deleted. This requires iterating
through all local indexes of all objects in system. Of course this is very absorbing for system
and may lock local index for a while. One of ideas to manage with one way inclusion is not to
81
delete indexes after deleting of object, but after first attempt to access not existing object. The
result is growing size of index. An idea to prevent that is to iterate through all objects and
local indexes from time to time and remove unnecessary data. Another problem is after casting object to another type. Object after casting should lose local index in other object (one
way inclusion) but both objects should be still alive, in order to keep indexes consistent, all
local indexes should be checked. For that situation, index cannot be deleted after first attempt, because both objects are still in system and OID (Object Identifier) is static. Solution
for all problems with one way inclusion is collecting indexes in both ways. That simplifies
operations on indexes after changing states of objects, but increases size of indexes.
Another type of index is global OID based index of all objects in database. This kind of
index is required to quickly find place in memory where object is stored. Key of index should
be created in a way that helps comparing objects. For example if system gets query, with
OID, to look for similar objects, after looking in indexes, system should get set of objects to
compare with details. That would reduce number of elements to compare, discard objects that
are surely not similar after comparing indexes keys and decrease time of query execution.
Global OID index should also contain information about server where data is stored for multimode database system. This information should be used for locating files, but is also useful
for mechanism of dividing execution of query between computers in database system.
Extension index is typical object-oriented architecture index. Indexing mechanism should
contain two types of this index: simple and compound. First type requires to collect all objects in database of the same class. Compound index is going to get together all objects of the
same class and objects which get inherit from this class. In fact that will be the sum of all
objects in simple index for inheritance tree. Dynamic inheritance delivers a lot of operations
so that extension indexes should be very fast and react immediately on object change. This
kind of index should be coherent in any time.
MUTDOD system is going to provide a few kinds of indexes. In fact most important indexes for user will be user-detained indexes. System needs to have complete system of indexes, quick and providing consistent state. Testing different ideas and algorithms for storing
and operating on indexes provides some schemas for the best balance between size and number of operations on indexes after typical changing objects states.
5. Query Execution
DOD‟s architecture provides two types of servers: central machine and data machine.
Central machine is a logic server. It manages other machines and supports pre-executing queries. If the central server stops working the other machines choose a new one. Data machines
82
store objects and executing queries provided by central machine. More information about
MUTDOD‟s server architecture can be found in related works [Object-Oriented Distribution
in MUTDOD].
Executing queries in MUTDOD system must be managed and optimized for a distributed
environment. Incoming query goes to actual central machine or to node the operator has chosen (in full p2p mode). The first step is analysis of the query and checking which machine has
the most free resources to resolve it. Other thing during choosing machine is checking how
many of the objects needed to resolve query are stored on it. Query analysis checks if it is
possible to divide the query execution to several machines. When it‟s possible, query is divided and sent to selected machines with information which machine should get results of
executing this part of the query. The machine that manages executing process wait for all
parts of the query result and combines it into a complete result that can be send back to user.
In a case when query must be execute in one machine it‟s send to it with information how to
send the result to the user.
Integration of the query language with a programming language allows using the full potential of object-oriented database including the absence of non-impedance. Moving query
interpretation and optimization module to client side decreases usage of server resources and
allows server to handle more clients. This solution requires storing some metadata on clientside. It needs metamodel, available procedures and functions.
6. Cache
Caching is one of basic mechanisms, which leads to a query result retrieval time optimization. It has a key impact on a database system performance. An immediate answer for the
same re-executed query is not the main mechanism‟s function. It shortens evaluation time of
a query that has a little similarity to any one which already has been evaluated. Token`s tree
is divided into independent parts. Each part is stored separately. That solution requires bigger
buffer size but gives extra opportunity. Here comes an example:
Finding independent subqueries is a way to increase probability of storing required result
(and avoiding data redundancy), but it is not the only one. The buffer sooner or later gets full.
Mechanism deletes some of the result to make space for new ones. Which are the less valuable ones is a very important question. Algorithm of data value estimation is crucial for attaining best buffer hit ratio. There are many approaches to the issue: LRU (based on last data
reference), NFU (based on number of data reference), Aging (mixed) and others. All of these
approaches will be available in MUTDOD. The mechanism will be able to adaptation. It will
83
work for a period of time, gather statistics and choose the best algorithm for the present environment.
Division of query into subqueries, using indices, removing dead subquery, removing auxiliary names, pushing selection before join are domains of other caching mechanisms. Applying them all will greatly increase database performance.
Table 3
Caching process steps
Employee(where Name = “John”
Original query
&& Seniority = “10”
&& Superior = “Marvin”).Salary
Original query is eva- Employee(where Name = “John”).Salary
luated and stored as three Employee(where Seniority = “10”).Salary
separated subqueries
Employee(where Superior = “Marvin”).Salary
That result in giving imEmployee(where Seniority = “10”
mediate answer for fol&& Superior = “Marvin”).Salary
lowing query
Caching mechanism returns intersection of suita- Employee (where Employee.Name = “John”
ble already stored query
&& Employee. Seniority = “10”
results.
&& Employee.Superior = “Marvin”
In addition following
&& Employee.Occupation = “accountant”). Salary
query evaluation time
Is the same as this subEmployee.Occupation = “accountant”
query
7. Client environment
Problem of integrating programming languages and databases remains open, since database systems came into common use. Despite many studies on this subject, a widely accepted
combination of these elements still has not been developed. The most important reason for
that may be the lack of standardization and fragmentation of the current solutions.
First attempts to define the problem in the field of object database integration may be found
in works of Carey and DeWitt [1], where the client integration is described as one of key elements of further researches and ODMG Language Bindings presented in their Specification [2].
Integration of common programming languages (such as Java or C#) and database query
languages is complex, because they are based on different semantic foundations. Imperative
expressions describe the steps needed to be made to get the proper product, when declarative
queries emphasize the result over the way to reach it.
In most cases direct data mapping is impossible due to differences in primitive types and
object definitions. Furthermore, database providers may have different look at some topics
84
including encapsulation, multiple inheritance or methods persistence, which remain open
according to “The object-oriented database system manifesto” [3].
Related works [4] highlight aspects like compile time query optimization, foster programmers habits and IDEs integration, as a key to popularization of specific solution.
There are two main types of client integration:

Explicit Query Execution – which can be realized both by provided APIs, where all database operations are hidden behind interface unifying data access or Embedded Query
Languages (LINQ, JSQL). Second method is much safer and easy for use because it is
not based on string expressions that cannot be checked, but makes language more complex.

Orthogonal Persistence - every object is persistent during its lifetime. There is no need
for any special actions to save its state. Since in most popular scenarios not all objects
have to be stored for further use Degree Persistence is used more often, where explicit
transaction calls are required to store object‟s state.
In accordance with the adopted model of a database, SBQL was chosen to be a query language. Main task was to create MUTDOD connectivity wrapper providing easy query call
mechanism. It was developed as an API level access mechanism allowing string based queries execution. First implementation was created to be called from C# 4.0 language, with
usage of its late binding features. It was supporting basic types mapping and object nested
structure access in C# like way, with method calls possibility.
Fig. 2. Simple database class diagram
Rys. 2. Prosty diagram klas bazy danych
Currently the main objective is the creation of SBQL based embedded query language. It
should have full language integration with basic types mapping, and database objects call
possibility from native code. We assumed no need for database model mapping before its use
and no object mapping because of possible metamodel differences. Additional works are carried out to provide IDE support including auto-completion, query check and precompilation
optimization.
85
8. Summary
As a research project Distributed Object Database System is constantly evolving. Described solutions, mechanisms and thesis are constantly verified and modified, not rarely new
ones are developed. Final look of MUTDOD is yet not specified. This article should show
only the basis and direction of MUTDOD‟s evolution. Described problems definitely show
that design and implementation of such complex environment as object-oriented database
with ability to distribute both data storing and query execution is not an easy task. Provided
aspects show that MUTDOD system design is based on usability and comfort of its users.
Every module of MUTDOD conforms to already mentioned characteristics but also security
and performance are not forgotten. Fully developed MUTDOD has huge chances to become
a widely used object systems platform.
BIBLIOGRAPHY
1.
2.
3.
4.
5.
6.
7.
8.
Carey M. J., Dewitt D. J.: Of Objects and Databases: A Decade of Turmoil. http://www.cs.ubc.ca/~rap/teaching/504/2005/readings/objects.pdf, downloaded at 31th January 2011.
Barry D.: ODMG 2.0: A Standard for Object Storage. http://www.inf.puc-rio.br/
/~casanova/INF1731-BD/Referencias/odmg20-storage.pdf, downloaded at 31th January
2011.
Atkinson M., Bancilhon F., DeWitt D., Dittrich K., Maier D., Zdonik S.: The ObjectOriented Database System Manifesto. http://reference.kfupm.edu.sa/content/o/b/
/the_object_oriented_database_system_mani_85098.pdf, downloaded at 31th January 2011.
Cook W. R., Rosenberger C.: Native Queries for Persistent Objects. A Design White Paper.
http://www.odbms.org/download/010.01%20Cook%20Native%20Queries%20for
%20Persistent%20Objects%20August%202005.pdf, downloaded at 31th January 2011.
Catell R. G. G.: The Object Database Standard: ODMG 3.0. Morgan Kaufmann, 2000.
Subieta K.: Teoria i konstrukcja obiektowych języków zapytań. Wydawnictwo PJWSTK,
Warszawa 2004.
Lausen G., Vossen G.: Obiektowe bazy danych Modele danych i języki. Warszawa 2000.
Góralczyk M.: Projekt oprogramowania zarządzającego obiektową bazą danych. Praca
dyplomowa WAT, Warszawa 2010.
86
9.
danych. Praca dyplomowa WAT, Warszawa 2010.
Brzozowska P.: Projekt analizatora syntaktycznego i semantycznego obiektowego języka
zapytań. Praca dyplomowa WAT, Warszawa 2010.
Wróbel E.: Projekt interfejsu programistycznego dostępu do obiektowej bazy danych.
Praca dyplomowa WAT, Warszawa 2010.
Connoly T., Begg C.: Database Systems: A Practical Approach to Design, Implementation, and Management. Addison-Wesley, 2002.
10.
11.
12.
Recenzenci: Dr inż. Łukasz Wyciślik
Dr inż. Hafed Zghidi
Omówienie
MUTDOD to Military University of Technology Distributed Object Database, czyli rozproszona, obiektowa baza danych tworzona w Wojskowej Akademii Technicznej. MUTDOD
jest tworzony jako platforma dla systemów tworzonych w obiektowych językach programowania.
Podstawowym elementem MUTDOD jest zaprojektowany od podstaw nowy język programowania, łączący zalety zarówno języków deklaratywnych, jak i imperatywnych. Język
taki musi być przede wszystkim bardzo pragmatyczny i wprowadzać jak najmniej elementów
nieznanych do tej pory programistom. Ponadto, nowy język będzie musiał sprostać zadaniu
manipulowania nie tylko danymi, ale także całym środowiskiem wykonawczym.
Istotnym problemem, z którym twórcy MUTDOD-a musieli się zmierzyć, było też wydajne zachowywanie danych na nośnikach komputerów. Głównym problem okazał się tutaj
taki dostęp do obiektów, który nie wymagałby powoływania ich instancji.
Również mechanizm indeksowania wymagał gruntownego przemyślenia. Baza, w związku z możliwością pracy rozproszonej, wymaga przechowywania nie tylko indeksów lokalnych, ale również indeksu globalnego, zawierającego rozmieszenie obiektów na węzłach.
Istotne dla indeksów było również znalezienie idealnego mechanizmu rozwiązywania problemów z powiązaniami jednokierunkowymi.
Wiele pracy w systemie MUTDOD zostało poświęcone rozproszonemu wykonywaniu
zapytań. System musi sobie radzić z podziałem zadań na mniejsze elementy oraz z łączeniem
wyników podzapytań w wynik ostateczny, stanowiący odpowiedź na zapytanie użytkownika.
Addresses
ul. gen. Sylwestra Kaliskiego 2, 00-908 Warszawa, Polska, [email protected].
Miłosz GÓRALCZYK: Wojskowa Akademia Techniczna, Wydział Cybernetyki,
Michał JASIOROWSKI: Wojskowa Akademia Techniczna, Wydział Cybernetyki,
Marcin KARPIŃSKI: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
Emil WRÓBEL: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
Kamil ADAMOWSKI: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
Joanna BRYZEK: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
Mariusz BUDZYN: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
Michał MAŁEK: Wojskowa Akademia Techniczna im. Jarosława Dąbrowskiego,
87
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Aleksandra BIEŃKOWSKA
Uniwersytet Jagielloński, Instytut Informatyki
BADANIE SYMULACYJNE PROTOKOŁU ZATWIERDZANIA
TRANSAKCJI MOBILNYCH
Streszczenie. Niniejszy artykuł jest poświęcony badaniom symulacyjnym protokołu zatwierdzania mobilnych transakcji, znanym jako Transaction Commit on Timeout
(TCOT). Zostały w nim przedstawione wyniki symulacji poprawności działania protokołu TCOT, które przeprowadzono przy użyciu autorskiego programu, umożliwiającego symulację oraz wizualizację wykonania transakcji w środowisku mobilnym.
Uzyskane wyniki badań symulacyjnych pozwoliły na wprowadzenie modyfikacji
i usprawnienia protokołu TCOT.
Słowa kluczowe: protokół zatwierdzania transakcji mobilnych, transakcje mobilne, systemy mobilne, symulacja
SIMULATION RESEARCHES OF PROTOCOL FOR COMMITING
MOBILE TRANSACTIONS
Summary. The article treats with problems related to the Transaction Commit on
Timeout protocol. The author created an application that enables simulation and visualisation of a protocol execution in the mobile system. The program was used to carry
out simulations of TCOT execution for a number of events. On the basis of the simulations results several modifications of the protocol are recommended.
Keywords: mobile systems, transaction commit protocols, mobile transactions,
simulation
1. Wprowadzenie
Systemy mobilne, opisane m.in. w monografii dotyczącej radiokomunikacji ruchomej
[11], to jeden z najnowszych i najbardziej dynamicznie rozwijających się obszarów informatyki. W ostatnich latach nastąpił znaczny wzrost rynku zastosowań mobilnych oraz obszaru
90
A. Bieńkowska
badań naukowych i rozwojowych z nim związanych. Z tego względu coraz więcej uwagi poświęca się rozwiązaniom przeznaczonym dla środowisk mobilnych. Do istotnych zagadnień
należą między innymi kwestie związane z działaniem mobilnych baz danych.
Systemy baz danych przeznaczone dla środowisk mobilnych, opisane m.in. w monografiach [5, 7, 10], wymagają uwzględnienia problemów związanych z mobilnością użytkowników oraz specyfiką środowiska mobilnego, takich jak: ograniczona jakość komunikacji pomiędzy jednostką mobilną a stacją bazową wynikająca z niskiej jakości łączy bezprzewodowych, ograniczone zasoby energii i pamięci w ruchomej jednostce (ang. mobile unit, MU),
możliwość przejścia MU w stan tzw. „drzemki” lub rozłączenia MU z systemem w celu prowadzenia pracy w trybie bezpołączeniowym. W wyniku pojawienia się wymienionych problemów mogą wystąpić nieoczekiwane sytuacje, takie jak: przerwanie komunikacji ze stacją
bazową przez MU, na przykład z powodu wyczerpania się baterii lub uszkodzenia MU, wyczerpanie się przestrzeni dyskowej w MU lub zakłócenia w połączeniu pomiędzy MU a stacją
bazową. Z mobilnością uczestników wiąże się również niejednokrotnie konieczność rekonfiguracji początkowych ustawień systemu w czasie wykonywania transakcji, a także możliwość
przejścia stacji ruchomej do sąsiedniej komórki (handoff).
Ze względu na powyższe czynniki, specjalnego podejścia wymagają kwestie związane
z przetwarzaniem zapytań w transakcjach, w szczególności protokoły zatwierdzania transakcji. Transakcje takie, określane mianem transakcji mobilnych, są realizowane w odmienny
sposób niż w statycznych środowiskach rozproszonych. W ostatnich latach zaproponowano
wiele protokołów zatwierdzania transakcji dla systemów mobilnych, między innymi zmodyfikowany protokół Two Phase Commit (2PC) [1] i Transaction Commit on Timeout (TCOT)
[7]. Opisy działania wielu innych protokołów zatwierdzania transakcji można znaleźć w pracach [3, 9]. Formalne podejście, dotyczące zatwierdzania transakcji, podano m.in. w publikacji [6].
Celem niniejszej pracy jest symulacyjne badanie podstawowych własności protokołu
TCOT. Jedną z najważniejszych jest wykorzystanie parametru timeout w celu weryfikacji
poprawności wykonania poszczególnych podtransakcji mobilnych. Przeprowadzone badania
symulacyjne poświęcono sprawdzeniu poprawności zatwierdzania transakcji mobilnych
w protokole TCOT, w tym przede wszystkim zachowania atomowości transakcji globalnej
oraz zachowania spójności danych w systemie mobilnym.
W kolejnych rozdziałach podano podstawowe pojęcia przedstawione w pracy. W trzecim
rozdziale opisano protokół TCOT dla zatwierdzania transakcji. Czwarty rozdział prezentuje
uzyskane wyniki przeprowadzonych badań symulacyjnych. Wnioski dotyczące usprawnienia
protokołu TCOT przedstawiono w rozdziale 5. Pracę podsumowano w rozdziale szóstym.
Badanie symulacyjne protokołu zatwierdzania transakcji mobilnych
91
2. Charakterystyka protokołów w systemach mobilnych
Mobilny System Baz Danych (ang. Mobile Database System, MDS) to rozproszony system
złożony z węzłów, na których odbywa się przetwarzanie danych, oparty na systemie telefonii
komórkowej [1, 11]. System mobilny od tradycyjnego systemu rozproszonego różni się tym,
że użytkownik zmienia swoje położenie, co powoduje ciągłą zmianę topologii sieci. Elementami mobilnego systemu baz danych są węzły stacjonarne oraz węzły mobilne. Do węzłów
stacjonarnych zalicza się stacje bazowe (ang. base transceiver station, BTS), stacjonarne serwery baz danych (DBS) oraz klientów stacjonarnych (ang. fixed host, FH). Element mobilny
to jednostka ruchoma (MU). Węzły stacjonarne są połączone między sobą szybkimi łączami
stacjonarnymi (np. przewodowymi, satelitarnymi). Węzły mobilne łączą się ze stacjonarnymi
elementami sieci poprzez łącza bezprzewodowe za pośrednictwem węzłów bazowych.
Podobnie jak w klasycznym systemie rozproszonym, również tutaj występuje całkowity
brak wspólnej pamięci, każdy węzeł dysponuje własną pamięcią lokalną. Przetwarzanie rozproszone, jakie ma miejsce w tym systemie, charakteryzuje również całkowity asynchronizm.
Komunikacja między węzłami i przekazywanie informacji odbywa się tylko na zasadzie
wymiany wiadomości. W systemie mobilnym nie istnieje globalny zegar, podejmuje się jedynie próby synchronizacji zegarów kwarcowych, zainstalowanych w komputerach mobilnych.
Mobilna transakcja rozproszona (mobilna transakcja globalna) to mobilna transakcja,
której polecenia DML odwołują się do tabel, znajdujących się w co najmniej dwóch węzłach
rozproszonej bazy danych. Mobilna transakcja rozproszona składa się ze zbioru transakcji
lokalnych. W każdej z baz danych, do której odwołuje się mobilna transakcja rozproszona,
jest tworzona jedna transakcja lokalna. Zarówno każda z transakcji lokalnych, jak również
transakcja rozproszona powinny posiadać cechy trwałości, spójności, izolacji i atomowości.
W transakcji realizowanej w systemie mobilnym biorą udział uczestnicy systemu mobilnego,
tacy jak DBS, MU, BTS.
Uwzględniając ograniczenia wynikające ze specyfiki systemów mobilnych, wyróżniono
zasadnicze cechy, które powinien posiadać protokół zatwierdzania transakcji, a mianowicie:

generować wymianę jak najmniejszej liczby komunikatów w sieci bezprzewodowej,
przez co zmniejsza się obciążenie systemu komunikatami,

ryzyko utraty komunikatu w zawodnej sieci bezprzewodowej,

podobnie jak w każdym systemie rozproszonym, również w systemie mobilnym protokół
taki musi zapewniać atomowość zatwierdzania transakcji oraz spójność danych w systemie po jej zakończeniu,

być nieblokujący, czyli umożliwiać każdemu uczestnikowi niezależne zakończenia transakcji bez konieczności oczekiwania na wynik transakcji w innych węzłach, w szczegól-
92
A. Bieńkowska
ności konieczności oczekiwania na odzyskanie (naprawę) węzła uszkodzonego, co zapobiega zakleszczeniom w systemie (nieskończonemu oczekiwaniu na odpowiedź od innego
uczestnika transakcji) oraz ogranicza liczbę wysyłanych komunikatów.
Zastosowanie protokołów używanych w systemach rozproszonych, takich jak 2PC lub
3PC jest niewystarczającym rozwiązaniem w systemie mobilnym, ze względu na występujące
w nim ograniczenia w komunikacji. Dla przykładu, 2PC wymaga trzech faz wymiany komunikatów pomiędzy uczestnikami w przypadku poprawnego wykonania transakcji i pięciu faz
komunikacji w przypadku zaistnienia awarii. W systemie mobilnym taka ilość przesyłanych
komunikatów może okazać się zbyt duża. Zakłócenia w komunikacji mogą powodować utratę dużej ilości komunikatów i wielokrotne, niepotrzebne wycofywanie transakcji (ang. transaction abort). Ponadto, protokoły te mają właściwość blokowania zasobów, co wyklucza ich
zastosowanie w środowisku mobilnym.
3. Protokół TCOT dla zatwierdzania transakcji mobilnych
W zaprezentowanym w pracy [7] protokole TCOT zaproponowano rozwiązania, uwzględniające powyższe wymagania dla protokołów w systemach mobilnych.
W protokole zatwierdzania transakcji globalnej w systemie mobilnym TCOT bierze
udział kilka rodzajów uczestników mobilnego systemu baz danych (MDS). Uczestnik każdego typu pełni ściśle określoną funkcję. Wyróżnia się następujące rodzaje funkcji:
Koordynator (CO) – jego funkcję pełni stacja bazowa (BTS). Zadaniem koordynatora jest
zarządzanie wykonaniem transakcji globalnej. Odpowiada on za jej atomowe zatwierdzenie
lub wycofanie.
Uczestnik statyczny – serwer baz danych (DBS), do którego są kierowane przez koordynatora żądania wykonania transakcji lokalnej.
Uczestnik mobilny – jednostka mobilna (MU), w czasie wykonywania transakcji może
zmieniać swoje położenie. Inicjuje transakcję globalną, komunikując się z koordynatorem.
W MU jest przechowywana lokalna kopia (cache) bazy danych znajdującej się na serwerze
baz danych (DBS). Po wykonaniu transakcji lokalnej log zmian w cache'u jest przesyłany do
koordynatora.
Dokładny opis wykonania protokołu TCOT został przedstawiony w pracy [7].
Dzięki wykorzystaniu w protokole parametru timeout, zarówno w przypadku uczestników
statycznych, jak również mobilnych, decyzja o zatwierdzeniu bądź wycofaniu transakcji lokalnej jest podejmowana niezależnie, co pozwala na uniknięcie blokowania w systemie.
Podczas wykonywania protokołu przebieg transakcji realizowanych przez koordynatora
i uczestników zbioru zatwierdzania jest zapisywany w dziennikach transakcji (logach). Koordy-
93
nator przechowuje log wykonania transakcji globalnej. Log przechowywany w MU po zakończeniu transakcji jest przesyłany do koordynatora, który następnie kieruje go do odpowiednich
DBS w celu aktualizacji danych w bazach na serwerach. Pozwala to na zachowanie globalnej
spójności w systemie. Dodatkowo, umieszczenie logu transakcji w bardziej bezpiecznej, stacjonarnej części systemu umożliwia odtworzenie danych zawartych w cache'u MU w razie awarii.
W kontekście omawianego protokołu na uwagę zasługuje również zjawisko przenoszenia
połączenia. Ponieważ duża odległość pomiędzy koordynatorem a MU może obniżać jakość
połączenia, przyjęto tzw. migrujący model koordynacji; w przypadku wystąpienia handoff’u
przeprowadza się zmianę koordynatora transakcji.
Rys. 1. Graficzny interfejs użytkownika
Fig. 1. Graphical user interface
4. Badanie symulacyjne protokołu TCOT
W celu przeprowadzenia symulacji wykonania tego protokołu utworzono aplikację,
umożliwiającą analizę przebiegu jego kolejnych kroków. W stworzonej aplikacji zaimplementowano protokół TCOT oraz wprowadzono funkcjonalność, umożliwiającą tekstową
i graficzną wizualizację jego wykonania. Graficzny interfejs użytkownika przedstawia rys. 1.
94
A. Bieńkowska
Rys. 2. Przykład linii czasowej pojedynczego przebiegu symulacji
Fig. 2. An example of run time of single simulation
Napisany program umożliwia m.in. podgląd zawartości lokalnych baz danych po wykonaniu symulacji, a także wyświetlenie przepływu komunikatów pomiędzy uczestnikami systemu w czasie wykonywania protokołu. Przykład linii czasowej uzyskanej w wyniku pojedynczego przebiegu symulacji przedstawia rys. 2, natomiast przykładowe kroki protokołu
TCOT pokazano na rys. 3.
Zadaniem stworzonego programu jest umożliwienie określenia warunków działania protokołu w postaci parametrów oraz symulacja działania protokołu przy zadanych parametrach
umożliwiająca analizę przebiegu jego wykonania w różnych sytuacjach działania systemu.
W aplikacji przyjęto dodatkowe założenia:

w symulowanym systemie występuje tylko jedna jednostka mobilna (MU), pewna liczba
BTS, obliczana podczas działania programu oraz pewna zdefiniowana przez użytkownika
liczba serwerów baz danych DBS n (n  1),

w czasie działania symulacji jest wykonywana jedna transakcja globalna (nie są rozpatrywane zagadnienia związane z izolacją transakcji i współbieżnością podczas działania
transakcji współbieżnych).
95
Rys. 3. Przykładowe kroki protokołu TCOT
Fig. 3. An example of TCOT protocol steps
Przy użyciu utworzonego programu przeprowadzono symulację wykonania protokołu
w przypadku wystąpienia różnych zdarzeń w systemie, specyficznych dla środowiska mobilnego. Wyniki symulacji pozwoliły na dokładne zaobserwowanie zachowania omawianego
protokołu w różnych sytuacjach oraz ukazały ważne jego cechy.
Dla celów symulacji zaprojektowano system złożony z jednej jednostki mobilnej o nazwie mu0, dwóch węzłów DBS (dbs0 i dbs1) oraz wielu stacji BTS (bts0, bts1...bts_n).
Każdy z serwerów dbs0 i dbs1 zawiera jedną bazę danych (odpowiednio dbs0_db
i dbs1_db).
W bazie dbs0_db na serwerze dbs0 znajdują się dwie tabele o nazwach: Products oraz
Clients, a w bazie dbs1_db na serwerze dbs1 znajduje się jedna tabela o nazwie Jobs.
Jednostka mobilna zawiera cache o nazwie mu0_db fragmentu bazy danych znajdującej
się na serwerze dbs0. Cache zawiera kopię tabeli Products z bazy dbs0_db.
Strukturę mobilnej bazy danych przedstawiono na rys. 4.
Rys. 4. Struktura mobilnej bazy danych
Fig. 4. Structure of mobile database
96
A. Bieńkowska
Transakcja globalna zdefiniowana w parametrze transaction dzieli się na trzy transakcje
lokalne przeznaczone do wykonania przez uczestników symulacji:

Fragment transakcji wykonywany przez mu0: wstawienie do tabeli Products wiersza
(1, computer, 5000).

Fragment transakcji wykonywany przez dbs0: wstawienie do tabeli Clients wiersza
(1, Kowalski).

Fragment transakcji wykonywany przez dbs1: wstawienie do tabeli Jobs dwóch wierszy:
(1, assistant) oraz (2, manager).
Poniższy scenariusz przedstawia przykład przebiegu wykonania protokołu TCOT w sy-
tuacji, gdy w systemie nie wystąpiła awaria podczas działania protokołu oraz wszystkie podtransakcje zostały wykonane poprawnie:

MU wysyła do koordynatora żądanie rozpoczęcia transakcji (TRANSACTION INIT),

koordynator wysyła tokeny do uczestników transakcji,

po otrzymaniu tokenu każdy z uczestników rozpoczyna wykonywanie transakcji lokalnej,

po wykonaniu transakcji serwery DBS wysyłają komunikat COMMIT do CO, natomiast
MU wysyła log dokonanych zmian w komunikacie UPDATES LOG,

po upływie czasu oczekiwania na ewentualny GLOBAL ABORT, uczestnicy transakcji
uznają transakcję za zatwierdzoną,

aktualizacja cache’u MU zostaje przesłana do serwerów DBS,

w przypadku poprawnego wykonania protokołu po wykonaniu transakcji globalnej w lokalnych bazach danych u wszystkich uczestników systemu znajdują się poprawne i spójne
dane. Atomowość i spójność transakcji zostają zachowane.
Jeżeli transakcja lokalna u jednego z uczestników się nie powiodła, wysyła on do koordy-
natora komunikat ABORT i anuluje lokalną transakcję. Po otrzymaniu komunikatu ABORT
koordynator anuluje transakcję globalną i wysyła komunikat GLOBAL ABORT do tych
uczestników, którzy nadesłali COMMIT. Pootrzymaniu komunikatu GLOBAL ABORT
uczestnicy wycofują transakcje lokalne.
W tym przypadku po zakończeniu transakcji globalnej tabele w bazach danych u uczestników są puste. Transakcja globalna nie zostaje zatwierdzona, wszystkie podtransakcje zostają wycofane, zostaje zachowana spójność danych i atomowość transakcji.
W tabeli 1 zostały przedstawione wyniki symulacji przeprowadzonych dla wybranych
przypadków awarii systemu mobilnego.
97
Tabela 1
Wyniki symulacji dla wybranych przypadków awarii system mobilnego
Przypadek
Zdarzenia wyjątkowe w przebiegu
Wyniki wykonania
awarii
wykonania protokołu
protokołu
A. Awaria na
łączach podczas próby wysłania tokenu
przez koordynatora do jednostki mobilnej
B. Awaria koordynatora po
wysłaniu tokenów do uczestników transakcji
C. Przypadek
awarii łącza
podczas wysyłania przez
jednostkę mobilną (MU)
komunikatu
UPDATES
LOG do koordynatora
Podczas wysyłania tokenu do MU następuje awaria łącza, token nie dociera do MU. Kiedy w jednostce mobilnej czas oczekiwania na token zostaje przekroczony, MU wysyła komunikat ABORT
do koordynatora (MU nie ponawia żądania wysłania tokenu). Po otrzymaniu komunikatu
ABORT od MU koordynator wysyła komunikat
GLOBAL ABORT do wszystkich uczestników
transakcji, którzy nadesłali COMMIT (dbs0
i dbs1). Po otrzymaniu komunikatu GLOBAL
ABORT serwery dbs0 oraz dbs1 wycofują transakcje lokalne oraz dokonują kompensacji wykonanych przez nie zmian w lokalnych bazach danych.
Po wysyłaniu tokenów następuje awaria koordynatora. Komunikaty UPDATES LOG oraz
COMMIT nie zostają obsłużone, globalna transakcja nie zostaje zatwierdzona.
Uczestnicy transakcji nie otrzymują w określonym czasie komunikatu GLOBAL ABORT
i uznają transakcję za zatwierdzoną.
Podczas wysyłania komunikatu UPDATES LOG
przez MU następuje awaria łącza i komunikat nie
dociera do koordynatora.
Ponieważ CO nie otrzymuje żadnego komunikatu
od MU w określonym czasie (następuje timeout),
to nie zatwierdza transakcji globalnej i wysyła
komunikat GLOBAL ABORT do wszystkich
uczestników
transakcji,
którzy
nadesłali
COMMIT (w tym przypadku do serwerów dbs0
i dbs1). Po odebraniu komunikatu GLOBAL
ABORT serwery dbs0 i dbs1 wycofują transakcje
lokalne.
W wyniku wykonania powyższych operacji tabele
w bazach danych u wszystkich uczestników pozostają puste.
Transakcja globalna nie
zostaje zatwierdzona, ale
atomowość
transakcji
i spójność danych we
wszystkich węzłach zostaje zachowana.
Atomowość transakcji nie
zostanie naruszona, ponieważ wszystkie lokalne
transakcje zostały wykonane. Może jednak powstać w systemie niespójność danych, ponieważ aktualizacje cache'u
MU nie zostają przekazane przez CO do serwera
dbs0.
W systemie
występuje
niespójność danych. Tabela Products w cache'u
MU zawiera jeden wpis.
Tabele na serwerach dbs0
i dbs1 są puste. Naruszona zostaje również atomowość transakcji globalnej, ponieważ w MU
nie została wycofana
transakcja lokalna.
98
A. Bieńkowska
cd. tabeli 1
D. Przypadek
awarii łącza
podczas wysyłania komunikatu
GLOBAL
ABORT przez
koordynatora
do jednostki
mobilnej
Serwer dbs1, w którym jedna z operacji się nie
powiodła, wysyła do koordynatora komunikat
ABORT i anuluje lokalną transakcję.
Po otrzymaniu komunikatu ABORT koordynator
anuluje transakcję globalną i wysyła do tych
uczestników, którzy nadesłali COMMIT (dbs0
i mu0) komunikat GLOBAL ABORT.
Podczas wysyłania komunikatu GLOBAL
ABORT do MU następuje awaria łącza i komunikat nie dociera do jednostki mobilnej.
Po upływie czasu oczekiwania na ewentualny
komunikat GLOBAL ABORT w MU transakcja
zostaje uznana za zatwierdzoną.
E. Przypadek
awarii łącza
podczas wysyłania do koordynatora komunikatu
ABORT
przez serwer
baz danych
(DBS)
W węźle dbs1 transakcja lokalna nie powiodła
się, zostaje wycofana, a do koordynatora zostaje
wysłany komunikat ABORT.
Podczas wysyłania komunikatu ABORT przez
serwer dbs1 następuje awaria łącza i komunikat
nie dociera do CO.
Ponieważ CO nie otrzymuje żadnego komunikatu
od dbs1 w określonym czasie (timeout), nie zatwierdza transakcji globalnej, wysyła komunikat
GLOBAL ABORT do wszystkich uczestników
transakcji, którzy nadesłali COMMIT lub
UPDATES LOG (w tym przypadku do mu0
i dbs0)
Po odebraniu komunikatu GLOBAL ABORT
uczestnicy mu0 i dbs0 wycofują transakcje lokalne.
Po zakończeniu transakcji globalnej tabela Products w cache'u MU zawiera jeden wpis. Tabele
na serwerach dbs0 i dbs1
są puste. W systemie występuje niespójność danych.
Naruszona zostaje również atomowość transakcji globalnej, ponieważ
nie u wszystkich uczestników została wycofana
transakcja lokalna.
Po zakończeniu transakcji
globalnej tabele w bazach
danych u uczestników są
puste. Spójność danych
w węzłach nie została naruszona. Transakcja globalna nie została zatwierdzona, wszystkie podtransakcje zostały wycofane, została zachowana
atomowość transakcji.
Na podstawie dokonanych obserwacji zostały zaproponowane modyfikacje i usprawnienia
protokołu TCOT.
5. Wnioski i propozycje usprawnienia protokołu TCOT
Jakkolwiek z teoretycznego punktu widzenia globalna decyzja o zatwierdzeniu lub odrzuceniu transakcji jest podejmowana w poprawny sposób, to w wyniku wystąpienia pewnych
zdarzeń, po zakończeniu wykonania protokołu dane w systemie mogą znajdować się
w niespójnym stanie.
Poniżej przedstawiono analizę wyników symulacji opisanych w poprzednim rozdziale,
ukazujących sytuacje, w których protokół działa niepoprawnie, oraz adekwatne do omawianych przypadków propozycje usprawnień protokołu.
99
5.1. Analiza przypadku B (awaria koordynatora po wysłaniu tokenów
do uczestników transakcji)
W przypadku awarii koordynatora po wysłaniu tokenów do (jednego lub wielu) uczestników transakcji, niespójność danych w węzłach powstaje na skutek niedostarczenia przez
koordynatora aktualizacji cache'u MU do serwera dbs0.
Propozycją rozwiązania tego problemu może być nawiązanie komunikacji z koordynatorem przez DBS w celu wysłania dodatkowego komunikatu informującego o nieotrzymaniu
komunikatu UPDATES LOG. Po usunięciu awarii koordynator mógłby ponownie nawiązać
komunikację z MU i przekazać do serwera DBS logi aktualizacji.
5.2. Analiza przypadku C (awaria łącza podczas wysyłania przez jednostkę
mobilną do koordynatora komunikatu UPDATES LOG)
Ponieważ komunikat UPDATES LOG od jednostki mobilnej nie dociera do koordynatora, CO uznaje brak wiadomości za oznakę awarii jednostki mobilnej i nie wysyła do niej
komunikatu GLOBAL ABORT. Komunikat GLOBAL ABORT jest wysyłany jedynie do tych
uczestników, którzy wysłali COMMIT (lub UPDATES LOG). Tymczasem jednostka mobilna
zatwierdza transakcję lokalną, ponieważ nie otrzymuje komunikatu GLOBAL ABORT. Gdyby komunikat ten dotarł do MU, nie wystąpiłaby niespójność danych w systemie.
Usprawnieniem protokołu, które redukowałoby wystąpienie sytuacji tego rodzaju, mogłoby być wysyłanie przez koordynatora (w przypadku podjęcia decyzji o zakończeniu transakcji
globalnej) komunikatu GLOBAL ABORT do wszystkich uczestników, bez względu na rodzaj
otrzymanych od nich komunikatów.
5.3. Analiza przypadku D (awaria łącza podczas wysyłania komunikatu
GLOBAL ABORT przez CO do jednostki mobilnej)
W tym przypadku niespójność danych w systemie po zakończeniu działania protokołu jest
konsekwencją przyjętego założenia, że w jednostce mobilnej nieotrzymanie komunikatu
GLOBAL ABORT od koordynatora po upływie określonego czasu od wysłania UPDATES
LOG jest równoznaczne z zatwierdzeniem transakcji.
Propozycją modyfikacji protokołu mogłoby być wysyłanie przez CO do uczestników
transakcji dodatkowego komunikatu GLOBAL COMMIT w czasie Et potwierdzającego zatwierdzenie transakcji globalnej. Wówczas nieotrzymanie jakiegokolwiek komunikatu od CO
byłoby jednoznacznie interpretowane przez uczestnika jako brak zatwierdzenia transakcji.
100
A. Bieńkowska
6. Podsumowanie
W niniejszym artykule przedstawiono sposób wykonania protokołu TCOT. Zostały zaprezentowane wyniki symulacji przeprowadzonych przy użyciu utworzonej w tym celu aplikacji.
Analiza wyników symulacji działania protokołu TCOT w różnych przypadkach pozwala
zauważyć, że w sytuacji wystąpienia niektórych rodzajów awarii działanie protokołu prowadzi do niespójności danych w systemie oraz naruszenia atomowości transakcji globalnej. Wystąpienie niespójności po zakończeniu działania protokołu w wielu przypadkach pojawia się
w wyniku nieodebrania przez uczestnika informacji o niepowodzeniu transakcji globalnej
w wyniku błędnej interpretacji braku komunikatu od koordynatora. Przyjęcie takiego założenia może być korzystne, jeżeli w systemie jest wycofywany niewielki odsetek transakcji globalnych (komunikaty GLOBAL ABORT są wysyłane rzadko), natomiast dane w jednostce
mobilnej mogą być łatwo uaktualnione. Dzięki temu jest możliwe wcześniejsze zakończenie
transakcji przez jednostkę mobilną oraz szybkie zwolnienie zasobów wykorzystywanych
przez transakcję, co może mieć znaczenie w przypadku wykonywania przez MU dużej liczby
transakcji.
Przeprowadzone symulacje ilustrują również własność nieblokowania protokołu TCOT.
Dzięki zastosowaniu w protokole timeout'u w żadnym z powyższych przypadków nie doszło
do zablokowania uczestnika w nieskończonej pętli oczekiwania. Jest to szczególnie ważna
cecha protokołu w przypadku komunikacji w zawodnej sieci bezprzewodowej, gdzie komunikaty są często gubione i częste blokowanie utrudniałoby przeprowadzanie transakcji. Jak
wykazuje analiza wyników symulacji, niejednokrotnie odbywa się ona kosztem braku zachowania atomowości transakcji globalnej i spójności danych w systemie. Ta własność protokołu
może być akceptowana, pod warunkiem że występowanie takich sytuacji nie będzie powodować poważnych zakłóceń w działaniu systemu. W przeciwnym przypadku jest konieczne
wprowadzenie odpowiednich mechanizmów kompensacji oraz strategii ich wykorzystania.
BIBLIOGRAFIA
1.
Bernstein Ph. A., Hadzilacos V., Goodman N.: Concurrency Control and Recovery in
Database Systems. Addison Wesley, Reading MA 1987.
2.
Kumar Y., Dash K., Dunham M. H., Seydin A. Y.: A Timeout-Based Mobile Transaction
Commitment Protocol. [in:] Proc. of ADBIS-DASEAA 2000, Advances in Database Systems and Information Systems, in cooperation with ACM SIGMOD, Prague, Czech Republic 2000.
3.
101
Bobineau C., Pucheral P., Abdallah M.: A Unilateral Commit Protocol for Mobile and
Disconnected Computing. [in:] Proceedings of the International Conference on Parallel
and Distributed Computing Systems (PDCS), USA, August 2000.
4.
Lin Y. W., Wu H. U.: Commit Protocol for Low-Powered Mobile Clients. IEICE Trans.
on Information and System, Vol. E82-D, No. 8, 1999, s. 1167÷1179.
5.
Imieliński T., Korth H. F.: Mobile Computing. Kluwer Academic Publishers, 1996.
6.
Korth H. F., Levy E., Silberschatz A.: A formal approach to recovery by Compensating
Transactions. [in:] Proceedings of the 16th VLDB Conference, Brisbane, Australia 1990.
7.
Kumar V.: Mobile Database Systems. John Wiley and Sons, Hoboken 2006.
8.
Nouali N., Doucet A., Drias H.: A Two-Phase Commit Protocol for Mobile Wireless Environment. [in:] Proceedings of the 16th Australasian database conference, Newcastle,
Australia 2005.
9.
Nouali N., Drias H., Doucet A.: Revisiting Distributed Protocols for Mobility at the Application Layer. [in:] The Third World Enformatika Conference, WEC'05, April 27-29, Istanbul, Turkey 2005.
10.
Tanenbaum S., van Steen M.: Systemy rozproszone. Zasady i paradygmaty. WNT, Warszawa 2006.
11.
Wesołowski K.: Systemy radiokomunikacji ruchomej. WKŁ, Warszawa 2003.
12.
Wrembel R., Bębel B.: Oracle. Projektowanie rozproszonych baz danych. Helion, Gliwice
2003.
Recenzent: Dr inż. Andrzej Sikorski
Wpłynęło do Redakcji 3 marca 2011 r.
Abstract
The design of transaction commit protocols intended for mobile systems requires the consideration of specific issues related to the characteristics of a mobile environment and a users'
mobility. The article discusses in detail the Transaction Commit on Timeout protocol, which
meets the requirements of a mobile system. Its main idea is to use the timeout parameter to
identify the successful completion of a transaction in the order to realize a characteristic of
non-blocking.
The author created an application that enables simulation and visualisation of a protocol
execution in the mobile system. The program was used to carry out simulations that illustrate
102
A. Bieńkowska
different ways of TCOT execution for a number of events. A structure of a mobile database
system designed for the purpose of the simulations is depicted on the figure Fig. 4. Results of
the simulations are presented in table Table 1. The results are discussed in detail. In the context of TCOT protocol execution issues connected with transaction atomicity and consistency
of data in the mobile system are considered. On the basis of the results, several modifications
of the protocol are recommended.
Adres
Aleksandra BIEŃKOWSKA: Uniwersytet Jagielloński, Instytut Informatyki,
ul. Prof. S. Łojasiewicza 6, 30-348 Kraków, Polska, [email protected].
STUDIA INFORMATICA
Volume 32
2011
Number 3B (99)
Jerzy MARTYNA
Jagiellonian University, Institute of Computer Science
MACHINE LEARNING FOR THE IDENTIFICATION OF THE DNA
VARIATIONS FOR DISEASES DIAGNOSIS
Summary. In this paper we give an overview of a basic computational haplotype
analysis, including the pairwaise association with the use of clustering, and tagged
prediction (using Bayesian networks). Moreover, we present several machine learning
methods in order to explore the association between human genetic variations and
diseases. These methods include the clustering of SNPs based on some similarity
measures and selecting of one SNP per cluster, the support vector machines, etc. The
presented machine learning methods can help to generate a plausible hypothesis for
some classification systems.
Keywords: computational haplotype analysis, SNP selection
UCZENIE MASZYNOWE DLA IDENTYFIKACJI ZMIAN DNA
DO DIAGNOZOWANIA CHOROBY
Streszczenie. W pracy przedstawiono podstawowe metody uczenia maszynowego
dla wyboru haplotypów, m.in. asocjacji par z użyciem klastrowania i przewidywania,
znaczonego SNP (Single Nucleotide Polimorhisms), maszyny wektorów wspierających (ang. Support Vector Machines, SVM) itp. Metody te znajdują zastosowanie
w przewidywaniu chorób. Mogą być także pomocne do generowania prawdopodobnych hipotez dla systemów klasyfikacji chorób.
Słowa kluczowe: obliczeniowa analiza haplotypów, wybór SNP
1. Introduction
The human genome can be viewed as a sequence of three billion letters from the nucleotide alphabet {A, G, C, T } . More than 99% of the positions of the genome possesse the same
nucleotide. However, in the 1% of the genome numerous genetic variations occur, such as
104
J. Martyna
the diletion/insertion of a nucleotide, multiple repetitions of the nucleotide, etc. It is obvious
that many diseases are caused by variations in the human DNA.
More than one million of the common DNA variations have been identified and published
in the public database [29]. These identified common variations are called single nucleotide
polymorphisms (SNPs). The nucleotides which occur often most in the population are referred
to as the major alleles. Analogously, the nucleotides which occur seldom are defined as the
minor alleles. For instance, nucleotide A (a major allele) occurs in a certain position of the genome, whereas nucleotide T (a minor allele) can be found in the some position of the genome.
Several diseases are identified by means of one of the SNP variations. The identification
of the mutation of the SNP variations at a statistically significant level allows one to postulate
a disease diagnosis. It is more often implemented by means of the use of the machine learning method.
Currently, a haplotype analysis for the identification of the DNA variations relevant for
the diagnosis of several diseases is used. We recall that the haplotype is a set of SNPs present
in one chromosome. Thus, the machine learning methods for an effective haplotype analysis
in order to identify several complex diseases are used.
Currently, a haplotype analysis for the identification of the DNA variations relevant for
the diagnosis of several diseases is used. We recall that the haplotype is a set of SNPs present
in one chromosome. Thus, the machine learning methods for an effective haplotype analysis
in order to identify several complex diseases are used.
The main goal of this paper is to present some computational machine learning methods
which are used in the haplotype analysis. This analysis includes the haplotype phasing, the
tag SNP selection and identifying the association between the haplotype or a set of haplotypes and the target disease.
2. Basic Concepts in the Computational Analysis
Let us assume that all the species of chromosomes reproduced sexually have two sets:
one inherited from the father and the other inherited from the mother. Every individual in this
sample also has two alleles for each SNP, one of them in the paternal chromosome and the
other in the maternal chromosome. Thus, for each SNP two alleles can be either the same or
different. When they are identical, we refer to them as homozygous. Otherwise, when the
alleles are different, the SNP is called heterozygous.
Machine learning for the identification of the DNA variations for diseases diagnosis
105
Fig. 1. Difference between haplotype, genotypes and phenotypes
Rys. 1. Różnica pomiędzy haplotypami, genotypami i fenotypami
Let our major allele of the SNP be colored gray and the minor colored black. Let us assume that the individual haplotype is composed of six SNPs constructed from his/her two
chromosomes. Thus, a haplotype is a set of the SNPs present in one chromosome. Each of the
haplotypes stems from the pair of the chromosomal samples and each pair is associated with
one individual.
Genotypes are represented by two major alleles. When the combined allele is composed
of the two major alleles, it is colored gray (see Fig. 1). In turn, when the SNPs have one minor allele and one minor allele, they are colored gray. In turn, when the SNPs have one
minor allele and the other SNPs one major, then they are colored as white.
A phenotype is a typical observable manifestation of a genetic trait. In other words,
a phenotype of an individual indicates a disease or lack of diseases (see Fig. 1c).
The haplotype analysis has more advantages than the single SNP analysis. The single
SNP analysis cannot identify a combination of SNPs in one chromosome. For example, hap-
106
J. Martyna
lotype CTTCTA marked with arrow in Fig. 1a indicates the lung cancer phenotype, whereas
the other individuals do not have lung cancer.
The haplotype analysis can be made in a traditional and a computational way. In the traditional analysis [22], [26] chromosome are separated, DNA clons, the hybrid constructed, and
as a result haplotype – the disease indicated.
The traditional haplotype analysis is carried out biomolecular methods. However, this
method is more costly than the computational analysis.
The computational haplotype analysis (which includes the haplotype phasing, the tag
SNP selection) has been successfully applied to the study of diseases associated with haplotypes. This analysis can be considered by means of use the data mining methods.
3. Selected Methods of the Haplotype Phasing
3.1. The Pairwise Associated with the Use Clustering
The goal of the haplotype phasing is to find a set of haplotype pairs that can resolve all
the genotypes from the genotype data. Formally, let the haplotype phasing problem be formulated as follows:
For a given G  {g1, g2 , ... , gn } set of n genotypes, where each genotype g i consists of
the allele information of m SNPs, s1 , s2 , ..., sm , namely
0

g ij  1
2

when the two allele of SNP are major homozygous,
when the two allele of SNP are minor homozygous.
when the two allele of SNP are heterozygous.
where i  1,2, ..., n , and j  1,2, ..., m .
The allele information of an SNP of a genotype is either major, minor or heterozygous.
Each genotype represents the allele information of SNPs in two chromosomes. Like the genotype, each haplotype hi  H consists of the same m SNPs s1 , s2 , ..., sm . Each haplotype
represents the allele information of SNPs in one chromosome. We define haplotype hi
(i  1,2,...,2m , j  1,2, ..., m as follows:
0
hij  
1
when the allele of SNP is major,
when the allele of SNP is minor.
107
Fig. 2. Finding a set of haplotype pairs and ambiguous genotypes
Rys. 2. Znajdowanie par haplotypów i niejednoznaczne genotypy
Now we can formulate the haplotype phasing problem as follows:
Problem : Haplotype phasing
Input
:
A set of genotypes G  {g1, g2 , ... , gn }
Output
:
A set of n haplotype pairs
O  { hi1, hi 2 | hi1  hi 2  gi , hi1, hi 2  H ,i  1,2, ..., n}
The haplotype phasing is shown in Fig. 2. Three genotype data are given on the left side.
When the two alleles of SNPs are homozygous, the SNPs are with the same color. When the
two alleles in the genotype are of an SNP, have one heterozygous the haplotype pairs are
identified unequivocally. When the two alleles in the genotype have two heterozygous, the
haplotype pairs cannot be identified unequivocally. Thus, the genotype is identified by means
of an additional biological analysis method.
We can use following methods in the haplotype phasing:
1) parsimony,
2) phylogeny,
3) the maximum likelihood (ML),
108
J. Martyna
4) the Bayesian inference.
The first two methods are treated as a combinatorial problem [14]. The last two methods
are based on the data mining approach and therefore are presented here.
3.2. The maximum likelihood (ML) method for the haplotype phasing
The maximum likelihood method can be based on the expectation-maximization (EM)
method. This method, among others described in [14], works as follows:
Let D be the genotype data of n individuals. Each of their genotypes consists of SNPs.
Let n be the number of distinct genotypes. We denote the i th distinct genotype by g i , the
frequency of g i in the data set D by f i , the number of the haplotype pairs resolving
gi (i  1,2, ..., n)  1) by ci . When H is a set of all haplotypes consisting of the same m
SNPs, the number of haplotypes in H is equal to 2m . Although the haplotype population frequencies   { p1 , p2 , ..., p2 m } are unknown, we can estimate them by the probability of the
genotypes comprising the genotype data D , namely
fi
 ci

L( D)  Pr(D | )   Pr( g i | )     Pr(h1ij , h2 ij |  
i 1
i 1  j 1

where h1ij , h2ij are the haplotype pairs resolving the genotype g i .
n
fi
n
(1)
The EM method depends on the initial assignment of values and does not guarantee
a global optimum of the likelihood function. Therefore, this method should be run multiple
times with several initial values.
3.3. The Bayesian Inference Markov Chain Monte Carlo with the Use
of the Haplotype Phasing Problem
The Bayesian inference methods are based on the computational statistical approach. In
comparison with the EM method, the Bayesian inference method aims to find the posterior
distribution of the model parameters given in the genotype. In other words, with the use of
the EM method the haplotype population frequencies,  , give a set of unknown frequencies
in a population, and the Bayesian inference method provides the a posteriori probability
Pr(H | D) . The Markov Chain Monte Carlo metod approximates samples from Pr(H | D) .
Some of the basic MCMC algorithms are:
a) the Metropolis-Hastings algorithm,
b) the Gibbs sampling.
Ad a) The Metropolis-Hastings algorithm was introduced in the papers [15], [25]. The method starts at t  0 with the selection of X (0)  x( 0) drawn at random from some starting dis-
109
tribution g , with the requirement that f ( x( 0) )  0 . Given X (t )  x(t ) , the algorithm generates
X (t 1) as follows:
1) Sample a candidate value X  from the proposed distribution g ( | x(t ) )
2) Compute the Metropolis-Hastings ratio R( x(t ) , X  ) , where
R(u, v) 
f (v) g (u | v)
f (u ) g (v | u )
(2)
R( x(t ) , X  ) is always defined, because the proposal X   x can only occur if
f ( x (t ) )  0
and g ( x | g (t ) )  0 .
3) Sample a value for X (t 1) according to the following
(t )

with probability min {R( x , X ),1}
X 
X
  (t )
otherwise
x
4) Increment t and return to step 1.
( t 1)
A chain constructed by the Metropolis-Hastings algorithm is Markov, since X (t 1) is only
dependent on X (t ) . Note that depending on the choice of the proposed distribution we obtain
an irreducible and aperiodic chain. If this check confirms irreducibility and aperiodicity, then
the chain generated by the Metropolois-Hastings algorithm has a unique limiting stationary
distribution.
Ad b) The Gibbs sampling method is specifically adapted for a multidimensional target
distribution. The goal is to construct a Markov chain whose stationary distribution equals the
target distribution f .
Let X  ( x1 , ..., x p )T and X i  ( X1, ..., X i 1, X i 1, ..., X p )T . We assume that the univariate
conditional density of X i | X i  xi denoted by f ( xi | xi ) is sampled for i  1,2,..., p . Then
from a starting value x ( 0 ) , the Gibbs sampling mthod can be described as follows:
1) Choose an ordering of the components of x (t )
2) For i sample X i | x(ti)  f ( xi | x(ti) )
3) Once step 2 has been completed for each component of X in the selected order, set
X (t 1)  X  .
The chain produced by the Gibbs sampler is a Markov chain. As with the MetropolisHastings algorithm, we can use the realization from the chain to estimate the expectation of
any function of X .
110
J. Martyna
Finally, the Bayesian inference method using the MCMC can be applied to samples consisting of a large number of SNPs or to samples in which a substantial portion of haplotypes
occur only once. Furthermore, the Gibbs sampler is a popular genetic model that denotes
a tree describing the evolutionary history of a set of DNA sequences [16].
4. Machine Learning Methods for Selecting Tagging SNPs
4.1. The Problem Formulation
The tag SNP selection problem can be formulated as follows: Let S  {s1 ,..., sn } be a set of
n SNPs in a studied region, D  {h1 ,..., hm} be a data set of m haplotypes that consist of the
n SNPs. According to definition 1, we assume that hi  D is a vector of size n whose vector
is a vector of size n whose vector element is 0 when the allele of a SNP is major and 1 when
it is minor. Let the maximum number of the haplotypes consisting SNPs (htSNPs) be k .
We assume that function f (T , D) provides a measure as to how well subset T   S
represents the original data D . Thus, the tag SNP selection is given by
problem
the tag SNP selection
input
1) a set of SNPs,
2) a set of haplotypes D,
3) a maximum number of htSNPs,
output a set of htSNPs T which is T  arg max T  S
and |T | k
f (T , D) .
In other words, the tag SNP selection consists on finding an optimal subset of SNPs of
size k at most based on the given evaluation function f among all possibile subsets of the
original SNPs.
Among the tag SNP selection methods based on the machine learning methods most often included are [22]:
1) the pairwise association with the use of clustering
2) the tagged SNP prediction with the use of Bayesian networks.
Now, we present these machine learning methods used for the tag SNP selection.
4.2. The Pairwise Association with the Use of Clustering
The cluster analysis for the paiwise association for the tag SNP selection was at first used
by Byng et al. [4]. This method works as follows: The original set of SNPs is divided into
hierarchical clusters. Within the cluster all SNPs are with a predefined level  (typically
111
  0.6 ) [4]. In other works, a.o. [1, 5] within each cluster the pairwise linkage equilibrium
(LD).
In the papers [1, 5] is used so-called the pairwise linkage equilibrium (LD), given the
joint probability of two alleles s1i and s2 j equal to the product of the allele individual probabilities. Thus, under the assumption that these probabilities are independent, we have the LD
[19], [12] given by
ij Pr(s1i , s2 j )  Pr(s1i )  Pr(s2 j )
(3)
For the two SNPs within the discrete region called a block here the LD is high, while for
the two SNPs belonging to different regions it is small. Unfortunately, there is no agreement
on the definition of the region [28, 13].
According to the clustering methods based on the LD pairwise, the LD parameter between htSNP and all the other SNPs is greater than the threshold level. These methods include:
1) the minimax clustering,
2) the greedy binning algorithm.
Ad
1)
The
former,
the
minimax
clustering
[1]
is
defined
as
Dmin max (Ci , C j )  min s(Ci  C j ) ( Dmax (s)) , where Dmax ( s) is the maximum distance between
the SNPs and all other SNPs in the two clusters. According to this method every SNP formulates its own cluster. Further, the two closest clusters are merged. The SNP defining the minimax distance is treated as a representative SNP for the cluster. The algorithm stops when
the smallest distance between the two clusters is larger than level 1   . Thus, the representative SNPs are selected as a set of htSNPs.
Ad 2) The latter, the greedy binning algorithm, initially examines all the pairwise LD between SNPs, and for each SNP counts the number of other SNPs whose pairwise LD with the
SNP is greater than the prespecified level,  . The SNP with the largest count is then clustered with its associated SNPs. Thus, this SNP becomes the htSNP for this cluster. This procedure is iterated until all the SNPs are clustered.
The pairwise association-based method for the tag SNP selection can be used for a disease diagnosis. The complexity of this method lies between O(mn2 log n) and O(cmn2 )
[32, 5], where the number of clusters is equal to c , the number of haplotypes is equal to $m$,
the number of SNPs is equal to n .
112
J. Martyna
4.3. The Tag SNP Selection Based on Bayesian Networks (BN)
The tagged SNP prediction with the use of on Bayesian networks was first used by Bafna
[2]. Recently, Lee at al. [23] proposed a new prediction-based tag SNP selection method,
called the BNTagger, which improves the accuracy of the study.
The BNTagger method of the tag SNP selection uses the formalism of BN. The BN is a
graphical model of joint probability distributions that comprises conditional independence
and dependence relations between its variables [18]. There are two components of the BN: a
directed acyclic graph, G and a set of conditional probability distributions,   {1 ,..., p } .
With each node in graph G a random variable X j is associated. An edge between the two
nodes gives the dependence between the two random variables. The lack of an edge
represents their conditional independence. This graph can be automatically learned from the
data. With the use of the learned BN it is easy to compute the posterior probability of any
random variable.
5. Machine Learning Methods for the Tag SNP Selection
for the Sake of Disease Diagnosis
5.1. The Feature Selection with the Use of the Similarity Method
The feature selection with the use of the feature similarity (FSFS) method was introduced
by Phuong [27]. This method works as follows:
We assume that N haploid sequences considering m SNPs are given. Each of them is
represented by N  m matrix M with the sequences as rows and SNPs as columns. Each
element of this matrix which represents the j -th alleles of the i -th sequence is equal to
0, 1, 2. 0 representing the missing data, 1 and 2 represent two alleles. The SNPs represents
the attributes that are used to identify the class to which the sequence belongs.
The machine learning problem is formulated as follows: how to select a subset of SNPs
chich can classify all haplotypes with the required accuracy. A measure of similarity between
pairs of features in the FSFS method is given by
( p AB  p ab  p Ab  p aB ) 2
r 
,
p AB  p ab  p Ab  p aB
2
0  r 1
(4)
where A and a are the two alleles at a particular locus, p xy is the frequency of observing
alleles x and y in the same haplotype, p x is the frequency of allele x alone.
The details of the algorithm used in the FSFS method [27] are given in the procedure presented in Fig. 3. As the input parameters are used S – the original set of SNP and K – the
113
number of nearest neighbors of an SNP to consider. The algorithm initializes R to S . In
each iteration the distance d iK between each SNP Fi in R and its K -th nearest neighbouring SNP is computed. Further, the FSFS algorithm removes its K nearest SNPs from R . In
the next step is comparing the cardinality of R with K and adjusting K . Thus, the condition
d0K  0 is gradually decreased until d 0K is less or equal to an error threshold  .
The parameter K is chosen for as long as the desired prediction accuracy is achieved. In
the experimental results given by Daly et al. [8] that the FSFS method can give a prediction
accuracy of 88% with only 100 tag SNPs.
Input data:
S – set of SNP, parameter K of the algorithm,
R – selected Tag SNPs,
1. select R from S ;
Output data:
2. for Fi  R do
diK : D( Fi , Fi K )
/* Fi K is the K -th nearest SNP of Fi in R
endfor;
3. find F0 such that d0K : arg min FiR (diK );
Let F01 , F02 , ..., F0K be the nearest SNPs of F0 and R : R  {F01 , ..., F0K }
Initially   d0
4. if K | R | 1 then K : R  1 ;
5. if K  1 then goto 1;
6. while d0K   do
begin
K : K  1;
if K  1 then goto 1;
compute d 0K ;
end;
7. goto 2;
8. if all R are selected from S then stop;
Fig. 3. FSFS algorithm for TAG SNP selection
Rys. 3. Algorytm FSFS dla wyboru znaczonego SNP
5.2. An Application of the SVM for the Tag SNP Selection for Disease Diagnosis
In this section, we describe an application the SVM method for the tag SNP selection
with a simultaneous disease diagnosis.
114
J. Martyna
The support vector machine (SVM) [30] is a machine learning method which was used to
outperform other technologies, such as neural networks or k -nearest neighbor classifier.
Moreover, the SVM has been succesfully applied for a binary prediction multiple of cancer
types with excellent forecasting results [33, 20]. We recall that the SVM method finds an
optimal maximal margin hyperplane separating two or more classes of data and at the same
time minimizes classification error. The mentioned margin is the distance between the hyperplane and the closest data points from all the classes of data.
The solution of an optimization problem with the use of the SVM method requires a solution
of a number of quadratic programming (QP) problems. It involves two parameters: the penalty parameter C and the kernel width  . If  2   C   is not fit for the problem under consideration because it has noise. If  2 and C  C1 2 where C1 is fixed then the SVM
converges with the linear SVM classifier with the penalty parameter C1 . A well selected
(C , 2 ) is crucial for unknown data prediction. In the paper [3] the procedure for finding
good C and  2 was given.
Table 1
No. Author(s)
1
Cho [6]
2
Cho [7]
3
Deb et al.
[21]
Deutsch
[11]
Huang
[17]
Lee [21]
4
5
6
7
8
Note:
Lee [24]
Waddell
[31]
The prediction accuracy of existing metods
Method
ALL/AML
Breast
Colon Multiple
cancer
myeloma
genetic algori- 73.53% (1) 77.3% (3)
thm
genetic algori- 94.12% 100% (21)
thm
(17)
evolutionary
97% (7)
algorithm
evolutionary
algorithms
genetic algorithm and SVM
Bayesian in100% (10)
terference
SVM
SVM
71%
SRBCT
100% (21)
98.75%
(6.2)
100% (20)
ALL/AML – acute lymphoblastic leukemia/acute myeloid leukemia,
SRBCT – small round blue cell tumor,
numbers in parentheses denote the number of selected genes.
According to the output results given by Waddell et al. [31 concerning the case of the
multiple myeloma (about 0.035% people over 70 and 0.002% people between the age of 30 54 in the USA) it was possible to detect differences in the SNP patterns between the good
human genome and the people diagnosed with this disease.
115
The obtained accuracy achieved 71% of the overall classification accuracy. Although the
accuracy was not high, it was significant that only relatively sparse SNP data are used for this
classification. The comparison of the SVM method with other existing methods is given in
Table 1. It is noticeable that these methods are complementary. From Table 1 we see that the
existing methods tend to select many genes with poor prediction accuracy. However, the
SVM metod selects genes with relatively high prediction accuracy.
6. Conclusion
We have presented some machine learning methods concerning the tag SNP selection,
additionally, some of which are used to diagnose diseases. These methods are applied to data
sets with hundreds of SNPs. In general, they are inexpensive and with varying accuracy for
the haplotype phasing, the tagged SNP prediction and, furthermore, diesease diagnosing. The
missing alleles, genotyping errors, a low LD among SNPs, a small size of sample, lack of
scalability with the increase of the number of markers are among basic weaknesses of the
currently used machine learning methods used for computational haplotype analysis.
Nevertheless, the machine learning methods are more and more often used in the tag SNP
selection and disease diagnosis.
BIBLIOGRAPHY
1.
2.
3.
4.
5.
Ao S. I., Yip K., Ng M., Cheung D., Fong P., Melhado I., Sham P. C.: CLUSTAG: Hierarchical Clustering and Graph Methods for Selecting Tag SNPs. Bioinformatics, Vol. 21,
2005, p. 1735‚1736.
Bafna V., Halldörsson B. V., Schwartz R., Clark A. G., Istrail S.: Haplotypes and Informative SNP Selection Algorithms: Don’t Block out Information. [in:] Proc. of the Seventh
Int. Conf. on Computational Molecular Biology, 2003, p. 19‚26.
Boser B. E., Guyon I. M., Vapnik V.: A Training Algorithm for Optimal Margin Classifiers. Fifth Annual Workshop on the Computational Learning Theory, ACM, 1992.
Byng M. C., Whittaker J. C., Cuthbert A. P., Mathew C. G., Lewis C. M.: SNP Subset
Selection for Genetic Association Studies. Annals of Human Genetics, Vol. 67, 2003,
p. 543‚556.
Carlson C. S., Eberle M. A., Rieder M. J., Yi Q., Kruglyak L., Nickerson D. A.: Selecting
a Maximally Informative Set of Single-nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium. American Journal of Human Genetics, Vol. 74,
2004, p. 106‚120.
116
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
J. Martyna
Cho J. H., Lee D., Park J. H., Lee I. B.: New Gene Selection Method for Classification of
Cancer Subtypes Considering Within-Class Variation. FEBS Letters, Vol. 551, 2003,
p. 3‚7.
Cho J. H., Lee D., Park J. H., Lee I. B.: Gene Selection and Classification from Microarray Data Using Kernel Machine. FEBS Letters, Vol. 571, 2004, p. 93‚98.
Daly M., Rioux J., Schaffner S., Hudson T., Lander E.: High-Resolution Haplotype Structure in the Human Genome. Nature Genetics, Vol. 29, 2001, p. 229‚232.
Deb K., Reddy A. R.: Reliable Classification of Two-Class Cancer Using Evolutionary
Algorithms. Biosystems, Vol. 72, 2003, p. 111‚129.
Dempster A. P., Laird N. M., Rubin D. B.: Maximum Likelihood from Incomplete Data
via the EM Algorithm. Journal of the Royal Statistical Society, Vol. 39, No. 1, 1977,
p. 1‚38.
Deutsch J.: Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction. Bioinformatics, Vol. 19, No. 1, 2003, p. 45‚52.
Devlin B., Risch N.: A Comparison of Linkage Disequilibrium Measures for Fine Scale
Mapping. Genomics, Vol. 29, 1995, p. 311‚322.
Ding K., Zhou K., Zhang J., Knight J., Zhang X., Shen Y.: The Effect of Haplotype-Block
Definitions on Inference of Haplotype-block Structure and htSNPs Selection. Molecular
Biology and Evolution, Vol. 22, No. 1, 2005, p. 48‚159.
Gusfield D., Orzack S. H.: Haplotype Inference. CRC Handbook in Bioinformatics, CRC
Press, Boca Raton, 2005, p. 1‚25.
Hastings W. K.: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika, Vol. 57, 1970, p. 97‚109.
Hedrick P. W.: Genetics of Population. Jones and Bartlett Publishers, Sudbury 2004.
Huang H. L., Chang F. L.: ESVM: Evolutionary Support Vector Machine for Automatic
Feature Selection and Classification of Microarray Data. Biosystems, Vol. 90, 2007,
p. 516‚528.
Jensen F.: Bayesian Networks and Decision Graphs. Springer-Verlag, New York, Berlin
Heidelberg 1997.
Jorde L. B.: Linkage Disequilibrium and the Search for Complex Disease Genes. Genome
Research, Vol. 10, 2000, p. 1435‚1444.
Keerthi S. S., Lin C. J.: Asymptotic Behaviour of Support Vector Machines with Gaussian
Kernel. Neural Computing, Vol. 15, No. 7, 2003, p. 1667.
Lee K. E., Sha N., Dougherty E. R., Vannucci M., Mallick B. K.: Gene Selection: A
Bayesian Variable Selection Approach. Bioinformatics, Vol. 19, No. 1, 2003, p. 90‚97.
Lee P. H.: Computational Haplotype Analysis: An Overview of Computational Methods
in Genetic Variation Study. Technical Report 2006-512, Queen’s University, 2006.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
117
Lee P. H., Shatkay H.: BNTagger: Improved Tagging SNP Selection Using Bayesian
Networks. The 14th Annual Int. Conf. on Intelligent Systems for Molecular Biology
(ISMB), 2006.
Lee Y., Lee C. K.: Classification of Multiple Cancer Types by Multicategory Support
Vector Machines Using Gene Expression Data. Bioinformatics, Vol. 19, No. 1, 2003,
p. 1132‚1139.
Metropolis N., Rosenblum A. W., Rosenbluth M. N., Teller A. H., Teller E.: Equation of
State Calculation by Fast Computing Machines. Journal of Chemical Physics, Vol. 21,
1953, p. 1087‚1091.
Nothnagel M.: The Definition of Multilocus Haplotype Blocks and Common Diseases.
Ph.D. Thesis, University of Berlin, 2004.
Phuong T. M., Lin Z., Altman R. B.: Choosing SNPs Using Feature Selection. Proc. of the
IEEE Computational Systems Bioinformatics Conference, 2005, p. 301‚309.
Schulze T. G., Zhang K., Chen Y., Akula N., Sun F., McMahonen F. J.: Defining Haplotype Blocks and Tag Single-nucleotide Polymorphisms in the Human Genome. Human
Molecular Genetics, Vol. 13, No. 3, 2004, p. 335‚342.
Sherry S. T., Ward M. H., Kholodov M., Baker J., Phan L., Smigielski E. M., Sirotkin K.:
dbSNP: the NCBI Database of Genetic Variation. Nucleic Acids Research, Vol. 29, 2001,
p. 308‚311.
Vapnik V.: Statistical Learning Theory. John Wiley and Sons, New York 1998.
Waddell M., Page D., Zhan F., Barlogie B., Shaughnessy J. Jr.: Predicting Cancer Susceptibility from Single-nucleotide Polymorhism Data: a Case Study in Multiple Myeloma.
Proc. of BIOKDD ‘05, Chicago, August 2005.
Wu X., Luke A., Rieder M., Lee K., Toth E. J., Nickerson D., Zhu X., Kan D., Cooper R. S.: An Association Study of Angiotensiongen Polymorphisms with Serum Level and
Hypertension in an African-American Population. Journal of Hypertension, Vol. 21, No.
10, 2003, p. 1847‚1852.
Yoonkyung L., Cheol-Koo L.: Classification of Multiple Cancer Types by Multicategory
Support Vector Machines Using Gene Expression Data. Bioinformatics, Vol. 19, No. 9,
2003, p. 1132.
Recenzent: Dr inż. Jerzy Respondek
Wpłynęło do Redakcji 11 marca 2011 r.
118
J. Martyna
Omówienie
W pracy dokonano przeglądu podstawowych metod obliczeniowych stosowanych w eksploracji danych przy wyborze minimalnego podzbioru pojedynczego polimorfizmu nukleotydów (ang. Single Nucleotide Polimorphisms, SNP). Wybór ten jest oparty na haplotypach
i pozwala on na znalezienie wszystkich SNP związanych z daną chorobą. W rezultacie, takie
metody, jak asocjacja par z użyciem klastrowania, metoda maksymalnej wiarygodności
(ang. maximum likelihood metod), algorytm Metropolis-Hastings, maszyna wektorów wspierających (ang. suport vector machine, SVM) itp., mają duże znaczenie w diagnozowaniu chorób onkologicznych. Metody te różnią się zarówno uzyskiwaną dokładnością, jak i liczbą
genów branych pod uwagę.
Address
Jerzy MARTYNA: Jagiellonian University, Institute of Computer Science,
ul. Prof. S. Łojasiewicza 6, 30-348 Kraków, Poland, [email protected].
INFORMATION FOR AUTHORS
The journal STUDIA INFORMATICA publishes both fundamental and applied Memoirs and Notes in the field
of informatics. The Editors' aim is to provide an active forum for disseminating the original results of theoretical
research and applications practice of informatics understood as a discipline focused on the investigations of laws
that rule processes of coding, storing, processing, and transferring of information or data.
Papers are welcome from fields of informatics inclusive of, but not restricted to Computer Science,
Engineering, and Life and Physical Sciences.
All manuscripts submitted for publication will be subject to critical review. Acceptability will be judged
according to the paper's contribution to the art and science of informatics.
In the first instance, all text should be submitted as hardcopy, conventionally mailed, and for accepted paper
accompanying with the electronically readable manuscript to:
Dr. Marcin SKOWRONEK
Institute of Informatics
ul. Akademicka 16
44-100 Gliwice, Poland
Tel.: +48 32 237-12-15
Fax: +48 32 237-27-33
e-mail: [email protected]
MANUSCRIPT REQUIREMENTS
All manuscripts should be written in Polish or in English. Manuscript should be typed on one side paper only,
and submitted in duplicate. The name and affiliation of each author should be followed by the title of the paper (as
brief as possible). An abstract of not more than 50 words is required. The text should be logically divided under
numbered headings and subheadings (up to four levels). Each table must have a title and should be cited in the text.
Each figure should have a caption and have to be cited in the text. References should be cited with a number in
square brackets that corresponds to a proper number in the reference list. The accuracy of the references is the
author's responsibility. Abbreviations should be used sparingly and given in full at first mention (e.g. "Central
Processing Unit (CPU)"). In case when the manuscript is provided in Polish (English) language, the summary and
additional abstract (up to 300 words with reference to the equations, tables and figures) in English (Polish) should
be added.
After the paper has been reviewed and accepted for publication, the author has to submit to the Editor a
hardcopy and electronic version of the manuscript.
It is strongly recommended to submit the manuscript in a form downloadable from web site
http://zti.polsl.pl/makiety/.
To subscribe: STUDIA INFORMATICA (PL ISSN 0208-7286) is published by Silesian University of
Technology Press (Wydawnictwo Politechniki Śląskiej) ul. Akademicka 5, 44-100 Gliwice, Poland, Tel./Fax +48
32 237-13-81. 2011 annual subscription rate: US$60. Single number price approx. US$10-20 according to the issue
volume.
INSTYTUT INFORMATYKI prowadzi:
 Studia stacjonarne I stopnia (inżynierskie)
 Studia stacjonarne II stopnia (magisterskie)
 Studia niestacjonarne I stopnia (inżynierskie)
 Studia niestacjonarne II stopnia (magisterskie)
 Studia podyplomowe:
 Sieci i systemy komputerowe, bazy danych
 Systemy informacji geograficznej
 Teleinformatyka w transporcie lotniczym
 Technologie internetowe i technologie mobilne
 Metody eksploracji baz danych przedsiębiorstw
 Studia doktoranckie
Informacje:
POLITECHNIKA ŚLĄSKA
Instytut Informatyki
44-100 Gliwice, ul. Akademicka 16
tel. (032) 237 24 05; 237 21 51;
fax (032) 237 27 33 (czynny całą dobę)
e-mail: [email protected]
http://www.inf.polsl.pl (dydaktyka)

Tytył habilitacji (czcionka 14pt) - Studia Informatica

Transkrypt

Podobne dokumenty