Grammatical Dictionary of Polish

Transkrypt

Zygmunt Saloni, W³odzimierz Gruszczyñski
Marcin Woliñski, Robert Wo³osz
Presentation by the Authors
Abstract
The dictionary provides a comprehensive grammatical description of Polish words. It covers
about 180,000 lexical items (lexemes). The dictionary has been compiled in an electronic
form and made accessible via a computer program. All lexemes are morphologically and
syntactically characterized by a set of features, which display on the monitor. Additionally,
some regular derivatives are presented in entries. The inflectional description strives for completeness, while the derivation and syntax are described as far as a clear formalized approach
was feasible. Non-inflected lexemes are provided with their part-of-speech feature and the
valence information is added where instructive (case government for prepositions, type of
conjoined phrase for conjunctions). The compact disc containing the program is accompanied
by a printed introduction, facilitating its use and presenting the authors’ theoretical principles.
The dictionary may serve as a source of research in the domain of inflection and — to some
extent — syntax of Polish. It can also be used in the automatic processing of Polish. It may
also be useful for teaching Polish, especially to foreigners.
Key words:
dictionary, Polish, grammar, electronic dictionary, inflection, formalized approach
Streszczenie:
S³ownik dostarcza obszernego opisu gramatycznego polskich s³ów. Zawiera ok. 180 000 jednostek leksykalnych (leksemów). Zosta³ on opracowany w postaci elektronicznej, a korzystanie
z niego odbywa siê poprzez program komputerowy. Wszystkie leksemy s¹ scharakteryzowane
za pomoc¹ zestawu cech, które pojawiaj¹ siê na monitorze. Dodatkowo, niektóre regularne
derywaty pojawiaj¹ siê w has³ach. Staraliœmy siê podaæ mo¿liwie wyczerpuj¹cy opis fleksyjny,
natomiast s³owotwórstwo i sk³adnia by³y potraktowane na tyle dok³adnie, na ile pozwala³o
na to podejœcie formalne. Dla leksemów nieodmiennych podajemy czêœæ mowy; informacja
dotycz¹ca walencji jest podana tam, gdzie jest to instruktywne (rz¹d przypadka dla przyimków,
typ zdania z³o¿onego dla spójników). CD zawieraj¹cemu program towarzyszy drukowany
wstêp u³atwiaj¹cy korzystanie, a tak¿e przedstawiaj¹cy za³o¿enia teoretyczne autorów. S³ownik
mo¿e stanowiæ Ÿród³o dla badañ nad polsk¹ fleksj¹, a tak¿e — w pewnym zakresie — sk³adni¹.
Mo¿e te¿ znaleŸæ zastosowanie w przetwarzaniu automatycznym jêzyka polskiego. Bêdzie
te¿ przydatny w nauczaniu jêzyka polskiego, zw³aszcza jako obcego
S³owa klucze:
s³ownik, jêzyk polski, gramatyka, s³ownik elektroniczny, fleksja, formalizacja gramatyki
5
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
1. Introduction
The content of this article is related to the talks given at the meeting of the Polish
Academy of Sciences Committee of Linguistics on October 2007 and at the 7th
European Conference on Formal Description of Slavic Languages (FDSL-7), held
at the University of Leipzig, Germany, on December 2007. Its purpose is to present
the Grammatical Dictionary of Polish, S³ownik gramatyczny jêzyka polskiego (Saloni
et al. 2007, henceforth: SGJP) compiled by the authors and published by Wiedza
Powszechna Publishers in December 2007.
SGJP gained its final shape as a result of the project 2 H01D 007 24 S³ownik
gramatyczny jêzyka polskiego, sponsored by the Polish Ministry of Science and held
at the University of Varmia and Masuria in Olsztyn, 2003–2006. The participants
of this project, besides the authors of the dictionary and of this paper, were several
colleagues who did some auxiliary but important work. They were: Monika Czerepowicka, Dorota Kopciñska, Ma³gorzata Sas, and Anna ŒledŸ. We are also indebted
to volunteers who analysed particular groups of lexemes: Joanna Bartycha, Alena
Bielewicz, Patrycja D¹browska, Przemys³aw Lipski, Danuta Makowska, Teofil Mroczek, Laura Polkowska, and Joanna Szumig³owska. At various stages of our work we
obtained essential comments and sugestions from our colleagues. We thank them
all sincerely. We express special thanks to Janusz Bieñ, who supported our project
from the very beginning.
The dictionary provides a comprehensive grammatical description of Polish
words. It is compiled in electronic form and made accessible via a computer program.
All lexemes are characterized by a set of features, which are displayed on the monitor.
Dictionary entries can be selected by typing them into a dialog box or marking them
on the list of entries. The CD containing the program is accompanied by a printed
introduction, facilitating its use and presenting the authors’ theoretical principles.
2. The History of the Project
The idea of SGJP was conceived by Z. Saloni under the influence of A. Zaliznjak’s
grammatical dictionary of Russian (Zaliznjak 1977, cf. Saloni 1979). The project
to compile such a dictionary, formulated immediately after analyzing the Russian
model, has been carried out since then slowly and with variable intensity. It is clear
that the conception of a dictionary had to evolve during those 30 years. For example,
in 1975 the only possible format in which to publish a dictionary was a traditional
book; in 2007 it no longer make much sense (we will comment further on this
statement later).
Nevertheless, some work was carried out still in the eighties by Z. Saloni and his
students at the Bia³ystok Branch of the Warsaw University. The first task consisted in
analyzing the grammatical information in the main Polish dictionary, usually referred
to as Doroszewski’s dictionary (Doroszewski 1958–1969, 11 volumes, henceforth:
SJPDor.). The results of those analyses were presented in a series of master’s theses,
partially published, as well as in other articles, in the three volumes of Studies in
6
Polish Contemporary Lexicography (Studia z polskiej leksykografii wspó³czesnej, vide
Saloni, ed. 1977, 1978, 1979).
At this stage the grammatical information containted in SJPDor. was transferred
into a card index (almost 130 000 items — on the basis of ca. 125 000 source entries).
The cards were useful in the next step, especially for a comprehensive analysis of the
declension of Polish common nouns, conducted by W. Gruszczyñski in his PhD.
thesis (Gruszczyñski 1989).
Very useful material for our analytical work was the reverse index to SJPDor.
Indeks a tergo do SJPDor. (Grzegorczykowa and Puzynina 1973), initially intended
as an additional volume to accompany the main work, but finally published separately. We used it from the very beginning. To the items in this index were given
very general grammatical characteristics (a part of the speech symbol and main
inflectional group, without details) taken from the dictionary. Those draft notes
(the revised version of the corrected and annotated Indeks a tergo do SJPDor. was
uploaded onto the Internet by R. Wo³osz) turned out to be an essential starting
point for further work.1
The above source was first used during final work on Jan Tokarski’s schematic
reverse index of Polish word forms (Tokarski 1993). This new index gave strings of
typical closing letters of word forms connected with a given grammatical characteristic. The preliminary version was prepared by the author in the form of a sloppy rough
draft during his work in the team compiling SJPDor. Later, although convinced of
its usefulness, Tokarski did not see the possibility of developing and publishing it, so
he bequeathed it to Z. Saloni to elaborate and publish in a well ordered form. This
index became the source material for several computer programs for morphological
analysis of Polish, two of which were constructed by members of the SGJP team
(vide Wo³osz 2005, Woliñski 2006).
The two reverse indexes mentioned above were an introduction to a much
more important work: an electronic version of the list of headwords of SJPDor.,
with grammatical information prepared by Robert Wo³osz. This list could be used
as source data for a spell checker, but could only serve as first approximation of a
list of entries for a dictionary. We needed to check all the details to be included in a
well edited work.
This task began with Polish verbs. Polish conjugation is complicated (the typical
paradigm contains forms represented textually by at least 37 different words), but it
has been quite precisely analyzed and characterized by J. Tokarski (Tokarski 1951).
However, the level of exactness of his analysis was not sufficient for automated
analysis. It was necessary to examine the whole material and where necessary change
the qualification. The results were presented in the handbook of Polish conjugation
(Saloni 2001) and included in their entirety in SGJP.
1
A copy of the book with pencil annotation for every entry is preserved as documentation of this stage of
our work.
7
3. The Grammatical Model
The work characterized in the previous section was carried out in the framework of
a model of Polish grammar developed and enriched during ca. 35 years by a team
of Polish linguists. For more than 10 years a computer specialist, Marcin Woliñski,
has also been cooperating with this group.
The description used in SGJP refers to the tradition of Polish grammar and takes
over its well established achievements. However, it also contains some new ideas
proposed in the second half of the 20th century.
The predecessor of the description given by the authors of SGJP was Jan Tokarski’s
work. He developed a homogeneous classification of alternations in Polish verbal
paradigms (organized in thematic groups). He was also the author of the conception of inflectional information in SJPDor. (vide Tokarski 1958)2 . Our attempts to
present Polish grammar in a formalized rigorous form can be treated as a natural
continuation of Tokarski’s work. For example, the general conception of the classification of Polish lexemes (Saloni 1974) referred to classes introduced in SJPDor.
(and in later dictionaries based on it).
Many solutions applied in SGJP were worked out earlier by members of the
group (above all, the entire conception of gender description, declension of pronouns and numerals, the two-stage presentation of the morphological description:
deep and surface, according to the function of the forms and enumeration of the
base forms — vide Bibliography). The essence of this description is presented in a
university textbook of Polish syntax (Saloni–Œwidziñski 1981). Some new ideas for
the description of conjunctions were taken from a later formalized description of
Polish syntax, formulated by M. Œwidziñski (Œwidziñski 1992).
The division of uninflected lexemes is based on the conceptions of R. Laskowski
and M. Grochowski (Laskowski 1984, Grochowski 1997).
We will present here two crucial aspects of our model, applicable mainly to nouns,
but having essential implications for the decription of other classes of lexemes: the
repertoire of genders/subgenders, and the introduction of a new inflectional category
for nouns: depreciativity. The reader can see applications of our decision in the
examples given below. Moreover, their presentation can help to see the character of
our grammatical model and its position in contemporary structural linguistics.
3.1. Gender
The system of genders adopted in SGJP is based on Saloni (1976b), which is a continuation of Mañczak’s analysis (Mañczak 1956), also taking into consideration the
analysis of gender differentiation in Russian by Zaliznjak (1967).
According to a long and important tradition, we understand grammatical gender
in the noun as its syntactic property, consisting in requiring a particular form of the
subordinate word. So grammatical gender of the noun manifests itself in the form of
words associated with it. (In linguistic literature grammatical genders are sometimes
2
8
Tokarski also explicitly suggested compiling a grammatical dictionary of Polish — cf. Tokarski 1969 p. 390.
called noun classes.) Theoretically, every noun must belong to one of the classes
(in practice, we permit a very few exceptions).
Traditional grammarians of Polish distinguished between masculine, feminine,
and neuter genders, e.g., ten dobry chleb (m) ‘this good bread’, ta dobra woda (f) ‘this
good water’, to dobre wino (n) ‘this good wine’. However, this classification is not
exhaustive. Pluralia tantum nouns can’t be put into any diagnostic context for the
nominative singular: ten dobry , ta dobra , to dobre . So we distinguish an additional gender, which we name co-plural and mark with the symbol p. Thus on the
basis of nominative singular forms of adjectives and verbs associated with the noun
(in the so-called agreement) we distinguish four genders of Polish nouns.
However, analogous distinctions occur in Polish also in other cases of adjectives
governed by nouns, mainly in the accusative; noun forms also require different forms
of numerals. If we take into account all of the above-mentioned features, we must
distinguish nine noun classes in Polish. Ultimately, we have the following system
of Polish genders:
Widzê jednego albo dwóch spoœród tych , których lubiê. m1
Widzê jednego albo dwa spoœród tych , które lubiê.
m2
Widzê jeden albo dwa spoœród tych , które lubiê.
m3
Widzê jedno albo dwoje spoœród tych , które lubiê.
n1
Widzê jedno albo dwa spoœród tych , które lubiê.
n2
Widzê jedn¹ albo dwie spoœród tych , które lubiê.
f
Widzê jedno albo dwoje spoœród tych , których lubiê.
p1
Widzê jedne albo dwoje spoœród tych , które lubiê.
p2
Widzê (jedn¹ albo dwie pary) spoœród tych , które lubiê. p3
(approximate English translation:
I see one or two from among those
whom/which I like.)
We distinguish three subgenders of the masculine (vide Mañczak 1956, Saloni
1976b), two subgenders of the neuter, and three subgenders of the co-plural (vide
Saloni 1976b).
3.2 Depreciativity
One of the most disputable decisions in our description is introducing an additional
inflectional category: depreciativity (vide Saloni 1988).
In contemporary Polish many masculine nouns denoting humans have an additional form in the nominative plural (we call it depreciative), for example the lexeme pirat ‘pirate’ — beyond the regular nominative plural form piraci — has also a
second one: piraty, significantly more rare, marked and requiring other forms of the
dependent adjectives and verbs. The description of them in Polish grammar textbooks is irresolute. As a rule, the existence of such forms is mentioned in comments
(sometimes they are called impersonal), but not shown in paradigms. Consequently,
many non-professional but educated native speakers may have problems with how
to qualify grammatically forms of type piraty. The Polish version of the popular
computer program Word marks them as incorrect (and — for piraty — suggests the
9
corrections: piaty, pirat, piryty, pirata), although they occur fairly frequently in Polish
texts (many occurrences can be found via Google).
These forms are in syntactic opposition to the regular ones, as is shown by the
following examples:
Byli to dobrzy piraci/oficerowie. ‘They were good pirates/officers.’
By³y to dobre piraty/oficery.
It sometimes happens that both forms coincide morphologically; however we can
identify them by their dependents:
Byli to tacy proœci ¿o³nierze.
By³y to takie proste ¿o³nierze.
‘They were such common soldiers.’
Therefore we introduced this opposition also into the adjective and verb paradigms.
Interestingly enough, the “neutral” form has — in terms of superficial morphology
— special features, while the depreciative one is totally “regular”, i.e., it is derived
like the only nominative plural form of nouns of other genders representing the
same inflectional pattern. For instance, for the noun pirat the depreciative form
is piraty (like the nom. pl. of the noun aparat (m3) ‘device’: aparaty, termit (m2)
‘termite’ — termity, or chata (f): chaty) and the “neutral” form piraci has no analogy
in the paradigms of nouns of other genders.
There are nouns (mostly with a some tint of expressiveness) for which the depreciative forms are much more frequent, e.g., przedszkolaczek ‘kindergarten pupil’ — (te)
przedszkolaczki (by³y), cham ‘boor’ — (te) chamy (by³y), and the existence of the neutral,
“personal” form is debatable. When we carry out an informal survey, Poles, as a rule, are
not sure whether we can say (ci) przedszkolaczkowie (byli) or (ci) chami/chamowie (byli), and
reply that these forms do not exist; however it is possible to find instances of them in
texts (especially on the Internet). Therefore we must treat them as possible, permitted
both by the system and by usage. We treat depreciativity as grammatical category.
The meaning of depreciative forms varies for individual nouns; however, as a rule
they are marked. Most often they express disgust, aversion, contempt, disrespect.
The typical contrast is the following: dobrzy kierowcy ‘good drivers’ may be said with
approbation; dobre kierowce is unambiguously ironic. Nevertheless, it sometimes happens that the depreciative is used in archaization or introduces a special, informal
mood, e.g. for ch³op in a secondary meaning ‘man’: dobrzy ch³opi ‘good men’ (formal,
unusual) and dobre ch³opy ‘good guys’.
4. The Scope of the Dictionary
The core of the SGJP vocabulary consists of words found in readily available sources:
dictionaries and texts. Although the vocabulary is large, it does not include all Polish
words. It is hoped, however, that the dictionary includes all possible inflectional
patterns for all inflecting lexemes of Polish.
10
Below we present data concerning main classes and subclasses of entries, as
well as the inflectional patterns in SGJP for particular classes. Among the entries,
in addition to lexemes whose characterization is the task of SGJP, there are prefixes
occurring in texts as initial parts of words. They are very productive and can be used
spontaneously to derive new lexical units, which can be found in texts rarely — such
new units are inflected like the basic lexemes without prefixes.
Entries
Patterns
244 669
1 095
81
2
244 588
135 529
6
29 590
28 980
68 171
8 782
59 389
1,095
762
6
2
1
adjectives
participles
active
passive
“regular”
comparative
positive
65,671
34,301
13,931
20,370
31,370
950
30,420
71
deadjectival adverbs
comparative
positive
11,146
1,106
10,040
1
1
1
98
45
29,532
35
29,497
215
1
214
2,612
491
193
113
121
458
1,117
119
2
1
1
2
1
1
1
1
total
prefixes
lexemes
nouns
pronouns
gerunds
-oœæ
others
proper
common
numerals
verbs
predicatives
conjugated
others
other adverbs
particles
prepositions
conjunctions
interjections and the like
abbreviations
others
1
71
1
71
11
In the above list we put before the core of a given class of lexemes the subclasses that
have properties atypical or less typical for that class. For nouns, this category includes
groups of deverbal and deadjectival nouns as well as a group of several substantive pronouns (ja, ty, my, wy, on, siê) whose relation to the category of gender is complicated
(although Polish nouns have stable gender, e.g., forms of the lexeme ja occur with verbal
and adjectival forms of various gender values: ja by³em ‘I was’ (m) — ja by³am ‘I was’ (f)),
and whose paradigm is idiosyncratic from the point of view of Polish inflection as a
whole. Contrary to the tradition of Polish grammarians, we treat differentiated groups
of adjectival and adverbial forms of positive and comparative degrees as separate lexemes
(too few adjectives participate in this opposition to accept it as having an inflectional
character) — the comparatives are listed in the table before the positives. In the class of
verbs we distinguish the so-called predicatives, verbs derived from words of another
type and having no superficial features of conjugation — no specific verbal endings,
like potrzeba (derived from the nominative singular of the noun potrzeba ‘need’) or
niepodobna ‘it is impossible’ (derived from an adjective form), in contrast to regular
verbs, which are inflected by means of typical conjugational morphemes.
We have set apart subclasses of lexemes that have not been entered in SGJP independently: deverbal nouns (gerunds) and adjectives (participles) derived automatically from verbal entries; names of attributes (properties) and regular deadjectival
adverbs — from adjective entries. Such entries, although potentially existing, might
not have been verified in texts and corpora and have been less carefully characterized (elaboration of lexical items varies depending on their stylistic status and their
frequency in contemporary Polish texts).
The consequence of this decision is the number of entries in SGJP (shown in
the lower part of the screen): in the published version there are 244 669 entries (lexemes) and 4 223 981 word forms (counting syncretic forms of the same lexeme as
one unit). However, we treat this number as overestimated. Four classes of derived
lexemes number ca. 100 000 units; however, about half of them are quite regular,
neutral lexemes of contemporary Polish. Therefore in informative and advertising
materials we define the size of SGJP as ca. 180 000 lexemes.
It is worth adding that there are groups of lexemes that are characterized in SGJP although they are not on the list of its entries. This problem will be discussed below.
SGJP, unlike the most popular general unilingual Polish dictionaries, includes
the most frequent and most useful proper names, geographical and personal (first
names used by Poles and two categories of surnames: popular or belonging to famous persons).
We present the number of inflectional patterns for the classes and subclasses
here unsystematically — for general illustration of our principles. In any case, the
subclasses of atypical lexemes as well as the classes of lexemes derived automatically are associated with few patterns; it is the main class of lexemes that produces
the diversity of patterns in each part of speech. Parenthetically, we feel obliged to
explain that the second pattern for prepositions and prefixes is connected with units
which in texts can occur with or without e, e.g. pode mn¹ — pod tob¹ ‘under me/you’,
pode+przeæ — pod+par³ ‘support’ (two forms).
12
5. Information Provided by the Dictionary
The Grammatical Dictionary of Polish has as its goal a complete grammatical characterization of contemporary Polish vocabulary. However, such a rigorous description
is possible in different respects and to different degrees.
First of all, in order to describe textual units it is necessary to classify and group
them into lexical units, or lexemes, organized in a strict and fixed manner. This
grouping and organizing is the subject of the inflectional description.
Therefore we strive for completeness of inflectional description information
(complete information on form variation for virtually all Polish inflecting lexemes).
This means that each form of any lexeme is included with all values of all morphological categories (categories for which a given lexeme inflects). However, this does
not mean that all values are visible at any particular moment. Some word forms
(identified as strings of letters) are syncretic (connected with several combinations
of specific values) and sometimes their shape is more important than their function.
Therefore we decided to introduce lexeme forms in two stages. In the first stage
(surface) we are interested in the textual exponent of the given form, in the second
— in its function (vide Mel’èuk 1974). The two stages of inflectional description
will be illustrated below.
Consequently, the most important syntactic functions of word forms have been
defined in the paradigms. They can be generalized to a basic syntactic characterization of lexemes. However, it is only a general, limited characterization. It can be
said that the syntax in SGJP is described to the extent to which a clear formalized
approach was feasible. For nouns the dictionary defines their gender (on a high level
of precision, with masculine, neuter, and pluralia tantum nouns split into subclasses
— vide above); for numerals it defines the type of their syntactic relation with nouns;
for verbs, besides perfective/imperfective aspect, it provides information about transitivity and reflexiveness (co-occurrence with siê) — obligatory or optional. Case
government (except for the nominative-subject) of verbs is rarely defined — for
reasons of organization of the whole work; however it is systematically introduced
for prepositions. Generally speaking, non-inflected lexemes are provided with their
part-of-speech feature, and valence information is added where instructive (case
government for prepositions, type of conjoined phrase for conjunctions).
Moreover the entries of SGJP contain also some information about the derivational features of lexical units. This information is given by links between lexemes,
e.g., between elements of aspectual pairs for verbs, between a verb and its nominal
derivatives (gerunds and participles), between adjectives and their regular derivatives: adverbs and nouns (substantival names of qualities), between positive and
comparative adjectives.
5.1. What is a Lexeme in SGJP?
SGJP contains no definitions. Short glosses suggesting the meaning are included
for homonymous entries. Homonymy is treated purely formally. The main reason
to consider lexemes as different is the existence of differing inflectional paradigms.
13
For example, SGJP contains only one lexeme para although the word can have two
clearly distinct meanings: ‘couple’ and ‘vapor’. Both meanings lead to exactly the
same paradigm. In this case we do not need glosses.
When the difference in meaning is accompanied by some grammatical features
included in the dictionary, the lexems are differentiated.
Thus three different lexemes bokser are distinguished, because three different
meanings are associated with three sets of forms, each having its own syntactic features. This is reflected in their gender: it is m1 for bokser ‘athlete’, m2 for bokser
‘breed of dogs’, and m3 for bokser ‘kind of car engine’.
Some exceptions are possible for lexical units with a clearly defined unique meaning but vague gender, e.g. cz³owieczysko ‘great good chap’ m1 / n2 or cabernet
‘Cabernet’ m2 / m3 / n2.
There are also lexical units that are treated in SGJP as lexemes and characterized
grammatically (i.e., inflectionally) although they are not included explicitly in the list
of entries. One example is the superlative: a search for any superlative form refers
the user to the corresponding comparative (superlatives in Polish are derived from
comparatives with the prefix naj+). Similarly, negated adjectives with the prefix
nie+ are not included in the list of entries but searching for one refers the user to
its non-negative counterpart.
5.2. Examples
How we present the dictionary information will be illustrated with examples of
typical entries belonging to three main classes of lexemes. We begin with adjectives
because they best illustrate our method. Nouns and verbs involve specific problems
and will be shown in the next section.
5.2.1. Adjectives
The paradigm of krakowski ‘Cracovian’ shown in Figure 1 is organized like paradigms
traditionally given in textbooks of Polish grammar, but it contains some original solutions proposed by the authors (the system of genders, depreciativity). The paradigm
itself is contained in the upper part of the table below the header (containing four
elements: the headword, the grammatical qualification — przymiotnik ‘adjective’, the
note about its presence in SJPDor., and the symbol of its inflectional pattern, P08).
In the last two lines of this part of the paradigm, following the lines labeled with
the names of cases, we place forms that are not normally included in the paradigm:
krakowsko — marked Z³o¿. (Pol. z³o¿enie ‘composition’) and krakowsku — marked
C.(po). The first is used as the non-final component of compounds (hundreds of
thousands of examples on the Internet, especially Jura Krakowsko-Czêstochowska in
various cases, but with the same shape of the component krakowsko); the second is
used after the proposition po (tens of thousands of examples on the Internet). It can
be systematically derived from adjectives ending with -sk(i), -ck(i), -dzk(i), and also
from denominal adjectives produced from proper names (e.g. Putin ? putinowski ?
(po) putinowsku). Parallel constructions (having the same meaning) derived from other
adjectives contain the regular dative form (of masculine or neuter), e.g. traktowa³a
14
Figure 1. An adjective entry with all forms (deep paradigm).
go nie po macierzyñsku, ale po macoszemu ‘she treated him not like a mother, but like a
stepmother’ (macierzyñsku is the C. (po) of macierzyñski, whose masculine dative is
macierzyñskiemu; macoszemu is the regular masculine dative of macoszy). Therefore
the form under discussion is marked as a special form of the dative (C. — celownik
in Polish). The forms Z³o¿. and C. (po) are inflectionally regular; it is strange that
they were not treated this way in earlier works.
The lower part of the table is devoted to regular derivatives. Both lexemes given
in the sample entry, the adverb krakowsko and the nominal quality name krakowskoœæ,
have visible drawbacks. The noun is rare and the adverb is only potential — on the
border of acceptability (the construction po krakowsku is normally used in the adverbial function parallel to krakowski). They are included automatically in the list of
entries in SGJP, although only some of them can be easily found on the Internet in
many instances (e.g. rosyjskoœæ, bohaterskoœæ or mistrzowsko, barbarzyñsko — cf. rosyjski
‘Russian’, bohaterski ‘heroic’, mistrzowski ‘masterly’, barbarzyñski ‘barbarous’).
Possible elimination of such units should be discussed before the next release of
our dictionary.
The screen presented in Figure 1 is shown when the option Wszystkie formy ‘All
forms’ in the menu Odmiana ‘Inflection’ is chosen and when its grammatical function has been assigned — this way of presentation can be called deep morphology. On
the other hand, it is possible to choose Formy bazowe ‘Basic forms’ (we can call this
presentation surface), when the function is neglected; only the shape is of concern.
15
Figure 2. An adjective entry with basic forms only (surface paradigm).
In the array in Figure 2, the forms are presented in one column and numbered:
1–12 and 3+. It is sufficient to have 11 differentiated shapes to express all distinctions of case, number, and gender (12 and 3+ are designed for the forms Z³o¿. and
C.(po)). This is universal for almost all Polish adjectives (except for several so-called
adjectival pronous, which have two additional differentiations).
5.2.2. Nouns
Nominal entries are much simpler, because they are organized according to two
main inflectional categories: case and number.
As an example for Figure 3 we chose a feminine noun: kopalnia ‘mine’. It does
not have the category of the depreciativity, which applies only to virile nouns (m1).
However, there is a specific grammatical opposition in it. In the genitive plural two
forms occur: the first is syncretic with some forms of the singular, the second one is
specific, used only for this combination of grammatical values. This contrast is well
known to Polish grammarians. We have introduced it as an inflectional category.
It is also possible to present textual manifestations of these forms, basic forms, as
in Figure 4.
The basic forms are chosen on the basis of the “type of inflection”, i.e., the
syncretisms occurring in the given type of the nominal pattern. We distinguish the
16
Figure 3. A noun entry with all forms (deep paradigm).
following types of inflection: masculine, feminine, neuter, and a special one for noninflecting nouns. Among the basic forms presented in Figure 4 there is no locative
singular, because its form in the feminine pattern is always syncretic with the dative.
Accusative plural is omitted in all patterns, because it is always syncretic with the
accusative or genitive, depending on the gender.
Let us note that in the surface paradigm, as illustrated in Figure 4, some parts
of forms are distinguished by colors: a word is divided into a stable part (the letters
which occur in any form presented in the paradigm) and a changeable part specific
for the form. Quite commonly these parts are not what could be called from the
linguistic point of view a stem or an ending.
5.2.3. Verbs
Conjugation is the most complicated part of Polish inflection. The full (deep)
paradigm with the explicit description of the functions of all forms — in a shape we
can present on paper — would be awkward and non-illustrative. Therefore we will
show several simpler illustrations.
In order to have a overall look at the main verbal categories let us consider the
example of the forms of the secondary predicative mo¿na ‘it is possible to’ (derived
from the feminine nominative singular of the adjective mo¿ny ‘mighty’, obsolete,
17
Figure 4. A noun entry with basic forms only (surface paradigm).
but used also in contemporary Polish). It has only one synthetic (and basic) form;
however, it can be inflected analytically for mode and tense (see Figure 5).
The same scheme is used also for other verbs (we call them niew³aœciwe ‘improper’)
that are used in constructions without any subject-nominative. As a result they do
not inflect for person, number, and gender (verb forms agree in that respect with
the subject-nominative). However, some verbs of this type, such as brakowaæ ‘not
suffice’, are constructed regularly in their surface paradigms. Such a paradigm is
shown in two variants in Figures 6 and 7.
The full (deep) paradigm of a typical verbal lexeme defines, in a given mode and
tense, all distinctions of person, number, and gender with variants for the position
of movable morphemes (like (e)m or byœmy).
The conjugational tables for SGJP were worked out according to methods used
in the reference book on Polish conjugation (Saloni 2001). In Figure 8 we quote
from this book the inflectional table for the verbs bóœæ and ubóœæ ‘hit with the
horus (#?#)’ (in fact, for the pattern represented by these two verbs, which are an
aspectual pair):
In SGJP the paradigms are derived separately for each lexeme, as well as for
each verb (on the basis of its pattern and other grammatical features); as a result the
18
Figure 5. The verb entry showing the deep paradigm of mo¿na.
Figure 6. The verb entry showing deep paradigm of brakowaæ.
tables for the imperfective bóœæ and the perfective ubóœæ are created separately. An
interested reader can look at the paradigms on the computer. In order to show the
complexity of the conjugation presented in SGJP we present in Figure 9 only the
surface variant (basic forms) of the paradigm of bóœæ.
19
Figure 7. Verb entry showing basic forms (surface paradigm) of brakowaæ.
Figure 8. Inflectional tables for verbs bóœæ and ubóœæ (Table 28 from Saloni 2001, p. 80).
20
Figure 9. A verb entry from SGJP: surface paradigm of bóœæ.
The headword is given with the optional reflexive pronoun siê in brackets.
Immediately below that, the header contains information about the aspect of the
lexeme bóœæ, its “w³aœciwoœæ” (occurring with a subject), transitivity, and its presence
in SJPDor., as well as its conjugational pattern (the classification of patterns is based
on Tokarski’s systematization). At the bottom are references to its aspectual counterpart and regular derivatives: nominal (ods³ownik ‘gerund’) and adjectival (imies³ów
przymiotnikowy czynny i bierny ‘active and passive participles’).
The set of 12 basic forms is the minimal one: it must be used in order to derive all
forms of all verbal patterns, including derivatives. For the pattern given for bóœæ each
basic form has two variants — both serve to derive non-basic forms; as a result, the
broad paradigm (i.e. including forms of both participles and the gerund) of the verb
bóœæ contains 85 different synthetic forms: 8 nominal, 31 adjectival (including variants),
and 46 purely verbal (including non-finite forms: bóœæ and bod¹c). All are introduced in
the full (deep) version of the paradigms, either of the verb or its derivatives.
6. The Organization of Data in SGJP
Due to the large amount of data involved, SGJP was developed using relational
database tools. This approach proved useful in an earlier work, Czasownik polski
(Saloni 2001; cf. Saloni and Woliñski 2003, 2004). In that project information on
21
the inflection of over 29 000 Polish verbs was entered into a database and developed
during the several years of the duration of the project. Finally, inflectional tables
for verbs and the dictionary part of the work were generated from the database and
typeset automatically.
Within the present project the entire material of the dictionary was organized
in a similar way. At an early stage Woliñski developed a relational model of Polish
inflection, so we were able to describe linguistic phenomena within the database
framework (Woliñski 2007). (A different organization of work, in our opinion less
convenient, would have been to use the database merely as a means of storage and to
resort to other facilities, e.g., to generate all inflected forms from the dictionary data.)
As a result, our database describes all subtleties of Polish inflection within a uniform
and relatively compact relational model. It is important to stress that the published
version of the dictionary is only one of possible uses of the underlying database. The
data could easily be used in various systems for natural language processing.
From the technical point of view, data for each grammatical class was kept in
a separate MS Access file that was operated by one of the authors. The form of data
used in the user interface was generated on a Linux system with Perl scripts and the
SQLite tool. In the next stages of SGJP’s development we intend to build a webbased application to enable authors to cooperate more closely.
7. The Program
We attempted to harmonize solutions intended for various users: professional linguists
and laymen (having only basic educational background) seeking immediate grammatical help. We wanted to make our dictionary user-friendly and introduced many
graphic solutions (the organization of inflectional tables, distinctions, colors, etc.).
The structure of typical entries was described above. Below we will consider the
structure of the list of entries and the search methods.
7.1. The List of Entries
The great advantage of a computerized dictionary over a traditional one is its flexibility.
In SGJP entries can be organized in various ways. In particular, the list of entries
(in the left part of the window) can be displayed in several ways.
First, it can be put into two orders: ordinary (a fronte) and reverse (a tergo). In both
cases the headword is provided with a simplified and abbreviated qualification of the
lexeme (repeated in the full form in the header of the entry).
In addition, the content of the displayed list is changeable. There are three possibilities: the user may choose either the full list of entries (more exactly, their headwords), or reduce it to one of five classes of lexemes (nouns, adjectives, numerals,
verbs, other). It is also possible to list all the wordforms occurring in SGJP. In any of
these displays the number of units presented on the list is shown in the lower part of
the screen (those numbers are given in the table above). The main classes of lexemes
(together with derivatives) are dispayed on backgrounds of different colors.
22
7.2. Search
Of course, it is possible to search in the dictionary for any headword of any lexeme
(found on the list of entries or typed from the keyboard).
Moreover, it is possible to find a lexeme through any of its forms. For example,
if we type into the query window (when the list of entries includes nouns) ód, we
will obtain the information on the lexeme oda ‘ode’ (its genitive plural has the
shape ód). If the word typed in is homonymic, i.e., it can be interpreted as a form of
several lexemes, only one of them is seen. However, we can easily find all possible
interpretations. When we press the key Enter or the button Szukaj ‘search’, in the
upper part of the panel of the list of entries, we will get an additional small window
containing a “sublist of suggested entries”, which include the given homonymic
word as one or several of its forms. For example, such a sublist for the word mam
(see Figure 10) contains the lexemes mama (noun) ‘mom’, mamiæ (verb) ‘beguile’,
mieæ (verb) ‘have’; and for the word ¿ó³ci — the lexemes: ¿ó³æ (noun) ‘bile’, ¿ó³ciæ
(verb) ‘make yellow’, ¿ó³ty (adjective) ‘yellow’. When we click on one of them we
will see the chosen entry.
Additionally, it is possible to reconstruct with the help the dictionary the paradigm of a lexeme that does not occur on the list of entries. Methods for such an
advanced search are discussed (in Polish) in the instructions to SGJP, contained in
the program’s helpfile and in the printed booklet.
Figure 10. The result of typing mam into the search box, plus Enter or Szukaj ‘search’.
Note the sublist immediately below search box.
23
8. Perspectives
8.1. Planned Improvements
The dictionary may serve as a source of research in the domain of inflection and — to
some extent––syntax of Polish. It may also be useful for teaching Polish, especially
to foreigners.
In our work on SGJP we chose the extensive method — including a great number
of entries. However, it seems that in the future extending it further will be desirable
and favorable — mainly with proper names.
The breadth contributed, unfortunately, in some instances to a lack of depth of
description. So we plan the following improvements to the data:
— enrichment of the entries with more labels, glosses, notes, etc.;
— more in-depth study of depreciative forms, non-obvious genders for nouns,
and other debatable phenomena;
— systematic classification of inflectional patterns;
— introduction of information on corpus frequency of lexemes (e.g., a view
of the 1000, 10,000 or 50,000 most frequent lexemes of Polish).
It is also possible to make some improvements in the interface, mainly to add:
— possibility of filtering by inflectional patterns;
— possibility of user-defined views of the list of entries (by grammatical classes,
inflectional; patterns, frequency, conditions on endings, arbitrary forms,
etc.).
8.2. Conclusion
The first edition of SGJP has just been published. It provides an extensive grammatical description of Polish words. We believe that it is the first time such a rigorous
description was applied with sufficient precision. However, we see a real possibility
of many improvements, so we hope the first edition of SGJP will not be the last.
Bibliography
DOROSZEWSKI, Witold, ed. (1958–1969): S³ownik jêzyka polskiego PAN. 1–11. — Warszawa: Wiedza
Powszechna — PWN (abbr. SJPDor.).
GROCHOWSKI, Maciej (1997): Wyra¿enia funkcyjne. Studium leksykograficzne. — Kraków: IJP PAN.
GRUSZCZYÑSKI, W³odzimierz (1989): Fleksja rzeczowników pospolitych we wspó³czesnej polszczyŸnie pisanej.
— Wroc³aw: Ossolineum.
GRUSZCZYÑSKI, W³odzimierz, SALONI, Zygmunt (1978): Sk³adnia grup liczebnikowych we wspó³czesnym
jêzyku polskim. — [In:]; Roman LASKOWSKI and Zuzanna TOPOLIÑSKA (eds.): Studia gramatyczne
II, Wroc³aw: Ossolineum, 17–42.
GRZEGORCZYKOWA, Renata, PUZYNINA, Jadwiga, eds. (1973): Indeks a tergo do S³ownika jêzyka polskiego
pod redakcj¹ Witolda Doroszewskiego. — Warszawa: PWN.
LASKOWSKI, Roman (1984): Wyraz — Funkcjonalna klasyfikacja leksemów — Podstawowe pojêcia fleksji
— Kategorie morfologiczne jêzyka polskiego — fleksja funkcjonalna. — [In:] Renata GRZEGORCZYKOWA, Gramatyka wspó³czesnego jêzyka polskiego. Morfologia;
Roman LASKOWSKI, Henryk WRÓBEL (eds.): Warszawa: PWN (2nd ed. 1998) 33–65 and 125–224.
MAÑCZAK, Witold (1956): Ile rodzajów jest w polskim? — Jêzyk Polski, XXXVI, 116–121.
24
MEL’ UK, Iigw1974): !"# #$%&'' (')*+',#'-$,.'/0%1$($2 «30",(
4$.,#» — 3$05)#'.5,
3')#5.,',. — !"#$%&: '&($&.
SALONI, Zygmunt (1974): Klasyfikacja gramatyczna leksemów polskich. — Jêzyk Polski, LIV, 3– 13 and
93–101.
SALONI, Zygmunt (1976a): Cechy sk³adniowe polskiego czasownika. — Wroc³aw: Ossolineum.
SALONI, Zygmunt (1976b): Kategoria rodzaju we wspó³czesnym jêzyku polskim. — [In:] Roman
LASKOWSKI (ed.): Kategorie gramatyczne grup imiennych. Materia³y konferencji, Wroc³aw: Ossolineum,
43–78 and 96–106.
SALONI, Zygmunt (1977): Kategorie gramatyczne liczebników we wspó³czesnym jêzyku polskim.
— [In:], Roman LASKOWSKI and Zuzanna TOPOLIÑSKA (eds.): Studia gramatyczne [I] Wroc³aw:
Ossolineum, 145–173.
SALONI, Zygmunt (1979): [Rev. of:] ZALIZNJAK (1977). — International Review of Slavic Linguistics 4,
241–250.
SALONI, Zygmunt (1981): Uwagi o opisie fleksyjnym tzw. zaimków rzeczownych. — [In:] Acta Universitatis Lodziensis. Folia Linguistica II, £ódŸ: Wydawnictwo Uniwersytetu £ódzkiego, 143–153.
SALONI, Zygmunt (1988): O tzw. formach nieosobowych rzeczowników mêskoosobowych we
wspó³czesnej polszczyŸnie. — Bulletin de la Société polonaise de linguistique XLI, 155–166.
SALONI, Zygmunt (1992b): Rygorystyczny opis polskiej deklinacji przymiotnikowej. — Zeszyty Naukowe
Wydzia³u Humanistycznego Uniwersytetu Gdañskiego. Prace Jêzykoznawcze 16, 215–228.
SALONI, Zygmunt (2001): Czasownik polski. — Warszawa: Wiedza Powszechna; 3th rev. ed. 2007.
SALONI, Zygmunt, ed. (1987): Studia z polskiej leksykografii wspó³czesnej, Tom II. — Bia³ystok: Dzia³
Wydawnictw Filii Uniwersytetu Warszawskiego.
SALONI, Zygmunt , ed. (1988): Studia z polskiej leksykografii wspó³czesnej. — Wroc³aw: Ossolineum.
SALONI, Zygmunt, ed. (1989): Studia z polskiej leksykografii wspó³czesnej, Tom III. — Bia³ystok: Dzia³
Wydawnictw Filii Uniwersytetu Warszawskiego.
SALONI, Zygmunt, GRUSZCZYÑSKI, W³odzimierz, WOLIÑSKI, Marcin, WO³OSZ, Robert (2007): S³ownik
gramatyczny jêzyka polskiego. — Warszawa: Wiedza Powszechna (abbr. SGJP).
SALONI, Zygmunt, ŒWIDZIÑSKI, Marek (1981): Sk³adnia wspó³czesnego jêzyka polskiego. — Warszawa:
Wydawnictwa Uniwersytetu Warszawskiego; 5th ed. 2007, Warszawa PWN.
SALONI, Zygmunt, WOLIÑSKI, Marcin (2003): A Computerized Description of Polish Conjugation.
— [In:] Peter KOSTA et al. (ed.): Investigations into Formal Slavic Linguistics. Proceedings of the 4th European
Conference on Formal Description of Slavic Languages in Potsdam, Part I; Frankfurt am Main: Peter Lang,
373–384. „Czasownik polski”. — Bulletin de la Société polonaise de linguistique LX, 145–156.
SGJP, [see:] SALONI, Zygmunt, GRUSZCZYÑSKI, W³odzimierz, WOLIÑSKI, Marcin, WO³OSZ, Robert
(2007): S³ownik gramatyczny jêzyka polskiego. SJPDor., [see:] Witold DOROSZEWSKI (1958–1969):
S³ownik jêzyka polskiego PAN.
ŒWIDZIÑSKI, Marek (1992): Gramatyka formalna jêzyka polskiego. — Warszawa: Wydawnictwa UW.
TOKARSKI, Jan (1951): Czasowniki polskie. — Warszawa: Wydawnictwo S. Arcta.
TOKARSKI, Jan (1958): Formy fleksyjne. — [In:] DOROSZEWSKI, (1958–1969), vol. 1, XLIX–LXXIV.
TOKARSKI, Jan (1969): Perspektywy S³ownika. — Poradnik Jêzykowy 7, 385–394.
TOKARSKI, Jan (1973): Fleksja polska. — Warszawa: PWN.
TOKARSKI, Jan (1993): Schematyczny indeks a tergo polskich form wyrazowych, ed. Z. SALONI. — Warszawa:
PWN; 2nd ed. 2002.
WOLIÑSKI, Marcin (2006): Morfeusz: A Practical Tool for the Morphological Analysis of Polish. — [In:]
Mieczys³aw A. K³OPOTEK, S³awomir T. WIERZCHOÑ, and Krzysztof TROJANOWSKI (eds.): Intelligent
Information Processing and Web Mining, IIS:IIPWM’06 Proceedings; Stuttgart: Springer, 503–512.
WOLIÑSKI, Marcin (2007): A Relational Model of Polish Inflection. — [In:] Zygmunt VETULANI (ed.):
Proceedings of the 3rd Language & Technology Conference; Poznañ, 59–63.
WO³OSZ, Robert (2005): Efektywna metoda analizy i syntezy morfologicznej w jêzyku polskim. — Warszawa:
EXIT.
ZALIZNJAK, Andrej A. (1967): 67,,.%$ '0$))%$ ,(%+%'80$)$)'$. — !"#$%&: '&($&.
ZALIZNJAK, Andrej A. (1977): 9&5005#'-$,.'2 ,(%+5&: &7,,.%*% ;8".5. — !"#$%&: )(##$*+ ,-.$;
4th rev. ed. 2003. — !"#$%&: )(##$*/ #0"%&1*.
25

Grammatical Dictionary of Polish

Transkrypt

Podobne dokumenty

Grammatical Dictionary of Polish: The Current State and

Paper 3301- Functional Polish

Customer Support Representative (1st level Support) with Russian