combining web analytics and computational linguistics to enhance

Transkrypt

Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa
Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku
Agnieszka Smolczewska Tona*
Université Charles-de-Gaulle Lille 3
GERIICO
COMBINING WEB ANALYTICS AND COMPUTATIONAL
LINGUISTICS TO ENHANCE ACCESS TO DIGITAL LIBRARIES.
A CASE STUDY
[POŁĄCZENIE WEBOMETRII I LINGWISTYKI KOMPUTEROWEJ
JAKO SPOSÓB USPRAWNIENIA DOSTĘPU DO BIBLIOTEK CYFROWYCH.
STUDIUM PRZYPADKU]
Abstract: The article reports on a study on user searching activities that has been carried out for the Lyon Public
th
Library to enhance a prototype online user interface developed for the CaNu XIX Project (19 Century Digital
Newspapers Project). Gathering and analyzing user searching behavior in digital library context involve several
specific challenges such as the difficulty of reaching physically absent, geographically dispersed and culturally
unbounded populations. One of several methods to achieve this purpose consists in analyzing log files data. Log
files allow tracking not only the traffic (number of visitors, visit duration, etc.) at the particular website, but also
what those visitors are doing at the concerned website (individual click trail, searched keywords, etc.). The data
th
analyzed in the present study come from the 19 Century Digital Newspapers Project website logs. We focus our
attention on search terms: techniques from computational linguistics have allowed us to sort them into categories
representing major topics searched by visitors. Results from this process can be exploited by the library’s online
service to enhance the design of the web site. The potential and the limits of this multidisciplinary approach combining web analytics and computational linguistics is thoroughly discussed in the paper.
DIGITAL HISTORIC NEWSPAPERS – INFORMATION RETRIEVAL – QUERY TERMS – TRANSACTION LOG ANALYSIS
– SEMANTIC ANALYSIS
*
AGNIESZKA SMOLCZEWSKA TONA, PhD; Associate Professor in UFR IDIST (Department of Information,
Documentation, Scientific and Technical Information), Université Charles-de-Gaulle Lille 3, France; coordinator of the PostGraduate Degree Program in Document Engineering, Edition and Cross-Cultural Mediation (Université Lille 3); MA in
Computational Linguistics (Trinity College of Dublin, Ireland); post-graduate degree in Computer Science for Social
Sciences (Université Pierre Mendès-France, Grenoble 2, France); PhD in Information and Communication Sciences
(Université Claude Bernard, Lyon 1, France). Two the most important publications: (2006) Une nouvelle lecture de la
structure d’un document en vue de la construction d’index [A New Look at the Document’s Structure in View of Index
Constructing]. [In:] Terminologie et accès à l’information [Terminology and Access to Information]. Paris: Hermes Science
Publications, p. 101–118 [co-author: G. Lallich-Boidin]; (2006) An Interpretation of the Effort Function through the
Mathematical Formalism of Exponential Infometric Process. “Information Processing and Management” 42(6), p. 1442–
1450 [co-author: T. Lafouge]. E-mail: [email protected]
[Dr AGNIESZKA SMOLCZEWSKA TONA, UFR IDIST [Instytut Informacji Naukowo-Technicznej i Dokumentacji],
Université Charles-de-Gaulle Lille 3, Francja; koordynatorka studiów III stopnia kierunku „Edytorstwo elektroniczne
i komunikacja międzykulturowa” (Université Lille 3); absolwentka lingwistyki komputerowej (MA, Trinity College of
Dublin, Irlandia) i studiów podyplomowych w zakresie informatyki dla nauk społecznych (Université Claude Bernard, Lyon
1, Francja). Dwie najważniejsze publikacje: (2006) Une nouvelle lecture de la structure d’un document en vue de la
construction d’index [Nowe spojrzenie na strukturę dokumentu w perspektywie tworzenia indeksów]. [In:] Terminologie et
accès à l’information [Terminologia i dostęp do informacji]. Paris: Hermes Science Publications, p. 101–118 [współaut.:
G. Lallich-Boidin]; (2006) An Interpretation of the Effort Function through the Mathematical Formalism of Exponential
Infometric Process [Interpretacja funkcji wytężenia w formalizmie matematycznym wykładniczego procesu
informetrycznego]. „Information Processing and Management” 42(6), p. 1442–1450 [współaut.: T. Lafouge]. E-mail:
[email protected]].
264
Abstrakt: Przedstawiono wyniki badania zachowań wyszukiwawczych, które zostało przeprowadzone w Miejskiej
Bibliotece Publicznej w Lyonie wśród użytkowników prototypowego interfejsu biblioteki cyfrowej CaNu XIX, stwoth
rzonej w ramach projektu digitalizacji XIX-wiecznych czasopism (19 Century Digital Newspapers Project). Pozyskiwanie i analizowanie danych dotyczących zachowań informacyjnych użytkowników biblioteki cyfrowej okazuje
się szczególnie trudne ze względu na wirtualność ich interakcji z systemem, rozproszenie geograficzne i niejednorodność cech społeczno-demograficznych. Jedną z metod stosowanych w tego rodzaju badaniach jest analiza
plików rejestru (ang. log files), w których odnotowywane są kolejne logowania do systemu. Dane pochodzące
z plików rejestru umożliwiają nie tylko ocenę parametrów ruchu w witrynie (liczba odsłon, czas odwiedzin itd.), ale
także śledzenie działań podejmowanych przez użytkownika (odtwarzanie indywidualnych tras nawigowania
w serwisie, użytych w wyszukiwaniu słów kluczowych itd.). Przytoczone w artykule dane, pochodzące z plików
th
rejestru biblioteki cyfrowej 19 Century Digital Newspapers Project, analizowano głównie ze względu na ujęte
w kwerendach argumenty wyszukiwawcze: dzięki zastosowaniu technik językoznawstwa komputerowego możliwe było posortowanie ich w kategorie odpowiadające najpopularniejszym tematom wyszukiwań. Uzyskane w badaniu wyniki mogą posłużyć za podstawę do optymalizacji interfejsu biblioteki cyfrowej CaNu XIX. We wnioskach
końcowych omówiono potencjalne korzyści i ograniczenia podejścia interdyscyplinarnego łączącego elementy
webometrii i lingwistyki komputerowej.
ANALIZA PLIKÓW REJESTRU – ANALIZA SEMANTYCZNA DIGITALIZACJA CZASOPISM HISTORYCZNYCH
– WYSZUKIWANIE INFORMACJI
*
*
*
INTRODUCTION
Old newspapers are probably one of the most important information sources for historical research. Over the
last 300 years they recorded every aspect of everyday human life at the international, national and local level. As
a consequence, their content reflects history in the making: “it is one thing to read about historical events from
the perspective of historians, narrated with the value of hindsight. It is entirely different to read the story as it
was happening” [Sweeney 2007, p. 188].
Historic newspapers, especially regional and local newspapers and those published in the 18th and 19th centuries, have always challenged researchers with practical problems regarding their access. Researching pieces of
information in large-size format and non-indexed newspapers have always been time-consuming and often frustrating. Without exact bibliographic information, anyone wishing to access a specific article could spend days, if
not months, consulting countless volumes of crumbling newspapers or miles of microfilms surrogates.
Digitization and online publication of historic newspapers, particularly when some enhanced features such
as indexing and full-text search options are provided, holds out the promise of making this important information
easier to find and more accessible. In this way, it allows researchers “to spend more time using material and less
time finding it” [Jones 2004, doc. online, p. 20].
Many libraries, archives and other cultural institutions are digitizing their national, regional and local newspapers to make them freely available through the Internet. Recently, the Lyon Municipal Library, which is
France's second largest library after the National Library in Paris, also headed down the path of digitization. Two
years ago, part of its nineteenth century regional newspapers collections was digitized and put online on the 19th
Century Digital Newspapers website, developed in the framework of the collaborative project CaNu XIX (as
Canards numériques du XIXe) [Smolczewska 2008].
Several tasks were planned in the CaNu XIX project in order to promote the valuable heritage represented
by those digital collections. The research work presented in this paper is an outcome of the task devoted to the
265
study of the usages of the digital historic newspapers available online at the 19th Century Digital Newspapers
website, in view of improving the existing prototype interface.
The approach presented in the following is based on Web analytics, and more particularly on transaction
log analysis, an unobtrusive technique which makes it possible to reach the scattered reader population of digital
historic newspapers. We advocate the use of natural language processing tools to improve the extraction of relevant information from full-text search queries, and find out more precisely what the users are looking for.
The paper is organized as follows. The background of the study is presented in the next section, which includes a description of the CaNu XIX project, a brief state-of-the-art of the uses of digital historical newspapers
and of the applications of transaction log analysis to this field. Then, our methodology is illustrated. First, we
describe the data and their processing for the study. Next, we provide a set of standard metrics on the interface
usage and more particularly on the full-text search facility. Finally, we present a thematic classification of search
query contents obtained in a semi-automatic way thanks to the use of semantic analysis software. The potential
and the limits of this approach are discussed at the end of the paper.
CONTEXT
The CaNu XIX project
The CaNu XIX project, financed by the Rhône-Alpes region, brought together partners from different regional institutions and research teams led by Lyon Municipal Library over a two-year period ending in December
2009. Its primary objective was to develop a prototype Web interface to provide access in digital form to the
historic newspaper collections owned by the library. The second and overall more challenging objective consisted in promoting this valuable digital heritage, by enhancing online access to the collections.
Among the possible lines of research, the study of interface uses emerged as a particularly promising one.
Indeed, in most projects involving digitization of historical newspaper collections the interfaces are designed for
professional users and tend to reproduce the uses of the original paper collection. Since few studies on the use of
online collections are available, it is difficult to assess to what extent existing interfaces meet user needs and
how the can be improved.
The first version of the 19th Century Digital Newspapers website was completed and made freely available
on the Internet in December 2007, as a part of the Lyon Municipal Library website. It presented a collection of
769 issues of Le Progrès Illustré (1890–1905), a very popular local illustrated newspaper. The collection includes digital reproductions of every page from every issue (for a total of over 7000 pages), available in various
formats.
The prototype interface supports a number of searching and browsing features. For the purposes of this article we focus our attention on the options for searching the newspaper in the collection. The first option consists
in submitting a query from the full-text search box available in every page. This creates a results list of all the
issues containing the searched terms with each term highlighted in every place it occurs on the page. Then the
user can choose the appropriate page format for printing or for browsing. The second possibility to find relevant
articles is to limit the range to a particular year or issue, using the specific search boxes available in the homepage (full-text search, searching by year, by issues, sample thematic tracks and index of names).
266
Current uses of digital historical newspapers
Who is the typical reader of digital historic newspapers? The answer to this question is not an easy one: despite the growing popularity of historical newspapers among readers, very little research has been done regarding
their current uses and users. There have been, however, at least two studies suggesting that old newspapers engage a broad audience, both academic and non-academic. A survey conducted under the NEWSPLAN Scotland
project, exploring the use of historic newspapers on microfilm in Scottish Public Libraries, revealed that for nonacademic readers
who are seeking titles older than a hundred years, the major reasons for using the newspapers were family and
local history. Users wanted to look at birth, marriage and death records […], weather reports, shipping records, information on soldiers, school board information, and gaining insight into Scottish perspective in major historic events” [Jones 2004, doc. online, p. 26].
Concerning the use of 19th century newspapers by academic and scholar audience, [Jones 2004, doc. online]
shows that they serve as important historical sources, both primary and secondary, for scholars of multiple disciplines (historians, archaeologists, geographers, geologists, linguists, etc.), as well as for librarians, teachers, students, and genealogists [Jones 2004, doc. online, p. 2]. Nevertheless, even though considerable useful information has been forthcoming from these studies, they do not fully answer the questions about users of digital historic newspapers. Indeed, both studies have inquired about users of these resources in their traditional forms
(printed format or microfilm), and they do not discuss the specific ways those traditional user groups have utilized them.
This lack of studies regarding the specific user and use of digital historic newspaper collections can be explained, in our opinion, by two factors. Firstly, a great many of those projects are in their infancy: the oldest still
running projects are less than ten years old. Secondly, the study of uses in a digital environment proves to be
difficult.
Indeed, investigating the current and potential user information needs and behavior in the digital context
presents several specific challenges. One of them is to adopt “a robust and scientifically grounded methodology
that provides rich and detailed data on the working habits of users interacting with digital materials” [Snow et al.
2008, doc. online]. In order to collect the most detailed and rigorous data, researchers apply a number of qualitative and quantitative research methods, including questionnaires, interviews, focus group interviews, observation
and experiments. In user studies which aim to investigate information needs and behavior, observation and interviews have been found to be the most valuable methods [Gorman, Clayton 2005; Inskip et al. 2007]. However,
in the digital environment, these traditional data collection techniques pose a severe practical problem. As we
argued earlier, a historical document collection in a traditional library, as a physical (and often unique and as
unique, valuable) object, can be accessed in one place and by one user at a time. Its digital surrogate, freed from
the physical constraints of library and freely available on the library website, has the virtues of ubiquity and simultaneity [Deegan 2001, p. 402]. It can be accessed continuously and simultaneously, from anywhere (or virtually anywhere), at any time and by anyone, library subscriber or random website visitor, having an access to
a computer or other type of terminal connected to the network. In that context, a key question arises: how to
reach this physically absent, geographically dispersed and culturally unbounded reader population?
One of the possible solutions to deal with this underlying difficulty of collecting data about the use of online
resources consists in applying a technique known as transaction log analysis (TLA).
267
Transaction log analysis
Transaction log analysis is a relatively recent technique whose use as a research method arose in parallel
with the development of new information and communication technologies. It refers to the practice of capturing
and analyzing log files data from a particular website and is thus also known as Web log analytics or (on-line)
Web analytics [Ferrini, Mohr 2009, p. 125]. Transaction log analysis is considered as an unobtrusive method
which takes advantage of the technology that is being evaluated [Powell et al. 2004, p. 52]. Indeed, all Web
based systems can be configured to generate an automatic and real-time record of their use by everyone who
happens to access their information services. More precisely, every time the “client” computer requests a particular page, the “server” computer on which the website is hosted records a certain amount of data representing
the client’s activity on the website. It can include details about the website visitor (the visitor’s URL, or Web
address, the visitor’s geographic location, etc.), the IP address of the visitor’s computer, the language setting on
the visitor’s computer, the date and time of the visit and its duration, the visitor’s navigation paths, information
about viewed or downloaded pages, used keywords and other information that the server is currently configured
to gather (for a more extensive overview of the data that can be collected by Web analytics, see [Napier et al.
2003; Ferrini, Mohr 2009]).
The growing number of user studies based on the analysis of transaction log data in recent years shows the
increasing interest of the scientific community for both subject matter and method. In this field, transaction logs
analysis has been extensively applied for different purposes, and particularly to observe users’ informationseeking behavior when accessing various Web-based resources and services. Thus, it has been found particularly
helpful in carrying out investigations of the use of electronic journal databases [Pu 2000; Jones et al. 2000; Ke et
al. 2002; Bollen et al. 2003, doc. online; Nicolas et al. 2006, Harley et al. 2007, doc. online] and Web search
engines [Thorsten 2002, doc. online; Spink, Jansen 2004; Tjondronegoro et al. 2009]. Different aspects of
searching and browsing behavior in the digital library context were more particularly examined by Jones et al.
[2000] and Ke et al. [2002]. There are some similarities between these two earlier studies and ours: they all examine user searching activities in terms of search queries. However, in comparison with the study proposed by
Ke et al. [2002], the present study does not involve other aspects of searching activity such as browsing or downloading. Moreover, a major difference between our study and that conducted by Jones et al. [2000] lies in the
processing of search queries. To gain some qualitative insight into what users are searching for, in the study performed by Jones et al. [2000] query terms were examined and classified into categories manually. In our study,
the query content analysis has been performed with the assistance of automatic text analysis tools.
STUDY OF SEARCHING ACTIVITIES ON THE “19TH CENTURY DIGITAL
NEWSPAPERS PROJECT” WEBSITE
Log data preprocessing and access statistics
The data used for this study comes from a raw transaction log file obtained from the 19th Century Digital
Newspapers website server. The transaction log file covers a 6 month (184 days) period from April 15th (0:00:18)
to October 16th (23:59:47), 2009, and contains 582 065 access events (or transaction records) each of them representing one user access or action on the interface. The daily average is thus of 3 163 transaction records, which
gives a first idea of the usage of the website. Every individual access event to the Website is recorded with the
268
kind of information described above in the section dedicated to Web Analytics. For the purpose of this study, the
most relevant pieces of information are the date and the time of the request, the Internet Protocol (IP) address of
the requester and the nature of the request. Table 1 shows some sample records of unprocessed data.
Table 1. Sample records of unprocessed data from the transaction log file
DATE AND TIME
2009-08-03 9:45:23
IP ADDRESS
X.X.X.X
REQUEST
PER001941 page Le Progrès Illustré, n°4, pp.5
2009-08-03 9:45:26
X.X.X.X
/presseIllustree/search?searchType=&query=marin
2009-08-03 9:45:32
X.X.X.X
/PER0011261
2009-08-03 9:45:51
X.X.X.X
/PER0012350 page Le Progrès Illustré, n°83, pp.4
2009-08-03 9:46:09
X.X.X.X
/presseIllustree/
2009-08-03 9:47:24
X.X.X.X
PER0012944 fascicule
2009-08-03 9:48:36
X.X.X.X
/presseIllustree/search?query=fullname_subject:”société”
page
Le Progrès Illustré, n°6, pp.3
Le Progrès Illustré, n°17 1894/03/1
Source: Research by the author
The first column displays the date and the time at which the query was made in year-month-day
hour:minutes:seconds format, the second column contains the IP address, and the remainder of the line contains
the user’s query.
The original raw log file, which contained data that are not relevant for this study, was preprocessed to remove unwanted transactions records and to provide metrics corresponding to our goals. All data preprocessing
and metrics computation algorithms were coded in Perl [Schwartz 2008].
The first step in preprocessing the log file was a data cleaning procedure: the transaction records generated
by the website development and maintenance team (71 181) and by some known Web robots (11 290) have been
removed from the original file. The processing reduced the size of the transaction log file from 582 065 to
499594 access events. The number of unique IP addresses in the processed transaction log file is 7 059.
In general, there is no guarantee that each IP address corresponds to one user. Indeed, when personal information about the user (“cookies”) is available in the transaction log file, it is generally admitted that using it
provides a less error-prone mean to identify individual users than using the IP address. Since cookies recording
were not programmed in 19th Century Digital Newspapers Project website server, we had to work with IP adressess only.
Since this study focuses on search activities, a second preprocessing step consisted in extracting the transactions related to those activities from the log file records. In fact, only about 3000 transactions concern the use of
search facilities, the rest consisting in home page accesses and browsing actions.
Basic analysis of searching activities
As in [Ke et al. 2002], we define query as follows: one or more query terms, and possibly query operators,
entered in a “search box” on the website interface. By query term we mean any unbroken string of alphanumeric
characters bounded by delimiters such as a blanks, quotation marks, apostrophe and comma.
•
Searching modes by query
As mentioned previously, three search options are proposed in the website interface. The transaction log records indicate whether the newspapers pages accessed were the result of a query in the Full-text search box or
in the Issue or Year search box. The distribution of queries for each search mode is shown in the pie chart of
Figure 1.
269
Figure 1. Distribution of search modes by query (on a total of 3097 queries)
705; 23%
Full-text Search
56; 2%
Search by Issue
Search by Year
2336; 75%
It can be seen 75% of the queries were of the full-text search type, 23% corresponded to a search by year
and only 2% to a search by issue.
•
Searching modes by IP address
Of the 7059 uniques IP addresses which performed some actions on the interface, 558 searched the website
using at least one of the three search modes.
Figure 2. Distribution of search modes by IP address (on a total of 558 IP addresses doing a query)
6; 1%
99; 18%
118; 22%
10; 2%
3; 1%
Year
Full text
Issue + Year
Issue + Full text
Year + Full text
Issue + Year + Full text
311; 56%
The most used search mode is thus full-text search alone (56%), followed by the search by year alone (22%)
or combined with full-text search (18%). Search by year, alone or combine with the other modes, is seldom used.
This is not surprising since it corresponds to a very specialized way to access the collection.
270
For the rest of the analysis, only full-text search queries have been considered.
•
Full-text search queries by IP address
Before analyzing the structure and the content of the full-text search queries, it may be interesting to give
some information about the distribution of those queries by IP address. As shown in Table 1, 44.5% of the users
(supposing that the hypothesis on the correspondence user – IP address is correct) made just one query, about
20% made two queries and slightly more than 35% made more than three queries.
Table 2. Number of full-text search queries per IP address
No. queries
No. IPs
%
1
190
44.5
2
85
19.9
3
37
8.7
4
32
7.5
5
17
4.0
6
11
2.6
7
10
2.3
8
8
1.9
9
9
2.1
10
7
1.6
11
4
0.9
12
3
0.7
13
4
0.9
14
0
0.0
15
1
0.2
16
2
0.5
17
1
0.2
18
1
0.2
19
1
0.2
22
1
0.2
35
1
0.2
46
1
0.2
66
1
0.2
•
Search queries ranking by length
Splitting queries into single terms allows studying the distribution of query length. The query length is
measured by counting the number of query terms the user entered into the search box. As mentioned before, by
query term we mean any unbroken string of alphanumeric characters bounded by delimiters such as blank, quotation mark, apostrophe or comma. Thus, a query containing only one term (e.g. cinématographe, Saint-Etienne,
1899, etc.) is referred to as a one-term query, a query containing two terms is a two-term query (e.g. expo 1900,
rue neuve-des-philistins, crime passionnel), etc.
Figure 3. displays the ranking of all queries by length. About 2% of all queries correspond to blank queries
which arise when a user submits a query without entering any query term. The results show that queries tend to
271
be quite short: less than 90% of queries generally rank between one and three terms. One-term queries are most
commonly encountered (68.9%), followed by two-term queries (16.3%) and three-term queries (8,4%). Approximately, less than 4.5% of the queries contain more than 3 terms.
Figure 3. Distribution of search query length (number of search queries = 2336)
381; 16%
196; 8%
79; 3%
0 term
12; 1%
1 term
7; 0%
2 terms
54; 2%
3 terms
4 terms
5 terms
6 terms
1609; 70%
•
Term occurrences in search queries
A total of 3 384 single terms were extracted from all the full-text search queries (2 336). On the average, users used approximately 1,5 terms per query. After converting uppercase terms to lowercase terms and eliminating duplicate terms, 1 348 unique terms remained. A complete rank-frequency table was built for all the unique
terms left when excluding stop words such as de, aux, etc. (1 329 items).
Table 3 shows an excerpt of the rank-frequency table for the top 50 most commonly encountered terms (excluding stop words).
Table 3. Rank and frequency of the 50 most commonly occurring query terms (1329 total occurrences)
Rank
Term
Occurrence
Rank
Term
1
bains
2
Occurrence
36
26
alphonse
10
guignol
34
27
blague
10
3
peyrebrune
33
28
exposition
10
4
mode
29
29
greville
10
5
lyon
24
30
henry
10
6
brides
23
31
observatoire
10
7
port
21
32
saint-chamond
10
8
dreyfus
20
33
bataille
9
9
rhône
16
34
chansons
9
10
brides-les-bains
15
35
cyclisme
9
11
carnot
15
36
hôpital
9
272
Table 4 – continuation. Rank and frequency of the 50 most commonly occurring query terms (1329 total occurrences)
Rank
Term
Occurrence
Rank
Term
12
casino
13
Occurrence
15
37
imprimerie
9
cirque
14
38
villeurbanne
9
14
police
14
39
aix
8
15
bal
13
40
bihin
8
16
géant
13
41
croix-rousse
8
17
mystification
13
42
gier
8
18
napoléon
13
43
hotel
8
19
élégante
13
44
juifs
8
20
allais
12
45
kock
8
21
bellecour
12
46
luigini
8
22
bicyclette
11
47
montbrison
8
23
crime
11
48
rive
8
24
ravachol
11
49
vendanges
8
25
revue
11
50
étudiants
8
Analysis of query contents
•
The need for semantic analysis
Counting occurrences of terms in queries (term count in queries) does provide some hints on the content of
queries, but cannot tell us precisely what the website users are seeking. Let us consider the following query:
“Napoléon Bonaparte île Sainte-Hèlene”. A raw extraction would give four different terms, whereas the query
clearly contains two distinct references, to one famous personage and one geographical place. To obtain more
meaningful results, some form of semantic analysis could be used. For example, the use of a dictionary would
allow identifying compound substantives, very common in French (e.g., pomme de terre), as unique references.
The same consideration applies to the use of specialized (historical, geographical, technical ...) thesauri. To go
further in the interpretation of queried words, several grammatical and semantic ambiguities must be solved.
Does a query containing “Carnot” refer to the fourth president of the Third French Republic, assassinated in Lyon in 1894, or to his uncle, the famous physicist? Does it refer to concepts related to the former or the latter
(“statut Carnot”, “cycle de Carnot”) or to a landmark in Lyon (“place Carnot”)?
Content analysis [Krippendorff 2003] can be automated, at least to a certain extent, using Artificial Intelligence ambiguity-solving algorithms [Ghiglione et al. 1998]. Several applications to text and discourse analysis
have been proposed, with some success. Unfortunately, content analysis for queries proves to be much more
difficult, since little context information can be provided to help ambiguity solving. However, when the search is
performed on a very specialized database, automatic content analysis tools can help retrieving relevant information from the queries. In the following, we show how we have used a Text Analysis tool, Tropes [Molette 2009],
to obtain the distribution of search queries inside a specific semantic classification for the 19th Century Digital
Newspapers Project website.
273
•
Semi-automatic semantic analysis of query contents
There exist several software tools that can help carrying out (semantic) text analysis. In the open source
world, let us cite the Natural Language Toolkit [Loper et al. 2002], a set of Python modules, linguistic data and
documentation for research and development in natural language processing. Several scattered modules of
uneven quality also exist for Perl. Full-fledged software programs also exist, as the aforementioned Tropes or
LIWC (Linguistic Inquiry and Word Count), a text analysis software program (see [Pennebaker et al. 2007] for
a description of 2007 version) designed to calculate the degree to which people use different categories of words
across a wide array of texts. Compared to Tropes, LIWC has a narrower scope and less features, and does not
provide versions or modules for processing texts in French.
Tropes software performs a multi-step text analysis which allows identifying word and expression senses
and can be used for semantic classification, keyword extraction, linguistic and qualitative analysis. In order to
enable the software to build up a representation of the context, groups of closely related meanings (common
nouns, proper nouns, trademarks, etc.) appearing frequently throughout the text are formed to constitute the socalled “equivalent classes”. These classes are built via internal dictionaries that contain hundred of thousand preset semantic classifications. “Scenarios” can be build to enrich and filter the equivalent classes. This allows defining specific classifications, modifying the internal dictionaries and customizing information retrieval.
To carry out our semantic analysis, we first formatted the query list with unambiguous delimiters to indicate
that they each query is a separate (and unrelated) text segment. Then we applied the default “Concepts” scenario,
a sort of generalist thesaurus, to the formatted query list. Browsing the list of “references” generated by Tropes,
we assessed which ones could not be classified in the default scenario and how the other ones were classified.
Then we modified the default scenario, by merging some themes and concepts, adding new ones and deleting
other ones, not relevant for our analysis. The goal was to classify as many references as possible found in the
query list, in an appropriate way. To give an example, using the default semantic classification, the Arsenal, a historical landmark in Lyon, was classified in the category “football club”. Indeed, the definition of the new scenario required some work, but it must be underlined that the classified reference list obtained from the default scenario was already very useful to give a first idea of the semantic content of the queries. In particular, the fact that
compound words and unaccented forms in French are automatically recognized and step words are left out really
simplifies the task, compared to ad-hoc text processing.
Applying the specific CaNu XIX scenario, 1 043 references could be classified out of 1 593 detected by
Tropes (Figure 4). Most of the 550 unclassified references occur just once (380 references) or twice (80 reference) in all queries. Because of their frequencies, it is not worth trying to refine the classification to include
them. In fact, browsing the list of unclassified references it turns out that most of them correspond to misspelt
words.
Figure 5. shows the distribution of most frequent classified references by theme. Only the themes containing
20 references or more have been considered.
274
Figure 4. Tropes graphic user interface, after the application of the CaNu XIX scenario
Figure 5. Distribution of most frequent classified references by theme
35.0%
30.2%
30.0%
25.0%
20.0%
15.0%
9.2%
10.0%
8.5%
7.2%
6.9%
6.1%
5.8%
5.4%
4.1%
5.0%
3.4%
2.6%
2.4%
2.0%
an
d
ed
ia
M
Sp
or
t
s
an
d
Re
cr
ea
ti o
n
Co
m
m
Sc
u
ie
ni
nc
ca
e
ti o
an
n
d
Te
ch
no
M
ed
lo
gy
ic
in
e
an
d
H
ea
lth
Fa
ct
s
e
an
d
Ti
m
Ev
en
ts
an
d
lif
e
Ag
C
ri c
ity
ul
an
tu
d
re
Tr
an
sp
or
ta
tio
n
N
at
ur
e
Ev
er
yd
ay
lit
ic
s
an
d
Po
Ar
ts
an
d
ul
tu
re
ha
ra
ct
C
So
ci
et
y
Pe
rs
on
ag
e
s
an
d
G
C
eo
gr
a
ph
y
er
s
0.0%
275
The classes are composed as follows:
•
G e o g r a p h y . This group contains more than 30% of the classified references, and is articulated in
subgroups related to geographical areas, from the continent and country level to specific places at city
level. About two third of the references concern searches focused on the Rhône-Alpes region, whose
capital is Lyon, and one third on Lyon itself (10% of the total number of classified references). In this
subgroup, we can find city places and landmarks such as “(place) Bellecour”, Hôtel-Dieu, CroixRousse, traboules and so on.
•
P e r s o n a g e s a n d C h a r a c t e r s . This group contains references to famous (and less famous) personages, many of them local ones, but also to popular fictional and folk characters. Among the fictional
characters, those from the puppet show Guignol (created in Lyon) are the most frequently searched.
•
C u l t u r e a n d A r t s . We can find here query words related to arts and entertainment, among which
music, cinema, theatre, and circus, including peculiar forms of them such as occultism (divination,
magnétisme, fantômes and so on).
•
S o c i e t y a n d P o l i t i c s . References related to law and justice, and more particularly to crimes and
punishments are preponderant here (assassin, banditisme, bagne, guillotine ...). Some references concerns political doctrines (socialisme, anarchisme ...) and religion.
•
E v e r y d a y l i f e . This group includes references concerning food, beverages and cooking; professions
and craftsmanship; fashion, fabrics, clothing and dressmaking. In this last group, not unexpectedly considering Lyon tradition, we can find words such as soie, tisseur, linge, dentelle, broderie.
•
N a t u r e a n d A g r i c u l t u r e . Half of the reference concern animal and plants. Others are related to
agriculture and peasant life (vigne, vendanges, moisson).
•
C i t y a n d T r a n s p o r t a t i o n . Air, sea, land transportation means and transportations infrastructure
can be found here, together with city elements that cannot be related to a specific place (“rues pittoresques”, “vieux quartiers”).
•
T i m e . Nearly all these references are dates or years.
•
E v e n t s a n d f a c t s . War and conflicts, catastrophes and accidents monopolize this group together
with a few specific detected events, not necessarily tragic (grippe espagnole, exposition universelle, affaire dreyfus).
•
S p o r t s a n d R e c r e a t i o n . The group includes sports, games, spare-time activities, sightseeing and
travels.
•
M e d i a a n d C o m m u n i c a t i o n . Two third of the references included in this group concern modern
and ancient press (newspapers and magazines).
•
S c i e n c e a n d T e c h n o l o g y . Most frequent sciences referred to are astronomy and archeology. Industry and machines are also often referred to.
•
M e d i c i n e a n d H e a l t h . References to diseases, pharmacy and treatments can be found here.
Benefits and limits of the proposed approach
In principle, the recognition and classification of query concepts could have been done manually, as it is has
been proposed in the literature on web log analysis. However, with more than three thousand queries to analyze,
276
the manual approach does not seem viable. In our approach, we do not start from scratch to build the semantic
classification of queries. The use of a text analysis tool with large internal dictionaries and ambiguity solving
features allows obtaining very quickly a preliminary classification containing a significant number of query
words (notice that inflexions such as grammatical number or gender, conjugations and unaccented forms, at least
partly, taken into account). To achieve the final classification we rearrange the preliminary one and enrich it with
specific concepts and themes. The main drawback of this approach is that we are not processing natural language, but queries. Little context information can be provided to the software, and this reduces its ambiguity
solving capability, while increasing the possibilities of misclassification. Indeed, we cannot rule out such eventualities, since we have checked the content of the most frequent subgroups only. However, eventual mistakes in
the classification of infrequent words should not affect much the distribution of themes, especially for the largest
classes. The choice of concepts and the classification structure itself may be debatable too, but in our opinion it
finally provides a nontrivial overview of what kind of information the 19th Century Digital Newspaper Project
website users are looking for.
It is also important to underline that to enrich the default scenario in order to obtain a more relevant classification, specific dictionaries (of landmarks and historical personages, for instance) have been built. These dictionaries could be extended and applied to other studies on search behavior in web interfaces providing access to
historical regional press.
CONCLUSION
In this article, we have reported on a transaction log analysis that has provided usage data for a digital historical newspaper collection. Transaction log analysis is not only unobtrusive but also provides “a direct and
immediately available record of what people have done: not what they say they might or would do; not what
they were prompted to say, not what they thought they did” [Nicolas et al. 2006, p. 1349].
We have shown that semantic text analyis enhances the results of transaction log analysis, and allows disclosing information seeking behavior. In this context, the adoptions of automatic or semi-automatic text-analysis
tools offers the possibility of processing large transaction logs in a reasonnable time. As a results, we can provide the librarians with a flexible support to build new thematic tracks in the interface, thus enhancing online
collection access. Still, it must be underlined that what we have presented are preliminary results that need to be
validated more thoroughly. In particular, the extent and influence of misclassifications must be assessed. To
strenghten our findings, we plan to enhance transaction segmentation and to analyze other transaction logs from
the 19th Century Digital Newspapers Project website and from other digital historical newspaper collections.
We also wish to investigate the direct application of semantic analysis to the full-text articles available in the
collections, in order to compare the resulting thematic distribution (the concepts present in the collection) from
the one obtained from the search queries (the concepts that the users are seeking).
REFERENCES
Bollen, J.; R. Luce; S.S. Vemulapalli; W. Xu, doc. online (2003). Usage Analysis for the Identification of Research Trends in
Digital Libraries. D-Lib Magazine. http://www.dlib.org/dlib/may03/bollen/05bollen.html [visited: 28/03/2010].
277
Deegan, M.; S. Tanner (2001). Digital Futures: Strategies for the Information Age. London: Library Association Publishing,
288 p.
Ferrini, A.; J. Mohr (2009). Uses, limitations, and Trends in Web Analytics. [In:] A. Spink, I. Taksa eds. (2009). Handbook
of Research on Web Log Analysis. London: Information Science Reference, p. 124–142.
Ghiglione, R.; A. Landre; M. Bromberg; P. Mollette (1998). L'analyse automatique des contenus. Paris: Dunod, 154 p.
Gorman, G.E; P. Clayton (2005). Qualitative Research for the Information Professional. 2nd ed. London: Neal-Schuman Publishers, 282 p.
Harley, D.; J. Henke, doc. online (2007). Toward an Effective Understanding of Website Users. D-Lib Magazine Vol. 13,
No. 3/4. http://www.dlib.org/dlib/march07/harley/03harley.html [visited: 28/03/2010].
Jones, A., doc. online (2005). The Many Uses of Newspapers. Technical report for IMLS project “The Richmond Daily Dispatch”. http://dlxs.richmond.edu/d/ddr/docs/papers/usesofnewspapers.pdf [visited: 28/03/2010].
Jones, S.; S.J. Cunningham; R. McNab; S. Boddie (2000). A Transaction Log Analysis of a Digital Library. International
Journal on Digital Libraries Vol. 3, p. 152–169.
Ke, H.-R.; R. Kwakkelaar; Y.M. Tai; L.C. Chen (2002). Exploring Behavior of E-Journal Users in Science and Technology:
Transaction Log Analysis of Elsevier's ScienceDirect Onsite in Taiwan. Library & Information Science Research
Vol. 24, No. 3, p. 265–291.
Krippendorff, K.H. (2003). Content Analysis: An Introduction to Its Methodology. 2nd ed. London: Sage Publications, 440 p.
Loper, E.; S. Bird (2002). NLTK: The Natural Language Toolkit. [In:] Processing of ACL Workshop on Effective Tools and
Methodologies for Teaching Natural Language Processing and Computational Linguistics. ACL, Somerset, NJ, USA,
p. 62–69.
Molette P., doc. online (2009). De l’APD à Tropes: comment un outil d’analyse de contenu peut évoluer en logiciel de
classification sémantique généraliste. Conférence internationale « Psychologie Sociale & Communication ». Tarbes,
France. http://psc2009.iut-tarbes.fr/IMG/pdf/P_Molette_-_Colloque_Tarbes.pdf [visited: 28/03/2010].
Napier, H.A.; P. Judd; O. Rivers; A. Adams, A. (2003). E-Business Technologies. Boston: Thomas Course Technology,
p. 372–380.
Nicolas, D.; P. Huntington; H.R. Jamali; A. Watkinson, A. (2006).The Information Seeking Behaviour of the Users of Digital
Scholarly Journals. Information Processing & Management, Vol. 42, No. 5, p. 1345–1365.
Pennebaker, J.W.; C.K. Chung; M. Ireland; A. Gonzales; R.J. Booth, doc. online (2007). The Development and Psychometric
Properties of LIWC2007. http://www.liwc.net/LIWC2007LanguageManual.pdf [visited: 28/03/2010].
Powell, R.; L. Silipigni Connaway (2004). Basic Research Methods for Librarians. Greenwich, Conn: Praeger Publishers,
360 p.
Pu, H.T. (2000). An Exploratory Analysis on Search Terms of Network Users in Taiwan [in Chinese]. National Central Library Bulletin Vol. 89, No 1, p. 23–37.
Schwartz, R.L.; T. Phoenix; B. De Foy (2008). Learning Perl, 5th ed. O'Reilly Media, 352 p.
Smolczewska, A.; G. Boidin-Lallich (2008). De l'édition traditionnelle à l'édition numérique: le cas de la collection. [In:]
Processing of Conférence. Document numérique et Société. Cnam, Paris, France.
Snow, K.; B. Ballaux; B. Christensen-Dalsgaard; H. Hofman et al., doc. online (2008). Considering the User Perspective:
Research into Usage and Communication of Digital Information. D-Lib Magazine Vol. 14, No. 5/6.
http://www.dlib.org/dlib/may08/ross/05ross.html [visited: 28/03/2010].
Spink A.; B.J. Jansen (2004). Web Search: Public Searching of the Web. Dordrecht: Springer, 199 p.
Sweeney, M. (2007). The National Digital Newspaper Program: Building on a Firm Foundation. Hawkins Serials Review
Vol. 33, p. 188–189.
Thorsten, J., doc. online (2002). Optimizing Search Engines Using Clickthrough Data. [In:] Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. www.cs.cornell.edu/People/tj/publications/joachims_02c.pdf [visited:
28/03/2010].
278

combining web analytics and computational linguistics to enhance

Transkrypt

Podobne dokumenty

INFORMATION LITERACY AS A RESOURCE FOR CITIZENSHIP

the use of electronic resources in libraries: an essential

Inwentarz rekopisow biblioteki jagiellonskiej PWHBPDFHHZV

interdisciplinary research carried out in danish academic libraries

could the application of comparative method within a single case

index of contents - Description of SBA and SBQL