combining web analytics and computational linguistics to enhance
Transkrypt
combining web analytics and computational linguistics to enhance
Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Agnieszka Smolczewska Tona* Université Charles-de-Gaulle Lille 3 GERIICO COMBINING WEB ANALYTICS AND COMPUTATIONAL LINGUISTICS TO ENHANCE ACCESS TO DIGITAL LIBRARIES. A CASE STUDY [POŁĄCZENIE WEBOMETRII I LINGWISTYKI KOMPUTEROWEJ JAKO SPOSÓB USPRAWNIENIA DOSTĘPU DO BIBLIOTEK CYFROWYCH. STUDIUM PRZYPADKU] Abstract: The article reports on a study on user searching activities that has been carried out for the Lyon Public th Library to enhance a prototype online user interface developed for the CaNu XIX Project (19 Century Digital Newspapers Project). Gathering and analyzing user searching behavior in digital library context involve several specific challenges such as the difficulty of reaching physically absent, geographically dispersed and culturally unbounded populations. One of several methods to achieve this purpose consists in analyzing log files data. Log files allow tracking not only the traffic (number of visitors, visit duration, etc.) at the particular website, but also what those visitors are doing at the concerned website (individual click trail, searched keywords, etc.). The data th analyzed in the present study come from the 19 Century Digital Newspapers Project website logs. We focus our attention on search terms: techniques from computational linguistics have allowed us to sort them into categories representing major topics searched by visitors. Results from this process can be exploited by the library’s online service to enhance the design of the web site. The potential and the limits of this multidisciplinary approach combining web analytics and computational linguistics is thoroughly discussed in the paper. DIGITAL HISTORIC NEWSPAPERS – INFORMATION RETRIEVAL – QUERY TERMS – TRANSACTION LOG ANALYSIS – SEMANTIC ANALYSIS * AGNIESZKA SMOLCZEWSKA TONA, PhD; Associate Professor in UFR IDIST (Department of Information, Documentation, Scientific and Technical Information), Université Charles-de-Gaulle Lille 3, France; coordinator of the PostGraduate Degree Program in Document Engineering, Edition and Cross-Cultural Mediation (Université Lille 3); MA in Computational Linguistics (Trinity College of Dublin, Ireland); post-graduate degree in Computer Science for Social Sciences (Université Pierre Mendès-France, Grenoble 2, France); PhD in Information and Communication Sciences (Université Claude Bernard, Lyon 1, France). Two the most important publications: (2006) Une nouvelle lecture de la structure d’un document en vue de la construction d’index [A New Look at the Document’s Structure in View of Index Constructing]. [In:] Terminologie et accès à l’information [Terminology and Access to Information]. Paris: Hermes Science Publications, p. 101–118 [co-author: G. Lallich-Boidin]; (2006) An Interpretation of the Effort Function through the Mathematical Formalism of Exponential Infometric Process. “Information Processing and Management” 42(6), p. 1442– 1450 [co-author: T. Lafouge]. E-mail: [email protected] [Dr AGNIESZKA SMOLCZEWSKA TONA, UFR IDIST [Instytut Informacji Naukowo-Technicznej i Dokumentacji], Université Charles-de-Gaulle Lille 3, Francja; koordynatorka studiów III stopnia kierunku „Edytorstwo elektroniczne i komunikacja międzykulturowa” (Université Lille 3); absolwentka lingwistyki komputerowej (MA, Trinity College of Dublin, Irlandia) i studiów podyplomowych w zakresie informatyki dla nauk społecznych (Université Claude Bernard, Lyon 1, Francja). Dwie najważniejsze publikacje: (2006) Une nouvelle lecture de la structure d’un document en vue de la construction d’index [Nowe spojrzenie na strukturę dokumentu w perspektywie tworzenia indeksów]. [In:] Terminologie et accès à l’information [Terminologia i dostęp do informacji]. Paris: Hermes Science Publications, p. 101–118 [współaut.: G. Lallich-Boidin]; (2006) An Interpretation of the Effort Function through the Mathematical Formalism of Exponential Infometric Process [Interpretacja funkcji wytężenia w formalizmie matematycznym wykładniczego procesu informetrycznego]. „Information Processing and Management” 42(6), p. 1442–1450 [współaut.: T. Lafouge]. E-mail: [email protected]]. 264 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Abstrakt: Przedstawiono wyniki badania zachowań wyszukiwawczych, które zostało przeprowadzone w Miejskiej Bibliotece Publicznej w Lyonie wśród użytkowników prototypowego interfejsu biblioteki cyfrowej CaNu XIX, stwoth rzonej w ramach projektu digitalizacji XIX-wiecznych czasopism (19 Century Digital Newspapers Project). Pozyskiwanie i analizowanie danych dotyczących zachowań informacyjnych użytkowników biblioteki cyfrowej okazuje się szczególnie trudne ze względu na wirtualność ich interakcji z systemem, rozproszenie geograficzne i niejednorodność cech społeczno-demograficznych. Jedną z metod stosowanych w tego rodzaju badaniach jest analiza plików rejestru (ang. log files), w których odnotowywane są kolejne logowania do systemu. Dane pochodzące z plików rejestru umożliwiają nie tylko ocenę parametrów ruchu w witrynie (liczba odsłon, czas odwiedzin itd.), ale także śledzenie działań podejmowanych przez użytkownika (odtwarzanie indywidualnych tras nawigowania w serwisie, użytych w wyszukiwaniu słów kluczowych itd.). Przytoczone w artykule dane, pochodzące z plików th rejestru biblioteki cyfrowej 19 Century Digital Newspapers Project, analizowano głównie ze względu na ujęte w kwerendach argumenty wyszukiwawcze: dzięki zastosowaniu technik językoznawstwa komputerowego możliwe było posortowanie ich w kategorie odpowiadające najpopularniejszym tematom wyszukiwań. Uzyskane w badaniu wyniki mogą posłużyć za podstawę do optymalizacji interfejsu biblioteki cyfrowej CaNu XIX. We wnioskach końcowych omówiono potencjalne korzyści i ograniczenia podejścia interdyscyplinarnego łączącego elementy webometrii i lingwistyki komputerowej. ANALIZA PLIKÓW REJESTRU – ANALIZA SEMANTYCZNA DIGITALIZACJA CZASOPISM HISTORYCZNYCH – WYSZUKIWANIE INFORMACJI * * * INTRODUCTION Old newspapers are probably one of the most important information sources for historical research. Over the last 300 years they recorded every aspect of everyday human life at the international, national and local level. As a consequence, their content reflects history in the making: “it is one thing to read about historical events from the perspective of historians, narrated with the value of hindsight. It is entirely different to read the story as it was happening” [Sweeney 2007, p. 188]. Historic newspapers, especially regional and local newspapers and those published in the 18th and 19th centuries, have always challenged researchers with practical problems regarding their access. Researching pieces of information in large-size format and non-indexed newspapers have always been time-consuming and often frustrating. Without exact bibliographic information, anyone wishing to access a specific article could spend days, if not months, consulting countless volumes of crumbling newspapers or miles of microfilms surrogates. Digitization and online publication of historic newspapers, particularly when some enhanced features such as indexing and full-text search options are provided, holds out the promise of making this important information easier to find and more accessible. In this way, it allows researchers “to spend more time using material and less time finding it” [Jones 2004, doc. online, p. 20]. Many libraries, archives and other cultural institutions are digitizing their national, regional and local newspapers to make them freely available through the Internet. Recently, the Lyon Municipal Library, which is France's second largest library after the National Library in Paris, also headed down the path of digitization. Two years ago, part of its nineteenth century regional newspapers collections was digitized and put online on the 19th Century Digital Newspapers website, developed in the framework of the collaborative project CaNu XIX (as Canards numériques du XIXe) [Smolczewska 2008]. Several tasks were planned in the CaNu XIX project in order to promote the valuable heritage represented by those digital collections. The research work presented in this paper is an outcome of the task devoted to the 265 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku study of the usages of the digital historic newspapers available online at the 19th Century Digital Newspapers website, in view of improving the existing prototype interface. The approach presented in the following is based on Web analytics, and more particularly on transaction log analysis, an unobtrusive technique which makes it possible to reach the scattered reader population of digital historic newspapers. We advocate the use of natural language processing tools to improve the extraction of relevant information from full-text search queries, and find out more precisely what the users are looking for. The paper is organized as follows. The background of the study is presented in the next section, which includes a description of the CaNu XIX project, a brief state-of-the-art of the uses of digital historical newspapers and of the applications of transaction log analysis to this field. Then, our methodology is illustrated. First, we describe the data and their processing for the study. Next, we provide a set of standard metrics on the interface usage and more particularly on the full-text search facility. Finally, we present a thematic classification of search query contents obtained in a semi-automatic way thanks to the use of semantic analysis software. The potential and the limits of this approach are discussed at the end of the paper. CONTEXT The CaNu XIX project The CaNu XIX project, financed by the Rhône-Alpes region, brought together partners from different regional institutions and research teams led by Lyon Municipal Library over a two-year period ending in December 2009. Its primary objective was to develop a prototype Web interface to provide access in digital form to the historic newspaper collections owned by the library. The second and overall more challenging objective consisted in promoting this valuable digital heritage, by enhancing online access to the collections. Among the possible lines of research, the study of interface uses emerged as a particularly promising one. Indeed, in most projects involving digitization of historical newspaper collections the interfaces are designed for professional users and tend to reproduce the uses of the original paper collection. Since few studies on the use of online collections are available, it is difficult to assess to what extent existing interfaces meet user needs and how the can be improved. The first version of the 19th Century Digital Newspapers website was completed and made freely available on the Internet in December 2007, as a part of the Lyon Municipal Library website. It presented a collection of 769 issues of Le Progrès Illustré (1890–1905), a very popular local illustrated newspaper. The collection includes digital reproductions of every page from every issue (for a total of over 7000 pages), available in various formats. The prototype interface supports a number of searching and browsing features. For the purposes of this article we focus our attention on the options for searching the newspaper in the collection. The first option consists in submitting a query from the full-text search box available in every page. This creates a results list of all the issues containing the searched terms with each term highlighted in every place it occurs on the page. Then the user can choose the appropriate page format for printing or for browsing. The second possibility to find relevant articles is to limit the range to a particular year or issue, using the specific search boxes available in the homepage (full-text search, searching by year, by issues, sample thematic tracks and index of names). 266 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Current uses of digital historical newspapers Who is the typical reader of digital historic newspapers? The answer to this question is not an easy one: despite the growing popularity of historical newspapers among readers, very little research has been done regarding their current uses and users. There have been, however, at least two studies suggesting that old newspapers engage a broad audience, both academic and non-academic. A survey conducted under the NEWSPLAN Scotland project, exploring the use of historic newspapers on microfilm in Scottish Public Libraries, revealed that for nonacademic readers who are seeking titles older than a hundred years, the major reasons for using the newspapers were family and local history. Users wanted to look at birth, marriage and death records […], weather reports, shipping records, information on soldiers, school board information, and gaining insight into Scottish perspective in major historic events” [Jones 2004, doc. online, p. 26]. Concerning the use of 19th century newspapers by academic and scholar audience, [Jones 2004, doc. online] shows that they serve as important historical sources, both primary and secondary, for scholars of multiple disciplines (historians, archaeologists, geographers, geologists, linguists, etc.), as well as for librarians, teachers, students, and genealogists [Jones 2004, doc. online, p. 2]. Nevertheless, even though considerable useful information has been forthcoming from these studies, they do not fully answer the questions about users of digital historic newspapers. Indeed, both studies have inquired about users of these resources in their traditional forms (printed format or microfilm), and they do not discuss the specific ways those traditional user groups have utilized them. This lack of studies regarding the specific user and use of digital historic newspaper collections can be explained, in our opinion, by two factors. Firstly, a great many of those projects are in their infancy: the oldest still running projects are less than ten years old. Secondly, the study of uses in a digital environment proves to be difficult. Indeed, investigating the current and potential user information needs and behavior in the digital context presents several specific challenges. One of them is to adopt “a robust and scientifically grounded methodology that provides rich and detailed data on the working habits of users interacting with digital materials” [Snow et al. 2008, doc. online]. In order to collect the most detailed and rigorous data, researchers apply a number of qualitative and quantitative research methods, including questionnaires, interviews, focus group interviews, observation and experiments. In user studies which aim to investigate information needs and behavior, observation and interviews have been found to be the most valuable methods [Gorman, Clayton 2005; Inskip et al. 2007]. However, in the digital environment, these traditional data collection techniques pose a severe practical problem. As we argued earlier, a historical document collection in a traditional library, as a physical (and often unique and as unique, valuable) object, can be accessed in one place and by one user at a time. Its digital surrogate, freed from the physical constraints of library and freely available on the library website, has the virtues of ubiquity and simultaneity [Deegan 2001, p. 402]. It can be accessed continuously and simultaneously, from anywhere (or virtually anywhere), at any time and by anyone, library subscriber or random website visitor, having an access to a computer or other type of terminal connected to the network. In that context, a key question arises: how to reach this physically absent, geographically dispersed and culturally unbounded reader population? One of the possible solutions to deal with this underlying difficulty of collecting data about the use of online resources consists in applying a technique known as transaction log analysis (TLA). 267 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Transaction log analysis Transaction log analysis is a relatively recent technique whose use as a research method arose in parallel with the development of new information and communication technologies. It refers to the practice of capturing and analyzing log files data from a particular website and is thus also known as Web log analytics or (on-line) Web analytics [Ferrini, Mohr 2009, p. 125]. Transaction log analysis is considered as an unobtrusive method which takes advantage of the technology that is being evaluated [Powell et al. 2004, p. 52]. Indeed, all Web based systems can be configured to generate an automatic and real-time record of their use by everyone who happens to access their information services. More precisely, every time the “client” computer requests a particular page, the “server” computer on which the website is hosted records a certain amount of data representing the client’s activity on the website. It can include details about the website visitor (the visitor’s URL, or Web address, the visitor’s geographic location, etc.), the IP address of the visitor’s computer, the language setting on the visitor’s computer, the date and time of the visit and its duration, the visitor’s navigation paths, information about viewed or downloaded pages, used keywords and other information that the server is currently configured to gather (for a more extensive overview of the data that can be collected by Web analytics, see [Napier et al. 2003; Ferrini, Mohr 2009]). The growing number of user studies based on the analysis of transaction log data in recent years shows the increasing interest of the scientific community for both subject matter and method. In this field, transaction logs analysis has been extensively applied for different purposes, and particularly to observe users’ informationseeking behavior when accessing various Web-based resources and services. Thus, it has been found particularly helpful in carrying out investigations of the use of electronic journal databases [Pu 2000; Jones et al. 2000; Ke et al. 2002; Bollen et al. 2003, doc. online; Nicolas et al. 2006, Harley et al. 2007, doc. online] and Web search engines [Thorsten 2002, doc. online; Spink, Jansen 2004; Tjondronegoro et al. 2009]. Different aspects of searching and browsing behavior in the digital library context were more particularly examined by Jones et al. [2000] and Ke et al. [2002]. There are some similarities between these two earlier studies and ours: they all examine user searching activities in terms of search queries. However, in comparison with the study proposed by Ke et al. [2002], the present study does not involve other aspects of searching activity such as browsing or downloading. Moreover, a major difference between our study and that conducted by Jones et al. [2000] lies in the processing of search queries. To gain some qualitative insight into what users are searching for, in the study performed by Jones et al. [2000] query terms were examined and classified into categories manually. In our study, the query content analysis has been performed with the assistance of automatic text analysis tools. STUDY OF SEARCHING ACTIVITIES ON THE “19TH CENTURY DIGITAL NEWSPAPERS PROJECT” WEBSITE Log data preprocessing and access statistics The data used for this study comes from a raw transaction log file obtained from the 19th Century Digital Newspapers website server. The transaction log file covers a 6 month (184 days) period from April 15th (0:00:18) to October 16th (23:59:47), 2009, and contains 582 065 access events (or transaction records) each of them representing one user access or action on the interface. The daily average is thus of 3 163 transaction records, which gives a first idea of the usage of the website. Every individual access event to the Website is recorded with the 268 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku kind of information described above in the section dedicated to Web Analytics. For the purpose of this study, the most relevant pieces of information are the date and the time of the request, the Internet Protocol (IP) address of the requester and the nature of the request. Table 1 shows some sample records of unprocessed data. Table 1. Sample records of unprocessed data from the transaction log file DATE AND TIME 2009-08-03 9:45:23 IP ADDRESS X.X.X.X REQUEST PER001941 page Le Progrès Illustré, n°4, pp.5 2009-08-03 9:45:26 X.X.X.X /presseIllustree/search?searchType=&query=marin 2009-08-03 9:45:32 X.X.X.X /PER0011261 2009-08-03 9:45:51 X.X.X.X /PER0012350 page Le Progrès Illustré, n°83, pp.4 2009-08-03 9:46:09 X.X.X.X /presseIllustree/ 2009-08-03 9:47:24 X.X.X.X PER0012944 fascicule 2009-08-03 9:48:36 X.X.X.X /presseIllustree/search?query=fullname_subject:”société” page Le Progrès Illustré, n°6, pp.3 Le Progrès Illustré, n°17 1894/03/1 Source: Research by the author The first column displays the date and the time at which the query was made in year-month-day hour:minutes:seconds format, the second column contains the IP address, and the remainder of the line contains the user’s query. The original raw log file, which contained data that are not relevant for this study, was preprocessed to remove unwanted transactions records and to provide metrics corresponding to our goals. All data preprocessing and metrics computation algorithms were coded in Perl [Schwartz 2008]. The first step in preprocessing the log file was a data cleaning procedure: the transaction records generated by the website development and maintenance team (71 181) and by some known Web robots (11 290) have been removed from the original file. The processing reduced the size of the transaction log file from 582 065 to 499594 access events. The number of unique IP addresses in the processed transaction log file is 7 059. In general, there is no guarantee that each IP address corresponds to one user. Indeed, when personal information about the user (“cookies”) is available in the transaction log file, it is generally admitted that using it provides a less error-prone mean to identify individual users than using the IP address. Since cookies recording were not programmed in 19th Century Digital Newspapers Project website server, we had to work with IP adressess only. Since this study focuses on search activities, a second preprocessing step consisted in extracting the transactions related to those activities from the log file records. In fact, only about 3000 transactions concern the use of search facilities, the rest consisting in home page accesses and browsing actions. Basic analysis of searching activities As in [Ke et al. 2002], we define query as follows: one or more query terms, and possibly query operators, entered in a “search box” on the website interface. By query term we mean any unbroken string of alphanumeric characters bounded by delimiters such as a blanks, quotation marks, apostrophe and comma. • Searching modes by query As mentioned previously, three search options are proposed in the website interface. The transaction log records indicate whether the newspapers pages accessed were the result of a query in the Full-text search box or in the Issue or Year search box. The distribution of queries for each search mode is shown in the pie chart of Figure 1. 269 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Figure 1. Distribution of search modes by query (on a total of 3097 queries) 705; 23% Full-text Search 56; 2% Search by Issue Search by Year 2336; 75% Source: Research by the author It can be seen 75% of the queries were of the full-text search type, 23% corresponded to a search by year and only 2% to a search by issue. • Searching modes by IP address Of the 7059 uniques IP addresses which performed some actions on the interface, 558 searched the website using at least one of the three search modes. Figure 2. Distribution of search modes by IP address (on a total of 558 IP addresses doing a query) 6; 1% 99; 18% 118; 22% 10; 2% 3; 1% Year Full text Issue + Year Issue + Full text Year + Full text Issue + Year + Full text 311; 56% Source: Research by the author The most used search mode is thus full-text search alone (56%), followed by the search by year alone (22%) or combined with full-text search (18%). Search by year, alone or combine with the other modes, is seldom used. This is not surprising since it corresponds to a very specialized way to access the collection. 270 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku For the rest of the analysis, only full-text search queries have been considered. • Full-text search queries by IP address Before analyzing the structure and the content of the full-text search queries, it may be interesting to give some information about the distribution of those queries by IP address. As shown in Table 1, 44.5% of the users (supposing that the hypothesis on the correspondence user – IP address is correct) made just one query, about 20% made two queries and slightly more than 35% made more than three queries. Table 2. Number of full-text search queries per IP address No. queries No. IPs % 1 190 44.5 2 85 19.9 3 37 8.7 4 32 7.5 5 17 4.0 6 11 2.6 7 10 2.3 8 8 1.9 9 9 2.1 10 7 1.6 11 4 0.9 12 3 0.7 13 4 0.9 14 0 0.0 15 1 0.2 16 2 0.5 17 1 0.2 18 1 0.2 19 1 0.2 22 1 0.2 35 1 0.2 46 1 0.2 66 1 0.2 Source: Research by the author • Search queries ranking by length Splitting queries into single terms allows studying the distribution of query length. The query length is measured by counting the number of query terms the user entered into the search box. As mentioned before, by query term we mean any unbroken string of alphanumeric characters bounded by delimiters such as blank, quotation mark, apostrophe or comma. Thus, a query containing only one term (e.g. cinématographe, Saint-Etienne, 1899, etc.) is referred to as a one-term query, a query containing two terms is a two-term query (e.g. expo 1900, rue neuve-des-philistins, crime passionnel), etc. Figure 3. displays the ranking of all queries by length. About 2% of all queries correspond to blank queries which arise when a user submits a query without entering any query term. The results show that queries tend to 271 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku be quite short: less than 90% of queries generally rank between one and three terms. One-term queries are most commonly encountered (68.9%), followed by two-term queries (16.3%) and three-term queries (8,4%). Approximately, less than 4.5% of the queries contain more than 3 terms. Figure 3. Distribution of search query length (number of search queries = 2336) 381; 16% 196; 8% 79; 3% 0 term 12; 1% 1 term 7; 0% 2 terms 54; 2% 3 terms 4 terms 5 terms 6 terms 1609; 70% Source: Research by the author • Term occurrences in search queries A total of 3 384 single terms were extracted from all the full-text search queries (2 336). On the average, users used approximately 1,5 terms per query. After converting uppercase terms to lowercase terms and eliminating duplicate terms, 1 348 unique terms remained. A complete rank-frequency table was built for all the unique terms left when excluding stop words such as de, aux, etc. (1 329 items). Table 3 shows an excerpt of the rank-frequency table for the top 50 most commonly encountered terms (excluding stop words). Table 3. Rank and frequency of the 50 most commonly occurring query terms (1329 total occurrences) Rank Term Occurrence Rank Term 1 bains 2 Occurrence 36 26 alphonse 10 guignol 34 27 blague 10 3 peyrebrune 33 28 exposition 10 4 mode 29 29 greville 10 5 lyon 24 30 henry 10 6 brides 23 31 observatoire 10 7 port 21 32 saint-chamond 10 8 dreyfus 20 33 bataille 9 9 rhône 16 34 chansons 9 10 brides-les-bains 15 35 cyclisme 9 11 carnot 15 36 hôpital 9 272 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Table 4 – continuation. Rank and frequency of the 50 most commonly occurring query terms (1329 total occurrences) Rank Term Occurrence Rank Term 12 casino 13 Occurrence 15 37 imprimerie 9 cirque 14 38 villeurbanne 9 14 police 14 39 aix 8 15 bal 13 40 bihin 8 16 géant 13 41 croix-rousse 8 17 mystification 13 42 gier 8 18 napoléon 13 43 hotel 8 19 élégante 13 44 juifs 8 20 allais 12 45 kock 8 21 bellecour 12 46 luigini 8 22 bicyclette 11 47 montbrison 8 23 crime 11 48 rive 8 24 ravachol 11 49 vendanges 8 25 revue 11 50 étudiants 8 Source: Research by the author Analysis of query contents • The need for semantic analysis Counting occurrences of terms in queries (term count in queries) does provide some hints on the content of queries, but cannot tell us precisely what the website users are seeking. Let us consider the following query: “Napoléon Bonaparte île Sainte-Hèlene”. A raw extraction would give four different terms, whereas the query clearly contains two distinct references, to one famous personage and one geographical place. To obtain more meaningful results, some form of semantic analysis could be used. For example, the use of a dictionary would allow identifying compound substantives, very common in French (e.g., pomme de terre), as unique references. The same consideration applies to the use of specialized (historical, geographical, technical ...) thesauri. To go further in the interpretation of queried words, several grammatical and semantic ambiguities must be solved. Does a query containing “Carnot” refer to the fourth president of the Third French Republic, assassinated in Lyon in 1894, or to his uncle, the famous physicist? Does it refer to concepts related to the former or the latter (“statut Carnot”, “cycle de Carnot”) or to a landmark in Lyon (“place Carnot”)? Content analysis [Krippendorff 2003] can be automated, at least to a certain extent, using Artificial Intelligence ambiguity-solving algorithms [Ghiglione et al. 1998]. Several applications to text and discourse analysis have been proposed, with some success. Unfortunately, content analysis for queries proves to be much more difficult, since little context information can be provided to help ambiguity solving. However, when the search is performed on a very specialized database, automatic content analysis tools can help retrieving relevant information from the queries. In the following, we show how we have used a Text Analysis tool, Tropes [Molette 2009], to obtain the distribution of search queries inside a specific semantic classification for the 19th Century Digital Newspapers Project website. 273 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku • Semi-automatic semantic analysis of query contents There exist several software tools that can help carrying out (semantic) text analysis. In the open source world, let us cite the Natural Language Toolkit [Loper et al. 2002], a set of Python modules, linguistic data and documentation for research and development in natural language processing. Several scattered modules of uneven quality also exist for Perl. Full-fledged software programs also exist, as the aforementioned Tropes or LIWC (Linguistic Inquiry and Word Count), a text analysis software program (see [Pennebaker et al. 2007] for a description of 2007 version) designed to calculate the degree to which people use different categories of words across a wide array of texts. Compared to Tropes, LIWC has a narrower scope and less features, and does not provide versions or modules for processing texts in French. Tropes software performs a multi-step text analysis which allows identifying word and expression senses and can be used for semantic classification, keyword extraction, linguistic and qualitative analysis. In order to enable the software to build up a representation of the context, groups of closely related meanings (common nouns, proper nouns, trademarks, etc.) appearing frequently throughout the text are formed to constitute the socalled “equivalent classes”. These classes are built via internal dictionaries that contain hundred of thousand preset semantic classifications. “Scenarios” can be build to enrich and filter the equivalent classes. This allows defining specific classifications, modifying the internal dictionaries and customizing information retrieval. To carry out our semantic analysis, we first formatted the query list with unambiguous delimiters to indicate that they each query is a separate (and unrelated) text segment. Then we applied the default “Concepts” scenario, a sort of generalist thesaurus, to the formatted query list. Browsing the list of “references” generated by Tropes, we assessed which ones could not be classified in the default scenario and how the other ones were classified. Then we modified the default scenario, by merging some themes and concepts, adding new ones and deleting other ones, not relevant for our analysis. The goal was to classify as many references as possible found in the query list, in an appropriate way. To give an example, using the default semantic classification, the Arsenal, a historical landmark in Lyon, was classified in the category “football club”. Indeed, the definition of the new scenario required some work, but it must be underlined that the classified reference list obtained from the default scenario was already very useful to give a first idea of the semantic content of the queries. In particular, the fact that compound words and unaccented forms in French are automatically recognized and step words are left out really simplifies the task, compared to ad-hoc text processing. Applying the specific CaNu XIX scenario, 1 043 references could be classified out of 1 593 detected by Tropes (Figure 4). Most of the 550 unclassified references occur just once (380 references) or twice (80 reference) in all queries. Because of their frequencies, it is not worth trying to refine the classification to include them. In fact, browsing the list of unclassified references it turns out that most of them correspond to misspelt words. Figure 5. shows the distribution of most frequent classified references by theme. Only the themes containing 20 references or more have been considered. 274 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Figure 4. Tropes graphic user interface, after the application of the CaNu XIX scenario Source: Research by the author Figure 5. Distribution of most frequent classified references by theme 35.0% 30.2% 30.0% 25.0% 20.0% 15.0% 9.2% 10.0% 8.5% 7.2% 6.9% 6.1% 5.8% 5.4% 4.1% 5.0% 3.4% 2.6% 2.4% 2.0% an d ed ia M Sp or t s an d Re cr ea ti o n Co m m Sc u ie ni nc ca e ti o an n d Te ch no M ed lo gy ic in e an d H ea lth Fa ct s e an d Ti m Ev en ts an d lif e Ag C ri c ity ul an tu d re Tr an sp or ta tio n N at ur e Ev er yd ay lit ic s an d Po Ar ts an d ul tu re ha ra ct C So ci et y Pe rs on ag e s an d G C eo gr a ph y er s 0.0% Source: Research by the author 275 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku The classes are composed as follows: • G e o g r a p h y . This group contains more than 30% of the classified references, and is articulated in subgroups related to geographical areas, from the continent and country level to specific places at city level. About two third of the references concern searches focused on the Rhône-Alpes region, whose capital is Lyon, and one third on Lyon itself (10% of the total number of classified references). In this subgroup, we can find city places and landmarks such as “(place) Bellecour”, Hôtel-Dieu, CroixRousse, traboules and so on. • P e r s o n a g e s a n d C h a r a c t e r s . This group contains references to famous (and less famous) personages, many of them local ones, but also to popular fictional and folk characters. Among the fictional characters, those from the puppet show Guignol (created in Lyon) are the most frequently searched. • C u l t u r e a n d A r t s . We can find here query words related to arts and entertainment, among which music, cinema, theatre, and circus, including peculiar forms of them such as occultism (divination, magnétisme, fantômes and so on). • S o c i e t y a n d P o l i t i c s . References related to law and justice, and more particularly to crimes and punishments are preponderant here (assassin, banditisme, bagne, guillotine ...). Some references concerns political doctrines (socialisme, anarchisme ...) and religion. • E v e r y d a y l i f e . This group includes references concerning food, beverages and cooking; professions and craftsmanship; fashion, fabrics, clothing and dressmaking. In this last group, not unexpectedly considering Lyon tradition, we can find words such as soie, tisseur, linge, dentelle, broderie. • N a t u r e a n d A g r i c u l t u r e . Half of the reference concern animal and plants. Others are related to agriculture and peasant life (vigne, vendanges, moisson). • C i t y a n d T r a n s p o r t a t i o n . Air, sea, land transportation means and transportations infrastructure can be found here, together with city elements that cannot be related to a specific place (“rues pittoresques”, “vieux quartiers”). • T i m e . Nearly all these references are dates or years. • E v e n t s a n d f a c t s . War and conflicts, catastrophes and accidents monopolize this group together with a few specific detected events, not necessarily tragic (grippe espagnole, exposition universelle, affaire dreyfus). • S p o r t s a n d R e c r e a t i o n . The group includes sports, games, spare-time activities, sightseeing and travels. • M e d i a a n d C o m m u n i c a t i o n . Two third of the references included in this group concern modern and ancient press (newspapers and magazines). • S c i e n c e a n d T e c h n o l o g y . Most frequent sciences referred to are astronomy and archeology. Industry and machines are also often referred to. • M e d i c i n e a n d H e a l t h . References to diseases, pharmacy and treatments can be found here. Benefits and limits of the proposed approach In principle, the recognition and classification of query concepts could have been done manually, as it is has been proposed in the literature on web log analysis. However, with more than three thousand queries to analyze, 276 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku the manual approach does not seem viable. In our approach, we do not start from scratch to build the semantic classification of queries. The use of a text analysis tool with large internal dictionaries and ambiguity solving features allows obtaining very quickly a preliminary classification containing a significant number of query words (notice that inflexions such as grammatical number or gender, conjugations and unaccented forms, at least partly, taken into account). To achieve the final classification we rearrange the preliminary one and enrich it with specific concepts and themes. The main drawback of this approach is that we are not processing natural language, but queries. Little context information can be provided to the software, and this reduces its ambiguity solving capability, while increasing the possibilities of misclassification. Indeed, we cannot rule out such eventualities, since we have checked the content of the most frequent subgroups only. However, eventual mistakes in the classification of infrequent words should not affect much the distribution of themes, especially for the largest classes. The choice of concepts and the classification structure itself may be debatable too, but in our opinion it finally provides a nontrivial overview of what kind of information the 19th Century Digital Newspaper Project website users are looking for. It is also important to underline that to enrich the default scenario in order to obtain a more relevant classification, specific dictionaries (of landmarks and historical personages, for instance) have been built. These dictionaries could be extended and applied to other studies on search behavior in web interfaces providing access to historical regional press. CONCLUSION In this article, we have reported on a transaction log analysis that has provided usage data for a digital historical newspaper collection. Transaction log analysis is not only unobtrusive but also provides “a direct and immediately available record of what people have done: not what they say they might or would do; not what they were prompted to say, not what they thought they did” [Nicolas et al. 2006, p. 1349]. We have shown that semantic text analyis enhances the results of transaction log analysis, and allows disclosing information seeking behavior. In this context, the adoptions of automatic or semi-automatic text-analysis tools offers the possibility of processing large transaction logs in a reasonnable time. As a results, we can provide the librarians with a flexible support to build new thematic tracks in the interface, thus enhancing online collection access. Still, it must be underlined that what we have presented are preliminary results that need to be validated more thoroughly. In particular, the extent and influence of misclassifications must be assessed. To strenghten our findings, we plan to enhance transaction segmentation and to analyze other transaction logs from the 19th Century Digital Newspapers Project website and from other digital historical newspaper collections. We also wish to investigate the direct application of semantic analysis to the full-text articles available in the collections, in order to compare the resulting thematic distribution (the concepts present in the collection) from the one obtained from the search queries (the concepts that the users are seeking). REFERENCES Bollen, J.; R. Luce; S.S. Vemulapalli; W. Xu, doc. online (2003). Usage Analysis for the Identification of Research Trends in Digital Libraries. D-Lib Magazine. http://www.dlib.org/dlib/may03/bollen/05bollen.html [visited: 28/03/2010]. 277 Seria III: ePublikacje Instytutu INiB UJ. Red. Maria Kocójowa Nr 7 2010: Biblioteki, informacja, książka: interdyscyplinarne badania i praktyka w XXI wieku Deegan, M.; S. Tanner (2001). Digital Futures: Strategies for the Information Age. London: Library Association Publishing, 288 p. Ferrini, A.; J. Mohr (2009). Uses, limitations, and Trends in Web Analytics. [In:] A. Spink, I. Taksa eds. (2009). Handbook of Research on Web Log Analysis. London: Information Science Reference, p. 124–142. Ghiglione, R.; A. Landre; M. Bromberg; P. Mollette (1998). L'analyse automatique des contenus. Paris: Dunod, 154 p. Gorman, G.E; P. Clayton (2005). Qualitative Research for the Information Professional. 2nd ed. London: Neal-Schuman Publishers, 282 p. Harley, D.; J. Henke, doc. online (2007). Toward an Effective Understanding of Website Users. D-Lib Magazine Vol. 13, No. 3/4. http://www.dlib.org/dlib/march07/harley/03harley.html [visited: 28/03/2010]. Jones, A., doc. online (2005). The Many Uses of Newspapers. Technical report for IMLS project “The Richmond Daily Dispatch”. http://dlxs.richmond.edu/d/ddr/docs/papers/usesofnewspapers.pdf [visited: 28/03/2010]. Jones, S.; S.J. Cunningham; R. McNab; S. Boddie (2000). A Transaction Log Analysis of a Digital Library. International Journal on Digital Libraries Vol. 3, p. 152–169. Ke, H.-R.; R. Kwakkelaar; Y.M. Tai; L.C. Chen (2002). Exploring Behavior of E-Journal Users in Science and Technology: Transaction Log Analysis of Elsevier's ScienceDirect Onsite in Taiwan. Library & Information Science Research Vol. 24, No. 3, p. 265–291. Krippendorff, K.H. (2003). Content Analysis: An Introduction to Its Methodology. 2nd ed. London: Sage Publications, 440 p. Loper, E.; S. Bird (2002). NLTK: The Natural Language Toolkit. [In:] Processing of ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. ACL, Somerset, NJ, USA, p. 62–69. Molette P., doc. online (2009). De l’APD à Tropes: comment un outil d’analyse de contenu peut évoluer en logiciel de classification sémantique généraliste. Conférence internationale « Psychologie Sociale & Communication ». Tarbes, France. http://psc2009.iut-tarbes.fr/IMG/pdf/P_Molette_-_Colloque_Tarbes.pdf [visited: 28/03/2010]. Napier, H.A.; P. Judd; O. Rivers; A. Adams, A. (2003). E-Business Technologies. Boston: Thomas Course Technology, p. 372–380. Nicolas, D.; P. Huntington; H.R. Jamali; A. Watkinson, A. (2006).The Information Seeking Behaviour of the Users of Digital Scholarly Journals. Information Processing & Management, Vol. 42, No. 5, p. 1345–1365. Pennebaker, J.W.; C.K. Chung; M. Ireland; A. Gonzales; R.J. Booth, doc. online (2007). The Development and Psychometric Properties of LIWC2007. http://www.liwc.net/LIWC2007LanguageManual.pdf [visited: 28/03/2010]. Powell, R.; L. Silipigni Connaway (2004). Basic Research Methods for Librarians. Greenwich, Conn: Praeger Publishers, 360 p. Pu, H.T. (2000). An Exploratory Analysis on Search Terms of Network Users in Taiwan [in Chinese]. National Central Library Bulletin Vol. 89, No 1, p. 23–37. Schwartz, R.L.; T. Phoenix; B. De Foy (2008). Learning Perl, 5th ed. O'Reilly Media, 352 p. Smolczewska, A.; G. Boidin-Lallich (2008). De l'édition traditionnelle à l'édition numérique: le cas de la collection. [In:] Processing of Conférence. Document numérique et Société. Cnam, Paris, France. Snow, K.; B. Ballaux; B. Christensen-Dalsgaard; H. Hofman et al., doc. online (2008). Considering the User Perspective: Research into Usage and Communication of Digital Information. D-Lib Magazine Vol. 14, No. 5/6. http://www.dlib.org/dlib/may08/ross/05ross.html [visited: 28/03/2010]. Spink A.; B.J. Jansen (2004). Web Search: Public Searching of the Web. Dordrecht: Springer, 199 p. Sweeney, M. (2007). The National Digital Newspaper Program: Building on a Firm Foundation. Hawkins Serials Review Vol. 33, p. 188–189. Thorsten, J., doc. online (2002). Optimizing Search Engines Using Clickthrough Data. [In:] Proceedings of the ACM Conference on Knowledge Discovery and Data Mining. www.cs.cornell.edu/People/tj/publications/joachims_02c.pdf [visited: 28/03/2010]. 278