by Diego Ceccarelli, Sergiu Gordea, Claudio Lucchese, Franco Maria Nardini, Raffaele Perego and Gabriele Tolomei
Europeana is a strategic project funded by the European Commission with the goal of making Europe's cultural and scientific heritage accessible to the public. ASSETS is a two-year Best Practice Network co-funded by the CIP PSP Programme to improve performance, accessibility and usability of the Europeana search engine. Here we present a characterization of the Europeana logs by showing statistics on common behavioural patterns of the Europeana users.
The strong inclination for culture and beauty in Europe has created a rich abundance of artifacts, starting with antiquity and continuing to today. This cultural strength is recognized globally, making Europe the preferred destination for 50% of world tourism.
Europeana  is a long-term project funded by the European Commission with the goal of making Europe's cultural and scientific heritage accessible to the public. Since 2008, about 1 500 institutions have contributed to Europeana, enabling people to explore Europe's museums, libraries and archives. This huge amount of multilingual and multimedia data is made available through the Europeana Portal, a search engine that allows users to explore the content by means of textual queries.
Due to the increasing amount of information published, access to the description of a specific masterpiece is becoming increasingly difficult, in particular when the user is not able to formulate a sufficiently discriminating query. For example, if we search today for general terms like “renaissance” or “art nouveau” we obtain more than 10 000 results. If we search for the term “Gioconda” we find a couple of hundred of items, while the query “Mona Lisa, Da Vinci” provides us twenty images of the well known painting. These examples show how important it is to have a good formulation of users’ information needs in textual queries when looking for very specific information on the web by using a portal like Europeana. This is even more challenging since Europeana documents are cross-domain, multi-lingual, and multi-cultural.
In order to improve the Europeana users' search experience, the ASSETS project  has the overall goal of enhancing the performance of the Europeana search engine. One of the most important resources for enhancing the users’ search experience in large information spaces is the use of the information stored in query logs. The knowledge extracted from query logs can be used to enhance both efficiency (ie, response time, throughput) and efficacy (ie, quality of results) of information retrieval platforms.
Figure 1: Most popular queries submitted by Italian users. Size is proportional to frequency.
The Europeana logs show the behaviour of users interacting with the portal. We present here a preliminary characterization of this behaviour, and a comparison with that of general Web users. Figure 2 shows the frequency distribution of submitted queries. As expected, the popularity of the queries follows a power-law distribution (p(x) ∝ k× x-α), where x is the popularity rank. The best fitting a parameter is α = 0.86, which indicates the skew of the frequency distribution. The larger the α, the larger the portion of the log covered by the top most frequent queries. The same analysis conducted on query logs coming from commercial Web search engines shows larger values of a (2.4 and 1.84 respectively for an Excite and a Yahoo! query log).
Such a small value of α means that the most popular queries submitted to Europeana do not account for a significantly large portion of the query log. This might be explained by considering the characterizing features of Europeana. Indeed, since Europeana is strongly focused on the specific context of cultural heritage, its users are likely to have a greater vocabulary, and therefore use a more diverse vocabulary. In addition, we found that the average length of queries is 1.86 terms, which is again lower than the typical value observed in Web search engine logs. We can argue that the Europeana users use a richer vocabulary, with discriminative queries made of specific domain terms.
Figure 2: Frequency distribution of queries. Figure 3: Distribution of the queries over the countries.
Figure 3 shows the distribution of the queries grouped by country. France, Germany, and Italy are the three major countries accounting for about the 50% of the total traffic of the Europeana portal.
Furthermore, Figure 4 reports the number of queries submitted per day. We observe a periodic behaviour on a weekly basis, with a number of peaks probably related to some Europeana dissemination or advertisement activities. For example, we observe several peaks between 18 and 22 November, probably due to the fact that, during this period, Europeana announced the indexing of new collections and contents of 14 million documents.
Figure 4: Distribution of the searches over the days. Figure 5: Distribution of the searches over the hours.
Figure 5 shows the load on the Europeana portal on an hourly basis. We observe a particular trend. The peak of load on the Europeana portal is in the afternoon, between 15.00 and 17.00. This is slightly different from commercial Web search engines where the peak is reached in the evening, between 19.00 and 23.00. A possible explanation for this phenomenon could be that the Europeana portal is intensively used by people working in the cultural heritage field and thus mainly accessed during working hours, whereas a commercial Web search engine is used by a wider range of users with the most disparate information needs throughout the entire day.
Raffaele Perego, ISTI–CNR, Italy
Tel: +39 050 3152993