by Claudio Gennaro, Raffaele Perego and Fausto Rabitti
Searching for non-text data (eg, images) is mostly done by means of metadata annotations or by extracting the text close to the data. However, supporting real content-based audio-visual search, based on similarity search on features, is significantly more expensive than searching for text. Moreover, the search exhibits linear scalability with respect to the data set size. The European project SAPIR is currently addressing this problem.
A large component of Web content consists of non-text data, such as images, music, animations, and videos. Current search engines index Web documents by their textual content. For instance, Web tools for performing image searching (such the ones provided by Google, Yahoo!, or MSN Live Search) simply index the text near the image and the ALT attribute of the IMG tag, used to provide a description of an image. One reason is that search at the level of features (such as color histograms or shapes) exhibits linear scalability with respect to the data search size, which is not acceptable for the expected dimension of the problem. The motivation is that for this kind of data the appropriate search methods are based on the similarity paradigm, which typically exploits range and nearest neighbor queries. These queries are computationally more intensive than exact match, since conventional inverted indexes used for text are not suitable for such data.
Peer-to-peer (P2P) systems are currently considered a promising way to address the problems of scalability, and several scalable and distributed search structures have been proposed covering even the most generic cases of metric space searching. A common characteristic of all these existing approaches is the autonomy of the peers with no need for central coordination or flooding strategies. Since there are no bottlenecks, the structures are scalable and high performance is achieved through parallel query execution on individual peers.
The European project SAPIR (Search on Audio-visual content using Peer-to-peer Information Retrieval) aims at developing a P2P architecture able to provide a scalable indexing structure that can be used for dealing with complex queries, ie, queries that involve more than one feature, such as: find all images in database similar to the query image with respect to the color and the shape.. SAPIR will support multimedia content for uploading/pushing, and for searching/retrieving from a variety of devices, including mobile phones, PDAs, and PCs. In general, there will be two different types of crawling methods: pull and push. The former is the more traditional: crawlers are responsible for locating, browsing, and gathering the web resources. In the latter case, the resources upload their content directly to the crawler (see Figure 1). User context such as GPS position, query history, and social networking (for groups of users with similar interests) will be used to increase result precision. Caching techniques will be developed to increase system performance. The indexing subsystem extracts the features from the data and, based on the indexing policy, indexes them in its local "audio-visual index" or distributes them over the P2P network to remote indices. Each peer of the P2P network provides a searching service, which either executes the query locally on its audio-visual index or submits the query over the P2P network to remote peers.
Complex queries are of fundamental importance in the application area of the project. For instance: find all images in database similar to the query image with respect to the color and the shape. In this situation, it is necessary to use an incremental nearest neighbor algorithm, since we do not know how many neighbors must be retrieved before one is found that satisfies the conditions. In the context of this project, ISTI-CNR is developing a first distributed incremental nearest neighbor algorithm for P2P-based systems. Our solution, based on a generalization of the priority queue algorithm proposed for hierarchical centralized structures, is optimal and not tied to a specific P2P architecture.