In the future, all data centres will be parallel with racks of modular servers housing multi-core CPUs with fast interconnection fabric. We investigate how to build peak-efficiency data analytics software for parallel text analysis, scalable enough for large corpora, but responsive enough for interactive use. The goals are clear: to achieve supercomputer performance at cloud prices and to push the limits of text search.
Automated text search and analysis require a formal representation of text, and term frequency vectors remain one of the most reliable forms in use. Most documents and queries contain only a few different words, thus very few terms have a non-zero frequency. Naturally, many search systems try to preserve this sparse, thinly populated structure and yet the nature of language suggests a different approach. Synonyms and hyponyms obviously counteract this sparse nature, but the ability to paraphrase goes far beyond relationships amongst individual words. Query expansion and query refinement by specification of desired and undesired search results are technical functions that cause the same problem: loss of sparseness.
One response is to use dimensionality reduction such as latent semantic indexing (LSI) to map the sparse indices into vectors in a lower-dimensional space. The vectors become densely populated with non-zero entries and there is simply no sparseness left that could be lost in the processing. Now, however, we need to handle the cost of this processing. Clustering can be used to drastically limit the search space but even if we save what we can with brains, there always remains a large volume of computation that must be conquered with brawn.
We attempt to tackle the problem with computational efficiency and parallelism. Parallel computing not only addresses scalability, but also allows us to partition the data into parts that fit into fast main memory and communication costs are kept at a minimum by careful algorithm design. In order to remain committed to performance at all times, we have begun constructing components based on existing middleware, without subscribing to substantially new paradigms. This has allowed us to detect problems as they emerge and develop solutions accordingly.
We have developed a new form of dimensionality reduction called rare term vector replacement (RTVR), which serves as a basis for our text representation. On the Reuters RCV1-V2 corpus, it delivers a substantial reduction from 47,236 features to 392, while preserving, and even improving, the baseline performance. We have successfully parallelized this algorithm, allowing us to compute the underlying projection matrix for 800,000 Reuters documents in about 100 seconds using a 32-core Xeon cluster, instead of 20 minutes on a single core. The parallelization is based on a combined task and data parallel strategy. The task parallelism allows for distribution and loosely coupled processing, creating a potential for fault tolerance and heterogeneous computing. The data parallelism can be used for data-partitioned, tightly coupled parallel computing, allowing us to scale to very large data sizes. The optimal performance was obtained with a fully parallel implementation combining both approaches.
The search on these dense vectors is based on the vector space model (VSM), using the cosine vector similarity as measure of document and query similarity. We have parallelized the query processing using data parallelism and found that it performs extremely well. The cosine similarity is based on a matrix-vector product, and the data transfer from main memory to the algorithmic unit dominates its computation. Consequently, the ability to use multiple caches, cache streams and memory channels has an immensely positive impact on the performance. The straightforward parallelization experiences super-linear speed-up between 130% and 170%, ie the performance gain is larger than the number of processors. While this is impossible in simple, theoretical models, the cost of memory access is not uniform and depends on a wide range of parameters. In this extreme case, parallel processing not only gives us a reduced response time, but also increases the throughput in queries per second. Data parallelism literally allows us to get the better of contemporary multi-core architectures.
Going beyond algorithms, we have begun to investigate how the middleware for parallel computing can be extended to deal with our requirements. We have developed a software interface for concurrent programming in parallel applications to effectively and conveniently model concurrent threads in parallel processes. This allows us to encapsulate interactivity and multi-user operation in threads. But there are many other technical challenges that must be solved to realize our vision. The fault tolerance and execution models currently provided are inadequate for our purposes. Document clustering and clustered search are two key algorithmic challenges that we need to address. Persistent data storage and management of results, including caching and server-side result set cursors, also remain interesting data management problems.
Link: http://www.cosy.sbg.ac.at/~tberka
References:
[1] T. Berka and M. Vajteršic. Parallel Rare Term Vector Replacement: Fast and Effective Dimensionality Reduction for Text. J. Parallel Distr. Com., in print.
[2] T. Berka and M. Vajteršic. Parallel Retrieval of Dense Vectors in the Vector Space Model. CAI 2, 2011.
[3] T. Berka, G. Kollias, H. Hagenauer, M. Vajteršic and A. Grama. Concurrent Programming Constructs for Parallel MPI Applications. J. Supercomput., 2012.
Please contact:
Tobias Berka, Marian Vajteršic
University of Salzburg, Austria
E-mail: