Special Theme

Image ERCIM News 89 cover page

April 2012
Special theme: Big Data
Guest editors: Stefan Manegold, Martin Kersten (CWI) and Costantino Thanos (ISTI-CNR)
This issue in pdf
(56 pages; 10 Mb)

Back Issues Online

Back Issues Online

Contents

Big Data - Introduction to the special theme

by Costantino Thanos, Stefan Manegold and Martin Kersten

Big data' refers to data sets whose size is beyond the capabilities of the current database technology. The current data deluge is revolutionizing the way research is carried out and resulting in the emergence of a new fourth paradigm of science based on data-intensive computing. This new data-dominated science will lead to a new data-centric way of conceptualizing, organizing and carrying out research activities which could lead to an introduction of new approaches to solve problems that were previously considered extremely hard or, in some cases, impossible to solve and also lead to serendipitous discoveries.

Data Stewardship in the Age of Big Data

by Daniel E. Atkins

As evidenced by a large and growing number of reports from research communities, research funding agencies, and academia, there is growing acceptance of the assertion that science is becoming more and more data-centric.

SciDB: An Open-Source DBMS for Scientific Data

by Michael Stonebraker

SciDB is a native array DBMS that combines data management and mathematical operations in a single system. It is an open source system that can be downloaded from SciDB.org

Data Management in the Humanities

by Laurent Romary

Owing to the growing interest in digital methods within the humanities, an understanding of the tenets of digitally based scholarship and the nature of specific data management issues in the humanities is required. To this end the ESFRI roadmap on European infrastructures has been seminal in identifying the need for a coordinating e-infrastructure in the humanities - DARIAH - whose data policy is outlined in this paper.

Managing Large Data Volumes from Scientific Facilities

by Shaun de Witt, Richard Sinclair, Andrew Sansum and Michael Wilson

One driver for the data tsunami is social networking companies such as FacebookTM which generate terabytes of content. Facebook for instance, uploads three billion photos monthly for a total of 3,600 terabytes annually. The volume of social media is large, but not overwhelming. The data are generated by a lot of humans, but each is limited in their rate of data production. In contrast, large scientific facilities are another driver where the data are generated automatically.

Revolutionary Database Technology for Data Intensive Research

by Martin Kersten and Stefan Manegold

The ability to explore huge digital resources assembled in data warehouses, databases and files, at unprecedented speed, is becoming the driver of progress in science. However, existing database management systems (DBMS) are far from capable of meeting the scientists' requirements. The Database Architectures group at CWI in Amsterdam cooperates with astronomers, seismologists and other domain experts to tackle this challenge by advancing all aspects of database technology. The group’s research results are disseminated via its open-source database system, MonetDB.

Zenith: Scientific Data Management on a Large Scale

by Esther Pacitti and Patrick Valduriez

Modern science disciplines such as environmental science and astronomy must deal with overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed, analyzed) in all kinds of ways in order to draw new conclusions and test scientific theories. Despite their differences, certain features are common to scientific data of all disciplines: massive scale; manipulated through large, distributed workflows; complexity with uncertainty in the data values, eg, to reflect data capture or observation; important metadata about experiments and their provenance; and mostly append-only (with rare updates). Furthermore, modern scientific research is highly collaborative, involving scientists from different disciplines (eg biologists, soil scientists, and geologists working on an environmental project), in some cases from different organizations in different countries. Since each discipline or organization tends to produce and manage its own data in specific formats, with its own processes, integrating distributed data and processes gets difficult as the amounts of heterogeneous data grow.

Performance Analysis of Healthcare Processes through Process Mining

by Diogo R. Ferreira

Process mining provides new ways to analyze the performance of clinical processes based on large amounts of event data recorded at run-time.

A Scalable Indexing Solution to Mine Huge Genomic Sequence Collections

by Eric Rivals, Nicolas Philippe, Mikael Salson, Martine Leonard, Thérèse Commes and Thierry Lecroq

With High Throughput Sequencing (HTS) technologies, biology is experiencing a sequence data deluge. A single sequencing experiment currently yields 100 million short sequences, or reads, the analysis of which demands efficient and scalable sequence analysis algorithms. Diverse kinds of applications repeatedly need to query the sequence collection for the occurrence positions of a subword. Time can be saved by building an index of all subwords present in the sequences before performing huge numbers of queries. However, both the scalability and the memory requirement of the chosen data structure must suit the data volume. Here, we introduce a novel indexing data structure, called Gk arrays, and related algorithms that improve on classical indexes and state of the art hash tables.

A-Brain: Using the Cloud to Understand the Impact of Genetic Variability on the Brain

by Gabriel Antoniu, Alexandru Costan, Benoit Da Mota, Bertrand Thirion and Radu Tudoran

Joint genetic and neuroimaging data analysis on large cohorts of subjects is a new approach used to assess and understand the variability that exists between individuals. This approach, which to date is poorly understood, has the potential to open pioneering directions in biology and medicine. As both neuroimaging- and genetic-domain observations include a huge number of variables (of the order of 106), performing statistically rigorous analyses on such Big Data represents a computational challenge that cannot be addressed with conventional computational techniques. In the A-Brain project, researchers from INRIA and Microsoft Research explore cloud computing techniques to address the above computational challenge.

Big Web Analytics: Toward a Virtual Web Observatory

by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum

For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!

Computational Storage in Vision Cloud

by Per Brand

Vision Cloud is an ongoing European project on cloud computing. The novel storage and computational infrastructure is designed to meet the challenge of providing for tomorrow’s data-intensive services.

Large-Scale Data Analysis on Cloud Systems

by Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

The massive amount of digital data currently being produced by industry, commerce and research is an invaluable source of knowledge for business and science, but its management requires scalable storage and computing facilities. In this scenario, efficient data analysis tools are vital. Cloud systems can be effectively exploited for this purpose as they provide scalable storage and processing services, together with software platforms for developing and running data analysis environments. We present a framework that enables the execution of large-scale parameter sweeping data mining applications on top of computing and storage services.

Big Software Data Analysis

by Mircea Lungu, Oscar Nierstrasz and Niko Schwarz

In today’s highly networked world, any researcher can study massive amounts of source code even on inexpensive off-the-shelf hardware. This leads to opportunities for new analyses and tools. The analysis of big software data can confirm the existence of conjectured phenomena, expose patterns in the way a technology is used, and drive programming language research.

Scalable Management of Compressed Semantic Big Data

by Javier D. Fernández, Miguel A. Martínez-Prieto and Mario Arias

The potential of Semantic Big Data is currently severely underexploited due to their huge space requirements, the powerful resources required to process them and their lengthy consumption time. We work on novel compression techniques for scalable storage, exchange, indexing and query answering of such emerging data.

SCAPE: Big Data Meets Digital Preservation

by Ross King, Rainer Schmidt, Christoph Becker and Sven Schlarb

The digital collections of scientific and memory institutions – many of which are already in the petabyte range – are growing larger every day. The fact that the volume of archived digital content worldwide is increasing geometrically, demands that their associated preservation activities become more scalable. The economics of long-term storage and access demand that they become more automated. The present state of the art fails to address the need for scalable automated solutions for tasks like the characterization or migration of very large volumes of digital content. Standard tools break down when faced with very large or complex digital objects; standard workflows break down when faced with a very large number of objects or heterogeneous collections. In short, digital preservation is becoming an application area of big data, and big data is itself revealing a number of significant preservation challenges.

Brute Force Information Retrieval Experiments using MapReduce

by Djoerd Hiemstra and Claudia Hauff

MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totaling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available for others.

A Big Data Platform for Large Scale Event Processing

by Vincenzo Gulisano, Ricardo Jimenez-Peris, Marta Patiño-Martinez, Claudio Soriente and Patrick Valduriez

To date, big data applications have focused on the store-and-process paradigm. In this paper we describe an initiative to deal with big data applications for continuous streams of events.

CumuloNimbo: A Highly-Scalable Transaction Processing Platform as a Service

by Ricardo Jimenez-Peris, Marta Patiño-Martinez, Kostas Magoutis, Angelos Bilas and Ivan Brondino

One of the main challenges facing next generation Cloud platform services is the need to simultaneously achieve ease of programming, consistency, and high scalability. Big Data applications have so far focused on batch processing. The next step for Big Data is to move to the online world. This shift will raise the requirements for transactional guarantees. CumuloNimbo is a new EC-funded project led by Universidad Politécnica de Madrid (UPM) that addresses these issues via a highly scalable multi-tier transactional platform as a service (PaaS) that bridges the gap between OLTP and Big Data applications.

ConPaaS, an Integrated Cloud Environment for Big Data

by Thorsten Schuett and Guillaume Pierre

ConPaaS makes it easy to write scalable Cloud applications without worrying about the complexity of the Cloud.

ConPaaS is the platform as a service (PaaS) component of the Contrail FP7 project. It provides a runtime environment that facilitates deployment of end-user applications in the Cloud. The team encompasses developers and researchers from the Vrije Universiteit in Amsterdam, the Zuse Institute in Berlin, and XLAB in Ljubljana.

Crime and Corruption Observatory: Big Questions behind Big Data

by Giulia Bonelli, Mario Paolucci and Rosaria Conte

Can we have information in advance on organized crime movements? How can fraud and corruption be fought? Can cybercrime threats be tackled in a safe way? The Crime and Corruption Observatory of the European Project FuturICT will work at answering these questions. Starting from Big Data, it will face big challenges, and will propose new ways to analyse and understand social phenomena.

Managing Big Data through Hybrid Data Infrastructures

by Leonardo Candela, Donatella Castelli and Pasquale Pagano

Long-established technological platforms are no longer able to address the data and processing requirements of the emerging data-intensive scientific paradigm. At the same time, modern distributed computational platforms are not yet capable of addressing the global, elastic, and networked needs of the scientific communities producing and exploiting huge quantities and varieties of data. A novel approach, the Hybrid Data Infrastructure, integrates several technologies, including Grid and Cloud, and promises to offer the necessary management and usage capabilities required to implement the ‘Big Data’ enabled scientific paradigm.

Cracking Big Data

by Stratos Idreos

A fundamental and emerging need with big amounts of data is data exploration: when we are searching for interesting patterns we often do not have a priori knowledge of exactly what we are looking for. Database cracking enables such data exploration features by bringing, for the first time, incremental and adaptive indexing abilities to modern database systems.