by Costantino Thanos, Stefan Manegold and Martin Kersten

Big data' refers to data sets whose size is beyond the capabilities of the current database technology. The current data deluge is revolutionizing the way research is carried out and resulting in the emergence of a new fourth paradigm of science based on data-intensive computing. This new data-dominated science will lead to a new data-centric way of conceptualizing, organizing and carrying out research activities which could lead to an introduction of new approaches to solve problems that were previously considered extremely hard or, in some cases, impossible to solve and also lead to serendipitous discoveries.

by Laurent Romary

Owing to the growing interest in digital methods within the humanities, an understanding of the tenets of digitally based scholarship and the nature of specific data management issues in the humanities is required. To this end the ESFRI roadmap on European infrastructures has been seminal in identifying the need for a coordinating e-infrastructure in the humanities - DARIAH - whose data policy is outlined in this paper.

by Shaun de Witt, Richard Sinclair, Andrew Sansum and Michael Wilson

One driver for the data tsunami is social networking companies such as FacebookTM which generate terabytes of content. Facebook for instance, uploads three billion photos monthly for a total of 3,600 terabytes annually. The volume of social media is large, but not overwhelming. The data are generated by a lot of humans, but each is limited in their rate of data production. In contrast, large scientific facilities are another driver where the data are generated automatically.

by Martin Kersten and Stefan Manegold

The ability to explore huge digital resources assembled in data warehouses, databases and files, at unprecedented speed, is becoming the driver of progress in science. However, existing database management systems (DBMS) are far from capable of meeting the scientists' requirements. The Database Architectures group at CWI in Amsterdam cooperates with astronomers, seismologists and other domain experts to tackle this challenge by advancing all aspects of database technology. The group’s research results are disseminated via its open-source database system, MonetDB.

by Esther Pacitti and Patrick Valduriez

Modern science disciplines such as environmental science and astronomy must deal with overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed, analyzed) in all kinds of ways in order to draw new conclusions and test scientific theories. Despite their differences, certain features are common to scientific data of all disciplines: massive scale; manipulated through large, distributed workflows; complexity with uncertainty in the data values, eg, to reflect data capture or observation; important metadata about experiments and their provenance; and mostly append-only (with rare updates). Furthermore, modern scientific research is highly collaborative, involving scientists from different disciplines (eg biologists, soil scientists, and geologists working on an environmental project), in some cases from different organizations in different countries. Since each discipline or organization tends to produce and manage its own data in specific formats, with its own processes, integrating distributed data and processes gets difficult as the amounts of heterogeneous data grow.

by Eric Rivals, Nicolas Philippe, Mikael Salson, Martine Leonard, Thérèse Commes and Thierry Lecroq

With High Throughput Sequencing (HTS) technologies, biology is experiencing a sequence data deluge. A single sequencing experiment currently yields 100 million short sequences, or reads, the analysis of which demands efficient and scalable sequence analysis algorithms. Diverse kinds of applications repeatedly need to query the sequence collection for the occurrence positions of a subword. Time can be saved by building an index of all subwords present in the sequences before performing huge numbers of queries. However, both the scalability and the memory requirement of the chosen data structure must suit the data volume. Here, we introduce a novel indexing data structure, called Gk arrays, and related algorithms that improve on classical indexes and state of the art hash tables.

by Gabriel Antoniu, Alexandru Costan, Benoit Da Mota, Bertrand Thirion and Radu Tudoran

Joint genetic and neuroimaging data analysis on large cohorts of subjects is a new approach used to assess and understand the variability that exists between individuals. This approach, which to date is poorly understood, has the potential to open pioneering directions in biology and medicine. As both neuroimaging- and genetic-domain observations include a huge number of variables (of the order of 106), performing statistically rigorous analyses on such Big Data represents a computational challenge that cannot be addressed with conventional computational techniques. In the A-Brain project, researchers from INRIA and Microsoft Research explore cloud computing techniques to address the above computational challenge.

by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum

For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!

by Fabrizio Marozzo, Domenico Talia and Paolo Trunfio

The massive amount of digital data currently being produced by industry, commerce and research is an invaluable source of knowledge for business and science, but its management requires scalable storage and computing facilities. In this scenario, efficient data analysis tools are vital. Cloud systems can be effectively exploited for this purpose as they provide scalable storage and processing services, together with software platforms for developing and running data analysis environments. We present a framework that enables the execution of large-scale parameter sweeping data mining applications on top of computing and storage services.

by Mircea Lungu, Oscar Nierstrasz and Niko Schwarz

In today’s highly networked world, any researcher can study massive amounts of source code even on inexpensive off-the-shelf hardware. This leads to opportunities for new analyses and tools. The analysis of big software data can confirm the existence of conjectured phenomena, expose patterns in the way a technology is used, and drive programming language research.

by Javier D. Fernández, Miguel A. Martínez-Prieto and Mario Arias

The potential of Semantic Big Data is currently severely underexploited due to their huge space requirements, the powerful resources required to process them and their lengthy consumption time. We work on novel compression techniques for scalable storage, exchange, indexing and query answering of such emerging data.

by Ross King, Rainer Schmidt, Christoph Becker and Sven Schlarb

The digital collections of scientific and memory institutions – many of which are already in the petabyte range – are growing larger every day. The fact that the volume of archived digital content worldwide is increasing geometrically, demands that their associated preservation activities become more scalable. The economics of long-term storage and access demand that they become more automated. The present state of the art fails to address the need for scalable automated solutions for tasks like the characterization or migration of very large volumes of digital content. Standard tools break down when faced with very large or complex digital objects; standard workflows break down when faced with a very large number of objects or heterogeneous collections. In short, digital preservation is becoming an application area of big data, and big data is itself revealing a number of significant preservation challenges.

by Djoerd Hiemstra and Claudia Hauff

MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totaling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available for others.

by Ricardo Jimenez-Peris, Marta Patiño-Martinez, Kostas Magoutis, Angelos Bilas and Ivan Brondino

One of the main challenges facing next generation Cloud platform services is the need to simultaneously achieve ease of programming, consistency, and high scalability. Big Data applications have so far focused on batch processing. The next step for Big Data is to move to the online world. This shift will raise the requirements for transactional guarantees. CumuloNimbo is a new EC-funded project led by Universidad Politécnica de Madrid (UPM) that addresses these issues via a highly scalable multi-tier transactional platform as a service (PaaS) that bridges the gap between OLTP and Big Data applications.

by Thorsten Schuett and Guillaume Pierre

ConPaaS makes it easy to write scalable Cloud applications without worrying about the complexity of the Cloud.

ConPaaS is the platform as a service (PaaS) component of the Contrail FP7 project. It provides a runtime environment that facilitates deployment of end-user applications in the Cloud. The team encompasses developers and researchers from the Vrije Universiteit in Amsterdam, the Zuse Institute in Berlin, and XLAB in Ljubljana.

by Giulia Bonelli, Mario Paolucci and Rosaria Conte

Can we have information in advance on organized crime movements? How can fraud and corruption be fought? Can cybercrime threats be tackled in a safe way? The Crime and Corruption Observatory of the European Project FuturICT will work at answering these questions. Starting from Big Data, it will face big challenges, and will propose new ways to analyse and understand social phenomena.

by Leonardo Candela, Donatella Castelli and Pasquale Pagano

Long-established technological platforms are no longer able to address the data and processing requirements of the emerging data-intensive scientific paradigm. At the same time, modern distributed computational platforms are not yet capable of addressing the global, elastic, and networked needs of the scientific communities producing and exploiting huge quantities and varieties of data. A novel approach, the Hybrid Data Infrastructure, integrates several technologies, including Grid and Cloud, and promises to offer the necessary management and usage capabilities required to implement the ‘Big Data’ enabled scientific paradigm.

by Stratos Idreos

A fundamental and emerging need with big amounts of data is data exploration: when we are searching for interesting patterns we often do not have a priori knowledge of exactly what we are looking for. Database cracking enables such data exploration features by bringing, for the first time, incremental and adaptive indexing abilities to modern database systems.

Next issue: January 2018
Special theme:
Quantum Computing
Call for the next issue
Get the latest issue to your desktop
RSS Feed