This special theme section "Tackling Big Data in the Life Sciences" has been coordinated by Roeland Merks, CWI and Marie-France Sagot, Inria.

special theme

by the guest editors Roeland Merks and Marie-France Sagot

The Life Sciences are traditionally a descriptive science, in which both data collection and data analysis both play a central role. The latest decennia have seen major technical advances, which have made it possible to collect biological data at an unprecedented scale. Even more than the speed at which new data are acquired, the very complexity of what they represent makes it particularly difficult to make sense of them. Ultimately, biological data science should further the understanding of biological mechanisms and yield useful predictions, to improve individual health care or public health or to predict useful environmental interferences.

by Christoph Quix, Thomas Berlage and Matthias Jarke

Biomedical research applies data-intensive methods for drug discovery, such as high-content analysis, in which a huge amount of substances are investigated in a completely automated way. The increasing amount of data generated by such methods poses a major challenge for the integration and detailed analysis of the data, since, in order to gain new insights, the data need to be linked to other datasets from previous studies, similar experiments, or external data sources. Owing to its heterogeneity and complexity, however, the integration of research data is a long and tedious task. The HUMIT project aims to develop an innovative methodology for the integration of life science data, which applies an interactive and incremental approach.

by Alexander Schönhuth and Tobias Marschall

Detecting genetic variants is like spotting tiny sequential differences among gigantic amounts of text fragment data. This explains why some variants are extremely hard to detect or have even formed blind spots of discovery. At CWI, we have worked on developing new tools to eliminate some of these blind spots. As a result, many previously undiscoverable genetic variants now form part of an exhaustive variant catalogue based on the Genome of the Netherlands project data.

by Claudia Caudai and Emanuele Salerno

Within the framework of the national Flagship Project InterOmics, researchers at ISTI-CNR are developing algorithms to reconstruct the chromosome structure from "chromosome conformation capture" data. One algorithm being tested has already produced interesting results. Unlike most popular techniques, it does not derive a classical distance-to-geometry problem from the original contact data, and applies an efficient multiresolution approach to the genome under study.

by James T. Murphy, Mark Johnson and Frédérique Viard

Invasive non-native plant and animal species are one of the greatest threats to biodiversity on a global scale. In this collaborative European project, we use a computer modelling approach (in association with field studies, ecological experiments and molecular work) to study the impact of an important invasive seaweed species (Undaria pinnatifida) on native biodiversity in European coastal waters under variable climatic conditions.

by Marie-Dominique Devignes, Malika Smaïl-Tabbone and David Ritchie

Big data is a recurring problem in structural bioinformatics where even a single experimentally determined protein structure can contain several different interacting protein domains and often involves many tens of thousands of 3D atomic coordinates. If we consider all protein structures that have ever been solved, the immense structural space of protein-protein interactions needs to be organised systematically in order to make sense of the many functional and evolutionary relationships that exist between different protein families and their interactions. This article describes some new developments in Kbdock, a knowledge-based approach for classifying and annotating protein interactions at the protein domain level.

by Alberto Magi, Nadia Pisanti and Lorenzo Tattini

Where does this huge amount of data come from? What are the costs of producing it? The answers to these questions lie in the impressive development of sequencing technologies, which have opened up many research opportunities and challenges, some of which are described in this issue. DNA sequencing is the process of “reading” a DNA fragment (referred to as a “read”) and determining the exact order of DNA bases (the four possible nucleotides, that are Adenine, Guanine, Cytosine, and Thymine) that compose a given DNA strand. Research in biology and medicine has been revolutionised and accelerated by the advances of DNA and even RNA sequencing biotechnologies.

by Mohamed Boukhebouze, Stéphane Mouton and Jimmy Nsenga

A personal on-board data-mining framework that relies on wearable devices and supports on-board data stream mining can help with disease prediction, risk prevention, personalized intervention and patient participation in healthcare. Such an architecture, which allows continuous monitoring and real-time decision-making, can help people living with diseases such as epilepsy.

by Mark Cieliebak, Dominic Egger and Fatih Uzdilli

Drugs are great! We all need and use drugs every now and then. But they can have unwanted side-effects, referred to as “adverse drug reactions” (ADRs). Although drug manufacturers run extensive clinical trials to identify these ADRs, there are still over two million serious ADRs in the U.S. every year – and more than 100,000 patients in the U.S. die due to drug reactions, according to the U.S. Food and Drug Administration (FDA) [1]. For this reason, we are searching for innovative and effective ways to find ADRs.

by Peter Kieseberg, Edgar Weippl and Andreas Holzinger

The "doctor in the loop" is a new paradigm in information driven medicine, picturing the doctor as authority inside a loop supplying an expert system with data and information. Before this paradigm is implemented in real environments, the trustworthiness of the system must be assured.

by Erwan Zerhouni, Bogdan Prisacari, Qing Zhong, Peter Wild and Maria Gabrani

Most men, by the time they reach 80 years of age, get prostate cancer. The treatment is usually an operation or irradiation, which sometimes has complications. However, not every tumour is aggressive, in which case there is no urgent need to remove it. Ascertaining whether a tumour is aggressive or insignificant is difficult, but analysis of big data shows great promise in helping in this process.

by Adrien Coulet and Malika Smaïl-Tabbone

Most of the state of the art in pharmacogenomics (PGx) is based on a bank of knowledge resulting from sporadic observations, and so is not considered to be statistically valid. The PractiKPharma project is mining data from electronic health record repositories, and composing novel cohorts of patients for confirming (or moderating) pharmacogenomics knowledge on the basis of observations made in clinical practice.

by Elisabeth G. Rens, Sonja E. M. Boas and Roeland M.H. Merks

Throughout our lives our blood vessels form new capillaries whose insufficient or excessive growth is a key factor in disease. During wound healing, insufficient growth of capillaries limits the supply of oxygen and nutrients to the new tissue. Tumours often attract capillaries, giving them their own blood supply and a route for further spread over the body. With the help of biological and medical colleagues our team develops mathematical models that recapitulate how cells can construct new blood vessels. These models are helping us to develop new ideas about how to stimulate or stop the growth of new blood vessels.

by Benedetto Rugani, Paulo Carvalho and Benoit Othoniel

When defined, metadata information that accompanies Big and Open Data (OD) datasets may be hard to understand and exploit. A visual approach can support metadata re-use in integrated ecological-economic modelling. A method that allows specific model datasets to be regularly and consistently updated may improve their readability for use in the Life Cycle Assessment (LCA) modelling of ecosystem services.

by Paulo Carvalho, Patrik Hitzelberger and Gilles Venturini

Open data (OD) contributes to the spread and publication of life sciences data on the Web. Searching and filtering OD datasets, however, can be challenging since the metadata that accompany the datasets are often incomplete or even non-existent. Even when metadata are present and complete, interpretation can be complicated owing to the quantity, variety and languages used. We present a visual solution to help users understand existing metadata in order to exploit and reuse OD datasets – in particular, OD life sciences datasets.

Next issue: October 2018
Special theme:
Digital Twins
Call for the next issue
Image ERCIM News 104 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed