by Marleen Balvert and Alexander Schoenhuth (CWI)
Many diseases that we cannot currently cure, such as cancer, Alzheimer’s and amyotrophic lateral sclerosis (ALS), are caused by variations in the DNA sequence. It is often unknown which characteristics caused the disease. Knowing these would greatly help our understanding of the underlying disease mechanisms, and would boost drug development. At CWI we develop methods based on artificial intelligence (AI) to help find the genetic causes of disease, with promising first results.
CWI researchers are currently developing AI techniques to help identify the genetic characteristics that lead to disease. Picture: Shutterstock.
Identifying disease-causing genetic characteristics starts with analysing datasets containing the genetic information of both healthy individuals and patients with a disease of interest. The data analysis provides direction to disease experts and lab researchers, who can experimentally test whether a genetic variant indeed causes disease. Validated disease-causing genetic variants provide insight into the cellular processes involved in disease, which is the starting point for drug development.
Today’s predominant technique for analysing genome datasets, called genome-wide association studies (GWAS), ensures that cause (genetic variant) and effect (disease) can be linked in a way the human mind can grasp. GWAS examine each individual genetic variants for correlation with disease, following well-understood statistical principles. GWAS allows the researcher to easily interpret findings and has been very successful: many potentially disease-causing variants have been detected for various diseases.
However, several diseases stubbornly resist such “human intelligence-based approaches”, as their genetic architecture is difficult to unravel. One architectural feature that complicates analyses considerably is epistasis: genetic variants do not necessarily just add up their effects to establish effects, but operate in terms of logical combinations. Consider, for example, three variants A, B and C, which establish the disease-causing effects if (and only if) A is not there, or B and C are both there. Such complex logical relationships reflect common biochemical gateways.
Analysing diseases with a more involved genetic architecture, such as cancer, type II diabetes or ALS, in terms of “human mind perceivable” approaches clearly has reached certain limits. So, an immediate question is: if the human mind is struggling, can AI help out?
This motivated CWI researchers Marleen Balvert and Alexander Schönhuth to develop new, AI-based techniques for identifying complex combinations of genetic characteristics that are associated with disease. The challenge is twofold.
First, genome datasets contain millions of genetic variants for thousands or tens of thousands of individuals. Deep neural networks - currently established among the most successful classification techniques  - offer enhanced opportunities in processing large datasets. This motivated Balvert and Schönhuth to employ deep neural networks.
Second, deep neural networks have been predominantly developed for image classification. Unlike image data - the structure of which can be grasped immediately - genetic data has a structure that is governed by the laws of evolution and reproduction. Arranging genetics data to act as input to deep neural networks therefore requires expert knowledge.
Together with ALS expert Jan Veldink from UMC Utrecht, Balvert and Schönhuth took on the challenge of developing a deep neural network to classify healthy individuals from ALS patients using data from over 11,000 people. The data were collected through Project MinE, a global genome data project that deals with ALS. Note that the CWI researchers were guided by the idea to design a general neural network architecture for diseases with a complex genetic architecture, so as to not necessarily specialise in a particular disease.
The team implemented a two-step procedure . First, a relatively lightweight neural network identifies promoter regions - parts of the genome that initiate the reading of a gene - that are indicative of disease. Upon identifying several tens out of the 20,000 promoter regions an ultra-deep neural network predicts whether someone is affected by ALS based on the variants captured by the selected promoter regions.
If the neural network achieves good predictive performance, it has “learned” how to identify disease. The genetic architecture of the disease is thus captured by the wirings of the neural network.
Balvert, Schönhuth and their team were intrigued and enthusiastic to observe that their networks achieved excellent performance in predicting ALS; ALS has been marked as a disease whose genetic architecture is most difficult to disentangle. The networks achieved 76 % prediction accuracy, surpassing the simpler, “human mind perceivable” approaches that achieved 64 % accuracy at best. Further improvements are still possible.
These highly encouraging results point out that AI can do an excellent job in understanding complex genetic disorders. However, we will encounter many further issues before AI will find its way into clinical practice. Most importantly, while AI can understand the genetic architecture of a disease, we are not able to fully disentangle the wirings a neural network uses for its predictions, and the human mind still has not been helped. But there is hope: method development that aims at human understanding of AI is one of the most active areas of research of our times.
 V. Tam, et al.: “Benefits and limitations of genome-wide association studies”, Nature Review Genetics, 2019.
 J. Schmidhuber: "Deep learning in neural networks: An overview", Neural networks 61, 2015: 85-117.
 B. Yin, et al.: “Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype”, Bioinformatics (Proc. of ISMB/ECCB 2019), to appear.
Marleen Balvert, CWI, The Netherlands