Big Data Takes on Prostate Cancer

by Erwan Zerhouni, Bogdan Prisacari, Qing Zhong, Peter Wild and Maria Gabrani

Most men, by the time they reach 80 years of age, get prostate cancer. The treatment is usually an operation or irradiation, which sometimes has complications. However, not every tumour is aggressive, in which case there is no urgent need to remove it. Ascertaining whether a tumour is aggressive or insignificant is difficult, but analysis of big data shows great promise in helping in this process.

Prostate cancer (PC) represents the second leading cause of cancer related deaths in the Western world. PC is typically diagnosed on the basis of increased levels of serum protein PSA (prostate specific antigen) together with digital rectal examination, and is confirmed by prostate needle biopsies. However, PSA and biopsies often fail to distinguish between clinically indolent and aggressive forms, leading to overtreatment such as unnecessary prostactectomies and irradiation that sometimes greatly deteriorates a patient’s quality of life.

Currently, pathologists assess patient biopsies and tissue resections under a microscope, leading to diagnoses that are affected by subjective judgment and intra- and inter-observer variability. The procedure is time consuming and hence low-throughput, and hospitals may generate hundreds or even thousands of tissue samples per day. This number is expected to increase sharply, as the World Health Organization predicts the number of cancer diagnoses to increase by 70% in the next two decades. Moreover, novel staining technologies, such as immunohistochemistry (IHC) and in situ hybridization (ISH) enable the evidencing of molecular expression patterns through multicolour visualization. Such techniques are commonly used for targeted treatment response estimation and monitoring, increasing the need for high-throughput and unbiased digital solutions.

Emerging initiatives in moving hospitals into the digital era use bright-field and fluorescence scanners to convert glass slides of tissue specimens and needle biopsies to virtual microscopy images of very high quality. Slide images are huge, with several thousand pixels per axis, turning digital image analysis into a big data problem. Integration of digitized tissue specimens with an image analysis workflow allows objective and high-throughput evaluation of tissue slides, transforming surgical pathology into a truly quantitative and data-driven science.

We have developed a computational framework to extract unique protein signatures from immunohistochemistry (IHC) stained tissue images. IHC assays provide significant information with respect to cellular heterogeneity and disease progression [1], and are thus used to identify patients most likely to respond to targeted therapy. However, tissue morphology may have different molecular signatures (see Figure 1) owing to genetics and other parameters, such as environment and lifestyle. Currently, IHC image analysis focuses on the staining intensity performed mostly in a manual and thus low throughput and biased way. Emerging computational techniques use metrics, such as the H-score, or the Aperio metric [2]. Recent studies, however, show that to tailor a patient's treatment and to monitor treatment progression, the staining intensity needs to be correlated to the grade of disease; that is, to the morphological and cellular architecture that define cancer and many diseases.

Figure 1: IHC prostate images indicating the difficulty of sample analysis. Emphasizing the PTEN (top) and ERG (bottom) proteins. Left to right: normal, low, and intermediate grade stages of cancer development. Figure taken from Zerhouni et al [3].

In the developed framework, we use a pre-trained convolutional network to extract features that capture both morphology and colour at image patches around identified stained cells. We generate a feature dictionary, whose size is data dependent. To represent tissue and tumour heterogeneity, we capture spatial layout information using a commute time matrices approach. We then use the commute time matrix to compute a unique signature per protein and disease grade. The goal of this work is to evaluate whether the expression of the proteins can be used as pathogenesis predicator and quantify the relative importance of each protein at the different disease grades. To this end, we evaluate them in the task of classification individually and in combination. For the individual evaluation, we use a random forest classifier. To evaluate the collective contribution of the proteins, we use a multiple kernel learning (MKL) approach. We tested the proposed framework on a PC tissue dataset and demonstrated the efficacy of the derived protein signatures for both disease stratification and quantification of the relative importance of each protein [3].

The technique already shows promise, and we expect even greater accuracy in our future tuning of these techniques. Owing to the limited number of images in our dataset, training a CNN was not feasible. To this end, we are currently investigating the use of a convolutional auto-encoder for unsupervised feature extraction. Furthermore, we plan to test the framework on larger datasets and more proteins.

Links:
http://www.zurich.ibm.com/mcs/systems_biology
http://www.wildlab.ch/

References:
[1] Q. Zhong et. al.: “Computational profiling of heterogeneity reveals high concordance between morphology- and proteomics-based methods in prostate cancer”, DGP 2015.
[2] A. Rizzardi et al.: “Quantitative comparison of immunohistochemical staining measured by digital image analysis versus pathologist visual scoring”, Diagnostic Pathology 2012, 7:42.
[3] E. Zerhouni et al.: “A computational framework for disease grading using protein signatures”, IEEE ISBI 2016 (accepted).

Please contact:
Maria Gabrani
IBM Research Zurich
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Big Data Takes on Prostate Cancer