by Alexander Dür, Peter Filzmoser (TU Wien) and Andreas Rauber (TU Wien and Secure Business Austria)
With the desire and need to be able to trust decision making systems, understanding the inner workings of complex deep learning neural network architectures may soon replace qualitative or quantitative performance as the primary focus of investigation and measure of success. We report on a study investigating a complex deep learning neural network architecture aimed at detecting causality relations between pairs of statements. It demonstrates the need to obtain a better understanding of what actually constitutes sufficient and useful insights into the behaviour of such architectures that go beyond mere transformation into rule-based representations.
Recently there has been increased pressure by legislators as well as ethics boards and the public at large to ensure that algorithmic decision making systems also provide means for inspecting their behaviour and are able to explain how they arrive at any specific decision. This is basically well-aligned with researchers’ desire to understand the workings of an algorithm as this usually constitutes the most structured and only viable approach to improving an algorithm’s overall performance.
Naïve approaches such as untargeted parameter sweeps and architecture variations leading to higher performance in some specific benchmark settings turn out to be useless as they do not provide any rationale guiding the process of how to repeat such optimisations for specific tasks in different settings. Inspection of deep learning (DL) networks to understand what they are doing, which parts of the input are most influential in the final decision making, provide valuable insights both in understanding a specific model as well as guiding the targeted design of improved architectures, data representations and learning routines. Examples such as attention highlighting in image processing revealed both insights into the characteristics learned by a network, as well as uncovered errors and bias in the resulting systems, such as, for example, the unintentional focus on logos embedded in images as a clear separator between positive and negative class images due to the construction of the training data base.
Image analysis settings offer themselves for inspection as the data processed can conveniently be displayed in visual form, thus becoming early candidates for identifying areas of attention . Discovering the structure learned by various network layers, such as the focus on edges and orientation of these in subsequent layers of a convolutional neural network (NN) architecture , provide intuitive insights into their behaviour. Settings that do not offer themselves for direct visual inspection provide way harder challenges. Even more so, tasks that go beyond mere classification of data based on individual, independent attributes increase the challenge in devising interpretable representations of the inner workings of such complex DL architectures. Here we review the challenges of trying to devise such an inspection in a setting of a neural language inference model, where the goal is to correctly classify the logical relationship between a pair of sentences into one of three categories: entailment, neutral or contradiction.
Current attempts at explaining and understanding neural language processing models are primarily based on the visualisation of attention matrices. While this technique provides valuable insights into the inner workings of such models it is only focused on their attention layers and ignores all other types of layers.
Our approach to understanding complex neural network architectures is based on the analysis of the interaction patterns of a single input word with all other words. We do this by comparing the network’s activations on the original input to its activations when removing individual words. The resulting differences in activations show the interaction between different words of the input in different layers. The interactive tool we built allows users to enter a baseline input and directly perturbate this input by excluding words and observing the influence on activations through all layers of the network including the model’s predictive output.
Figure 1 shows how the initial removal of a single noun affects the processing of the activations belonging to all other words. The words most strongly influenced are those that have a linguistic relationship with the removed word, like a preposition referencing a noun.
Figure 1: Activation differences for the first four layers of a neural natural language inference model trained on the SNLI corpus . Layer 0 is an embedding lookup, layer 1 and 2 are bidirectional RNNs, layer 3 is a combination of an attention and a feed forward layer, layer 4 is a bidirectional RNN layer processing the two prior layers.
Similarly, we analysed (amongst other perturbations) the effect of changing the word order in the source or target sentences revealing the impact of positional characteristics.
State-of-the-art models for many natural language processing tasks are artificial neural networks which are widely considered to be black box models. Improvements are often the result of untargeted parameter sweeps and architectural modifications. In order to efficiently and systematically improve such models a deeper understanding of their inner workings is needed. We argue that interactive exploration through input perturbation is a promising and versatile approach for inspecting neural networks’ decision processes and finding specific target areas for improvement.
 M. D. Zeiler, R. Fergus: “Visualizing and understanding convolutional networks”, European Conference on Computer Vision, Springer, Cham, 2014.
 J. Yosinski, et al.: “Understanding neural networks through deep visualization”, in International Conference on Machine Learning (ICML) Workshop on Deep Learning, 2015.
 S. Bowman, et al.: “A large annotated corpus for learning natural language inference”, in Proc. of EMNLP, 2015.
Alexander Dür, TU Wien, Austria