by Vicenç Torra and Klara Stokes
The release of confidential data to third parties requires a detailed risk analysis. Identity disclosure occurs when a record is linked to the person or, more generally, the entity that has supplied the information in the record. A re-identification method (eg record linkage) is a tool which, given two files, links the records that correspond to the same entity. In the project Consolider-ARES we study re-identification methods, their formalization and their use for measuring disclosure risk.
When a database is released to third parties, the leakage of confidential information about the individuals in the database is an important issue. One common approach for solving this problem is to mask the data in order to destroy possible links between the confidential information and the individuals, at the same time preserving some utility of the data. In this process the anonymity of the individuals should be maximized and the information loss simultaneously minimized. In order to define good masking methods, it is essential to quantify anonymity and information loss. We believe that in most situations anonymity of individuals should be evaluated and ensured before information loss is taken into account, otherwise the anonymity is likely to suffer, as has been illustrated on several occasions. Consequently, our research focuses on quantifying and understanding anonymity in data privacy.
Identity disclosure occurs when a record in the database is linked to the person or organization whose data is in the record. In data privacy, re-identification methods (for example record linkage) are used to evaluate disclosure risk, while data protection methods are developed to avoid, or reduce the chance of, identity disclosure. The idea of re-identification is pervasive in data privacy, and some concepts as k-anonymity can be understood in the light of re-identification methods and record linkage.
Record linkage focuses on the case of two databases with information about the same individuals, and linkage is done at the record level. Re-identification is a more general term which encompasses the linkage of other objects as attributes and includes schema matching for example.
In the Consolider-ARES project we study re-identification methods. We address issues ranging from their formalization to algorithms to make re-identification effective. Our study includes the following topics:
1. Formalization of re-identification algorithms. This approach tries to answer the question of which algorithms are correct re-identification algorithms. Our formalization is based on imprecise probabilities and compatible belief functions. Only algorithms that return probabilities compatible with a true probability can be properly called re-identification algorithms. This construction permits us to revise probabilistic record linkage.
2. Optimal re-identification algorithms. Given two databases consisting of information about the same individuals, when the correct links between the two databases are known, we can formalize the optimal re-identification problem in terms of the number of incorrect links. That is, given a parameterization of the algorithm, we define the objective function as the number of incorrect links. Then the optimal re-identification algorithm is the one that minimizes this objective function – the number of incorrect links. Any solution of this optimization problem can be used as a measure of risk of the worst-case scenario. Machine learning tools and optimization techniques can be used to find this optimal solution.
3. K-confusion. We say that a data protection algorithm satisfies k-confusion if it causes any re-identification method to return at least k possible candidates. This is a generalization of k-anonymity. In k-anonymity it is required that there are at least k records that are equal over the attributes that are assumed to be useful for the adversary's re-identification process. As k-confusion is a generalization of k-anonymity, it allows for more flexibility and can therefore be used in order to achieve masked data with lower information loss, without yielding on the degree of anonymity that is provided.
Figure 1: Record linkage models an adversary's intent to link the data he has available (left) to the correct record in the protected database (right).
Within the Consolider-ARES project, re-identification methods have been applied to different types of databases. These databases include standard numerical and categorical tables, but also databases with time series, series for locations, and graphs to represent online social networks. Re-identification algorithms have been used to measure the disclosure risk of the outcome of data protection methods.
Partial support by the Spanish MEC (projects ARES – CONSOLIDER INGENIO 2010 CSD2007-00004 – and RIPUP– TIN2009-11689) is acknowledged.
 V. Torra, “Privacy in Data Mining”, in Data Mining and Knowledge Discovery Handbook, O. Maimon, L. Rokach, Eds., 2nd Edition, Springer, 2010, pp. 687-716.
 D. Abril, G. Navarro-Arribas, V. Torra, “Improving record linkage with supervised learning for disclosure risk assessment”, Information Fusion, Volume 13, Issue 4, October 2012, Pages 274–284.
 K. Stokes, V. Torra, “Reidentification and k-anonymity: a model for disclosure risk in graphs”, Soft Computing, in press. DOI:10.1007/s00500-012-0850-4
Tel: +34 935809570
Universitat Oberta de Catalunya, Spain
Tel: +34 93 450 54 17