by Stefano Cresci, Marinella Petrocchi, Maurizio Tesconi (IIT-CNR), Roberto Di Pietro (Nokia Bell Labs), and Angelo Spognardi, (DTU)
Inspired by biological DNA, we model the behaviour of online users as “Digital DNA” sequences, introducing a strikingly novel, simple, and effective approach to discriminate between genuine and spambot online accounts.
Modelling the behaviour of online users, as well as analysing their properties, is of primary importance for a broad variety of applications - for example, to mine substantial information about events of public interest. Secondly, online behavioural analysis can be applied to make predictions: linking behaviours to some kind of ground truth in the past leads to predictions of what will likely happen in the future when similar behaviours take place.
Modeling Twitter accounts via digital DNA.
Illustration: Stefano Cresci.
Here, we consider online behavioural analysis as a means to detect fictitious and automated accounts, which distribute unsolicited spam, advertise events and products of doubtful legality, sponsor public characters and, ultimately, lead to a bias in public opinion and harm social relationships. Spambot detection is thus a must for the protection of cyberspace, in terms of both threats to users’ sensitive information and trolls that may want to cheat and damage them. Unfortunately, new waves of malicious accounts present advanced features, making their detection with existing systems extremely challenging .
Inspired by biological DNA, we propose to model online user behaviour with strings of characters representing the sequence of a user's online actions . Each kind of action (e.g., posting new content, following or replying to a user) can be encoded with a different character, in a similar manner to the bases of DNA sequences. According to this paradigm, online user actions represent the bases of their ‘digital DNA’.
Digital DNA is a flexible way of modelling the different kinds of user behaviour that are observed on the internet. Its flexibility lies in the ability to choose which actions will form the sequence. For example, digital DNA sequences on Facebook could include a different base for each user-to-user interaction type: comments, likes, shares and mentions.
Like its biological namesake, digital DNA is a compact representation of information. For example, the timeline of a Twitter user could be encoded as a single string of 3,200 characters (one character per tweet).
In contrast with the supervised spambot detection approaches largely used in recent years, we have devised an unsupervised way to detect spambots by comparing their behaviour with the aim of finding similarities between automated accounts. We model the behaviour of spambots via their digital DNA and we compare it to that of genuine accounts. We exploit digital DNA to study the behaviour of groups of users following the intuition that, because of their automated nature, spambots are likely to share more similarities in their digital DNA than will a group of heterogeneous genuine users.
This process is called digital DNA fingerprinting and encompasses four main steps: (i) acquisition of behavioural data; (ii) extraction of DNA sequences; (iii) comparison of DNA sequences; (iv) evaluation. First, we create datasets of verified spambots and genuine Twitter accounts. Then, we extract the digital DNA of the accounts; that is, we associate each account to a string that encodes its behavioural information.
Successively, we study similarities among the DNA sequences of our accounts. We consider similarity as a proxy for automation and, thus, an exceptionally high level of similarity among a large group of accounts serves as a red flag for anomalous behaviours. In particular, we quantify similarity by looking at the Longest Common Substring (LCS) among digital DNA sequences. We show that the similarity, as measured by the LCS, between the DNA sequences of spambots is much higher than that of genuine accounts, and we leverage this distinctive feature to perform our spambot detection. Finally, we compare our spambot detection results with those of other state-of-the-art approaches.
Results show that our proposed technique outperforms best-of-breed algorithms that are commonly employed for spambot detection . In addition, most of those state-of-the-art approaches require a large number of data-demanding features, as shown in . Instead, our digital DNA fingerprinting technique on Twitter only exploits timeline data to perform spambot detection, thus being both effective and efficient. By relying on digital DNA, analysts can leverage a powerful set of tools that have been developed over decades for the analysis of biological DNA to validate their working hypotheses on online user behaviour.
 E. Ferrara et al.: “The Rise of Social Bots”, Communications of the ACM, 59(7), 2016.
 S. Cresci, et al.: “DNA-inspired online behavioral modeling and its application to spambot detection”, IEEE Intelligent Systems, PrePrints, doi:10.1109/MIS.2016.29, 2016.
 S. Cresci et al.: “Fame for Sale: Efficient detection of fake Twitter followers”, Decision Support Systems 80(4), pp. 56–71, 2015.
Marinella Petrocchi, IIT-CNR, Italy