by Katerina Papantoniou, Panagiotis Papadakos and Dimitris Plexousakis (ICS-FORTH)

Automatic deception detection is a crucial and challenging task that has many critical applications both in direct physical and in computer-mediated human communication.  The necessity of automatic detection is imperative, since humans are notorious for their poor performance in spotting deception. This is further hindered when cultural differences are involved, in which case differences in social norms may lead to misjudgments, and consequently impede fair treatment and justice. Here, we describe our findings on the exploitation of natural language processing (NLP) techniques and tools for the task of automated text-based deception detection, and focus on the relevant cultural and language factors [1].

The vast majority of works in automatic deception detection take an “one-size-fits-all” approach, failing to adapt the techniques based on the cultural factor.  Our aim is to add a larger scale computational approach in a series of recent interdisciplinary works that examine the connection between culture and deceptive language. Culture and language are tightly interconnected, since language is a means of expression, embodiment, and symbolization of cultural reality, and as such, differences among cultures are reflected in language usage. This also applies to the expression of deception among people belonging to different cultures.

Incorporation of cultural aspects in the research of deception detection.

Figure 1: Incorporation of cultural aspects in the research of deception detection.

Towards the above aim, our research questions and goals are:

  • Can we verify through experiments the prior body of work, which states that some linguistic cues of deception are expressed differently, for example, are milder or stronger, across cultures due to different cultural norms? More specifically, we want to explore how the individualism/collectivism divide defines the usage of specific linguistic cues. Individualism and collectivism constitute a well-known division of cultures, and concern the degree in which members of a culture value more individual over group goals and vice versa. Since cultural boundaries are difficult to define precisely when collecting data, we use datasets from different countries assuming that they reflect at an aggregate level the dominant cultural aspects that relate to deception in each country. In other words, we use countries as proxies for cultures, following Hofstede in that respect [2].
  • Explore which language indicators and cues are more effective to detect deception given a piece of text, and identify whether a universal feature set that we could rely on for detection deception tasks exists. On top of that, we investigate the volatility of cues across different domains by keeping the individualism/collectivism and language factors fixed, whenever we have appropriate datasets at our disposal.
  • Create a wide range of binary classifiers for predicting the truthfulness and deceptiveness of text, and evaluate their performance.

To answer our first and second research goals, we performed statistical tests on a set of linguistic cues of deception already proposed in bibliography, placing emphasis on those reported to differentiate across the individualism/collectivism divide. We conducted our analysis on datasets originating from six countries, namely United States of America, Belgium, India, Russia, Romania, and Mexico, which are seen as proxies of cultural features at an aggregate level. Regarding the third research goal, we created culture/language-aware classifiers by experimenting with a wide range of n-gram features from several levels of linguistic analysis, namely phonology, morphology and syntax, other linguistic cues like word and phoneme counts, pronouns use, etc., and token embeddings. We applied two classification methods, namely logistic regression and fine-tuned BERT models.  Regarding BERT, we have experimented with both monolingual, as well as with a cross-lingual model (mBERT [L1]).

The results showed that the undertaken task is fairly complex and demanding. In accordance with prior work, our analysis showed that people from individualistic cultures employ more third person and less first person pronouns to distance themselves from the deceit when they are deceptive, whereas in the collectivism group this trend is milder. Regarding the expression of sentiment in deceptive language across cultures, we observe an increased usage of positive language in deceptive texts for individualistic cultures (mostly in the US datasets), which is not observed in more collectivist cultures.

With respect to our second goal, our analysis showed the absence of a universal feature set. On top of this, our experiments inside the same culture (US) and over different genres, revealed how volatile and sensitive the deception cues are.

The experimentation with the logistic regression classifiers demonstrated the superiority of word and phoneme n-grams over all the other n-gram variations (character, POS, and syntactic). The linguistic cues surpass the baselines but lag behind the n-grams settings with the difference being milder in cross-domain experiments. The fine-tuning of the BERT models, although costly in terms of tuning the hyperparameters, performed rather well, whereas the experimentation with mBERT as a case of zero-shot transfer learning, showed promising results that can possibly be improved by incorporating culture-specific knowledge, or by taking advantage of cultural and language similarities for the least resourced languages.

In a follow-up work [3], we added in our analysis one more language, namely Greek, and a new genre, by introducing a new dataset in the context of April Fools’ Day articles. Similarly to the above results and in comparison with an English April Fools’ Day Dataset, the analysis showcased the use of emotional language, especially of positive sentiment, for deceptive articles, which is even more prevalent in the individualistic English dataset. Further, the less concrete language in deceptive texts is fairly evident both in Greek and English datasets.


[1] K. Papantoniou, et al.: “Deception detection in text and its relation to the cultural dimension of individualism/collectivism”, Natural Language Engineering, 1-62, 2021. doi:10.1017/S1351324921000152
[2] G.H. Hofstede: “Culture’s Consequences: Comparing Values, Behaviors, Institutions, and Organizations Across Nations”, 2nd and enlarged edition. Thousand Oaks, CA: Sage 2001.
[3] K. Papantoniou, et al.: “Linguistic Cues of Deception in a Multilingual April Fools’ Day Context”, in Proc. of the Eighth Italian Conference on Computational Linguistics (CLIC-it 2021), January 26-28, 2022, online

Please contact:
Katerina Papantoniou, ICS-FORTH, Greece
This email address is being protected from spambots. You need JavaScript enabled to view it.

Panagiotis Papadakos, ICS-FORTH, Greece
This email address is being protected from spambots. You need JavaScript enabled to view it.

Next issue: January 2023
Special theme:
"Cognitive AI & Cobots"
Call for the next issue
Image ERCIM News 128
This issue in pdf


Image ERCIM News 128 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed
Cookies user preferences
We use cookies to ensure you to get the best experience on our website. If you decline the use of cookies, this website may not function as expected.
Accept all
Decline all
Read more
Tools used to analyze the data to measure the effectiveness of a website and to understand how it works.
Google Analytics
Set of techniques which have for object the commercial strategy and in particular the market study.
DoubleClick/Google Marketing