by George Tambouratzis (Athena Research Centre)

Conversational agents and chatbots have developed rapidly in the past year to provide answers to user queries, drawing information from huge collections of data. From the user-side, the usefulness of conversational agents hinges on the accuracy of response in addition to user-friendliness and response speed. Here we briefly evaluate one of the most widely used chatbots, ChatGPT over a set of queries posed using multiple languages, to test its robustness and consistency, while running the experiment at two timepoints to monitor ChatGPT’'s evolution.

Conversational agents have been widely used at a global level and converse with humans in multiple languages. Despite remarkable progress fuelled by advances in Large Language Models (LLMs) [1], several types of problems persist. Categories of errors include hallucinations, syntax errors, and prompt brittleness [2] [3]. One question that arises is if they provide consistent responses in a multilingual environment. Will users X and Y, posing equivalent queries in different languages A and B, get equivalent or fundamentally different responses?

Defining Queries
To measure performance, questions posed to ChatGPT concern historical events dating back over 80 years and which are documented and widely accepted. The questions are listed in English in Table 1, but have been translated by language experts into German, French, Italian, Spanish and Greek.


What was the fate of Elli?


How was Elli lost?


Was the Elli sunk by torpedoes?


What about the rumour that Elli was struck by a mine?


How was Elli sunk?

Table 1: Queries posed.

The questions focus on the Greek destroyer Elli, lost in 1940 to enemy action. With the exception of Q3, questions are open-ended. Q2 and Q5 focus on the ship’s end, while Q3 and Q4 suggest a reason for the loss of the ship. The use of related questions allows us to measure prompt brittleness of the underlying LLM via the variation in the system answers.

First experiment
The responses obtained (April 2023) by posing questions Q1 to Q5 to ChatGPT are summarised in Table 2. Responses in all languages place Elli as being harboured at Tinos island during the event (coinciding with the actual events).

Language English Greek  French Italian Spanish German
A. Type of attack Submarine- 2 torpedoes Submarine- 2 torpedoes Submarine- 2 torpedoes Submarine- 3 torpedoes Submarine- 2 torpedoes Aircraft attack
B. date 15-Aug-1940 15-Aug-1940 15-Aug-1940 15-Aug-1940 15-Aug-1940 15-Aug-1940
C. Fate  sunk sunk sunk sunk sunk survived
D. Next events - - Raised in 1952, scrapped in 1953 - - Sunk (Oct- 1940) at Salamis
E. answer 
following Q5
Lost to mine A cruiser ship also named Elli survived  - - - -

Table 2: Summary of ChatGPT responses (April 2023) – blank cells in rows D and E indicate unchanged response.

Responses in all languages correctly report enemy action in August 1940, while inconsistencies occur in the number of casualties. However, the major discrepancies are in the German language where the ship is stated as surviving the August events to be lost in another event months later. Interestingly, for the Greek query, following further prompting, ChatGPT rescinded from its original response to state that a second naval ship (a cruiser) named Elli was also present at Tinos but survived the events and served for years. This hallucination is probably caused by the fact that (i) Elli was termed both as a destroyer and a light cruiser and that (ii) as is popular practice in navies, other ships have later used the same name. Other hallucinations include the raising of the ship in 1952 (French) and the sinking at late 1940 (German). The main conclusion is inconsistencies occur across languages, which is not desirable. Furthermore, Q5 causes ChatGPT to revise its responses, indicating uncertainty.

Repeating the experiment with ChatGPT-3.5 (Nov. 2023)
The newer ChatGPT version requires more clarifications from the user about period and disambiguation of terms before responding, which is probably integrated to increase accuracy. Results are shown in Table 3. For the later experiments, more variation occurs across languages. ChatGPT responds in three languages that the Elli was subsequently raised and in English and German it states that the ship served for years (actually Elli forms a sunken memorial since August 1940).

Language English Greek  French Italian Spanish German
A. Type of attack Submarine - 2 torpedoes Submarine - explosives Submarine - 1 torpedo Submarine - 2 torpedoes Submarine - 1 torpedo Submarine - torpedo
B. date 15-Aug-1940 Aug-1940 26 May1944 15-Aug-1940 Sweden Aug-1940 15-Aug-1940
C. Fate  damaged sunk sunk sunk sunk sunk
D. Next events repaired; served for years - raised in 1952, scrapped in 1953 - - Repaired; served several years
E. changes
following Q5
Sunk after attack Survived for years until scrapped  Elli was German; submarine was British attack by Marauders with explosives Loss caused by mine; then torpedo -

Table 3: Response summary (November 2023).

The ship’s nationality is inconsistently reported across languages. For queries in French, Elli is a German ship sunk by the British submarine HMS Sportsman. For Spanish, the Elli is a Swedish ship sunk by German action near the Aland isles (Sweden). Both represent hallucinations, where ChatGPT probably combines unconnected events, producing erroneous output. The instability regarding the cause of loss for Spanish prompting (stated as torpedo for Q2, then mine for Q3, then torpedo for Q4) is also a phenomenon not encountered in the first set of experiments.

ChatGPT provides a strong capability to respond to specialised user queries. This manuscript has investigated its use in multilingual settings and indicates that queries in different languages produce fundamentally different responses even for widely accepted events. The newer ChatGPT version has access to more extensive amounts of information which likely causes more frequent hallucinations by inadvertently connecting unrelated events. It is essential to improve the consistency in multilingual settings to ensure accuracy of results, and research work is underway to develop such methods on the basis of multilingual cross-checking. In future, the aim is to extend this study to cover grammaticality of responses.

[1] W. X. Zhao et al., “A Survey of Large Language models,” 2023, ArXiv:2303.18223v12.
[2] J. Kaddour et al., “Challenges and Applications of Large Language Models,” 2023, Arxiv:22207.10169v1.
[3] Y. Bang et al., “A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity,” Arxiv: 2302.04023.

Please contact: 
George Tambouratzis, Athena Research Centre, Greece
This email address is being protected from spambots. You need JavaScript enabled to view it.


Next issue: October 2024
Special theme:
Software Security
Call for the next issue
Image ERCIM News 136
This issue in pdf


Image ERCIM News 136 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed