Taranis AI: Applying Natural Language Processing for Advanced Open-Source Intelligence Analysis

by Florian Skopik and Benjamin Akhras (Austrian Institute of Technology)

Open-source intelligence (OSINT) provides up-to-date information about new cyber-attack techniques, attacker groups, changes in IT products, updates of policies, recent security events and much more. Often dozens of analysts search a multitude of sources and collect, categorise, cluster, and rank news items from the clear and dark web in order to prepare the most relevant information for decision makers. A tool that supports this job is “Taranis NG” from the Slovakian CERT. This solution ingests information from many types of sources such as websites, RSS feeds, emails and social media channels and makes them searchable. It also supports the creation of reports and daily summaries. However, the number of sources and news items is continuously growing, making it increasingly difficult to search them purely manually. These circumstances call for the application of novel natural language processing (NLP) methods to make OSINT analysis more efficient.

Open-source intelligence is the collection and analysis of data gathered from open sources to produce actionable intelligence [1]. The literature distinguishes at least technical Cyber Threat Intelligence (CTI), from tactical and strategic CTI. While technical CTI mainly includes simple data to configure security systems, such as indicators that are put into SIEMs, domain names for blocklists in name servers or proxies, and execution patterns to block malicious code in endpoint detection and response (EDR) solutions, tactical and strategic CTI is much “softer”. The latter usually consists of higher-level information, presented in natural language on various news sites and security tickers, and includes quite diverse information about new threat actors, new (features of) security products, news about breaches, incidents, and campaigns, information about vulnerabilities, patches, mitigations, counter-measures, and exploitation. It also includes policy news, such as political and diplomatic initiatives, new EU policy documents, GDPR-related lawsuits, updates on security standards, mergers, acquisitions, failures, or other company-related news.

Gathering this mostly public information is essential to maintain situational awareness and take early actions in security matters. The typical OSINT workflow [1] foresees five phases: (i) collection, (ii) processing, (iii) analysis, (iv) production and dissemination, (v) direction and planning. In short, large organisations, national authorities, and analysis centres collect on a wide scale potentially hundreds of sources with thousands of articles daily, and analyse them for relevant content to create so-called products, which are essentially reports for certain constituencies that support decision-making processes. It is obvious that the quality of these reports highly depends on the level of sophistication of the analysis phase.

However, ingesting, analysing and making use of semantically richer “soft” CTI is much more demanding than ingesting well-structured machine-readable technical CTI. This soft CTI usually comes as unstructured freeform text, containing high-level, often ambiguous strategic information designed for human consumption – and indeed, in course of complex analysis workflows is usually consumed by human analysts only. This is tedious, resource-intensive, and error-prone work. The human element slows down the analysis process and tremendously hinders scalability. As the number of OSINT sources as well as the frequency of published articles rises, we need new analysis techniques to keep pace with these developments and to not miss any critical pieces of information. Luckily, natural language processing (NLP) and Artificial Intelligence (AI) have made tremendous progress in recent years.

In the course of our research, we explore five essential user stories together with our stakeholders from national authorities and CERTs that human OSINT analysts face in their daily work. Supporting these user stories with appropriate technical means is the goal of Taranis AI [L1]:

User Story 1: What was going on in the cyber security domain in the last 24 hours? (“Hot Topics Clustering”)
User Story 2: What do we know about a specific entity? (E.g. a vulnerability, malware, company, product, person, etc.)
User Story 3: I‘ve read an interesting article. What further related news items exist?
User Story 4: Which news items are recommended for me based on my recent preferences (collaboratively and AI-assisted)?
User Story 5: I‘d like to build a report for certain clients. How to sum up my findings efficiently?

Project Taranis AI
The CEF project AWAKE, as well as the recently funded research projects NEWSROOM and EUCINF of the European Defence Fund (EDF) assess the application of Taranis NG to improve cyber situational awareness through analysis of information from the clear and dark web. These projects specifically examine the integration of modern NLP methods into Taranis AI [3], a Taranis NG fork, that categorises news items using machine learning, extracts relevant entities, such as locations, people, company names, products, CVEs, attacker groups, and thus indexes and labels the content of items. This is also the basis for identifying relations and grouping news items about the same events – a crucial step to creating “stories” [2]. These stories allow human analysts to capture the most important current “hot topics” even more quickly and massively relieve them from the burden of combining or filtering redundant information from different sources. Additional features such as the automatic creation of summaries of reports and stories, and a collaborative ranking system round off the new features of Taranis AI. A screenshot showing the integration of these features into the Taranis AI user interface is depicted in Figure 1. The project Taranis AI [3] is open source and free to use under the EUPL.

Figure 1: The AWAKE UI applying NLP for named entity recognition of news items, title creation for topic clusters, summary creation and story clustering.

Link:
[1] https://taranis.ai/

References:
[1] Collaborative Cyber Threat Intelligence: Detecting and Responding to Advanced Cyber Attacks at the National Level, F. Skopik, Ed., CRC Press, 2017.
[2] B. Liu, F. X. Han, D. Niu, L. Kong, K. Lai and Y. Xu, “Story forest: Extracting events and telling stories from breaking news,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 14, no. 3, pp. 1–28, 2020.

Please contact:
Florian Skopik, AIT Austrian Institute of Technology, Center for Digital Safety & Security, Austria
This email address is being protected from spambots. You need JavaScript enabled to view it.