In this blog, Dr Sarah James, a data scientist at Fivecast, explores the importance of unstructured data to open-source intelligence investigations and the role that new AI technologies play in uncovering hidden online narratives and empowering intelligence teams through topic analysis.
The AI and OSINT Revolution
With an ever-growing amount of online data, intelligence analysts face the challenging task of analyzing vast amounts of unstructured data (e.g. documents, emails, media articles, and social media posts). Artificial Intelligence (AI) and machine learning have revolutionized the process of analyzing unstructured data, taking it from a manual, time-consuming, and expensive process to a much more automated task, thus making it faster, more efficient, and affordable. As a result, there is an opportunity for intelligence teams – both government and corporate – to leverage AI-powered text analysis in open-source intelligence to uncover hidden narratives, identify key themes, and gain a deeper understanding of the data.
Beauty and the beast of the data world
We live in a world where global data creation is booming, leading us to a future where data is less structured and more reflective of human behavior and interaction. This means that unstructured data offers more meaningful insights and results in more informed decisions than structured data can provide.
Intelligence analysts can preprocess and analyze structured data using conventional data techniques and tools given its well-organized and easily searchable nature. However, unstructured data is messy and, therefore, requires specialized techniques and tools to preprocess and extract insights, which can be like finding a needle in a haystack.
Figure 1: An example of unstructured data (an AI-generated paragraph about Beauty and the Beast) and examples of its structured form (keyword counts, emotion and sentiment analysis, and general statistics).
Taming the data beast
Advancements in AI and machine learning have provided a wealth of techniques and tools to facilitate the analysis of unstructured data. The choice of technique or tool depends on the type of unstructured data you are dealing with and the information you wish to extract. In this blog, we explore some of the common AI and machine learning tools used for topic analysis, an area of research focusing on extracting and understanding the topics or themes within textual data and increasingly import to OSINT investigations.
Uncovering hidden topics
Before neural networks started showing their potential in the mid-2000s, Latent Dirichlet Allocation (LDA) was the most popular and powerful topic discovery technique available. It works off the premise that documents with the same topic will have a lot of words in common. Through analyzing word frequencies and co-occurrence patterns, LDA uncovers the hidden thematic structures within the documents that may not be obvious from visualization tools such as the word cloud. For example, intelligence analysts could use LDA to analyze a collection of media or news articles relating to a specific region to reveal potential areas of conflict or instability.
Figure 2: A high-level illustration of topic modelling using LDA and the associated outputs.
Grouping similar documents
Topic clustering takes a step in a different direction to LDA by grouping large and diverse documents that share similar thematic content. This technique leverages embedding models to generate mathematical representations (embedded vectors) of text that capture the context and semantics of documents.
Figure 3: An illustration of the embedding process.
By comparing the embedded vectors of different documents, we can determine their semantic similarity and group the documents into clusters (otherwise known as topics or themes). For example, open-source intelligence analysts could use topic clustering to analyze communications within an organized-crime network to reveal groups within the organization or extract insights into how the network operates.
Figure 4: An example clustering of 10,000 documents into topics based on the semantic similarity of the documents.
Classifying documents against topics of interest
While topic modeling techniques like LDA and clustering are great at discovering topics or themes within text, topic classification takes a different approach. This technique involves assigning labels to documents based on their thematic content, which is particularly useful for intelligence analysts who are interested in particular topics. Through doing so, intelligence analysts can prioritize and focus their efforts on the documents most relevant to their investigation. For example, intelligence teams can leverage topic classification to automatically detect documents that mention a narrative or person of interest, thus reducing the burden on intelligence analysts.
Figure 5: An illustration of topic classification: classifying the documents in the middle against an example of eight topics of interest.
Expanding the horizons
The emergence of generative AI models, such as Gemini and GPT has introduced new possibilities for analyzing structured and unstructured data. Given the ability to learn and process massive amounts of data, generative AI can understand the nuances of language and identify patterns between words and concepts. Within the field of topic analysis, generative AI models are becoming a promising asset for intelligence analysts to uncover hidden themes in documents or create topic-specific summaries of documents.
Figure 6: An illustration of the process of topic discovery and classification using generative AI.
Additionally, intelligence analysts can leverage generative AI to
- summarize the documents within topics extracted using LDA,
- generate human-interpretable labels and descriptions of the topics discovered using clustering techniques,
- enrich their understanding of pre-defined topics of interest and refine the boundaries of each topic, and
- extract representative quotes from the documents within each topic.
As covered by my colleague in his blog on the role of AI in Open-source intelligence, generative AI comes with its own limitations, meaning that it is not a replacement for human understanding. Common problems include hallucinations (i.e. it may invent topics or themes that do not exist in the data) and unfair bias or discrimination towards certain demographics or protected attributes. That said, techniques are emerging to alleviate these problems and, hence, deliver more explainable and reliable results.
The Future of OSINT
The field of AI and machine learning is constantly evolving, opening a new world of possibilities for discovering topics or themes within textual data. This delivers significant benefits to open-source intelligence investigations. By embracing these new technologies, intelligence teams can unlock the wealth of information contained within unstructured data and gain a deeper understanding of the changing world around them.
The AI and machine learning capabilities built into Fivecast open-source intelligence solutions are designed to enhance, rather than replace, the decision-making of intelligence analysts and are core to the Fivecast mission of enabling a safer world. We help our customers across national security, defense, law enforcement and corporate organizations collect and analyze masses of structured and unstructured data to uncover emerging narratives, understand networks and detect risks and threats to achieve their intelligence missions.