In this blog, Senior Data Scientist, Trent Lewis looks at the foundations of AI and machine learning (ML) and how these are being developed to enhance open-source intelligence with analyst-driven AI and ML.
As the name suggests, Open-Source Intelligence (OSINT) investigations leverage data that is openly available, most commonly from a wide variety of online sources across the Surface, Deep, and Dark Web. In its simplest form, open-source intelligence is a manual task but quickly evolves into automated, data-driven analysis as the data grows in both size and complexity across platforms. This is where analyst-driven AI and machine learning comes in, helping to filter and risk assess the growing volume, variety, and velocity of data.
Here in the Fivecast Data Science team, we have to work with text across many different threat groups, each with their own specific dialects and contextual use of words. This amounts to thousands of posts across different platforms – way beyond what a single analyst can possibly comprehend in a timely manner. This problem is pushed even further when different languages are considered. Manually analyzing the data by looking for keywords and applying business rules begins to break down at this scale and diversity – and this is before even thinking about putting in more complex or nuanced analyses like extracting the meaning or emotion of the text! Automated, machine-led analysis is required, so we need a way to teach our computers the subtlety and nuances of our online interactions.
From human inference to artificial Intelligence
Imagine a spreadsheet with information about a group of, say, drug traffickers; the columns are facts and the rows are people. These facts or features could help to differentiate between the people or instances. We could include a column where we can record something interesting about each of our persons of interest (POI), e.g., is that POI at risk of going on to traffic a bunch of drugs; yes or no? We can call this special column the person’s class. We could analyze the data by comparing each row and column (instances and features). We might even use a computer to help calculate statistics or examine variations across the people or classes. From this investigation we, the analysts, might generate a set of rules that capture the variation to explain the data and help make a guess or predict the class of a new person. If we can create enough rules, codify those rules as a computer program, and the computer begins to reach the performance of a human, we have created a system that can be deemed as possessing Artificial Intelligence (AI).
However, the average person isn’t great at pulling out robust rules from data. This is partly due to our implicit biases and prejudices (good and bad) and partly from our limited capacity to deal with the immense amount of data at our fingertips with the advent of the Internet and social media: determining truth and fake news. So, we can go one step further and get the computer to come up with the rules automatically.
A machine can learn robust, unbiased rules by examining data. This is Machine Learning (ML). A classical approach or algorithm in machine learning for determining rules from examples as described above is the Decision Tree. The seminal algorithm in this space is the C4.5 algorithm by Ross Quinlan where each feature is assessed as to see how well the instances are split into different classes using this feature alone, or how much information is gained from the decision . Splitting the data on a feature creates part of a rule. This process continues for each new grouping, splitting on different features until the sub-groups mostly contain examples from a single class. This process generates a binary tree and from this, we can extract a set of if-then-else rules, akin to what the human might create, but automatically.
People tend to be good at learning what constitutes a new “thing” (species of bird, type of TV, piece of software, genre of fiction, etc.) from a small set of examples and, for the most part, can readily make generalizations and inferences about these new things based on what we know about the world. We’re also very good at interpreting information across different modalities – I can usually pick up sarcasm pretty well both in spoken conversation and in text. However, getting a computer program to do the same thing (e.g., classify examples into groups, determining the emotion of text) has proven to be not so straightforward.
A class of machine learning algorithms that have made great use of modern computing technologies (such as graphics processing units (GPUs) and the huge amounts of data the internet offers) are Neural Networks (see 3Blue1Brown for an excellent video on how neural networks work). In the image classification space, Convolutional Neural Networks (CNNs) announced themselves when AlexNet trounced its old-school contemporaries at the ImageNet Large Scale Visual Recognition Challenge 2012, beating its closest competitor by 10 percentage points . Now, publicly available models routinely operate at levels above 95% accuracy.
In Natural Language Processing (NLP), neural network-based Transformers are revolutionizing how scientists and technologists are approaching the processing of text. Instead of recreating the rules of language by hand, billions of pieces of text are used to automatically derive the interesting parts of language that lead to better than state-of-the-art performance on many NLP tasks that were traditionally the domain of expert linguists. Transformers are also bridging the gap between image and text by producing models that can deal with the multimodal data (text, images, videos, URLs) on the Internet by directly creating multimodal transformers.
The Future of AI & ML for Open-Source Intelligence
With all the advances there are also many caveats and limitations to AI and machine learning. Interpretability and transparency of AI and machine learning decisions, limitations of the domains, and data scarcity (not everything is on the Internet), as well as the ethical issues of data usage and the application of AI and machine learning to everything, are a few of the challenges that face creators and users of the technology. Developing open-source intelligence solutions that take into account and try to reduce these limitations is integral to the future application of AI & machine learning models in any industry.
Fivecast is constantly innovating and enhancing our AI and machine learning capabilities to provide a risk assessment framework that supports intelligence missions and adapts to the changing threat landscape. Fivecast ONYX, our premier OSINT solution is designed to enhance rather than replace the vital skills of the analyst by combining the best AI and machine learning models and business rules. In addition, automated, customizable, and user-trainable models are deployed that require little to no data science knowledge, enabling analysts to filter and prioritize “risky” data to defined or pre-defined detector sets.
 Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). “ImageNet classification with deep convolutional neural networks” (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.