In this blog, Senior Data Scientist, Dr Trent Lewis and Product Manager, Jarrad Lisman explore how OSINT can harness the volume, velocity and variety of data to help analyze evolving online communities.
In 2020, there was about 64 zettabytes of data created – that’s 640,000,000 gigabytes of data! By 2025, that’s expected to grow to 180 zettabytes. So how does this volume of data translate into examining online activity? Well, the velocity of that data means that internet users spend 147 minutes on social media per day. This data is not just coming from well-known social media platforms, it also comes from closed community social media, and deep and dark web sources. Considering that social media is not your only source of data, the task of staying on top of the volume, velocity, and variety of data is exponentially challenging.
Finding multiple needles in multiple haystacks
Some might say that identifying persons of interest or risky online content in OSINT is like “finding a needle in a haystack”. However, we have different types of needles (persons, communities, events, hashtags) in multiple haystacks (different data sources). This “multiple needles in multiple haystacks” presents a very difficult challenge for analysts and intelligence teams to solve. How do we choose which needle to look for, how many can we afford to look for and in what haystack?
The fact we can’t focus on them all at the same time exacerbates these problems. When an analyst inevitably needs to pivot to the next threat, there is an organizational cost associated with the context switch and time must be spent researching and building a body of knowledge about the new threat, whilst contemporary knowledge of the previous threat becomes stale.
DOWNLOAD OUR WHITE PAPER – The Application of AI & ML to OSINT
Building Intelligence – External Sources
Fortunately, there are external sources that can assist analysts with the challenge.
Academia: Academics and think tanks spend long periods of time researching and monitoring the narratives and the evolution of threat groups that may be of interest to Law Enforcement and Intelligence. These range from alt-right, OMCGs, through to foreign influence. However, whilst these provide excellent definitions and characterisations of groups of interest, they are often best used as a seed point: a place to start looking for more specific and tangible threats.
Third-party data sources: Third-party data consists, in the majority, of data that has been collected from a combination of open and closed sources, including dark web and public leaks, and on occasion, from closed forums where the providers spend time obtaining access and harvesting content. The challenge here is that they often do not cover deep web data, leaving a gap in coverage and have limited filtering capability – the user can search for key terms of interest but can’t further filter the result to prioritise and triage the responses. So, after receiving 1,000 responses for the key word a user still must sift through that information.
Independently, the external sources will help any analyst achieve their objectives, but when you combine them together you can achieve a much greater outcome. Imagine if you could take the knowledge and expertise from Academia, apply it to Third-party data sources and then use that to filter or search. Not only does that change the way in which you can query the data, it expands the lines of questions you can ask, opening opportunities for strategic insights, patterns of behaviours, addressing those issues faced when having to pivot to a new threat in a dynamic environment.
Growing a Network
If an analyst has an investigation with a specific target or seed developed from internal or external sources, they also need to be aware of the target’s associates – their community. This community could be analyst-curated, painstakingly hand-crafted by reviewing posts and friend lists. Due to the velocity of the data, the community changes and the upkeep of the community is another time-sink the time-poor analyst must put effort towards, lest it become stale and outdated leading to erroneous connections.
This is where the volume of the data becomes an asset. Data-driven tools are available to automatically build out a community from a seed. For example, automatically analyzing a friend or follower lists, and then friends of friends or followers of followers to build out a connected community. This network of accounts, or social graph can give deeper insight into not only the activities of a particular individual but also how that individual operates within their community. Are they an influencer, a propagator of (mis)information, or simply a consumer?
To further support the analyst, graph analytics can be applied to guide the discovery of influencers in the social network by examining closeness centrality of the network, how connected an individual is to those influencers (such as Google’s PageRank), cliques or groups users important to the investigations, identify key links within the networks or the betweenness centrality for information propagation. These tools, and using an Augmented Intelligence approach to analysis, allow an analyst to quickly gain insights into an investigation with not only a minimal starting seed, but also with minimal effort.
Harnessing the volume, variety and velocity of data
Just as the volume of data is now an asset, the variety in the data forms the basis of further augmented intelligence analysis. Fivecast’s signature OSINT solution, Fivecast ONYX, enables intelligence teams to harness the volume, velocity and variety of data to enhance investigations by incorporating advanced data collection across open-source platforms including Surface, Deep and Dark Web sources with AI-enabled risk analytics through sophisticated, user configurable detectors. Fivecast ONYX acts as a force multiplier for analyst teams, delivering powerful and multi-lingual open-source data collection and analysis capabilities that combine to deliver actionable insights from vast quantities of unstructured, multimedia data.