<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=254990494906321&amp;ev=PageView&amp;noscript=1">

Data Lakes and Niche Sources: Addressing National Defense Requirements

Alex Ciarniello June 16, 2020 OSINT, National Defense

Reliable data is the linchpin of effective decision making—particularly in matters of national security.

The intelligence community must gather and correlate a variety of data types and sources to drive their missions with actionable information. Without storing and accessing this data in a well-maintained data lake, this process can quickly become inefficient and frustrating for data scientists.

New call-to-action

This is why data lakes are the foundation of an application programming interface (API), which allows data scientists and programmers in defense to integrate raw data feeds into customized tooling and interfaces. 

What is a data lake, and what are the benefits of a data lake for an effective API solution in government and defense?

What Is a Data Lake?

iStock-woman looking at data desktop investigate

Organizations in every sector are handling more data than ever before—in fact, the average company’s data repository grows by 50% every year. To analyze data effectively en masse, organizations need to store and access it in a flexible and scalable way. Enter: A data lake.

A data lake is a repository that stores all of an organization’s data in any format, including raw and structured data. Data transformations occur only when they are required for a specific application (also known as the “schema on read” technique). One of the benefits of a data lake is that this gives data scientists the flexibility to reconfigure relevant data as needed for different purposes, such as visualizations, analytics, and machine learning, and get results faster.

This differs from a traditional data warehouse, which contains a defined set of highly structured data to support specific queries and reporting. This is designed more for business users seeking KPIs and other well-defined criteria as opposed to data scientists, who require an agile approach to data analytics and other applications.

A data lake requires a fine balance of curation and flexibility. The goal isn’t to curate data to the point of it being inflexible to users, as with a data warehouse—but it also must be organized and catalogued as new data is added for it to be useful for applications like machine learning. An unmaintained data lake inevitably becomes a swamp of disorganized clutter.

Benefits of a Data Lake for Military & Defense

Based on this definition, data lakes enable data scientists to do their jobs effectively—and defense applications are no exception. What makes data scientists working in this space unique is the sources they access and their requirements for raw data integrations. Generally, data scientists in defense require: 

  • A wide variety of online data sets from multiple vendors. This includes everything from dark web networks to public social media sites and chat applications.
  • The flexibility to integrate these data feeds seamlessly into other tooling and mission-driven requirements. This typically involves bespoke data feed integrations, often supporting lower-level intelligence analysts on more intuitive interfaces.
  • Artificial intelligence and machine learning capabilities, which have been identified as a major priority for military and defense. Developing machine learning models requires a historical database of catalogued data—in other words, a data lake.

To satisfy these needs, data scientists in defense require APIs that are maintained as a data lake to access data feeds relevant to their missions.

Satisfying Defense Requirements with a Diverse API

iStock-map defense military planning data overlay

There is no silver-bullet API for defense users. Data scientists require multiple raw data feeds from different vendors, and develop tools to cross-reference these feeds to glean more powerful insights. In a counter-terror objective, for example, this could mean automatically linking extremist users or political actors across dark web and social platforms to real-life personas.

In use cases like these, a variety of well-known and fringe social, deep, and dark web networks are increasingly relevant. Cross-referencing these sources supports counter-terror initiatives by signalling national security threats. It also enables intelligence analysts to locate emerging threat actors and less-understood terror groups. Identifying these entities and understanding how they operate, communicate, and even recruit is incredibly valuable when analyzed alongside other data feeds intelligence teams are already using.

New call-to-action

However, many commercial, off-the-shelf APIs don’t offer access to the more hidden, specialized sources in combination with mainstream sources. And many vendors who do offer some of the more obscure feeds in an API prioritize real-time data streams, but do not maintain the data lake required for data science applications in defense.

To meet this need, Echosec Systems has developed a proprietary API that combines well-known data sources like dark web marketplaces and mainstream social networks with obscure social data sources on the deep and dark web. The Echosec Systems API, which is built with a data lake, allows data scientists to integrate unstructured data from these sources into existing tooling or other bespoke solutions. These feeds allow users to get more value from their existing data sets.

Specifically, the Echosec Systems API:

  • Provides access to a number of fringe social media networks that are relevant to national security.
  • Indexes, normalizes, and in some cases, tags content using machine learning classifiers for specific categories like identity hate and data disclosure.
  • Maintains a data lake by cataloguing post metadata, including source name, author identifiers, crawled and published dates, board names, and more. 
  • Provides access to a large repository of catalogued historical social data ideal for data scientists developing machine learning models and other defense applications.

For defense ministries, data lakes and API integrations are not new. What is new is how national security missions are evolving as online data sources and machine learning capabilities become more relevant. Accessing the more obscure social data sources, including networks like Gab, 4chan, and Telegram, is crucial for the intelligence community as the international threat landscape evolves. 

This points to a growing need for social APIs that not only provide comprehensive access to known and emerging online sources, but are also underpinned by well-maintained data lakes. This will only add more context to mission-driven environments for defense, and ultimately enable more informed national security decisions.

Looking for more data feeds or custom integrations? Contact us for more information.


Quarterly updates, news and opinion


New call-to-action
New call-to-action
New call-to-action
New call-to-action
New call-to-action