Threat Hunting Pyramid of Pain

Or should we say a Intel-Driven Data Analysis Pyramid of Pain?
Threat Hunting Pyramid of Pain

Hello everyone, it seems like a long time since this cyberscout wrote one of his usual stories.

Perhaps it is because the tales of a cyberscout are simply whispers by the fireside in a forest's clearing. Sometimes you will find me there, sometimes you will not.

Find you where you may ask? what forest? Well look around you, it is the one you are in right now. Sometimes we are enthralled by the crackling fire in a hypnosis of sorts and forget that we are in a forest.

In the wilderness, you need stories. Stories help you notice little things nobody else notices, stories become your inner compass, and they guide you towards meaning.

My stories speak of fantastic creatures and magical artefacts forged by mysterious wizards. This is how lore is passed down through generations. Generative lore I like to call it.

Allow yourself to sit and tune into the right frequency and it will offer many hints about a much deeper history though.

It is about vibration.

It cannot be reduced to ML algorithms.

There is no "statistical shortcut" through the sensorial centre of gravity that brings meaning to our lives.

Today I will tell you a story about a Pyramid, an inverted one that is. There is no specific reason for it, other than I thought the narrow pointy base is representative of less effort (less surface, less scaffolding needed to maintain the structure) whereas the wider top represents higher effort to me (more surface, more scaffolding and upward energy/resources required to maintain the structure). If at any point in time, the flow of resources needed to maintain the structure of the inverted pyramid is interrupted, it risks collapsing on itself.

For me, the inverted pyramid represents a struggle.

A fight against increasing entropy and decay.

Therein lies the pain. And the gain?

IDDA: Intel-Driven Data Analysis Pyramid of Pain


I originally wanted to call this a threat hunting pyramid of pain, but then I quickly realized it can be applied to many security data analysis scenarios, from detection engineering to threat intel and alert triage.

There is an underlying principle behind any successful security investigation: the need for intelligent, prioritized data analysis.

To successfully drive smart decisions, data and information need to be analysed folks, it doesn't matter what you plan to do with it afterwards. I've written about this extensively in RIDE: Intel Driven Research Pipeline, The Problem of Why: Threat-Informed Prioritization and The Threat Hunting Pipeline. So therefore I'm generalizing the "hunt" pyramid to an IDDA (intel-driven data analysis) pyramid.

Whatever dude... enough "toying with words", show me the goods!

Well, that would be describing the levels of the pyramid.

But first, remember something: this is just a model. It is not a "how to", I'm not telling you what to do or how you should do it, it's up to you to decide whether the model is useful or not. As they say in statistics

all models are wrong, some models are useful
A note on assumptions: this pyramid assumes that you start from a near zero point of "minimal knowledge" about the threat and your environment.

Step 1: Contextual Entity and Relationship Extraction

As information about threats finds its way to your pipeline of work, the first thing the model asks for is to extract basic situational information to understand its significance. It is about developing a semantic layer that helps orient your analysis.

For this, an analyst would have to obtain information based on the 5 Ws: who, what, when, where, why. This basic data will also help filter out non-significant information.

Hint: journalists have been using this method as far back as 1913 to capture the essence of a "lead" or story.

This is the start of your knowledge graph (this is why I talk about "entity" and "relationship" extraction, these are your nodes and edges, whether you do it formally or informally, it is what you are doing here implicitly)

Step 2: Structural Entity and Relationship Extraction

Once we've gained situational awareness and have applied a first filter on information, we need to inspect it for the low-hanging fruit, what I call atomic indicators and we normally know as IOCs.

This information might be structured in the form of a STIX schema, but even if it's not, it will be easy for you to recognize it: domains, IPs, urls, fingerprints of all kind (file hashes, PEHash, ImpHash, SSDEEP, JA3, JARM, JA4+, etc.).

Atomic indicators are the first layer of actionable information that you can plug into your alert validation, data analysis or threat hunting workstreams.

I call it Structural ERE (Entity and Relationship Extraction) because it is about inherently structured pieces of data that normally follow specific formats and patterns, making them easier to identify and extract than the unstructured information analyzed in the previous step.

Step 3: Behaviour Extraction

At this stage in the model, we now have situational awareness and we have extracted easily recognizable atomic indicators. What we are missing is the how and the second what that goes with it.

Asking how is focusing on specifics, it is asking about tradecraft, techniques, procedures and operations.

Some of the questions you may ask yourself in this stage are:

  • How does the hypothetical attack or attack chain work?
  • What kind of system evidence would provide information about the attack chain? Are there any artifacts, logs, or indicators that can be used for further analysis or detection?
  • How was the initial access gained? Was it through phishing, exploitation of a vulnerability, or some other method?
  • What tools and techniques were used in each stage of the attack? Were there specific malware families deployed? Any unusual command-and-control mechanisms?
  • What vulnerabilities were exploited? Are these known vulnerabilities? If not, what is the potential impact?
  • What are the techniques used? What are the procedures for each sub-technique?

Contrary to atomic indicators, behavioural ones are composite, they normally require a more expressive logic to be described rather than a single datapoint (like an IP). Our intent is to identify these composite behavioural indicators. Your best friend here will be MITRE ATT&CK.

Step 4: Discovery Analytics

This is your first data analytics step where you shift your attention not towards hypothetical threat intel but towards your actual telemetry and data lake. Discovery analytics is your first stage of Exploratory Data Analysis (EDA).

Remember, we assume that you don't have a full understanding of your data yet, you have to put in the work to figure out the type, shape, volume and completenes of your telemetry. As you grow more familiar with your data, this step will be less involved
  • What types of data do I have? Identify the different sources of data you have access to (logs, network traffic, endpoint data, etc.) and the types of fields they contain (timestamps, user IDs, file names, etc.).
  • What is the overall shape of my data? How many rows (observations) and columns (features) does my dataset have? This gives you a basic understanding of the scale and dimensionality of your data. Evaluate how many entries/rows/logs you have during a certain time period, and run deduplication queries.
  • What are the basic statistics of my data? Calculate summary statistics (mean, median, standard deviation, range, etc.) for numerical variables to get a sense of the central tendencies and spread of your data.
  • Are there any time-based patterns or trends? Create time series plots to identify any changes or patterns over time.
  • What are the distributions of my data? Visualize the distributions of your variables (histograms, box plots, density plots) to understand their shape and identify any outliers or unusual patterns.
  • What is the quality of my data? Assess the completeness, accuracy, and consistency of your data. Identify any missing values, duplicates, or inconsistencies that may need to be addressed.

Step 5: Atomic Analytics

So now you have a better understanding of what types of telemetry are available to you and what fields you can use to achieve analysis objectives. Armed with this knowledge, you can easily run analytic queries that search for the presence of the atomic indicators you extracted in Step 2.

Any observed matches shouldn't be immediately considered "bad juju", but rather notable datapoints or events that require further investigation.

The model does not concern itself with what is the level of granularity here, but I assume any defender out there would want to be detailed and granular to ensure you are not missing out logs from remote corners of the network.

Step 6: Behavioural and Anomaly Analytics

Searching for behavioural patterns in your data is a complex effort. You need to have done a really good in-depth analysis in Steps 2, 3 and 4 to have a solid understanding of attack behaviours, expected system evidence (logs and other artefacts) and available telemetry. We aim to run analytic queries that will help us identify patterns indicative of suspicious behaviour.

Sometimes data queries won't be enough and you will have to reach out to users, managers and other stakeholders to shed light on observed anomalies.

An anomaly is simply an observation or pattern that significantly deviates from expected behaviour, as such, it can be classified as behavioural analytics.

The kinds of actions and questions I would want to answer in this stage of the pyramid are:

  • Establish Baselines: If you haven't already, establish a baseline of normal behavior for the types of behaviours you are analysing: system proceses/handles/dlls, authentication activity, memory allocation, network traffic, etc. This provides a reference point for identifying anomalies.
  • Are there any unexpected or anomalous events? Look for outliers, spikes, or sudden changes in the data that may indicate unusual activity.
  • Are there any signs of lateral movement or privilege escalation? Look for events that indicate an attacker moving between systems, gaining access to sensitive data, or attempting to elevate their privileges.
  • Are there any correlations between variables? Calculate correlation coefficients or create scatter plots to identify any relationships between pairs of variables.
  • Are there any clusters or groupings in my data? Apply clustering algorithms or visualize the data in lower dimensions (PCA, t-SNE) to identify any natural groupings or segments within your data.
  • How do different variables interact with each other? Explore interactions between variables through techniques like pivot tables, grouped analysis, or multi-dimensional visualizations.

Step 7: Operational Environment Assessment

Up until now we have focused on extracting meaningful information from threat intelligence and have projected that against the backdrop of our data lake. We interrogated our intel and our telemetry data but did not interrogate control data.

Without gathering information from our attack surface, existing vulnerabilities and security control deployments, we have an incomplete picture of the potential impact of threats analysed so far.

When performing discovery work, you may not want to get this far into the pyramid since this stage is not merely about understanding isolated operational controls, it is about correlating these with all the pieces of information you have collated so far to form a better picture of how investigated attack vectors and chains can impact your environment.

We are entering into the territory of threat modelling, attack surface and vulnerability management.

The kinds of questions I would be looking to answer here are:

Attack Surface Assessment:

  • What assets could be potentially impacted? What are our critical systems, data, and infrastructure? What is their value? Can Crown Jewels potentially be impacted?
  • What vulnerabilities relate to the attack vectors investigated so far? What are the known weaknesses in our systems, applications, and network? Are there any relevant unpatched vulnerabilities?
  • What is the level of exposure of assets to the Internet? What is the attack surface for each asset?
  • Which vulnerabilities are most likely to be exploited? Based on threat intelligence analysed so far and extracted TTPs, which vulnerabilities pose the greatest risk?

Control Effectiveness:

  • What security controls do we have in place? You need to get detailed here, not just "yeah we have firewalls, IDS, endpoint protection, etc." You need to aim to understand the specifics, what particular ports are blocked? what are the specific process/memory controls applied via EDR or AV? what are the system policies applied to different endpoint families? what are the filtering rules of your email proxy? where can you find up-to-date information on cloud policies applied to your tenants? and so on...
  • Are our controls properly configured and up-to-date? Are they effectively mitigating the risks we've identified?
  • Do our controls provide sufficient visibility? Can we detect and respond to potential attacks promptly?

Impact Assessment:

  • How would a successful attack impact our organization? What would be the financial, operational, and reputational consequences?
  • Which attack vectors pose the greatest risk? Based on our attack surface and vulnerabilities, which attack paths are most likely to be successful?
  • What is the potential blast radius of an attack? How far could an attacker move laterally within our network? What other assets could be compromised?

Step 8: Operational Environment Attack Path Discovery

You can see so far how we have navigated from hypothetical scenarios towards the turbulent waters of our real and nuanced operational environments. If you want to truly test your acquired knowledge so far, you need to get more real too.

This is where the model implements adversarial emulation by engaging pentesters, and red teams or utilizing automated attack simulation tools that systematically test existing controls.

This includes cloud, on-prem, hybrid, containers, physical and virtual assets, and everything in between.

The primary goal of this stage is to uncover the most likely and impactful attack paths in your environment.

  1. Infiltrate: Gain unauthorized access to the target environment, simulating the initial stages of a real-world attack. This could involve exploiting the vulnerabilities you've researched so far, social engineering, or other techniques.
  2. Evade: Bypass existing security controls and remain undetected within the environment for as long as possible. This demonstrates the effectiveness (or lack thereof) of the organization's defense mechanisms.
  3. Escalate: Elevate privileges and gain access to sensitive systems, data, or resources. This assesses the resilience of the environment to lateral movement and privilege escalation attacks.
  4. Impact: Demonstrate the potential impact of a successful attack.

Step 9: Operational Environment Embedded Predictive Analytics

And so fellow cyberscouts, we've climbed the pyramid. From the theoretical heights of threat intelligence to the nitty-gritty trenches of our own systems.

We've poked, prodded, and stress-tested our defences. Now it's time to crystallize all that nuanced and tailored knowledge into something truly powerful: predictive analytics.

I'm not talking about crystal balls.

Though that would be pretty cool.

I'm talking about weaving together everything we've learned—threat actor behaviours, our system's quirks, the strengths and weaknesses of our controls—into a model that can anticipate the next move. We're building an early warning system, one that doesn't just react to attacks, but sniffs them out before they even fully materialize.

Think of it like a seasoned hunter reading the subtle signs in the forest: a broken twig here, a disturbed patch of leaves there. We're training our systems to do the same, but with the digital breadcrumbs left by those lurking in our networks. It's about turning data into foresight, transforming our deep research into a tactical advantage.

This is the pinnacle of intel-driven data analysis guys. It's where we stop playing catch-up and start setting the pace.

How do we do that?

Well I'm not an expert data scientist, are you? If you have delved into these mysteries please reach out, perhaps you can write about it in the Tales of a Cyberscout ;)


It seems we now reached the summit of our pyramid, or should I say the base?

The tales whispered around the crackling fire have led to these nine takeaways:

  1. 👂 Gather the whispers: collect and curate threat intelligence, tuning into the murmurs of the threat wilderness.
  2. 🍏 Pluck the low-hanging fruit: Identify and extract atomic indicators (IOCs) like domains, IPs, and hashes, the first layer of actionable information for further investigation.
  3. 🕵️ Decipher the attacker's playbook: understand the "how" of an attack by uncovering clear tactics, techniques and procedures, run a fine-grained pass over the information and delve to reveal hidden patterns, refocus and expand intel. You will need this to conduct proper data analysis later.
  4. 🗺️ Illuminate your own forest: explore and understand the contours of your telemetry data, uncovering its structure, patterns and potential outliers.
  5. 🔎 Hunt for atomic traces: scan your telemetry for the presence of known malicious indicators, seeking those flickering embers that signal notable datapoints or events that require further investigation.
  6. 🎭 Embrace the dance of uncertainty: unmask hidden patterns, and analyze telemetry for anomalies and deviations from established baselines, revealing potential malicious behaviour.
  7. 🕸️ Map your attack surface: survey your operational environment and control landscape, identifying weaknesses, gaps and the "as-is" state that pretty diagrams don't show.
  8. 👣 Walk the attacker's path: Simulate attacks, tracing their footsteps to reveal hidden vulnerabilities.
  9. 🔮 Weave a tapestry of foresight: Craft predictive models, transforming raw data into actionable insight.


Nothing I said here is truly magnificent and new, I stand on the shoulders of giants and have simply integrated information from research and models that already exist out there. These are some of my references:


The views and opinions expressed in this newsletter are solely my own and do not reflect those of my employer. Information shared here is only meant for general educational purposes and does not constitute real professional advice of any kind.

Share This Article
Post Comments