AI-based Threat Detection - Part 1: Bayesian Classification

AI-based Threat Detection - Part 1: Bayesian Classification
Image by Tom Huynh (https://www.pexels.com/sv-se/@tom-huynh-1675348979/)

This is a new series of posts I'll be doing on AI-based Threat Detection. Let's kick things off with my all time favourite - Bayesian Classification. Specifically, we will use Naive Bayes to build the model. Thereafter, we will evaluate the model. The entire exercise took me less than an hour - so, I strongly recommend you build it yourself.

So, lets dive in. Firstly, we will need the following:

  1. Script to train and build our AI model.
  2. Script to run and evaluate our AI model.
  3. Labelled dataset to train the AI model.

It's important to select a good training dataset. The basic laws of the universe apply, specifically - garbage in = garbage out. The better your data the better your model.

What is Bayes?

Bayes theorem is named after Thomas Bayes. It's essentially the probability of something happening, like an event or condition being met. For example, if I give you a fruit that's yellow, elongated with potassium - what's the probability it's a banana?

Taken from https://towardsdatascience.com/bayes-rule-with-a-simple-and-practical-example-2bce3d0f4ad0/

How does it help us?

Well, if we train a Naive Bayes classifier with IoC (indicator of compromise)and IoA (indicator of attack) labelled data we could use the model to detect malicious traffic.

How do we do it?

Well firstly it's important to identify, collect or synthesize labelled datasets to train the model. The following dataset seemed quite useful and fulfills our training requirement:

NF-UNSW-NB15-v3

Let's take a look-see inside the data for a better understanding:

FLOW_START_MILLISECONDS,FLOW_END_MILLISECONDS,IPV4_SRC_ADDR,L4_SRC_PORT,IPV4_DST_ADDR,L4_DST_PORT,PROTOCOL,L7_PROTO,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,TCP_FLAGS,CLIENT_TCP_FLAGS,SERVER_TCP_FLAGS,FLOW_DURATION_MILLISECONDS,DURATION_IN,DURATION_OUT,MIN_TTL,MAX_TTL,LONGEST_FLOW_PKT,SHORTEST_FLOW_PKT,MIN_IP_PKT_LEN,MAX_IP_PKT_LEN,SRC_TO_DST_SECOND_BYTES,DST_TO_SRC_SECOND_BYTES,RETRANSMITTED_IN_BYTES,RETRANSMITTED_IN_PKTS,RETRANSMITTED_OUT_BYTES,RETRANSMITTED_OUT_PKTS,SRC_TO_DST_AVG_THROUGHPUT,DST_TO_SRC_AVG_THROUGHPUT,NUM_PKTS_UP_TO_128_BYTES,NUM_PKTS_128_TO_256_BYTES,NUM_PKTS_256_TO_512_BYTES,NUM_PKTS_512_TO_1024_BYTES,NUM_PKTS_1024_TO_1514_BYTES,TCP_WIN_MAX_IN,TCP_WIN_MAX_OUT,ICMP_TYPE,ICMP_IPV4_TYPE,DNS_QUERY_ID,DNS_QUERY_TYPE,DNS_TTL_ANSWER,FTP_COMMAND_RET_CODE,SRC_TO_DST_IAT_MIN,SRC_TO_DST_IAT_MAX,SRC_TO_DST_IAT_AVG,SRC_TO_DST_IAT_STDDEV,DST_TO_SRC_IAT_MIN,DST_TO_SRC_IAT_MAX,DST_TO_SRC_IAT_AVG,DST_TO_SRC_IAT_STDDEV,Label,Attack

It's simply, a NetFlow dataset, labelled. It has a field to indicate malicious traffic (Label) and a field that indicates what kind of malicious activity (Attack):

1424242193373,1424242193630,175.45.176.0,61489,149.171.126.16,8088,6,0.0,160,4,80,2,18,18,18,256,256,0,254,255,40,40,40,40,0.3125,0.625,40,1,0,0,4980,2490,6,0,0,0,0,39753,7585,10240,40,0,0,0,0,0,256,85,120,0,0,0,0,1,Fuzzers

Here are the various classes of malicious traffic used in the dataset:

Taken from https://www.researchgate.net/figure/NF-UNSW-NB15-distribution_tbl2_350728555

Ok, cool now what?

Well now let's go create our model. You can leverage your choice of AI to build the Naive Bayes classifier. Alternatively, download the script I generated (with Claude):

GitHub - art-of-defence/NBdetect: An quick exercise using Naive Bayes for network threat detection. Paired with a blog entry at: https://art-of-defence.ghost.io/
An quick exercise using Naive Bayes for network threat detection. Paired with a blog entry at: https://art-of-defence.ghost.io/ - art-of-defence/NBdetect

Here's the help:

nithen@night heatseeker % python3 heatseeker.py -h
usage: heatseeker.py [-h] [--model MODEL] {train,run,live} ...
Bayesian Network Traffic Threat Classifier
positional arguments:
{train,run,live}
train Train the model on labelled CSV datasets
run Classify flows in CSV file(s)
live Capture and classify live network traffic
options:
-h, --help show this help message and exit
--model MODEL Path to model file (default: bayes_model.pkl)

Now we need to train the data, like so:

python3 heatseeker.py train NF-UNSW-NB15-v3.csv --label-column "Attack"

This will produce a model (binary file) called: bayes_model.pkl

The we can run it and see how it performs:

=== Model Summary ===
Total training samples : 2365424
Classes : Benign, Fuzzers, Exploits, Backdoor, Reconnaissance, Generic, DoS, Shellcode, Analysis, Worms
Benign 2237731 samples
Fuzzers 33816 samples
Exploits 42748 samples
Backdoor 4659 samples
Reconnaissance 17074 samples
Generic 19651 samples
DoS 5980 samples
Shellcode 2381 samples
Analysis 1226 samples
Worms 158 samples
[*] Starting live capture on interface: default
Flow timeout : 30.0s
Press Ctrl+C to stop.
[03:04:17] ⚠ THREAT 192.168.1.1:53077 → 31.255.33.19:443 (TCP) class=Reconnaissance conf=67.71% pkts=2 bytes=120
[03:04:20] benign 192.168.1.1:49228 → 31.120.69.21:443 (TCP) class=Benign conf=98.67% pkts=6 bytes=1400

So, it seems to pick up some reconnaissance activity from the get go, interesting:

[03:04:17] ⚠ THREAT 192.168.1.1:53077 → 31.255.33.19:443 (TCP) class=Reconnaissance conf=67.71% pkts=2 bytes=120

How do I train my own model?

Train on a labelled CSV (must have a 'label' column)

python classifier.py train traffic_labelled.csv

Custom label column name

python classifier.py train cicids2017.csv --label-column "Label"

Train on multiple datasets at once (incrementally stacks)

python classifier.py train train1.csv train2.csv train3.csv

Overwrite existing model instead of extending it

python classifier.py train new_data.csv --overwrite

How do I run the model?

Classify flows:

python classifier.py run live_capture.csv

Save results to a new CSV file:

python classifier.py run capture.csv --output results.csv

Ok, now what?

Go create your own datasets, classifications and models to suit your environment.

Conclusion

I'll write an article on the model performance once used it in a live environment - subscribe to be notified.

As always, if I got anything wrong,...

References

The following websites serve as appropriate references for additional detail:

Bayes’ rule with a simple and practical example | Towards Data Science
We demonstrate simple yet practical examples of the application of the Bayes’ rule with Python code.
CSE-CIC-IDS2018
A cleaned version of CSE-CIC-IDS2018 dataset
Reconnaissance, Tactic TA0043 - Enterprise | MITRE ATT&CK®