Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CLX DGA Detection #46

Merged
merged 30 commits into from
Jun 5, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
83892a6
migrate clx dga detection
efajardo-nv May 2, 2023
1487b54
style fixes
efajardo-nv May 2, 2023
35c1448
more style fixes
efajardo-nv May 2, 2023
7821442
convert to numpy docstring format
efajardo-nv May 3, 2023
9f4ad25
update notebook
efajardo-nv May 3, 2023
dd3abab
add training script
efajardo-nv May 5, 2023
bfffd88
add readme
efajardo-nv May 5, 2023
0cd5b74
style fixes
efajardo-nv May 9, 2023
6d6ca19
add onnx export step to notebook
efajardo-nv May 9, 2023
f5ef8fe
add padding to data preproc
efajardo-nv May 9, 2023
a667ad8
add onnx model and triton config
efajardo-nv May 9, 2023
aee74bc
add morpheus pipeline
efajardo-nv May 9, 2023
4098f0a
cleanup
efajardo-nv May 10, 2023
3ed7062
preproc fix
efajardo-nv May 10, 2023
03215bc
notebook and model updates
efajardo-nv May 10, 2023
d5a6faf
morpheus pipeline readme
efajardo-nv May 10, 2023
06f6f32
readme fix
efajardo-nv May 10, 2023
e0c242a
add dga-detection to workspace
efajardo-nv May 23, 2023
c53b3ab
fix onnx export
efajardo-nv May 23, 2023
2dba932
update versions to match morpheus
efajardo-nv May 25, 2023
f7a4d3e
Merge remote-tracking branch 'origin/update-to-morpheus-versions' int…
efajardo-nv May 26, 2023
43f7c57
updates to work with new data and cudf 23.02
efajardo-nv May 26, 2023
143cef8
flake8 fixes
efajardo-nv May 26, 2023
60f412f
doc update
efajardo-nv May 26, 2023
fdb6a05
move triton-model-repo
efajardo-nv May 31, 2023
36506ac
update stats for new training data
efajardo-nv May 31, 2023
cbc1bc4
Merge branch 'branch-23.07' of https://github.com/nv-morpheus/morpheu…
efajardo-nv Jun 1, 2023
7d1e327
add more info about training data
efajardo-nv Jun 1, 2023
5a4ff1f
pr feedback updates
efajardo-nv Jun 1, 2023
bd927d1
remove commented code
efajardo-nv Jun 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions dga-detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,24 @@ pip install -r requirements.txt

### Training

#### Training data

Training data consists of 116K labelled as DGA domains and 100K labelled as not DGA domains.

GPU Model: V100
Epochs = 25
Training batch size = 10000
Model precision = 0.997
Model accuracy = 0.998
Two types of DGA domains (Banjori, Chinad) were generated based on the implementations on https://github.com/baderj/domain_generation_algorithms. 100000 benign domains were taken from https://www.domcop.com/files/top/top10milliondomains.csv.zip.

#### Training epochs
25

#### Training batch size
10000

#### GPU model
V100

#### Model accuracy
precision = 0.995
ccuracy = 0.998

#### Training script

Expand Down
27 changes: 0 additions & 27 deletions dga-detection/morpheus-pipeline/messages.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,38 +16,11 @@
import dataclasses
import typing

import cupy as cp

from morpheus.messages import InferenceMemory
from morpheus.messages import MultiInferenceMessage
from morpheus.messages.data_class_prop import DataClassProp
from morpheus.messages.memory.tensor_memory import TensorMemory
from morpheus.messages.message_meta import MessageMeta


@dataclasses.dataclass(init=False)
class InferenceMemoryDGA(InferenceMemory, cpp_class=None):
"""
This is a container class for data that needs to be submitted to the inference server for DGA
use cases.

Parameters
----------
domains : cupy.ndarray
The token-ids for each string padded with 0s to max_length.
seq_lengths : cupy.ndarray
Sequence lengths

"""
domains: dataclasses.InitVar[cp.ndarray] = DataClassProp(InferenceMemory._get_tensor_prop,
InferenceMemory.set_input)
seq_ids: dataclasses.InitVar[cp.ndarray] = DataClassProp(InferenceMemory._get_tensor_prop,
InferenceMemory.set_input)

def __init__(self, *, count: int, domains: cp.ndarray, seq_ids: cp.ndarray):
super().__init__(count=count, tensors={'domains': domains, 'seq_ids': seq_ids})


@dataclasses.dataclass
class MultiInferenceDGAMessage(MultiInferenceMessage, cpp_class=None):
"""
Expand Down
7 changes: 4 additions & 3 deletions dga-detection/morpheus-pipeline/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

import cupy as cp
import mrc
from messages import InferenceMemoryDGA
# from messages import InferenceMemoryDGA
efajardo-nv marked this conversation as resolved.
Show resolved Hide resolved
from messages import MultiInferenceDGAMessage

import cudf
Expand All @@ -26,6 +26,7 @@
from morpheus.messages import MultiInferenceMessage
from morpheus.messages import MultiInferenceNLPMessage
from morpheus.messages import MultiMessage
from morpheus.messages.memory.tensor_memory import TensorMemory
from morpheus.stages.preprocess.preprocess_base_stage import PreprocessBaseStage


Expand Down Expand Up @@ -75,7 +76,7 @@ def supports_cpp_node(self):
@staticmethod
def pre_process_batch(x: MultiMessage, fea_len: int, column: str, truncate_len: int) -> MultiInferenceNLPMessage:

df = x.get_meta()[[column]]
df = x.get_meta([column])
df[column] = df[column].str.slice_replace(truncate_len, repl='')

split_ser = df[column].str.findall(r"[\w\W\d\D\s\S]")
Expand Down Expand Up @@ -113,7 +114,7 @@ def pre_process_batch(x: MultiMessage, fea_len: int, column: str, truncate_len:
seg_ids[:, 2] = fea_len - 1

# Create the inference memory. Keep in mind count here could be > than input count
memory = InferenceMemoryDGA(count=input.shape[0], domains=input, seq_ids=seg_ids)
memory = TensorMemory(count=input.shape[0], tensors={'domains': input, 'seq_ids': seg_ids})

infer_message = MultiInferenceDGAMessage.from_message(x, memory=memory)

Expand Down
4 changes: 4 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,11 @@ RUN conda clean -afy
COPY "./docker" "./docker"
COPY "./anomalous-auth-detection" "./anomalous-auth-detection"
COPY "./appshield-dga-detection" "./appshield-dga-detection"
COPY "./asset-clustering" "./asset-clustering"
COPY "./dga-detection" "./dga-detection"
COPY "./ids-detection" "./ids-detection"
COPY "./log-sequence-ad" "./log-sequence-ad"
COPY "./operational-technology" "./operational-technology"
COPY "./phishing-url-detection" "./phishing-url-detection"
COPY "./string-resemblance-grouping" "./string-resemblance-grouping"
COPY ["*.md", "LICENSE", "./"]
Expand Down