Malicious URL detector using char-level recurrent neural networks with Keras
The purpose of this project is to study malicious url detector that does not rely on any prior knowledge about urls
The training data is from this link, can be found in "demo/data/URL.txt"
The following deep learning models have been implemented and studied:
-
LSTM: this approach uses LSTM recurrent networks for classifier with categorical cross entropy loss function
- training: demo/lstm_train.py (one-hot encoding)
- predictor: demo/lstm_predict.py (one-hot encoding)
- training: demo/lstm_embed_train.py (word embedding)
- predictor: demo/lstm_embed_predict.py (word embedding)
-
CNN + LSTM: this approach uses CNN + LSTM recurrent networks for classifier with categorical cross entropy loss function
- training: demo/cnn_lstm_train.py
- predictor: demo/cnn_lstm_predict.py
-
Bidirectional LSTM: this approach uses Bidirectional LSTM recurrent networks for classifier with categorical cross entropy loss function
- training: demo/bidirectional_lstm_train.py
- predictor: demo/bidirectional_lstm_predict.py
To run the training on Bidirectional LSTM:
cd demo
python bidirectional_lstm_train.py
Below is the code in bidirectional_lstm_train.py:
from keras_malicious_url_detector.library.bidirectional_lstm import BidirectionalLstmEmbedPredictor
from keras_malicious_url_detector.library.utility.url_data_loader import load_url_data
import numpy as np
from keras_malicious_url_detector.library.utility.text_model_extractor import extract_text_model
from keras_malicious_url_detector.library.utility.plot_utils import plot_and_save_history
def main():
random_state = 42
np.random.seed(random_state)
data_dir_path = './data'
model_dir_path = './models'
report_dir_path = './reports'
url_data = load_url_data(data_dir_path)
text_model = extract_text_model(url_data['text'])
batch_size = 64
epochs = 30
classifier = BidirectionalLstmEmbedPredictor()
history = classifier.fit(text_model=text_model,
model_dir_path=model_dir_path,
url_data=url_data, batch_size=batch_size, epochs=epochs)
plot_and_save_history(history, BidirectionalLstmEmbedPredictor.model_name,
report_dir_path + '/' + BidirectionalLstmEmbedPredictor.model_name + '-history.png')
if __name__ == '__main__':
main()
After the training, the trained models are saved in the demo/models folder.
To test the trained model,run:
cd demo
python bidirectional_lstm_predict.py
Below is the code in bidrectional_lstm_predict.py:
from keras_malicious_url_detector.library.bidirectional_lstm import BidirectionalLstmEmbedPredictor
from keras_malicious_url_detector.library.utility.url_data_loader import load_url_data
def main():
data_dir_path = './data'
model_dir_path = './models'
predictor = BidirectionalLstmEmbedPredictor()
predictor.load_model(model_dir_path)
url_data = load_url_data(data_dir_path)
count = 0
for url, label in zip(url_data['text'], url_data['label']):
predicted_label = predictor.predict(url)
print('predicted: ' + str(predicted_label) + ' actual: ' + str(label))
count += 1
if count > 20:
break
if __name__ == '__main__':
main()
Currently the bidirectional LSTM gives the best performance, with 75% - 80% accuracy after 30 to 40 epochs of training
Below is the training history (loss and accuracy) for the bidirectional LSTM:
- Currently the data size of the urls is small
- Class imbalances - the URL.txt contains class imbalances (more 0 than 1), ideally the problem should be an outlier or anomaly detection problem. To handle the class imabalances, currently a resampling method is used to make sure that there are more or less equal number of each classes