Siamese Networks Effectively Learn Robust Features for Malware Attribution

Our novel deep learning approach enables automation and scalability to identify new malware that is outside known threat actors.

Presented at the Naval Applications of Machine Learning (NAML) Conference, February 2025

Malware attribution has proven challenging, as cyber forensic expertise is in short supply and agencies struggle to scale identification from cyberattack to threat actor. We present a novel deep learning approach for automated classification of malware by attribution categories, and present SoTA results on both pairwise and one-to-many classification tasks that far exceed existing work. Our model trained solely on the MOTIF benchmark achieving 80% validation classification accuracy. We achieve this purely via a Siamese model trained on an embedding of instructions in the malware disassembly.

An existing challenge with many malware attribution approaches is the ability to scale to classification categories and datasets beyond the limited available training data. While standard regularization techniques, such as dropout, remain invaluable to more general predictions, we present two major innovations for greater generality:

We model the problem as a similarity metric learning problem using a Siamese architecture to achieve a general attribution embedding space
We apply novel data augmentation to our malware samples to increase the effective diversity of our dataset

Our unique Siamese architecture solves the open set problem, enabling automated clustering of attribution categories that are outside of the original dataset, enabling scalability to new malware that is outside of known threat actors. We further find the embedding space produced by our model is meaningful; particularly, nearby clusters by our model in the embedding space represent closely related threat actors. Our unique data augmentation also improves model performance on the MOTIF set, and the produced embeddings for a malware sample and its augmentations suggest our model is robust to slight variations in malware features meaningless to attribution.

Contact our team to learn more >

View our presentation