What's up, everybody!
Today, I want to introduce an interesting paper related to Code Injection. Detecting this notorious cyber attack is a crucial challenge to realize safe cyberspace. At present, so many web sites have been attacked using the Code Injection technique. In this article, I explain one of the interesting strategies to detect an injection attack with a machine learning model.
What is the Code Injection?
The paper CODDLE: Code-Injection Detection With Deep Learning gives us useful knowledge for catching Code Injection attacks by artificial intelligence. Mainly CODDLE provides us insight regarding preprocess. They say this approach can achieve greater performance than the one without preprocess.
The crucial point of this research is how to make input string more readable for a machine learning model. To achieve this target, they apply two preprocesses: Removing noise and Symbolize
The most popular approach to enhance Neural Network detection ability is removing randomness. A raw web access log contains so many randomnesses such as a digit, name and so on.
The basic concept for this operation is that the goal of a well-trained neural network is to understand the role of a specific symbol or operator. To achieve this, the paper proposes replacing these symbols by code. The target for this swapping is a symbol, expression, programming language operator and so on. What's more, they apply coding with not only single value. The suggested strategy uses pair of values for encoding. One is for raw string part and the other is a code, which represents category for raw string potion. The respective values are,
|0||Operators||AND, UNION, SELECT, FROM|
Here I'd like to show you some encoding samples.
|Raw String||` and 1=0) union all|
|Remove Randomness||` and = ) union all|
|Raw String||SELECT column1, columns2, column3 FROM tablename|
|Remove Randomness||SELECT , , FROM|
They have conducted experiments for SQL Injection and XSS. The dataset for respective attacks is SQL Payload Dataset and XSS Payload Dataset. The algorithm they used is Convolutional Neural Network.
|Attack Type||Dataset||Data Source|
|SQL Injection||SQL Payload Dataset||https://github.com/SuperCowPowers/data_hacking/tree/master/sql_injection/data|
|XSS||XSS Payload Dataset||https://github.com/payloadbox/xss-payload-list|
According to the results in the above table, it looks like the suggested preprocessing method is superior to the one without preprocessing.
The results do seem to be positive. But I think this method has some weaknesses. The suggested encoding method requires domain-specific knowledge and it can only cover Known expression. So I imagine if the language supports a new function or something and the method can be used for an injection attack, the suggested method cannot detect these kinds of attacks. So in my opinion, the strategy offered by the paper is suitable for detecting a well-known attack. On the other hand, it is unsuitable for finding unknown attacks such as zero-day. But the idea of using coding is interesting.
CODDLE: Code-Injection Detection With Deep Learning
Machine Learning Researcher and Developer
Cyber Security Cloud, Inc.