Comparative Analysis of SQLi Detection Models
ID:224
View protection:Participant Only
Updated time:2025-12-28 11:13:49 Views:219
Online
Abstract
SQL injection (SQLi) remains a common and ongoing threat to web applications. Although various SQLi detection techniques have been proposed, most studies still evaluate them on a single dataset, which makes their conclusions lack verifiability across data conditions. This also makes it difficult to reveal the performance differences of the model under different scales and distributions. This study compares and evaluates machine learning (ML) and deep learning (DL) models based on two publicly available SQLi datasets that differ in size and composition.
The machine learning (ML) pipelines use a hybrid representation that combines character-level TF-IDF, word-level TF-IDF obtained from a SQL-aware tokenizer, and numeric behavioral indicators. The DL branch uses placeholder-based normalization and token-sequence modeling, covering recurrent networks (LSTM and GRU) as well as attention-based variants and a Transformer architecture.
Empirical results have shown that the scale of the dataset plays a significant role in the relative performance of DL models. On the smaller corpus, the Long Short-Term Memory (LSTM) model with multi-head attention achieves the best performance among all DL architectures, while several ML models perform at a comparable or higher level. On the larger and more heterogeneous corpus, the Transformer model attains the highest F1 macro, reaching 0.9946. Linear Support Vector Classification is one of the robust ML benchmarks on both datasets. These results show that ML models lead on the smaller dataset but are surpassed by the top-performing DL model once the dataset becomes larger and more diverse.
Keywords
SQL injection detection, machine learning, deep learning, LinearSVC, Transformer, TF–IDF, tokenization, web application security
Post comments