Text-based
Protocol Detection and Learning
Many well-known application protocols like CIFS、 FTP、 HTTP、 IMAP、 IRC、 NNTP、 POP3、 SMTP can be found their formal descriptions in RCF. Not surprisingly, an excellent programmer can develop his program according to that protocol description easily. However, if we want to know what kind of protocol using in a communication channel, specifying each well-known protocol a specified detector may work but is not a good way. Mover over, if the task is more challenging and there is no formal protocol description but only some observation samples belonging to the protocol, how can we identify the future sample? Machine learning seems a prospective way to handle the task above.
Our project team builds a C/S architecture system for collecting data samples and analyzes data on-line. Client software is installed on the internet gateways and several analysis servers are placed on the Lan.
In practice, especially in file transfer flow burst, the data collector in gateway will skip some samples to avoid high flowing pressure, and those configure can be controlled by the global server.
Training
process
1. For each protocol, automatically extract the keywords in RFC document; initialize the weight of each keyword according to equation (1).
2. Specify a pre-known channel, samples the data headers from data transfer.
3. Tokenize the data headers, and update the weight of keyword according equation (2).
4. Establish a keyword-weight list corresponding to each protocol.
The initial weight of keyword I is defined as follow, and InEntropy(i) is inner entropy of keyword I. (Inner entropy describes a keyword’s frequency in different protocols)
(1)
The weight of keyword i will be update according
to its occurrence frequency
in current document
(data header), and M is a prior constant.
(2)
All the keywords not in RFC document should be initialized their weight to 0.
Detection
Process
1. Specify a channel, samples the data headers from data transfer.
2. Tokenize the data headers, and a keyword vector of those headers can be generated.
3. Take top 100 keywords from each pre-trained protocol according to their weight and form the feature vector.
4. Dot product the keyword vector and vector of each protocol, and get a score of each protocol.
5. Select the maximum score protocol as the protocol using in the current channel.
Keyword-weight list example of “HTTP” protocol
KEYWORD |
WEIGHT |
get |
0.406939353189554 |
. |
0.395007586242046 |
1 |
0.338233483751328 |
: |
0.331044000203868 |
connection |
0.269702470331325 |
- |
0.184387631485061 |
http |
0.176468927771919 |
" |
0.0979632802361732 |
More detail can be found on the Design Document.
Count the data flow on different protocol in real
time.
Count the data flow on different file type in real
time.
Construct the topological network connection graph in real time.
Count the number of network attack in real time
This simple test tool is developed for stress test and performance analysis. We should prepare some samples of each protocol before launching the test. Test tool can generate huge random data flow according to the prepared samples and the configure.
DataCollector binary program and
source code
TestKit binary program and source
code