Text-based Protocol Detection and Learning

Many well-known application protocols like CIFS FTP HTTP IMAP IRC NNTP POP3 SMTP can be found their formal descriptions in RCF. Not surprisingly, an excellent programmer can develop his program according to that protocol description easily. However, if we want to know what kind of protocol using in a communication channel, specifying each well-known protocol a specified detector may work but is not a good way. Mover over, if the task is more challenging and there is no formal protocol description but only some observation samples belonging to the protocol, how can we identify the future sample? Machine learning seems a prospective way to handle the task above.

 

Our project team builds a C/S architecture system for collecting data samples and analyzes data on-line. Client software is installed on the internet gateways and several analysis servers are placed on the Lan.

 

Data collection process

In practice, especially in file transfer flow burst, the data collector in gateway will skip some samples to avoid high flowing pressure, and those configure can be controlled by the global server.

 Analysis algorithm

Training process

1.       For each protocol, automatically extract the keywords in RFC document; initialize the weight of each keyword according to equation (1).

2.       Specify a pre-known channel, samples the data headers from data transfer.

3.       Tokenize the data headers, and update the weight of keyword according equation (2).

4.       Establish a keyword-weight list corresponding to each protocol.

 

The initial weight of keyword I is defined as follow, and InEntropy(i) is inner entropy of keyword I. (Inner entropy describes a keyword’s frequency in different protocols)

  (1)

The weight of keyword i  will be update according to its occurrence frequency  in current document (data header), and M is a prior constant.

                    (2) 

All the keywords not in RFC document should be initialized their weight to 0.

Detection Process

1.       Specify a channel, samples the data headers from data transfer.

2.       Tokenize the data headers, and a keyword vector of those headers can be generated.

3.       Take top 100 keywords from each pre-trained protocol according to their weight and form the feature vector.

4.       Dot product the keyword vector and vector of each protocol, and get a score of each protocol.

5.       Select the maximum score protocol as the protocol using in the current channel.

 

Keyword-weight list example of “HTTP” protocol

KEYWORD

WEIGHT

get

0.406939353189554

.

0.395007586242046

1

0.338233483751328

:

0.331044000203868

connection

0.269702470331325

-

0.184387631485061

http

0.176468927771919

"

0.0979632802361732

 

More detail can be found on the Design Document.

Functionality of Server

Count the data flow on different protocol in real time.

Count the data flow on different file type in real time.

 

Construct the topological network connection graph in real time.

Count the number of network attack in real time

 

 

Self-made Test Tool

This simple test tool is developed for stress test and performance analysis. We should prepare some samples of each protocol before launching the test. Test tool can generate huge random data flow according to the prepared samples and the configure.

 

Download

Test Document

Design Document

Class Diagrams

DataCollector binary program and source code

Client software

Server software

ETReporter Source Code

TestKit binary program and source code

Keyword Analyzer Source code