Online Semantic Analysis over Big Network Data

by Chengchen Hu and Yuming Jiang

Mobile applications, web services and social media generate huge data traffic from many sources. This is described as ‘big network data’ with the ‘4V-characteristics’ - that is, variety, velocity, volume and veracity. Analysis on such big network data can discover problems, reveal opportunities and provide advice for the service and network providers for fine-grained traffic engineering with close to real-time adjustment or application acceleration [1].

Different applications tend to integrate several functionalities with various data formats. For example, the Twitter application produces network traffic such as tweeting, posting pictures, embedding video. Classical methods are limited to protocol or application identification [2]. We need to go beyond the packet and application analysis, when semantic information is the target [3]. Therefore, the first goal is for our method to exhibit fine-grained awareness, which analyzes user behaviour instead of traffic only related to a certain application. This implies that we need to use a general grammar to associate the unstructured data with user behaviour.

The second goal is to develop a flexible and uniform specification of the user semantic from network traffic. In previous work, heterogeneous and unstructured big network data in different formats are studied separately. It is a challenge to normalize the independent data structures and describe user behaviour in a unified framework to conduct a comprehensive analysis in a fine-grained semantic manner.

Finally, the third goal of our approach is that the method should operate at (close to) wire speed. Even when real-time analysis is not strictly needed, the off-line method is limited by storage because the analysis capacity and capabilities cannot keep up with the rate at which data is produced. Extracting and storing only useful information is a viable approach that needs to be explored further.

In order to achieve the above goals, we apply Deep Semantic Inspection (DSI), which contains a standard description to unify the various formats of different applications and finally obtain user semantics. Our basic idea is to extract a minimum but complete semantic for each user behaviour at wire speed, and then apply data analysis and data mining on the small sized semantic data instead of the raw traffic data. This process purifies the raw traffic and reduces the data volume by several orders of magnitude. Our preliminary experiment shows that the compression ratio between the raw traffic volume and our approach is in three orders of magnitude. As a result, the data volume for further user-defined high-level analyses can be significantly reduced to handle big, and increasing, network data.

Figure 1: An example and processing pipeline of DSI/SOLID.

We have designed and implemented a cross-platform system, named Semantic On-Line Intent Detection (SOLID) to realize our DSI approach. As shown in Figure 1, SOLID builds a processing pipeline in its kernel space, where a semantic parser translates the segments into the application semantic. A matching engine compares the application semantic with predefined specifications to output the user sketches. The kernel design allows simultaneous processing of multiple application specifications and multiple PDUs. Our implementation has demonstrated that on a real traffic trace, the SOLID-system achieves 17.6Gbps throughput with the cost of 709MB memory consummations in x86 platform. To make the SOLID-system flexible so that it supports various analyses in the user space, we are now trying to improve the interface abstraction between the kernel and user space.

A few practical cases have been used to demonstrate the flexibility of SOLID, but more cases need to be investigated. This is necessary in order to acquire a thorough understanding of big network data with real traffic to analyze various factors, such as the application performance, user profiling, CP competition and application correlations.

The authors would like to appreciate Poul Heegaard of NTNU for his valuable suggestions on this article.

References:
[1] T. Benson et al.: “Fine grained traffic engineering for data centers”, in proc. of CoNEXT ’11 (New York, NY, USA, 2011), ACM, pp. 8:1–8:12.
[2] F. Yu et al.: “Fast and memory-efficient regular expression matching for deep packet inspection”, in ANCS 2006. (Dec. 2006), pp. 93 –102.
[3] H. Li, C. Hu: “ROOM: Rule organized optimal matching for fine-grained traffic identification”, in proc. of INFOCOM, 2013, IEEE (2013), pp. 65–69.

Please contact:
Chengchen Hu
ERCIM Fellow at NTNU, Norway
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it., This email address is being protected from spambots. You need JavaScript enabled to view it.

Yuming Jiang
NTNU, Norway
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.