Chinese Spam Filtering Based On Back-Propagation Neural Networks
Peiguo Li1, Yan Ye2
1Department of Mathematics, Jinan University, Guangzhou, China
2Department of Computer Science, Guangzhou College of Commerce, Guangzhou, China
To cite this article:
Peiguo Li, Yan Ye. Chinese Spam Filtering Based On Back-Propagation Neural Networks. Software Engineering. Vol. 4, No. 2, 2016, pp. 9-12. doi: 10.11648/j.se.20160402.11
Received: March 30, 2016; Accepted: April 8, 2016; Published: April 16, 2016
Abstract: As the email service is becoming an important communication way on the Network, the spam is increasing every day. This paper describes a new filtering model based on email content by using Back-Propagation Neural Networks (BPNN). And for the Chinese email, it uses Natural Language Processing & Information Retrieval Sharing Platform (NLPIR) system to perform Chinese word segmentation. The simulation results show that this model can precisely filter the Chinese spam.
Keywords: Spam, BPNN, NLPIR
Spam, also known as junk email, is a subset of electronic spam involving nearly identical messages sent to numerous recipients by email . The spams take up huge Internet resources and users’ time. And the cost of spam in terms of lost productivity has reached about $20 billion annually, according to the National Technology Readiness Survey .
Nowadays, two general measures have been used in anti-spam system: filter-based and content-based . In filter-based way, sets of rules have to be set up by user or some filter system , such as white/black list, specific words in email address or email title, etc. In this way, the user or filter needs to frequently update the rules to adapt to the changing spamming. In content-based way, the email server uses some classification algorithms based on email content, to determine if an email is spam or not. Most of content-based measures use an artificial intelligent algorithm as the classification, which means they don’t require complex rules to be maintained.
At present, several algorithms have been used for spam classification, include support vector machines , Bayesian classifiers , boosting decision trees , rough sets , neural networks , fuzzy logic , etc. In this paper, we introduce the use of BPNN for content-based spam filtering, and NLPIR system to perform Chinese word segmentation. During the training stage of the BPNN, we use GA to optimize the architecture of the BPNN to get an optimal result.
2. Related Studies
2.1. The Standard BPNN
The back-propagation neural network is a widely used supervised learning algorithm. A typical BPNN, see Figure 1, there is an input layer, an output layer, a hidden layer among them. During the training stage, the current output is compared with the desired output of the training sample, and then to adjust the weights of the network.
There are two main problems of the standard BPNN, its slow training speed and easy to fall into a local optimal solution.
To resolve these problems, we introduce Genetic Algorithm (GA) to speed up the training speed and avoid the local optimal solution. In this study, we use five steps to combine the GA and BPNN:
1. Encode: design an encoding scheme for all the weight of the network, such as:
2. Initialization: initialize a random population of N chromosomes (encoded weights of the network);
3. Fitness: construct a BPNN by decoding every chromosome in the population, and calculate the mean square error of the BPNN as the fitness of current chromosome;
4. GA loops;
5. Test: if the end condition is satisfied, stop the loop and get the best structure of the BPNN;
Figure 2 shows the whole picture of this process.
2.3. NLPIR System
The NLPIR segmentation system is developed from ICTCLAS system. Its main functions include Chinese segmentation, POS tagging, Named entity recognition, user dictionary and Keywords identification . The system can work with GBK, UTF-8 and Big5 encoding content. And there are a lot of APIs/Interfaces for C/C#, Java, etc.
In this paper we use this system to do segmentation of the Chinese email content. After the segmentation, we get a set of keywords from the email content, which construct the input value of the BPNN.
3. Chinese Spam Filter Model
There are four models in our spam filter, segmentation model, keyword extraction model, spam filtering model and user feedback model.
3.1. Segmentation Model
In this model, the main job is do segmentation of the email content. Before that, we need to preprocess the email content.
1. Identify the encoding of the email;
2. Remove the head data of the email;
3. Remove some specific characters, like space, slash; some spammers intentionally add these characters to interfere the spam filter.
After above processing, the email contents are input into the segmentation system, and then get the result words. Figure 3 shows the workflow of this part.
3.2. Keyword Extraction Model
Among the massive words generated by segmentation, there are only some keywords which will be helpful for spam filtering. So we need to extract the keywords from all words library. In this study, we select keywords according to the probability of these words appear in the spam, i.e.
At the same time, we need to avoid selecting some meaningless words as keywords, such as yes, no. These words frequently appear in all emails, but they are not helpful for filtering the spam.
According to the table of keywords, every email is converted to a feature vector. In the vector, every column is 0 or 1, which means the corresponding keyword appearing or not, see Table 1 and Table 2.
|word||email 1||email 2||email 3|
3.3. BPNN Filtering Model
As described in 2.2, we introduce a GA-BPNN as our spam filtering model. We use three layers network, with 30 input nodes, 10 hidden nodes, and 1 output node.
At the training stage of the BPNN, we use GA to optimize the weights of the BPNN, as shown in Figure 4.
3.4. User Feedback
In practical application, we should add user feedback model, to allow user to reconfirm whether an email is spam or not. And the filter could readjust the weights of the BPNN, i.e. learning in practice.
Figure 5 shows the framework of the whole system.
In this section, we implement our GA-BPNN spam filter in Java language, and evaluate the effectiveness of our proposal.
4.1. Experimental Settings
We use the open source project JOONE (Java Object Oriented Neural Engine)  as the implementation of the BPNN. The parameters of the JOONE are:
1. For input layer, using LinearLayer, no transformation function; for hidden layer and output layer, using Sigmoid function as the transformation function, i.e. SigmoidLayer;
2. Using three layers network, with 30 input nodes, 10 hidden nodes, and 1 output node;
3. For links of three layers, using FullSynapse.
When applying the Genetic Algorithm, we use another open source project JGAP (Java Genetic Algorithm Programming package) . The settings of the JGAP are:
1. Using the vector of the BPNN’s weights as the chromosome, see Formula 1;
2. Generating an initial population by random;
3. Using the mean square error of the BPNN output as the fitness;
For the segmentation, we get the NLPIR (ICTCLAS 2013 version) system. With JNI framework, we can call the APIs of the NLPIR system to do the segmentation.
The testing sample emails, we use the email set collected in 2005 by the CCERT. We choose about 6,000 emails, include spams and normal emails.
4.2. Results and Analysis
At first, we used all the samples in the email set, about 30,000 emails, to train our GA-BPNN filter. The segmentation system completed its work and got the keywords table quickly. But the convergence speed was very slow; it took about 16 hours, and didn’t complete the training. So we decreased the number of samples from 30,000 to 10,000, the training work was finished well, and the training time was acceptable.
And we did the test in two ways: one with GA optimized filter, another with only BPNN filter. Table 3 shows the testing results with GA optimized filter:
|training samples||Training time||Testing samples||Recall rate||Precision rate|
And Table 4 shows the testing results with only BPNN filter:
|training samples||Training time||Testing samples||Recall rate||Precision rate|
As we can see from the testing results, the filter is able to precisely detect the spam after trained by an appropriate sample sets. With the GA optimized, the filter was trained quickly, and it had a higher precision rate. But its recall rate isn’t ideal in both; we think the reason is the limitation of the training samples. This limitation result in the narrow scope of the keywords, and then affects the recall rate of the filter.
In this study, we propose a GA-BPNN-based spam filter model for Chinese emails, and introduce NLPIR system for the segmentation. Then we implement our model in Java, by using some open source project, and get some experimental results. The experiments demonstrate the effectiveness of our model, also some need improved aspects. In the future, we will optimize our BPNN structure, and Genetic Algorithm, to improve the performance of our model.
In practice, the task of spam filtering needs to combine multiple techniques, include content-based and filter-based. Only a comprehensive spam filtering system can fight off the spammer.