CS5611 (Fuzzy Sets: Theory and Applications)

Final Report for CS5611 (Fuzzy Sets: Theory and Applications):

Categorize Chinese News

³¯²±¤å
(mr854304,¸êºÓ¤@)

¶À¦ó®Ñ¦w
(mr854333, ¸êºÓ¤@)

·¨¿«C
(mr854359, ¸êºÓ¤@)

Abstract	Problem Definition	Data Set Description	Approach
Simulation Results	Conclusions	Computer Programs	Division of labor
References

Abstract

In this project we apply four different clustering methods to Chinese News documents. Our main goal is not in comparing their performance but in examining the practicability of categorizing Chinese document. In traditional information retrieval (IR), the documents are indexed by some indexing ways and retrieved by users according to their queries. We do not index them. Conversely, we represent them as vectors and try several clustering and classifying methods. The results are further compared to human experience -- the categories by reporters.

Problem Definition

The categorization of text document is an interesting and challenging problem in information retrieval (IR). To categorize Chinese news document, many techniques may be applied, e.g. linguistic approaches. We adopt a statistical approach to tackle this problem. First, we count the frequency of Chinese characters. Second, we analyze the distribution of characters. At last, we categorize documents by the analysis results. The judge criterion is based on the reporter's categorization rule, but this rule is not really objective.

Data Set Description

We gathered articles from on-line Chinese Daily News for one week. 601 documents written by professional reporters were collected from nine categories: Economy, Editorial, Entertainment, Focus, International, Mainland, Social, Sports and Taiwan. Each document contains about 500-2000 Chinese characters. The documents were first transformed into vectors according to the frequency of occurrence of characters. All document-vectors were clustered or classifified into several classes, and the experimental results were compared to the original nine categories

Approach

Preprocessing

Each document is first parsed according to Chinese characters. The size of Chinese character vocabulary is about 13000 among which less than 5000 are the commonly used characters. We keep the granularity down to the character level to avoid having to use linguistic knowledge to segment words in Chinese. In this approach, no stemming and stop word lists or a thesaurus is needed. Since character is the basic processing unit and is used equivalently as the concept of a "term" in IR denominating tradition, we will sue terms to refer to characters in the rest of the paper.

In the parsing process, both term frequency and document frequencies are counted. We represent the weight of a term in a given document by adopting Salton's well-tested formula in IR, i.e. the term frequency (tf) multiplied by the natural log of the inverse document frequency (idf) [Salton 89]. Namely, the weight of a term t in a given document d, namely w(t,d), is represented as

w(t, d) = tft,d * log(N/dft)

where N is the total number in a collection of documents, tft,d is the term frequency of term t in document d, and the dft is the document frequency of the term t in the collection.

Then a document can be represented as a vector V with elements v1, v2, , vn, where n is equal to the size of character vocabulary, and vi is the weight of term i in the document [Yan 94]. For convenience and by convention, all vector are normalized. We can then easily calculate the similarity between two documents Di and Dj by the cosine of the angle between their vector representations:

Similarity(Di, Dj)=

where * represents the inner product between two vectors; || || represents the norm of a vector. Apparently based on this formula, two documents with totally different character set will have "zero" similarity between them because the inner product of the two document vectors would be zero.

SVD

Singular-Value-Decomposition (SVD) is a reliable tool available for matrix factorization [Lang 96] . For any matrix A, A^TA has nonnegative eigenvalues. The nonnegative square roots of the eigenvalues of A^T A are called the singular values of A, and the number of the non-zero singular values is equal to the rank of A, rank(A). If A is an m*n matrix and rank(A) = r, the SVD of A is defined as

A=UWV^T

and an approximation of the matrix A, A_k, is a schema of the truncated SVD of matrix A, as follows:

Using the SVD technique, a term-by-document matrix A is mapped into a reduced k*n matrix represented by W_kV_k^T, which relates to k factors to n documents.

Fuzzy ARTMAP

Fuzzy ART is similar to ART1,but use fuzzy operations instead of logical operations in weight changing and category choice. The architecuture of Fuzzy ARTMAP consists of two fuzzy ART module.

Figure 1. Fuzzy ART module

Fuzzy ART use three layers(F0,F1,F2) to cluster input data. F0 contains raw(preprocessing) data , F2 selects the highest score node from choice function and sends it to F1. F1 checks that is the node satisfing vigilance.

Fuzzy ARTMAP contains two Fuzzy ART module[Carpenter et al. 92] in it. But in our experiment, ARTb reduces to a single instance( s.a. 2 : Entertainment, 8 : Sport ).

KNNR (Please refer to Jang's neuro-fuzzy textbook.)
K-Means (Please refer to Jang's neuro-fuzzy textbook.)
Fuzzy K-Means (Please refer to Jang's neuro-fuzzy textbook.)
Joining (Tree Clustering) A trivial clustering method based on similarity comparison.

Simulation Results

Data Preprocessing: The detail is described in the previous section.

Input selection: Not all 601 documents are used in our project because some categories are not well-defined by the reporters. A preliminary experiment is taken to see this phenomenon. Among all documents, 235 documents in three categories (Economy, Entertainment, and Sports) are chosen to performed our clustering and classfifying experiments.

Data utilization:

Training data: Randomly selected in various proportion.(20% or 50%)
Test data: Randomly selected in various proportion. (80% or 50%)

Results:

Preliminary experiment -- similarity between docs

Data Set 1: contains all 601 Data, nine categories. (Economy, Editor, Entertain, Focus, International, Mainland, Social, Sports, Taiwan)

Data Set 2: contains 235 Data. (Economy, Entertainment, Sports)

Fuzzy ARTMAP:

In this project, the improved version revised by Ah-Hwee Tan (Boston University), 1993, is used.

Experiment 1: 50 Trainning data, 185 test data, 12 iterations

Actual Category	1	2	3	Total	Precision
Predict 1	612	53	74	739	82.81%
Predict 2	42	580	40	662	87.61%
Predict 3	53	45	721	819	88.03%
Total	707	678	835	2220
Recall	86.56%	85.55%	86.35%		86.17%

Experiment 2: 117 Trainning data, 118 test data, 15 iterations

Actual Category	1	2	3	Total	Precision
Predict 1	492	22	24	538	91.45%
Predict 2	36	480	19	535	89.72%
Predict 3	42	12	643	697	92.25%
Total	570	514	686	1770
Recall	86.32%	93.39%	93.73%		91.24%

K-Means Clustering:

Processed by SVD method, 235 documents, 100 factors, 3 categories (Economy, Enterntainment, Sport). Choose observations to maximize initial inter-category distances. Solution was obtained after 3 iterations.

Actual Category	1	2	3	Total	Precision
Predict 1	68	0	0	68	100.00%
Predict 2	12	73	5	89	82.02%
Predict 3	0	0	78	78	100.00%
Total	79	73	83	235
Recall	86.08%	100%	93.98%		93.19%

KNNR

K-Nearest-Neighbor classification, K from 1 to 13, 5 iterations.

Average Accurecy 89.67%.

Fuzzy K-Means

	1	2	3	Total	Precision
Predict 1	60	0	0	60	100.00%
Predict 2	19	73	14	106	68.87%
Predict 3	0	0	69	69	100.00%
Total	79	73	83	235
Recall	75.95%	100%	83.13%		85.96%

Joining(Tree Clustering)

Amalgamation(linkage) rule: Complete linkage.

Distance measure: Euclidean distance.

Concluding Remarks and Future Work

Concluding Remarks:

By choosing the most easy-recognized categories (Economy, Entertain, Sports), the clustering results are in compliance with the human experience (above 85% in all methods used).
Without input selection, the performance decreases drastically and only two categories can be correctly recognized, namely, Sports and Entertainment. This is due to the fuzziness of word usage.
The method we presented here perform in competitive level.
Automatically document categorization is in fact a hard problem. To get higher performance, additional techniques like word segmenting or semantic parsing are needed. In another aspect, the categorization metric is subjective rather than objective sometimes.
We have tried the PCPand LVQ methods but they took too much memory space and computation time.
We also combined PCP and SVD but the performance decreased.

Future Work:

Apply our methods to other kinds of text document.
How to promote the performance?

Computer Programs

All the computer programs (including C and MATLAB) and data sets for this project are available at: http://aidu.cs.nthu.edu.tw/~pcyang/FuzzyTerm/fuzzyfinal.zip
Function of each program:

Artmap.zip: A fuzzy-artmap program based on Ah-Hwee Tan's revised version, 1993.
knnr3.m: KNNR clustering program.
go_svd.m: SVD transforms a matrix to a reduced matrix.
go_fcm.m: Fuzzy K-Means program.

Division of Labor

³¯²±¤å: HTML writing, MATLAB coding, simulation, STASTICA simulation.
¶À¦ó®Ñ¦w: MATLAB coding, simulation, HTML writing.
·¨¿«C: Data set collecting and proprocessing, MATLAB coding, project presentation, HTML writing.

Reference

[Salton 89] Gerald Salton. Automatic Text Processing, Addison Wesley, Reading, Massachusetts, 1989
[Lang 96] Inien Syu, S.D. Lang and Narsingh Deo. Incoporating Latent Semantic Indexing into a Neural Network Model for Information Retrieval, 1996
[Yan 94] Tak W. Yan and Hector Garcia-Molina. Index Structures for Information Filtering Under the Vector Space Model, 1994
[Carpenter et al. 92]Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., and Rosen, D. B.(1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps. IEEE Transactions in Neural Networks, 3, 698-713.