Final Report for CS5611 (Fuzzy Sets: Theory and Applications):

Categorize Chinese News


³¯²±¤å
(mr854304,¸êºÓ¤@)


¶À¦ó®Ñ¦w
(mr854333, ¸êºÓ¤@)


·¨­¿«C
(mr854359, ¸êºÓ¤@)


Table of Contents

Abstract Problem Definition Data Set Description Approach
Simulation Results Conclusions Computer Programs Division of labor
References


Abstract

In this project we apply four different clustering methods to Chinese News documents. Our main goal is not in comparing their performance but in examining the practicability of categorizing Chinese document. In traditional information retrieval (IR), the documents are indexed by some indexing ways and retrieved by users according to their queries. We do not index them. Conversely, we represent them as vectors and try several clustering and classifying methods. The results are further compared to human experience -- the categories by reporters.

Problem Definition

The categorization of text document is an interesting and challenging problem in information retrieval (IR). To categorize Chinese news document, many techniques may be applied, e.g. linguistic approaches. We adopt a statistical approach to tackle this problem. First, we count the frequency of Chinese characters. Second, we analyze the distribution of characters. At last, we categorize documents by the analysis results. The judge criterion is based on the reporter's categorization rule, but this rule is not really objective.

Data Set Description

We gathered articles from on-line Chinese Daily News for one week. 601 documents written by professional reporters were collected from nine categories: Economy, Editorial, Entertainment, Focus, International, Mainland, Social, Sports and Taiwan. Each document contains about 500-2000 Chinese characters. The documents were first transformed into vectors according to the frequency of occurrence of characters. All document-vectors were clustered or classifified into several classes, and the experimental results were compared to the original nine categories

Approach

  1. Preprocessing

Each document is first parsed according to Chinese characters. The size of Chinese character vocabulary is about 13000 among which less than 5000 are the commonly used characters. We keep the granularity down to the character level to avoid having to use linguistic knowledge to segment words in Chinese. In this approach, no stemming and stop word lists or a thesaurus is needed. Since character is the basic processing unit and is used equivalently as the concept of a "term" in IR denominating tradition, we will sue terms to refer to characters in the rest of the paper.

In the parsing process, both term frequency and document frequencies are counted. We represent the weight of a term in a given document by adopting Salton's well-tested formula in IR, i.e. the term frequency (tf) multiplied by the natural log of the inverse document frequency (idf) [Salton 89]. Namely, the weight of a term t in a given document d, namely w(t,d), is represented as

w(t, d) = tft,d * log(N/dft)

where N is the total number in a collection of documents, tft,d is the term frequency of term t in document d, and the dft is the document frequency of the term t in the collection.

Then a document can be represented as a vector V with elements v1, v2, , vn, where n is equal to the size of character vocabulary, and vi is the weight of term i in the document [Yan 94]. For convenience and by convention, all vector are normalized. We can then easily calculate the similarity between two documents Di and Dj by the cosine of the angle between their vector representations:

Similarity(Di, Dj)=

where * represents the inner product between two vectors; || || represents the norm of a vector. Apparently based on this formula, two documents with totally different character set will have "zero" similarity between them because the inner product of the two document vectors would be zero.

  1. SVD

Singular-Value-Decomposition (SVD) is a reliable tool available for matrix factorization [Lang 96] . For any matrix A, ATA has nonnegative eigenvalues. The nonnegative square roots of the eigenvalues of AT A are called the singular values of A, and the number of the non-zero singular values is equal to the rank of A, rank(A). If A is an m*n matrix and rank(A) = r, the SVD of A is defined as

A=UWVT

and an approximation of the matrix A, Ak, is a schema of the truncated SVD of matrix A, as follows:

Using the SVD technique, a term-by-document matrix A is mapped into a reduced k*n matrix represented by WkVkT, which relates to k factors to n documents.

  1. Fuzzy ARTMAP

Fuzzy ART is similar to ART1,but use fuzzy operations instead of logical operations in weight changing and category choice. The architecuture of Fuzzy ARTMAP consists of two fuzzy ART module.

Figure 1. Fuzzy ART module

Fuzzy ART use three layers(F0,F1,F2) to cluster input data. F0 contains raw(preprocessing) data , F2 selects the highest score node from choice function and sends it to F1. F1 checks that is the node satisfing vigilance.

Fuzzy ARTMAP contains two Fuzzy ART module[Carpenter et al. 92] in it. But in our experiment, ARTb reduces to a single instance( s.a. 2 : Entertainment, 8 : Sport ).

  1. KNNR (Please refer to Jang's neuro-fuzzy textbook.)
  2. K-Means (Please refer to Jang's neuro-fuzzy textbook.)
  3. Fuzzy K-Means (Please refer to Jang's neuro-fuzzy textbook.)
  4. Joining (Tree Clustering) A trivial clustering method based on similarity comparison.

Simulation Results

Concluding Remarks and Future Work

Computer Programs

Division of Labor

Reference