The Analysis, Indexing, and Retrieval of Web Data

Due to the great popularity of the WWW, a huge amount of web pages has spread over various web sites and the population of web users grows rapidly. The proliferation of web users leads to the urgent requirement of the information services with high quality and great performance. In this study, we consider the WWW as a very large database and devote to the explorations of the research topics for data-intensive applications on the WWW. We classify the data on the WWW into two categories: the contents and the metadata of web pages, and the browsing behavior of web users. Our goal is to analyze these data for knowledge discovery and apply the results to the information services such as searching, filtering, and prefetching web pages.

Owing to the scale up problem on searching web pages, we consider the filtering approach that can perform as usual no matter how the number of web pages grows. In this approach, users first give descriptions about what they need in the form of user profiles. By the comparisons between web pages and user profiles, the users who are interested in a web page can be identified and notified. Two types of profiles are considered in our study. One only contains a set of keywords that can specify the contents of web pages. The other considers the URL's of web pages. To tackle the performance issues, we devise several indexing methods for both types of profiles, respectively. The dramatic growth of the Web has brought about the increasing possibility of information sharing. As the population on the Web grows, the analysis of user interests and behaviors will provide hints on how to improve the quality of service. In this study, we propose a method for deriving the user profiles by data mining techniques. Moreover, we define six types of user profiles and a distance measure to classify users into clusters. Finally, several kinds of recommendation services using the clustered results are realized.

To analyze the web pages, we propose a classification scheme that takes the hyperlink structure and the associated text into consideration. Moreover, we design a rough-set based method for the discovery of classification rules. Based on the object-oriented concept, the metadata of web pages are organized into a class hierarchy, which can be utilized for specifying user queries. In addition, a user interface is also built to support database-like queries. Both the page contents and the hyperlink structure can be specified in our query language. Considering the keyword search on the Web, it is often difficult for the users to specify queries that precisely describe what they need. In fact, such kind of queries can be very complex. It is therefore unrealistic for the search engines on the Web to demand precise queries directly from the users. In this study, we propose a new method for query refinement, which allows users to specify simple queries and then repeatedly refines the queries. Our method takes advantage of the historical information (user feedbacks and query term associations) to refine queries. As for the analysis of the user behaviors, we apply the data mining techniques to the log data in order to find the popular sequences of user requests. On the other hand, we propose a framework that applies the mining results to predicting user requests. To tackle the performance issues, we devise an index structure to provide a fast prediction process.