Ms.N.Kalpana , Dr.S.Appavu Alias Balamurugan
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Word similarity and Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A fundamental data-mining problem is to examine data for “similar” items. These pages could be plagiarized, for example, or they could be mirrors that have almost the same pleased, but differ in information about the host and about other mirrors. We introduce a technique called “min hashing,” which compresses large sets in such a way that we can still deduce the similarity of the underlying sets from their compressed versions. Finally, we explore notions of “similarity” that are not expressible as intersection of sets. This study leads us to consider the theory of distance measures in arbitrary spaces.