Identifying Outliers from Web Documents Using Reflective Weighted Correlation
Raheemaa Khan1, Mohammed SaleemIrfan Ahmed2, HusniHamadAlmistarihi3
1Mrs. R. L. Raheemaa Khan, Department of Computer Science, Bharathiar University. Tamil Nadu, South India.
2Dr. M. S. Irfan Ahmed, Department of specialization in Trusted Networks from.Taibah University – Saudi Arabia.
3Dr. Husni HamadAlmistarihi, Department of specialization in Grid Computing and Distributed Systems, Taibah University – Saudi Arabia.
Manuscript received on 02 June 2019 | Revised Manuscript received on 10 June 2019 | Manuscript published on 30 June 2019 | PP: 752-759 | Volume-8 Issue-8, June 2019 | Retrieval Number: E3268038519/19©BEIESP
Open Access | Ethics and Policies | Cite | Mendeley | Indexing and Abstracting
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC-BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Due to the enormous use of the Web, the size of the web is getting increased rapidly at every second. Web mining is an important research field where the web documents are mined to extract useful knowledge related to web content and its usage. Web content mining is one of the categories of web mining, where the web pages and web documents are mined to eliminate web outliers. Generally, due to enormous usage of Internet, the contents in the web is becoming redundant as the same data is stored at several web servers by several users. Thus, accessing the relevant web pages is becoming very challenging task for the search engine. This paper focuses on web content mining wherein the set of web documents extracted by the search engine are examined to mine the interesting documents required for the user by removing the redundant and irrelevant documents. The proposed method employs a powerful mathematical concept called correlation analysis. In specific, reflective weighted correlation analysis has been used along with the term frequencies to identify the outliers thereby removing them improves the quality of results. Also, the score for the documents is computed and based on which the irrelevant documents are removed and the significant documents are extracted for the user. The method is highly useful for identifying and removing outliers. The proposed method is evaluated based on the experimental analysis and the results show that the proposed method has better accuracy of above 90% in predicting and differentiating the outliers from significant documents. The results are also compared with the other existing methods with accuracy and execution time as parameters.
Keyword: Web Content Mining, Reflective Weighted Correlation Coefficient, Ranking, Outliers, Term Frequency.
Scope of the Article: Web and Text Mining.