Monday, 31 March 2014



Extracting webpage data with Boilerpipe


In research of text mining and text analytics we do play with crawllers, scrappers. The most difficult problem which we face is to remove the garbage from the scrapped web page.

Here comes an API which helps to extract the real content from the web page, Boilerpipe. Boilerpipe library provides algorithms to detect and remove the surplus templates around the main textual content of a web page.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The concepts behind the API can be seen in video [http://videolectures.net/wsdm2010_kohlschutter_bdu/].


Boilerplate provides five extractor strategies:

ArticleExtractor
LargestContentExtractor
DefaultExtractor
CanolaExtractor
KeepEverythingExtractor

It provides six output modes :

HTML Extract Fragment
HTML highlight
Plain text
JSON
Debug
Image Only

Api is very helpful for automated text extraction with apache lucne.









No comments:

Post a Comment