Main Content Extraction from html written in Java. It will extract the article text with out the around clutters.
Recent days its really a challenging open issue to extract the main article content from html pages. There are many open source algorithms / implementations available. What i aim in this project is concise some of the best content extraction algorithm implemented in JAVA.
My focus is mainly on the tuning parameters and customization / modifications of these algorithmic features according to my requirements.
readabilityBUNDLE will perform equally what other algorithms does plus below listed extras.
- Preserve the html tags in the extracted content.
- Keep all the possible images in the content instead of finding best image.
- Keep all the available videos.
- Better extraction of li,ul,ol tags
- Content normalization of extracted content.
- Incorporated 3 best popular extraction algorithm , you can choose based on your requirement.
- Provision to append next pages extracted content and create a consolidated output
- Many cleaner / formatter measures added.
- Some core changes in algorithms.
The main challenge which i was facing to extract the main content by keeping all the images / videos / html tags / and some realated div tags which are used as content / non content identification by most of the algorithms.
readabilityBUNDLE borrows much code and concepts from Project Goose , Snacktory and Java-Readability. My intension was just fine tune / modify the algorithm to work with my requirements.
Some html pages works very well in a particular algorithm and some not. This is the main reason i put all the available algorithm under a roof . You can choose an algorithm which best suits you.
You can see all author citations in each java file itself.
- StringHelpers
- Network
- [NextPageFinder] (https://github.com/srijiths/NextPageFinder)
You need to say which extraction algorithm to use. The 3 extraction algorithms are ReadabilitySnack,ReadabilityCore and ReadabilityGoose. By default its ReadabilitySnack.
- With out next page finding
Sample Usage
Article article = new Article();
ContentExtractor ce = new ContentExtractor();
HtmlFetcher htmlFetcher = new HtmlFetcher();
String html = htmlFetcher.getHtml("http://blogmaverick.com/2012/11/19/what-i-really-think-about-facebook/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Counterparties+%28Counterparties%29", 0);
article = ce.extractContent(html, "ReadabilitySnack");
System.out.println("Content : "+article.getCleanedArticleText());
- With next page html sources
If you need to extract and append content from next pages also then,
-
You can use [NextPageFinder] (https://github.com/srijiths/NextPageFinder) to find out all the next pages links.
-
Get the html of each next pages as a List of String using Network
-
Pass it to the content extractor like
article = ce.extractContent(firstPageHtml,extractionAlgorithm,nextPagesHtmlSources)
Using Maven , mvn clean package
Apache License 2 - http://www.apache.org/licenses/LICENSE-2.0.html