justext-java

justext-java is a library for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. This implementation is the Java port of https://github.com/miso-belica/jusText.

How it works

See what is kept and what is discarded from a typical web page. Read a description of the jusText algorithm.

Releases

TBD

Building from source

% mvn clean install

Code Sample (Groovy)

import nl.wizenoze.justext.JusText
import nl.wizenoze.justext.paragraph.Paragraph
import nl.wizenoze.justext.util.StopWordsUtil

JusText jusText = new JusText()
String rawHtml = "http://www.devx.com/wireless/remote-work-and-the-social-forces-and-technologies-that-enable-it.html".toURL().getText()
Set<String> stopWords = StopWordsUtil.getStopWords("en")
List<Paragraph> paragraphs = jusText.extract(rawHtml, stopWords)

paragraphs.each { println(it.getText()) }

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
COPYING		COPYING
COPYING.LESSER		COPYING.LESSER
LICENSE.txt		LICENSE.txt
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

justext-java

How it works

Releases

Building from source

Code Sample (Groovy)

Contributing

License

Copyright

About

Licenses found

Releases

Packages

Contributors 4

Languages

License

Licenses found

wizenoze/justext-java

Folders and files

Latest commit

History

Repository files navigation

justext-java

How it works

Releases

Building from source

Code Sample (Groovy)

Contributing

License

Copyright

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages