PreProcessor

PreProcessor is a program that helps users to process and annotate raw textual data. Despite various Part of Speech taggers, Lemmatisers, Stemmers, Named Entities Recognisers, Sentence Splitters, Tokenisers and Stopword Checkers can be used for this purpose, they are independent programs built for only one specifically purpose (e.g. identify the word's stem). Thus, when users want to use more than one or import them in their own programs/ applications, their integration turns to be really complex and time-consuming. As an attempt to fulfil this gap, PreProcessor aims at offering the user with a simple, yet robust and agile variety of morphosyntatic options to process and annotate raw textual data by taking advantage of the best known open-source libraries on the market.

TECHNICAL INFORMATION =========================

This program provides several abstraction methods to perform various NLP tasks, such as: POS Tagging (TreeTagger); Lemmatisation (TreeTagger); Stemming (Snowball); Tokenisation (OpenNLP); Sentence Delimitation (OpenNLP); NER (OpenNLP); and Stopword Checker. Hereafter we describe how these methods can be called.
- NLPManager nlpManager = new NLPManager(Constants.EN); // receives the language
- The NLPManager class wraps all the NLP methods offered by the PreProcessor (http://github.com/hpcosta/PreProcessor). Within this class you will find a demo() that demonstrates how you can use all these methods for various languages. Please have a closer look at the demo() method located at 'src/nlp/NLPManager'

INSTALATION =========================
Import the code to your Java editor.
The folder 'config' and 'internalResources' should be at the same level as the src folder.
- The folder 'internalResources' contains models for the:
  - TreeTagger (English, French, German, Italian, Portuguese and Spanish)
  - OpenNLP (tokeniser, sentence splitter and NER - only for English)
  - Stopword files (German, English, Italian, Portuguese and Spanish)
- The folder 'config' contains a configuration file for the TreeTagger. You will need have the TreeTagger installed in your computer an configure the treetagger.properties file.

3.1 External Libraries

This section is important to let you know what libraries are used in this project, as well as to know how to update the resources or models.

3.1.1 NLP libraries

* TreeTagger
	* provides a POS Tagger for EN, SP, PT, FR, DE, IT and RU
		* due to the github limitations, we need to download the RU model from the following address: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/russian-par-linux-3.2-utf8.bin.gz
	* The following java library allows to use TreeTagger in Java.
		* org.annolab.tt4j-1.0.15

* Stemmer 
	* provides a Stemmer for EN, SP, PT, FR, DE, IT and RU
	* the following java library allows to use Stemmer in Java
		* org.tartarus.snowball

* OpenNLP
	* provides a **sentence splitter** and **tokenization** in EN, but can be used for at least EN, PT and SP
	* also provides **NER** for EN and SP, see models available through http://opennlp.sourceforge.net/models-1.5/
	* the following java library allows to use OpenNLP in Java
		* opennlp-maxent-3.0.3;
		* opennlp-tools-1.5.3; 
		* opennlp-uima-1.5.3

	* you can find these models inside the project folder, more specificaly in the "/resources/opennlpmodels/..." folder.
		* contains the following models for English:
			* Date name finder model.			
			* Location name finder model.		
			* Money name finder model.		
			* Organization name finder model.	
			* Percentage name finder model.	
			* Person name finder model.		
			* Time name finder model.
		* and the following models for Spanish:
			* Location name finder model, trained on conll02 shared task data.
			* Organization name finder model, trained on conll02 shared task data.	
			* Person name finder model, trained on conll02 shared task data.	
			* Misc name finder model, trained on conll02 shared task data.
		* the English and Spanish models are loaded by the 'NEREnModelsLoader' and 'NEREsModelsLoader' classes, respectively.

REQUIREMENTS =========================

Java 6 (JRE 1.6) or higher
Several models and libraries (already included either in the 'internalResources' or in the 'libs' folder)

LICENSE =========================

For more information please contact:

hercos (at) uma (dot) es

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PreProcessor

TABLE OF CONTENTS

3.1 External Libraries

3.1.1 NLP libraries

Follow me on

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bin		bin
config		config
internalResources		internalResources
libs		libs
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md

hpcosta/PreProcessor

Folders and files

Latest commit

History

Repository files navigation

PreProcessor

TABLE OF CONTENTS

3.1 External Libraries

3.1.1 NLP libraries

Follow me on

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages