This project implements a number of Java utilities which might be useful in Machine Learning NLP tasks.
Namely it allows for:
- laying out a training/test corpus into the file system
- NLP preprocsssing the documents (tokenization, stemming, POS filtering...)
- extracting features (e.g. TfIdf), and converting them to LibSVM file format
This project is featured in the essay "What is the best method for Automatic Text Classification?".
WARNING: for the time being, this project must not considered production ready due the lack of adequate automatic testing.
- lay out the corpus into the file system
- have a look to this example from another related project
- NLP preprocess
- class
- parameters
--corpusFolderRoot <corpus_root> --preprocessedCorpusFolderRoot <corpus_root>_preprocessed --iso6391Language it
- class
- TfIdf export to LibSVM
- Build Lucene Index
- class
- parameters
--corpusFolderRoot <corpus_root>_preprocessed --luceneIndexFolder <corpus_root>_preprocessed_lucene
- class
- Export "Terms Dictionary" from Lucene Index
- class
- parameters
--luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryOutputJSONFile <corpus_root>_preprocessed_lucene_terms.json --maxTerms 10000
is optional
- class
- Compute TfIdf and export to LibSVM
- class
- parameters
--corpusFolderRoot <corpus_root>_preprocessed\ --libSVMExportFilePrefix <corpus_name> --libSVMExportFolder <libsvm_files_output_folder> --luceneIndexFolder <corpus_root>_preprocessed_lucene\ --termsDictionaryJSONFile <corpus_root>_preprocessed_lucene_terms.json
- class
- Build Lucene Index
Yuo basically just have to implement for your language the classes located in the package com.ml_text_utils.nlp.impl.italian
- class
- parameters
--corpusFolderRoot <corpus_root> --googleAutoMlCsvFile <corpus>.csv --googleCloudStorageFolderUri gs://<your bucket>/<your folder path if any>