GitHub - arqam/TF-IDF-Generator: Generate TF-IDF for terms in a collection of documents.

arqam / TF-IDF-Generator Public

forked from yebrahim/TF-IDF-Generator

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Generate TF-IDF for terms in a collection of documents.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
.tfidf.py.swp		.tfidf.py.swp
1.txt		1.txt
1.txt_tfidf		1.txt_tfidf
2.txt		2.txt
2.txt_tfidf		2.txt_tfidf
README.txt		README.txt
french_lemmas.txt		french_lemmas.txt
french_stopwords.txt		french_stopwords.txt
in.txt		in.txt
tfidf.py		tfidf.py
tfidf_1.txt		tfidf_1.txt
tfidf_2.txt		tfidf_2.txt
tmp		tmp

Repository files navigation

This script implements the TF-IDF term relevance scoring as described on wikipedia's article: en.wikipedia.org/wiki/Tf–idf
----

Generate the TF-IDF ratings for a collection of documents.

This script will also tokenize the input files to extract words (removes punctuation and puts all in
    lower case), and it will use the NLTK library to lemmatize words (get rid of stemmings)

IMPORTANT:
    A REQUIRED library for this script is NLTK, please make sure it's installed along with the wordnet
    corpus before trying to run this script

Usage:
    - Create a file to hold the paths+names of all your documents (in the example shown: input_files.txt)
    - Make sure you have the full paths to the files listed in the file above each on a separate line
    - For now, the documents are only collections of text, no HTML, XML, RDF, or any other format
    - Simply run this script file with your input file as a single parameter, for example:
            python tfidf.py input_files.txt
    - This script will generate new files, one for each of the input files, with the prefix "tfidf_"
            which contains terms with corresponding tfidf score, each on a separate line

This code now supports French (and similar accented European languages), but a lexicon file is needed, which maps a word to its lemmata. An example file for French is given under the name oldlexique.txt
If needed to run using the lexicon file, use the -l directive to specify the file, from which the script will load two columns corresponding to the word and its lemmata.
Example usage:
        python tfidf.py -l oldlexique.txt input_files.txt