Text cleansing process is to clean the extracted text and preprocessing module ensures that data is ready for analysis process. As in literature, different preprocessing techniques can be applied to this step. After applying these preprocessing techniques, most interesting terms can be found from the data. In Promine, following preprocessing methods are involved in text cleaning.
First, tokenization is applied to the data and converts a stream of characters into streams of words, which are our processing unit.
To reduce the dimensionality of tokenized data, stop word filter is applied. In this process most frequent but unimportant words are removed from the data.
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
This measures the frequency of a word in a document. In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We have calculated it by using built-in functions in python.
At the end of preprocessing step, a list of keywords is created. This list of words came from a file that is generated by a process model. Single file cannot provide enough information for generating knowledge elements that we need for domain specific ontology. For every keyword, we get a set of synonyms from WordNet and generate a list of words of that keyword
Open Directory Clicked
While loading file:
After file loaded:
Applied Tokenization
Stopword Removal applied
Applied POS TAGGER:
After applied Lemmatization
WordNet Applied
Applied TD/IDF
Applied Clear Text:
Applied RESET:
Do validation when there is no item in Analysis textarea:
Selecting Corpus File:
Loading File for corpus:
File Loaded in Text Area for Corpus Generation:
File Saved In Corpus:
Corpus File Validation if file is already exist:
Corpus Files in Explorer: