Negapedia TFIDF Analyzer analyze Wikipedia's dumps and makes statistical analysis on reverts text.
The data produced in output can be used to clarify the theme of the contrast inside a Wikipedia page.
english
, arabic
, danish
, dutch
, finnish
, french
,
german
, greek
, hungarian
, indonesian
, italian
,
kazakh
, nepali
, portuguese
, romanian
,
russian
, spanish
, swedish
, turkish
, armenian
,
azerbaijani
, basque
, bengali
, bulgarian
, catalan
,
chinese
, croatian
, czech
, galician
, hebrew
, hindi
,
irish
, japanese
, korean
, latvian
, lithuanian
,
marathi
, persian
, polish
, slovak
, thai
, ukrainian
,
urdu
, simple-english
This kind of data come from Negapedia/nltk
english
, arabic
, danish
, dutch
, finnish
, french
,
german
, hungarian
, italian
, portuguese
,
spanish
, swedish
, chinese
, czech
, hindi
, japanese
,
korean
, persian
, polish
, thai
, simple-english
This kind of data come from Negapedia/badwords
GlobalPagesTFIDF.json
: contains for every page the list of words associated with their absolute frequency and tf-idf value;GlobalPagesTFIDF_topNwords.json
: asGlobalPagesTFIDF.json
, but are reported only the most important N words (in term of tf-idf value);GlobalWords.json
: contains all the analyzed wiki's words associated with their absolute frequency;GlobalTopic.json
: contains all the words in every topic (using Negapedia topics);BadWordsReport.json
: contains for every page which has them, a list of badwords associated with their absolute frequency.
The minimum requirements which are needed for executing the project in reasonable times are:
- At least 4 cores-8 threads CPU;
- At least 16GB of RAM (required);
- At least 300GB of disk space.
However the recommended requirements are:
- 32GB of RAM or more (highly recommended).
docker build -t <image_name> .
from the root of repository directory.
docker run -d -v <path_on_fs_where_to_save_results>:<container_results_path> <image_name>
example:
docker run -d -v /path/2/out/dir:/data my_image
-lang
: wiki language;-d
: container result dir;-s
: revert starting date to consider;-e
: revert ending date to consider;-specialList
: special page list to consider;-rev
: number of revert to consider;-topPages
: number of top words per page to consider;-topWords
: number of top words of global words to consider;-topTopic
: number of top words per topic to consider;-delete
: if true, after compressing results directory will be deleted (default: true);-test
: if true, logs are shown and is processed a single dump.
example:
docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it
Go packages can be installed by:
go get github.com/negapedia/wikitfidf
and docker image can be downloaded by:
docker pull negapedia/wikitfidf