-
Notifications
You must be signed in to change notification settings - Fork 0
waltaskew/Honors-Thesis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
QUICK START FOR IR00: put the following in you .profile: export LD_LIBRARY_PATH=/aut/proj/ir/wsaskew/System/local/lib:$LD_LIBRARY_PATH export LD_RUN_PATH=/aut/proj/ir/wsaskew/System/local/lib:$LD_RUN_PATH export MANPATH=/aut/proj/ir/wsaskew/System/local/man:$MANPATH export PATH=/aut/proj/ir/wsaskew/System/local/bin:$PATH export PYTHONPATH=/aut/proj/ir/wsaskew/System/pythonpath:$PYTHONPATH then look at the example.py files provided to get started REQUIREMENTS: python2.6: http://www.python.org/download/ libxml: http://xmlsoft.org/downloads.html libxslt: http://xmlsoft.org/XSLT/downloads.html lxml: http://pypi.python.org/pypi/lxml/ antlr: http://www.antlr.org/download/Python/ arff package: http://www.mit.edu/~sav/arff/dist/ INSTALLATION: copy the code directory into someplace on your pythonpath example: cp -R cooccurrence_similarity ~/python_code/ touch ~/python_code/__init__.py export PYTHONPATH=~/python_code:$PYTHONPATH PACKAGE DESCRIPTION: The package allows a corpora to be mined in order to calculate relations between a set of target words. The package creates ARFF files which have four features which measure the contextual relatedness of two words. These ARFF files may be used with the WEKA machine learning toolkit (http://www.cs.waikato.ac.nz/ml/weka/) for a variety of machine learning tasks. USAGE: The main interface is the Experimenter class. The experimenter class allows a user to generate ARFF files while varying a number of parameters. The Experimenter class maintains consistency between uses, and uses results from previous experiments when possible to avoid redundant calculation. An Experimenter instance requires only one argument to be constructed, the path to a directory which will be used to store results of experiments. e = Experimenter('experiment_dir') experiment_dir should be either a path to an empty directory, a path to a non-existent file or directory (which will then be created) or a path to a previously created experiment directory. If a path to a previously created experiment directory is provided, then the experimenter instance returned will be identical to the experimenter instance which last performed work on the directory, thus maintaining consistency across multiple uses. Next, one or more corpora must be indexed. The method add_to_index requires two arguments, and allows for three more. e.add_to_index('corpus_dir', 'corpus_type', stop_file='stop_file', tag_file='tag_file', synch_freq=10000) corpus_dir and corpus_type are required. corpus_dir should be a directory full either of html or xml files to be mined. The directory will be read recursively, so directory structure does not matter, as long as the files to be indexed end in '.htm', 'php', '.html', or '.xml'. corpus_type must be either 'phpBB' or 'xml'. More corpus types may be supported in the future. If corpus_type is phpBB, then the files are treated as files generated by the popular phpBB forum software. If the type is xml, then 'tag_file' must be provided. The tag_file argument should be a path to a file which instructs the xml parser how to parse the xml files in the corpus_dir. An example tag file looks like this: TitleTag: ArticleTitle DelimiatorTag: MedlineCitation HeadingTag: MeshHeading AbstractText TitleTag specifies the xml tag which contains an article's title HeadingTag specifies an xml tag which holds interesting heading or meta-information. DelimiatorTag specifies a tag which separates documents from each other if a single xml file holds multiple documents Tags which are not preceded by a label (such as AbstractText in the above example) specify the location of text to be parsed out. An arbitrary number of such tags may be specified. The stop_file argument is optional but recommended. stop_file should be a path to a file containing a sequence of stop words to be removed from the indexed text, separated by newlines. The sync_freq argument is optional, and defaults to a reasonable values. The argument controls how often the data structures involved in the indexing task are wiped from memory and synced to disk. Synchronization will occur after synch_freq number of documents are processed. High values yield faster indexing and higher memory usage, and lower values the opposite. Once corpora have been indexed, experiments may be performed. e.perform_experiment(target_file, synonym_file=None, window=50, pmi_threshold=25, relation_threshold=100, truth_DB=truth_file, truth_function='2_way_mild') target_file should contain a series of newline separated words which relations should be calculated between. synonym_file may be optionally provided. A synonym file should be of the format: synonym:target and will cause all occurrences of synonym to be counted as occurrences of target. window, pmi_threshold and relation_threshold are variables which influence how context similarity metrics are calculated. windows is the cooccurrence window. A word must be within window words of a target in either direction in order to be considered a cooccurrence. pmi_threshold controls whether a cooccurrence will be included when measuring the relatedness of two target words. The cooccurrence will only contribute to the relation metric if the cooccurrence occurs at least pmi_threshold times. This value restricts infrequent cooccurrences from affecting the final similarity measure between two targets. relation_threshold restricts the number of relation values calculated. Relation values are only calculated between targets which share at least relation_threshold cooccurrence words. This value restricts relation values calculated from small numbers of share cooccurrences from being calculated. relation_threshold is a confidence threshold which influences the number of instances which will appear in the generated ARFF files. If two targets do not share enough cooccurrence words, then their relation will not be represented in the generated ARFF file. The truth_db is an optional argument which should provide a truth value for the relation between the specified targets. The truth_db should be a Berkely DB with keys in the format: target_1,target_2 and values should be pearson correlation values. If no truth_db is provided, then feature files without truth values are generated. truth_function controls how the pearson correlation values are interpreted from the truth_db. '2_way_mild' causes pearson values > .1 to indicate a positive relation '2_way_strong' causes values > .3 to indicate a positive relation '3_way_mild' causes values > .1 to indicate a positive relation, values < -.1 to indicate a negative relation, and values in between to indicate independence '3_way_strong'' causes values > .3 to indicate a positive relation, values < -.3 to indicate a negative relation, and values in between to indicate independence '5_way' causes values > .3 to indicate strong positive correlation, values > .1 to indicate mild positive correlation, values < -.3 represent strong negative correlation, values < -.1 represent mild negative correlation, and values between -.1 and .1 represent independence. The ARFF files generated will have five features. One of these features is the disease pair represented by the instance. This is a feature that is useful for humans, but should probably not be used for classification or training. Using WEKA, the following command will filter out all string features for training and classification. Because the only string feature in the generated ARFF files is the name of the instance, this will have the desired effect of only using the proper features for training and classification: java weka.classifiers.meta.FilteredClassifier -F weka.filters.unsupervised.attribute.RemoveType -W weka.classifiers.trees.J48 -t train.arff -T test.arff -p 5 The option: -F weka.filters.unsupervised.attribute.RemoveType removes all string fields, and the only string field in the arff file is the disease names. If you actually decide to use string fields in the ARFF file, you will need to use a more clever filter. BUGS AND 'FEATURES': none (yet) SEE ALSO: a few example files are provided and named example(#).py You can generate documentation for any of the python modules with the pydoc command. send bugs or complaints to: waltaskew@gmail.com
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published