A machine learning (ML) library for classification using a nearest neighbor algorithm based on Hamming distances.
You can incorporate the VHamMLL
functions into your own code, or use the included Command Line Interface app (cli.v
).
Link to html documentation for the library functions and structs
You can use VHamMLL
with your own datasets, or with a selection of publicly available datasets that are widely used for demonstrating and testing ML classifiers, in the datasets
directory. These files are mostly in Orange file format; there are also datasets in ARFF (Attribute-Relation File Format) or in comma-separated-values (CSV) as used in Kaggle.
This table reports balanced accuracy results for classification of a variety of publicly available datasets.
What, another AI package? Is that necessary? And have a look here for a more complete description and potential use cases.
For interactive descriptions of the two key algorithms used by VHamMLL, download the Numbers app spreadsheets: Description of Ranking Algorithm and Description of Classification Algorithm.
v install holder66.vhammll
You may also need to install its dependencies, if not automatically installed:
v install vsl
v install Mewzax.chalk
In your v code, add:
import holder66.vhammll
First, install V, if not already installed. On MacOS, Linux etc. you need git
and a C compiler (For windows or android environments, see the v lang documentation).
In a terminal:
git clone https://github.com/vlang/v
cd v
make
sudo ./v symlink # add v to your PATH
v install holder66.vhammll
On older macs, if the make
process fails, you may need to also do:
brew install bdw-gc # you will need to have homebrew installed; when that completes,...
cp /usr/local/Cellar/bdw-gc/8.2.8/lib/libgc.a .thirdparty/tcc/lib/libgc.a # use the version number of the just-installed bdw-gc
Then repeat the make
in the v directory.
Finally, export VFLAGS="-d dynamic_boehm"
See above re needed dependencies.
In a folder or directory that you want to use for your project, you will need to create a file with module main
, and a function main()
.
You can do this in the terminal, or with a text editor. The file should contain:
module main
import holder66.vhammll
fn main() {
vhammll.cli()!
}
Assuming you've named the directory or folder vhamml
and the file within main.v
, in the terminal:
v run .
followed by the command line arguments, eg
v run . --help
or v run . analyze <path_to_dataset_file>
Command-specific help is available, like so:
v run . explore --help
or v run . explore -h
Note that the publicly available datasets included with the VHamMLL distribution can be found at ~/.vmodules/holder66/vhammll/datasets
.
That's it!
v run . examples go
v up # installs the latest release of V
v update # get the latest version of the libraries, including holder66.vhammll
v . # recompile
The V lang community meets on Discord
For bug reports, feature requests, etc., please raise an issue on github
Use the -c (--concurrent) argument (in the CLI) to make use of available CPU cores for some vhammll functions; this may speed things up (timings are on a MacBook Pro 2019)
v main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab # 10.157 sec
./main explore -c ~/.vmodules/holder66/vhammll/datasets/iris.tab # 4.910 sec
A huge speedup usually happens if you compile using the -prod (for production) option. The compilation itself takes longer, but the resulting code is highly optimized.
v -prod main.v
./main explore ~/.vmodules/holder66/vhammll/datasets/iris.tab # 3.899 sec
./main explore -c ~/.vmodules/holder66/vhammll/datasets/iris.tab # 4.849 sec!!
Note that in this case, there is no speedup for -prod
when the -c
argument is used.
Please see examples_of_command_line_usage.md
Health care professionals frequently make use of calculators to inform clinical decision-making. Data regarding symptoms, findings on physical examination, laboratory and imaging results, and outcome information such as diagnosis, risk for developing a condition, or response to specific treatments, is collected for a sample of patients, and then used to form the basis of a formula that can be used to predict the outcome information of interest for a new patient, based on how their symptoms and findings, etc. compare to those in the dataset.
Please see clinical_calculator_example.md.
Please see a worked example here: noisy_data.md
The mnist_train.tab file is too large to keep in the repository. If you wish to experiment with it, it can be downloaded by right-clicking on this link in a web browser, or downloaded via the command line:
wget https://henry.olders.ca/datasets/mnist_train.tab
The process of development in its early stages is described in this essay written in 1989.
Copyright (c) 2017, 2024: Henry Olders.