Projct aims at classifying FASTA format protien sequences using multiclass classification algorithms in Spark.
- Data Preparation through python script
- Convert raw data csv and then to libsvm in R
- Run spark ML algos in jupyter notebook
As per our evaluation, this gives us best result with accuracy of 57% with 30 trees. We can further work on this to increase number of trees, max depth and other hyper parameter tunings