GitHub - vishnuvikash/Recipe-Detector-: Suggests Recipes and their origin based on given Ingredients and Genrate Visualization results from data

PROJECT-2 IN CLASS PHASE 1-- PART A

As part of the project we are provided with the https://www.dropbox.com/s/f0tduqyvgfuin3l/yummly.json?dl=0 YUMMLY DATA SET.

We are given the master list of all possible dishes, their ingredients, an identifier, and the cuisine for thousands of different dishes.

We have to present a display of clustered ingredients and train a classifier to predict the cuisine type of a new food.

The phase1 of the project involves 2 tasks:

TASK1: CLUSTERING THE CUISINES BASED ON THEIR INGREDIENTS

In the dataset, we include the recipe id, the type of cuisine, and the list of ingredients of each recipe (of variable length). The data is stored in JSON format.

An example of a recipe node in yummly.json:

{ "id": 24717, "cuisine": "indian", "ingredients": [ "tumeric", "vegetable stock", "tomatoes", "garam masala", "naan", "red lentils", "red chili peppers", "onions", "spinach", "sweet potatoes" ] }

STEP1: I converted the JSON to a countsMatrix[i,j] where each row in countsmatrix represent a recipe and each column denotes a unique ingredient in the entire list of ingredients.

countsmatrix[i,j] denotes the count of occurence of ingredient j in cuisine i in the data set.

STEP2: Generate Td-idf matrix from the count matrix and reduce the components to 2 using PCA analysis

For Tf-idf Vectorizer I have used the sklearn package: from sklearn.feature_extraction.text import TfidfTransformer

For PCA i have used the Sklearn package: from sklearn.decomposition import PCA

The reduced data after fitting through PCA is used in CLustering the data.

STEP 3: For Clustering I have taken the KMeans Clustering Algorithm and clustered the data into three clusters.

I have Considered five clusters for the project.

TASK2: VISUALIZING THE CLUSTERS

For Visualizing the clusters I have used the PYLAB and MATPLOTLIB packages

from pylab import * from scipy import * import matplotlib.pyplot as plt

Effect on the cluster has been used in caluculating the size of the clusters. The sixe of the clusters is based on the effect of the clusters,JACCARD SIMILARITY (one vs the other cuisines in its cluster) We intersect all ingredients of cuisine i with the union of ingredients in all OTHER cuisines in its CLUSTER! (e.g.: intersect(filipino_Ingredients, other_asian_Ingredients) and divide by the union of all ingredients in the cluster.

I have created spheres of size relative to the times that it occurs within a cluster The size of the bubble is determined by the within-cluster similarity which is obtained by the Jaccard similarity of each cuisine with other members in it its cluster (computed above).

PROJECT-2 IN CLASS PHASE 2---PART B

Creates the list of given different data types such as list for meal_ids, cuisines, ingredients for the each of the meal and unicode format of ingredients of each of the meal

lists_created:

meal_id cuisine ingredients ing

Takes input of the Ingredients from the user and appends to the "ing" list

Vectorizes the ing list and converts them into features, used Tfidf Vectorizer.

The Transformed data from the data set is split into train set and test set. train_Set=ing[:len(ing)-1] test_Set-ing[len(ing)-1] which is the user input of ingredients

---Train the Model

Take the input of n closest foods from the user and train the model using KNeighbours Classifier.

close_n = KNeighborsClassifier(n_neighbors=n) return close_n.fit(train_set,cuisine)

---Predict the Cuisine and Top N closest foods For the Top N Closest Foods used the advantage of KNeighborsClassifier probability predictor function predicted_cuisine = close_n.predict_proba(test_set)[0] which gives the probabilities in Decreasing Order

For Identifying the Most Common Cuisine among the N closest foods predicted_single_cuisine = close_n.predict(test_set) most_common_Cuisine=predicted_single_cuisine[0]

For Cuisine Types of the Top N Closest foods #List of predicted Cuisines of Top N Closest Foods predicted_class = close_n.classes_ #list of probabilities of Top N Closest Foods predicted_cuisine = close_n.predict_proba(test_set)[0]

For meal ids of the top N Closest Foods with matched ingredients match_perc,match_id = close_n.kneighbors(test_set) for i in range(len(match_id[0])): print (meal_id[match_id[0][i]])

--Accuracy Checker

I have also included the accuracy checker where I have trained on 95% train data and checked on 5% test data the accuracy found to be 72%.

ASSUMPTIONS :

Accepts user input as space strings. Input should not be given in any other formats
User specifies n closest neighbours every time they run the model.
Since this is a lazy ML approach every time the user runs the model, based on n size it trains again, leading to a high wait time to see the results

DESIGN DECISIONS FOR THE MODEL:

Supervised Machine Learning is used instead of a unsupervised ML like clustering. Since the yummly data set already had meal information along with the cuisine tag.
A K Nearest Neighbour classifier is used and the model is trained with the entire dataset from yummly
Due to memory issues encountered during testing, the model is made as a lazy ML. The model is trained only after the user gives input.
The data set has 20 cuisines. Testing (Accuracy_checker.py) showed that the model had an accuracy of 72 % when the yummly data set is split as 95% train set and 5% test set with 5 nearest neighbours. So the model now shows the predicted cuisine as well as the probabilities of all non - zero cuisines to account for alpha and beta errors

FILES NEEDED :

yummly.json is in the data directory(specified in the script-no need to give externally)
food_detector.py that predicts the cuisine and n nearest meals
Accuracy_checker.py (runs with 5 nearest neighbors and 95% - train data and 5 % - test data selected randomnly from the yummly data set) that uses functions in the food_detector package and checks it's performance. (Warning : Running time is very long for this script)

EXECUTABLE METHOD: python3 AnalyzingFood.py -PHASE ONE CLUSTERING AND VISUALIZATION pyhton3 food_detector.py -PHASE 2

OUTPUT vishnu@vishnu-Inspiron-5537:~~/Desktop/AnalyzingFood/AnalyzingFood$ python3 AnalyzingFood.py vishnu@vishnu-Inspiron-5537:~~/Desktop/AnalyzingFood/AnalyzingFood$ python3 food_detector.py Reading all the data files and creating lists.... Enter the ingredients that you want to compare : eggs sugar Enter the number of closest items you want to find : 6 Model has been successfully trained.. Trying to predict the cuisine and n closest meal items...

The model predicts that the ingredients resembles french

The ingredients resemble brazilian with 16.666667 percentage The ingredients resemble french with 33.333333 percentage The ingredients resemble italian with 16.666667 percentage The ingredients resemble mexican with 16.666667 percentage The ingredients resemble russian with 16.666667 percentage

The 6 closest meals are listed below : 48995 38948 2399 42427 11897 34440

--- It took 37.185532569885254 seconds ---

Enter 1 if you want to search again or 2 if you want to quit..1 Enter the ingredients that you want to compare : wheat sugar eggs Enter the number of closest items you want to find : 5 Model has been successfully trained.. Trying to predict the cuisine and n closest meal items...

The model predicts that the ingredients resembles indian

The ingredients resemble french with 20.000000 percentage The ingredients resemble indian with 60.000000 percentage The ingredients resemble italian with 20.000000 percentage

The 5 closest meals are listed below : 5366 22213 22463 30385 11251

--- It took 92.15953874588013 seconds ---

Enter 1 if you want to search again or 2 if you want to quit..2 vishnu@vishnu-Inspiron-5537:~/Desktop/AnalyzingFood/AnalyzingFood$

REQUIRED:

DEPENDENCIES: from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.neighbors import KNeighborsClassifier from scipy.sparse import csr_matrix

import json import codecs import pandas as pd import numpy as np import time

VERSION: PYTHON VERSION 3.5+ preferred ,any version above PY(3.0) have to work.

REFERENCES: YUMMLY DATA SET- https://www.dropbox.com/s/f0tduqyvgfuin3l/yummly.json?dl=0

Sorting List using Itemgetter- http://stackoverflow.com/questions/10695139/sort-a-list-of-tuples-by-2nd-item-integer-value

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
phase-1		phase-1
phase-2		phase-2
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The phase1 of the project involves 2 tasks:

About

Releases

Packages

Languages

vishnuvikash/Recipe-Detector-

Folders and files

Latest commit

History

Repository files navigation

The phase1 of the project involves 2 tasks:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages