Skip to content

Galina-Blokh/ai_data_minig_aidock

Repository files navigation

Introductory information

Data Mining NLP python project:

The goal was to collect the data, build the Neural Network for text binary classification using only Tensorflow, Pandas,
Numpy, and BeautifulSoup. The implementation is in Python 3.6+. Web scraping is done with asynchronous HTTP
requests (Grequests)
It contains two parts:

  • Scraping data from recipe website part (data collecting)
  • Data Science part (it made around the classification problem that determines for each paragraph with what probability
    it’s the label is ‘ingredients’ or ‘recipe’)

!!!These parts are not running as one pipeline!!! No console output - only log (except when you run to run.sh)!!!


All functions wrapped with @profile decorator - it prints out into log file time and memory usage during
program execution
. The code of this utility is in utils.py ( in the head project directory).
The project has only one recipes_logging.log file ( in the head project directory). The project has run.sh
file, which you run in CLI's bash run.sh htth://www.anylinkfromtestlinksfileindatafolder.com. Consequently, it
gives the output in the terminal window a JSON-like text from the entered link page. In recipes_logging.log file will
be information about all functions ran: steps made, classification, probability, and metrics.
requirements.txt contains applied Python packages.
notebooks_and_drafts - just notebooks and drafts. The project run doesn't touch it.
All constants (or almost all) in config.py

Methodological information

Scraping data from recipes website

Scraped data {'Recepie':[text],'INGREDIENTS': text} is in data/recipes.pkl.
The scraping part starts with main_scraper.py. It extracts all links from a page with all recipes using
BeautifulSoup and asynchronous HTTP requests (Grequests). Collects all data into defaultdict, save into pickle
file data/new_recipe.pkl. Ten first URLs saves into data/test_links.txt. Use these links to test the model
work using run.sh file.
To continue run preprocess.py.

Preprocessing, modeling, feature engineering

Next stage is starting from run main_preprocess(filename) in preprocess.py.This function loads the data from
data/recipes.pkl file. Calls for load_data_transform_to_set(filename) to transform into data
set with column paragraph and label.Then calls for utils.stratified_split_data(text, label, TEST_SIZE).
After splitting, preprocess separately train/test sets with preprocess_clean_data(train_dataset.as_numpy_iterator(), f'train').
In preprocess_clean_data function cleans the series of text and creates new columns with additional features. New
sets save into two pkl files: data/train_data_clean.pkl and data/test_data_clean.pkl.
It is the end of the first Data Science part. The next step is in model_train.py.

Train, tune, evaluation

When you run model_train.py, at first, it will read data/train_data_clean.pkl and data/test_data_clean.pkl.
Then count max_len sent/sequence and vocabulary size. Next will be call for preprocess.tfidf(texts, vocab_size)
function to transform data into sequences. Then split data into nlp and meta sets for test and train sets, call for
preprocess.get_model(tf_idf_train, X_meta_train, results, embedding_dimensions=EMBEDDING_DIM) to create, build, and
train the MODEL. The next step: evaluation of the test sets with writing down results into a log file, plot loss vs
val_loss, save the model into config.MODEL_NAME='data/my_model.h5'. Here is the end of the modeling and preprocessing.

The last final stage: to run in terminal bash run.sh htth://www.anylinkfromtestlinksfileindatafolder.com.
It calls main_task_run.py. It accepts an URL as an argument. Then it calls for get_one.py to collect needed
data from the page. We are assuming the URL is valid and, it redirects you to a page where a valid recipe is located.
Next step - calls the utils.print_json(url_to_get_recipe, json_file), function which give you console output in
JSON-like format. Next, apply transformation for the JSON file: from list ==> to string each element. Call the function
preprocess.load_data_transform_to_set() to transform from a defaultdict into data set. Preprocess text, engineer
new features, save to pickle file using preprocess.preprocess_clean_data(). Split into nlp and meta sets, call
eval_on_one_page(tfidf_one_page, X_meta_one_page, y_one_page, model, text) from model_train.py.
All information about the metrics, model, table with predictions, and probability values writes into the log file.


IMPORTANT: If you want to test models in /data folder on test links, then you have to run notebooks_and_drafts/list_dir.py
In /data folder can be several pre-trained models. notebooks_and_drafts/list_dir.py runs with URLs from data/test_links.txt

Data specific information

The data is collected from 'https://www.loveandlemons.com'
Data is imbalanced 80/20.
Paragraph = all lines in ingredients labeled as 1,
Paragraph = each \n in Instructions labeled as 0.
Vocabulary size = 2474 words/lemmas. It can be changed. It very depends on scrapping part -
links to scrap each time goes in different order --> when you take out first 10 links it will be 10 different links
each scraper run --> the vocabulary every time will be different size
Test split size = 0.2
Data from 10 URLs in data/test_links.txt wasn't included into set for model train.