This project is my personal motivation which Aims to builds a data pipeline to prepare message data from major natural disasters around the world. I build an ETL and RandomForest pipeline to categorize emergency messages based on the needs communicated by the sender.
- Clean data by Natural Language Processing (Normalized, tokenized and lemmatized the text messages)
- Built up pipelines to train Random forest with grid search; Applied TF-IDF to assign the weights to words in the message
- Use Random Forest in the final model and test results in the website
Data Source: Figure Eight, San Francisco, CA
If you are running this in your local environment, run conda install --file requirements.txt
Or pip install -r requirements.txt to install the required python module dependencies
app
- template
- master.html # main page of web app
- go.html # classification result page of web app
- run.py # Flask file that runs app
data
- disaster_categories.csv # data to process
- disaster_messages.csv # data to process
- process_data.py
- InsertDatabaseName.db # database to save clean data to
models
- train_classifier.py
- classifier.pkl # saved model
README.md
Run the following commands in the project's root directory to set up database and model.
• To run ETL pipeline that cleans data and stores in database - python data/process_data.py data/disaster_messages.csv data/disaster_categories.csv data/Disaster.db
• To run ML pipeline that trains classifier and saves - python models/train_classifier.py data/Disaster.db models/classifier.pkl
Step 1: Type in the command line: python run.py
Step 2: Open another Terminal Window, Type: env|grep WORK
Step 3: In a new web browser window, type in the following: https://SPACEID-3001.SPACEDOMAIN where SPACEID & SPACEDOMAIN are shown in step 2.
Environment - Jupyer Notebook and Python IDE (Atom)
- NumPy
- Pandas
- Scikit-Learn
- Plotly
- SQLAlchemy
- Flask
I would like to thank Future Eight or providing the Data and Assistance.