Skip to content

Performance a Pipelines, grid search and text mining. Let's start with several basic exercises.

Notifications You must be signed in to change notification settings

Pevicsanch/advanced-machine-learning

Repository files navigation

advanced-machine-learning

IT Academy - Data Science Itinerary: Advanced machine learning.

S12 - T01 : Advanced machine learning: pipelines, gridsearch and textmining

Performance a Pipelines, gridsearch and text mining. Let's start with several basic exercises.

Scikit-learn pipelines are a tool to simplify the preprecessing steps such as feature extraction, feature scaling, and dimensionality reduction. Here we are going to learn how the pipelines work and how to implement them. In addition, we will also learn about text mining (also known as text analysis), is the process of transforming unstructured text into structured data for easy analysis

Objectives:

This project consists of two parts:

in the first part we are going to make a random forest prediction model. Here we will practice downloading data from a web page and saving it to the workspace. The dataset was obtained from the UCI Machine Learning Repository consisting of votes made by US House of Representatives Congressmen. Our goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues.

We are also going to use pipelines, which is a tool that will help us automate the pre-processing of the data. In addition, we are going to adjust the hyperparameters of the model, which will help us to improve its performance.

In the second part of the project we are going to start doing webscraping. We are going to pick up a Article from the New York Times newspaper and we are going to analyze it. In this part we are going to practice the following: First of all we are going to use the BeautifulSoup library, which is one of the most used tools for webscrapping. We will also practice creating functions. Which will help us improve programming skills. We will also practice text preprocessing. We will also approach natural language processing. We will use libraries like NLTK, Spacy, TexBlob*. On the other hand, we will practice the use of lambda functions, among other things.

To access the web version of this project click here