homework-4-at3250_hw4

#Amon Tokoro Timothy Luk

homework-4-at3250_hw4

homework-4-at3250_hw4 created by GitHub Classroom

Since the reddit_200k_train.csv contains the sufficient number of rows, we did not actually test our models with reddit_200k_test.csv.

Since the data is imbalance, we undersample and have around 60k for each target group. Count vectorizer doesn't work well with pipeline, because count vectorizer takes in a list of strings while logistic regression takes in np array or list of list. Therefore, we have to do count vectorizer outside of the pipeline and grid search. Adding new features is also an issue when trying to pipeline the entire process. We couldn't use a column transformer here, because count vectorizor is a sparse matrix and the new features in ndarray needed to be converted into a sparse matrix format in order to be appended to the sparse matrix. Column transformer doesn't support that. Therefore making training unseen data extremely cumbersome.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
Homework4.ipynb		Homework4.ipynb
Homework4.pdf		Homework4.pdf
README.md		README.md
homework4_copy.ipynb		homework4_copy.ipynb
notebook.tex		notebook.tex
untitled.txt		untitled.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

homework-4-at3250_hw4

About

Releases

Packages

Languages

aml-spring-19/homework-4-at3250_hw4

Folders and files

Latest commit

History

Repository files navigation

homework-4-at3250_hw4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages