QMSS GR5069 - APPLIED DATA SCIENCE FOR SOCIAL SCIENTISTS

Instructor: Marco Morales, Columbia University
Co-Instructor: Nana Yaw Essuman, Columbia University

TAs: Ludovico Genovese, Columbia University
Naveen Reddy Dyava, Columbia University

This repository is a companion to the course Applied Data Science for Social Scientists taught at the Quantitative Methods in the Social Sciences program over the Spring of 2024.

It contains curated reference materials, slides and sample code. You can find the most updated version of the course syllabus here. Make sure to check regularly for updates.

Overview

In his now classic Venn diagram, Drew Conway described Data Science as sitting at the intersection between good hacking skills, math / statistics knowledge, and substantive expertise. Standard quantitative training in the social sciences supplies a fluid combination of all three, but tailored to understanding human behavior, and to explaining why things happen the way they do. Social Scientists are, thus, a particular kind of Data Scientist.

This course is a collection of topics that fill very specific gaps identified over the years on what a social scientist should know at minimum when entering Data Science, and the skills and best practices a Data Scientist should know to add immediate value to their teams.

To do that, this course aims to:

teach processes and practices at the intersection of Data Science and (Data and ML) Engineering that are central to the Data Product Cycle. Data Scientists typically start being exposed to Engineering and MLOps on the job. There's much to be gained from early exposure to these industry-standard concepts and practices;
sharpen technical skills not only in fitting models, but particularly in building knowledge and generating insights from the data. While this may seem obvious for a Data Scientist, it is not always the focus of training,
train in working effectively in teams to build projects and products. Data Science is collaborative in nature and constantly evolving in best practices that enhance efficient collaboration. Collaboration for school projects/assignments is vastly different from the highly-structured collaboration that happens in Data Science teams, but is not always the focus of training

All of these are highly valued skills in the Data Science job market, but not always considered explicitly as part of an integral Data Science curriculum.

Prerequisites:

It is assumed that students have basic to intermediate knowledge of object-oriented programming - e.g R or Python - including experience using it for data manipulation, visualizations, and model estimation. Some mathematics, statistics and algebra are also assumed.

Course Roadmap

outline\

     --- fundamentals and best practices ---

| -- topic  1 : THE DATA SCIENCE SHOP ROADMAP
| -- topic  2 : VERSION CONTROL & GITHUB
| -- topic  3 : STRUCTURING YOUR WORKSPACE: DS & DE PERSPECTIVES
| -- topic  4 : CODING ETIQUETTE
| -- topic  5 : MANAGING THE PROCESS

     --- the practice of Data Science ---

| -- topic  6 : DATA PIPELINE IN PRACTICE
| -- topic  7 : MISSING DATA & DATA QUALITY
| -- topic  8 : MODEL DEPLOYMENT & VERSIONING,
               WORKING ENVIRONMENTS (DEV, STAGING, PROD)
| -- topic  9 : EXPLANATION v PREDICTION
| -- topic 10 : MODEL EVALUATION
| -- topic 11 : FRONTENDS AND DATA VISUALIZATION
| -- topic 12 : WORKFLOW COLLABORATION

Course Resources

There are no required textbooks for this course. The course will rely on a combination of curated reading materials, in-class workshops and take-home exercises that will leverage the following tools:

Slack
GitHub / GitHub classroom
AWS
Databricks

To actively participate on this course

Make sure to have the latest versions of R/RStudio, and/or Anaconda, as well as git installed on your computer. Sign up for a GitHub account if you don't have one already.

Registered students will receive instructions to access GitHub classroom, Slack, AWS, and Databricks.

Accessing course materials in this repo

install git in your local machine
from the command line, go to the directory where you want to clone this repo
```
$ cd <your chosen directory>
```

clone this repository to get a local copy in your machine

$ git clone https://github.com/marco-morales/QMSS-GR5069_Spring2024.git

pull every week before class to sync your local copy with the lates changes pushed to the repo
```
$ git pull origin main
```
"Watch" the repository to get notifications each time updates are pushed

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
guest-lecture		guest-lecture
syllabus		syllabus
topic_01		topic_01
topic_02		topic_02
topic_03		topic_03
topic_04		topic_04
topic_05		topic_05
topic_06		topic_06
topic_07		topic_07
topic_08		topic_08
topic_09		topic_09
topic_11		topic_11
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QMSS GR5069 - APPLIED DATA SCIENCE FOR SOCIAL SCIENTISTS

Overview

Prerequisites:

Course Roadmap

Course Resources

To actively participate on this course

Accessing course materials in this repo

About

Languages

marco-morales/QMSS-GR5069_Spring2024

Folders and files

Latest commit

History

Repository files navigation

QMSS GR5069 - APPLIED DATA SCIENCE FOR SOCIAL SCIENTISTS

Overview

Prerequisites:

Course Roadmap

Course Resources

To actively participate on this course

Accessing course materials in this repo

About

Topics

Resources

Stars

Watchers

Forks

Languages