Skip to content

Latest commit

 

History

History
174 lines (151 loc) · 11.6 KB

README.md

File metadata and controls

174 lines (151 loc) · 11.6 KB

Essentials for Data Science (2023/2024)

A course of Statistics and Data Science master, Leiden University.

⚠️ ⚠️ ⚠️ Prepare your laptop as described in the installation section below.

Teachers

Overview

The course offers a practical introduction to a few programming languages and tools currently used in data science:

  • Python is a general-purpose, high-level and easy to learn programming language. It provides a large number of data science libraries (e.g. machine learning, neural networks, data manipulation, data visualization).
  • SQL is a standard language used to create, query, update and manage relational databases. For example, such databases are used to store large tables with results of experiments.
  • Git is a tool that allows to track changes in files during development of programs. It is the current standard for collaborative code development.

During the course the students will write Python programs of growing complexity (from basic coding examples to fitting a machine learning model). After this course you will be able to program simple reproducible data analyses (consisting of data reading, cleaning, simple modelling, and reporting steps). The state-of-the-art Python-specific data manipulation/visualization (pandas, Matplotlib) and data science libraries will be discussed.
Fundamentals of the relational databases and of the SQL language will be presented in a context of an example database (SQLite). The database will be accessed through direct SQL statements and through high-level object-oriented Python library (SQLAlchemy).

First, you will work alone and practice code development. Later, shared code development will be practiced in groups. The students will be requested to use git to track changes in their code and to share their code with other students through GitHub.

Finally, the relevance of data stewardship and FAIR principles (Findable, Accessible, Interoperable, Reusable) will be discussed.

Course Objectives

During the course you will practice writing Python code. After the course you will be able to:

  • ✍️ use Python collections (list, tuple, set, dict)
  • ✍️ use Python flow control statements (if, for, while, exceptions), context managers (with) and define functions
  • 🚫 understand Python classes (instance variables, methods, inheritance)
  • ✍️ use Python standard libraries (reading/writing files in different formats; math, statistics, random)
  • ✍️ use common data science libraries (NumPy, pandas, Matplotlib)
  • 🚫 understand relational databases and use SQL to create, query, update a database
  • 🚫 understand basics of SQLAlchemy for Python object-oriented database access
  • ✍️ understand how to execute several machine learning algorithms
  • 🚫 use git and GitHub for individual and collaborative code development
  • 🚫 explain the relevance of data stewardship and FAIR principles for scientific research

Schedule

The schedule given below might change:

  • The primary source for lecture, exam and retake dates/locations is Essentials for Data Science course 4433EDASCY schedule at https://rooster.universiteitleiden.nl/. The dates on this page are manually copied and may lag behind.
  • The order/content of the future lectures might be adjusted.
  • The dates of the assignments and the group assignment might be adjusted if order of the lectures changes.

The schedule:

Grading

  • Components of the final grade:
    • Assignments A, B, C (each of weight 1; total weight 3):
      • Assignments A, B and C are separately graded.
      • The grade range is 1-10 but when the primary deadline is not met then the maximum grade is 8.
      • To pass the course, the Assignments A, B, C rounded mean grade must be greater than 5.5.
      • The Assignments A, B, C rounded mean grade has weight=3 in the final grade.
    • Group Assignment (weight 3):
      • The grade range is 1-10.
      • To pass the course, the group assignment rounded grade must be greater than 5.5.
      • The group assignment rounded grade has weight=3 in the final grade.
    • Data stewardship quiz:
      • To pass the course, the quiz needs to be solved with the PASS result.
      • The quiz grade is not part of the final grade formula.
    • Exam/Retake (weight 4):
      • The grade range is 1-10.
      • To pass the course, the exam/retake grade must be greater than 5.5.
      • The exam/retake grade has weight=4 in the final grade.
      • The exam will cover the course objectives marked with ✍️.
      • The exam will not cover the course objectives marked with 🚫 - these objectives are evaluated in the group assignment and the quiz.
  • Final grade:
    • The final grade is calculated as a weighted mean of the component grades.
    • The final grade is rounded to the nearest half integer.
    • To pass the course, the final grade needs to be greater or equal 6.0.

Installation

For the course you will need to bring a laptop with properly installed Python and a development environment.
Install (in the order listed below):

Moreover, you will need:

  • git: Free and open source distributed version control system. Follow the Downloads instructions provided at https://git-scm.com/. Additional GUI (graphical) clients will not be used during the course but might be useful.