Essentials for Data Science (2023/2024)

A course of Statistics and Data Science master, Leiden University.

⚠️ ⚠️ ⚠️ Prepare your laptop as described in the installation section below.

Teachers

Szymon M. Kiełbasa [LUMC/BDS], coordinator, smkielbasa@lumc.nl
Ramin Monajemi [LUMC/BDS]
Mo Arkani [LUMC/BDS]

Overview

The course offers a practical introduction to a few programming languages and tools currently used in data science:

Python is a general-purpose, high-level and easy to learn programming language. It provides a large number of data science libraries (e.g. machine learning, neural networks, data manipulation, data visualization).
SQL is a standard language used to create, query, update and manage relational databases. For example, such databases are used to store large tables with results of experiments.
Git is a tool that allows to track changes in files during development of programs. It is the current standard for collaborative code development.

During the course the students will write Python programs of growing complexity (from basic coding examples to fitting a machine learning model). After this course you will be able to program simple reproducible data analyses (consisting of data reading, cleaning, simple modelling, and reporting steps). The state-of-the-art Python-specific data manipulation/visualization (pandas, Matplotlib) and data science libraries will be discussed.
Fundamentals of the relational databases and of the SQL language will be presented in a context of an example database (SQLite). The database will be accessed through direct SQL statements and through high-level object-oriented Python library (SQLAlchemy).

First, you will work alone and practice code development. Later, shared code development will be practiced in groups. The students will be requested to use git to track changes in their code and to share their code with other students through GitHub.

Finally, the relevance of data stewardship and FAIR principles (Findable, Accessible, Interoperable, Reusable) will be discussed.

Course Objectives

During the course you will practice writing Python code. After the course you will be able to:

✍️ use Python collections (list, tuple, set, dict)
✍️ use Python flow control statements (if, for, while, exceptions), context managers (with) and define functions
🚫 understand Python classes (instance variables, methods, inheritance)
✍️ use Python standard libraries (reading/writing files in different formats; math, statistics, random)
✍️ use common data science libraries (NumPy, pandas, Matplotlib)
🚫 understand relational databases and use SQL to create, query, update a database
🚫 understand basics of SQLAlchemy for Python object-oriented database access
✍️ understand how to execute several machine learning algorithms
🚫 use git and GitHub for individual and collaborative code development
🚫 explain the relevance of data stewardship and FAIR principles for scientific research

Schedule

The schedule given below might change:

The primary source for lecture, exam and retake dates/locations is Essentials for Data Science course 4433EDASCY schedule at https://rooster.universiteitleiden.nl/. The dates on this page are manually copied and may lag behind.
The order/content of the future lectures might be adjusted.
The dates of the assignments and the group assignment might be adjusted if order of the lectures changes.

The schedule:

(01) Feb. 6th, 2024:
- General course introduction
- Python notebooks
- Python basic
- Python lists and tuples
- Memory organization
- Git/GitHub introduction
(02) Feb. 13th:
- Python sets and dictionaries
- Git/GitHub practice
(03) Feb. 20th:
- Python flow control and user functions
- 📙 Assignment A: start
(04) Feb. 27th:
- Python object oriented programming
- Git/GitHub assignment preparation
(05) Mar. 5th:
- Python standard libraries and scripts
- 📗 Assignment B: start
(06) Mar. 12th:
- Data manipulation:NumPy [Exercises]
- 📙 Assignment A: primary deadline (end-of-day)
(07) Mar. 19th:
- Data manipulation:pandas [Exercises]
(08) Apr. 2nd:
- Data visualisation [Exercises]
- 📗 Assignment B: primary deadline (end-of-day)
- 📘 Assignment C: start
(09) Apr. 9th:
- Relational databases:
- SQL language:
  - Downloading and connecting to the example database
  - Querying and selecting data (SELECT, LIMIT, AS, ORDER, DISTINCT, WHERE, IN, BETWEEN, LIKE) [Exercises]
  - Grouping and summarising (GROUP BY, HAVING, COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT) [Exercises]
(10) Apr. 23rd:
- Relational databases:
- SQL language:
  - Modification statements (UPDATE, INSERT, DELETE) [Exercises]
  - Data definition language (CREATE TABLE, DROP TABLE)
  - Joining tables 1 (INNER JOIN, LEFT JOIN, CREATE TEMP TABLE) [Exercises]
  - Joining tables 2 (UNION, EXCEPT, INTERSECT, self joins, CROSS JOIN, subqueries, EXIST) [Exercises]
- 📚 Group Assignment: start
(11) Apr. 30th:
- Python SQL Toolkit and Object Relational Mapper (SQLAlchemy)
- 📘 Assignment C: primary deadline (end-of-day)
(12) May 7th:
- Git branching and merging
- General Q&A and group assignment Q&A, programming practice
(13) May 14th:
- Machine learning libraries (examples)
  - scikit-learn
  - Keras
(14) May 21st:
- FAIR[presentation] & data stewardship
- 📝 Data stewardship quiz: start
(--) June 7th:
- 📚 Group Assignment: deadline (end-of-day)
(--) June 11th:
- 🏢 Exam
(--) June 18th:
- 📙 📗 📘 Assignments A, B, C: secondary deadline (end-of-day)
- 📝 Data stewardship quiz: deadline (end-of-day)
(--) July 2nd:
- 🏢 Retake

Grading

Components of the final grade:
- Assignments A, B, C (each of weight 1; total weight 3):
  - Assignments A, B and C are separately graded.
  - The grade range is 1-10 but when the primary deadline is not met then the maximum grade is 8.
  - To pass the course, the Assignments A, B, C rounded mean grade must be greater than 5.5.
  - The Assignments A, B, C rounded mean grade has weight=3 in the final grade.
- Group Assignment (weight 3):
  - The grade range is 1-10.
  - To pass the course, the group assignment rounded grade must be greater than 5.5.
  - The group assignment rounded grade has weight=3 in the final grade.
- Data stewardship quiz:
  - To pass the course, the quiz needs to be solved with the PASS result.
  - The quiz grade is not part of the final grade formula.
- Exam/Retake (weight 4):
  - The grade range is 1-10.
  - To pass the course, the exam/retake grade must be greater than 5.5.
  - The exam/retake grade has weight=4 in the final grade.
  - The exam will cover the course objectives marked with ✍️.
  - The exam will not cover the course objectives marked with 🚫 - these objectives are evaluated in the group assignment and the quiz.
Final grade:
- The final grade is calculated as a weighted mean of the component grades.
- The final grade is rounded to the nearest half integer.
- To pass the course, the final grade needs to be greater or equal 6.0.

Installation

For the course you will need to bring a laptop with properly installed Python and a development environment.
Install (in the order listed below):

Python (version >= 3.9.?, optimally >= 3.12.?): Follow the download instructions at https://www.python.org/.
pip: The Python Package Installer. It should already be installed during Python installation. If that is not the case, follow https://pip.pypa.io/en/stable/installation/.
Microsoft Visual Code: A free source-code editor made by Microsoft for Windows, Linux and MacOS. Follow the instructions at https://code.visualstudio.com/.

Moreover, you will need:

git: Free and open source distributed version control system. Follow the Downloads instructions provided at https://git-scm.com/. Additional GUI (graphical) clients will not be used during the course but might be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Essentials for Data Science (2023/2024)

Teachers

Overview

Course Objectives

Schedule

Grading

Installation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Essentials for Data Science (2023/2024)

Teachers

Overview

Course Objectives

Schedule

Grading

Installation