The five stages of programming for data science:
- Use the REPL as a sophisticated calculator
- Realize that you are repeating many operations, so you decide to write some functions
- To organize all your functions, you begin scripting
- You want to share your code with others and thus, you want to write a package
- Your package is actually used by others and thus, it should be optimized and have good performance
Someone who is in stages 1, 2 or 3 does not need to learn to use Julia. Whatever language that they are using will be good for their purposes. However, anyone in stage 4 or 5 should seriously look into Julia as an alternative for their data science computing needs.
In a nutshell, Julia offers two main advantages to data scientists:
- You avoid the two-language problem
- You can easily write performant code
Julia provides a set of tools (like in a professional kitchen) that make it easy to write efficient code that is written 100% in Julia.
Among the main Julia tools, we will highlight five:
-
Data tools:
- Arrow.jl: memory, layout, data frame, binary form. The binary form allows for cross-platform use (julia, R, python).
- Tables.jl: generic idea of data table; row oriented (vector of named tuples) or column oriented (named tuple of vectors)
- DataFrames.jl: ideas similar to
tidyverse
; split-apply-combine
-
Model fitting:
- MixedModels.jl: 100% julia package
-
Communications with other systems:
-
Package system
- With Julia 1.6, precompilation is done when the package is added
- A local environment can be established and preserved with
Project.toml
andManifest.toml
files. - Use of
Artifacts.toml
allows for binary dependencies
-
Tuning performance
Disadvantages of Julia
- Lack of literate programming tools like
knitr
- Some packages exist (e.g. Literate.jl and Weave.jl) but they are not as well developed as
knitr
. - Notebook systems like Jupyter and Pluto.jl can be used. Pluto is different from Jupyter in that these are julia scripts with structured comments; no heavy metadata and no out of sequence evaluations
- Some packages exist (e.g. Literate.jl and Weave.jl) but they are not as well developed as
Other things worth mentioning for more savy programmers:
- Julia's multiple dispatching: generic functions; this is different to Java, C++, or Python where methods are part of classes. Multiple dispatch allows good implementation of linear algebra
-
Introduction (15 minutes by Claudia)
- Getting started with Julia; highlight Julia's growth (table in email)
- High level description of the main advantages
- Illustration of DrWatson and potential for reproducible research; make attendees start a project to follow along the exercises of the tutorial
-
Description of data tools (30 minutes by Doug)
- Description of
Arrow.jl
andTables.jl
- Exercise with real data: checking consistency
- Description of
-
Description of
MixedModels
(30 minutes by Doug)- Exercise to illustrate main functionalities
-
Brief illustration of other tools (30 minutes by Claudia)
- Communication with other systems
- Package system
- Tuning performance
-
Conclusions (15 minutes by Claudia)
- Disadvantages: lack of
knitr
, others? - Resources/links for people who want to learn more
- Disadvantages: lack of