Main topics for the Julia for Data Science workshop

The five stages of programming for data science:

Use the REPL as a sophisticated calculator
Realize that you are repeating many operations, so you decide to write some functions
To organize all your functions, you begin scripting
You want to share your code with others and thus, you want to write a package
Your package is actually used by others and thus, it should be optimized and have good performance

Someone who is in stages 1, 2 or 3 does not need to learn to use Julia. Whatever language that they are using will be good for their purposes. However, anyone in stage 4 or 5 should seriously look into Julia as an alternative for their data science computing needs.

In a nutshell, Julia offers two main advantages to data scientists:

You avoid the two-language problem
You can easily write performant code

Julia provides a set of tools (like in a professional kitchen) that make it easy to write efficient code that is written 100% in Julia.

Among the main Julia tools, we will highlight five:

Data tools:
- Arrow.jl: memory, layout, data frame, binary form. The binary form allows for cross-platform use (julia, R, python).
- Tables.jl: generic idea of data table; row oriented (vector of named tuples) or column oriented (named tuple of vectors)
- DataFrames.jl: ideas similar to tidyverse; split-apply-combine
Model fitting:
- MixedModels.jl: 100% julia package
Communications with other systems:
- RCall.jl: 100% julia package
- PyCall.jl: 100% julia package
Package system
- With Julia 1.6, precompilation is done when the package is added
- A local environment can be established and preserved with Project.toml and Manifest.toml files.
- Use of Artifacts.toml allows for binary dependencies
Tuning performance
- Profiling

Disadvantages of Julia

Lack of literate programming tools like knitr
- Some packages exist (e.g. Literate.jl and Weave.jl) but they are not as well developed as knitr.
- Notebook systems like Jupyter and Pluto.jl can be used. Pluto is different from Jupyter in that these are julia scripts with structured comments; no heavy metadata and no out of sequence evaluations

Other things worth mentioning for more savy programmers:

Julia's multiple dispatching: generic functions; this is different to Java, C++, or Python where methods are part of classes. Multiple dispatch allows good implementation of linear algebra

Potential structure

Introduction (15 minutes by Claudia)
- Getting started with Julia; highlight Julia's growth (table in email)
- High level description of the main advantages
- Illustration of DrWatson and potential for reproducible research; make attendees start a project to follow along the exercises of the tutorial
Description of data tools (30 minutes by Doug)
- Description of Arrow.jl and Tables.jl
- Exercise with real data: checking consistency
Description of MixedModels (30 minutes by Doug)
- Exercise to illustrate main functionalities
Brief illustration of other tools (30 minutes by Claudia)
- Communication with other systems
- Package system
- Tuning performance
Conclusions (15 minutes by Claudia)
- Disadvantages: lack of knitr, others?
- Resources/links for people who want to learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

structure.md

structure.md

Main topics for the Julia for Data Science workshop

Potential structure

Files

structure.md

Latest commit

History

structure.md

File metadata and controls

Main topics for the Julia for Data Science workshop

Potential structure