JuliaAI · ablaom · May 19, 2024 · Apr 22, 2024 · Apr 22, 2024 · Apr 24, 2024
diff --git a/docs/src/about_mlj.md b/docs/src/about_mlj.md
@@ -1,6 +1,6 @@
 # About MLJ
 
-MLJ (Machine Learning in Julia) is a toolbox written in Julia 
+MLJ (Machine Learning in Julia) is a toolbox written in Julia
 providing a common interface and meta-algorithms for selecting,
 tuning, evaluating, composing and comparing [over 180 machine learning
 models](@ref model_list) written in Julia and other languages. In
@@ -22,8 +22,7 @@ The first code snippet below creates a new Julia environment
 [Installation](@ref) for more on creating a Julia environment for use
 with MLJ.
 
-Julia installation instructions are
-[here](https://julialang.org/downloads/).
+Julia installation instructions are [here](https://julialang.org/downloads/).
 
 ```julia
 using Pkg
@@ -44,7 +43,7 @@ Loading and instantiating a gradient tree-boosting model:
 using MLJ
 Booster = @load EvoTreeRegressor # loads code defining a model type
 booster = Booster(max_depth=2)   # specify hyper-parameter at construction
-booster.nrounds=50               # or mutate afterwards
+booster.nrounds = 50             # or mutate afterwards
 ```
 
 This model is an example of an iterative model. As it stands, the
@@ -92,7 +91,7 @@ it "self-tuning":
 ```julia
 self_tuning_pipe = TunedModel(model=pipe,
                               tuning=RandomSearch(),
-                              ranges = max_depth_range,
+                              ranges=max_depth_range,
                               resampling=CV(nfolds=3, rng=456),
                               measure=l1,
                               acceleration=CPUThreads(),
@@ -105,12 +104,12 @@ Loading a selection of features and labels from the Ames
 House Price dataset:
 
 ```julia
-X, y = @load_reduced_ames;
+X, y = @load_reduced_ames
 ```
 Evaluating the "self-tuning" pipeline model's performance using 5-fold
 cross-validation (implies multiple layers of nested resampling):
 
-```julia
+```julia-repl
 julia> evaluate(self_tuning_pipe, X, y,
                 measures=[l1, l2],
                 resampling=CV(nfolds=5, rng=123),
@@ -155,8 +154,7 @@ Extract:
 
 * Consistent interface to handle probabilistic predictions.
 
-* Extensible [tuning
-  interface](https://github.com/JuliaAI/MLJTuning.jl),
+* Extensible [tuning interface](https://github.com/JuliaAI/MLJTuning.jl),
   to support a growing number of optimization strategies, and designed
   to play well with model composition.
 
@@ -229,19 +227,19 @@ installed in a new
 [environment](https://julialang.github.io/Pkg.jl/v1/environments/) to
 avoid package conflicts. You can do this with
 
-```julia
+```julia-repl
 julia> using Pkg; Pkg.activate("my_MLJ_env", shared=true)
 ```
 
 Installing MLJ is also done with the package manager:
 
-```julia
+```julia-repl
 julia> Pkg.add("MLJ")
 ```
 
 **Optional:** To test your installation, run
 
-```julia
+```julia-repl
 julia> Pkg.test("MLJ")
 ```
 
@@ -252,7 +250,7 @@ environment to make model-specific code available. This
 happens automatically when you use MLJ's interactive load command
 `@iload`, as in
 
-```julia
+```julia-repl
 julia> Tree = @iload DecisionTreeClassifier # load type
 julia> tree = Tree() # instance
 ```

diff --git a/docs/src/adding_models_for_general_use.md b/docs/src/adding_models_for_general_use.md
@@ -5,4 +5,4 @@ suitable for addition to the MLJ Model Registry, consult the [MLJModelInterface.
 documentation](https://juliaai.github.io/MLJModelInterface.jl/dev/).
 
 For quick-and-dirty user-defined models see [Simple User Defined
-Models](simple_user_defined_models.md). 
+Models](simple_user_defined_models.md).
diff --git a/docs/src/api.md b/docs/src/api.md
diff --git a/docs/src/common_mlj_workflows.md b/docs/src/common_mlj_workflows.md
@@ -23,31 +23,27 @@ MLJ_VERSION
 ## Data ingestion
 
 ```@setup workflows
-# to avoid RDatasets as a doc dependency:
+# to avoid RDatasets as a doc dependency, generate synthetic data with
+# similar parameters, with the first four rows mimicking the original dataset
+# for display purposes
 color_off()
 import DataFrames
-channing = (Sex = rand(["Male","Female"], 462),
-            Entry = rand(Int, 462),
-            Exit = rand(Int, 462),
-            Time = rand(Int, 462),
-            Cens = rand(Int, 462)) |> DataFrames.DataFrame
+channing = (Sex = [repeat(["Male"], 4)..., rand(["Male","Female"], 458)...],
+            Entry = Int32[782, 1020, 856, 915, rand(733:1140, 458)...],
+            Exit = Int32[909, 1128, 969, 957, rand(777:1207, 458)...],
+            Time = Int32[127, 108, 113, 42, rand(0:137, 458)...],
+            Cens = Int32[1, 1, 1, 1, rand(0:1, 458)...]) |> DataFrames.DataFrame
 coerce!(channing, :Sex => Multiclass)
 ```
 
 
 ```julia
 import RDatasets
 channing = RDatasets.dataset("boot", "channing")
+```
 
-julia> first(channing, 4)
-4×5 DataFrame
- Row │ Sex   Entry  Exit   Time   Cens
-     │ Cat…  Int32  Int32  Int32  Int32
-─────┼──────────────────────────────────
-   1 │ Male    782    909    127      1
-   2 │ Male   1020   1128    108      1
-   3 │ Male    856    969    113      1
-   4 │ Male    915    957     42      1
+```@example workflows
+first(channing, 4) |> pretty
 ```
 
 Inspecting metadata, including column scientific types:
@@ -61,17 +57,17 @@ Horizontally splitting data and shuffling rows.
 Here `y` is the `:Exit` column and `X` a table with everything else:
 
 ```@example workflows
-y, X =  unpack(channing, ==(:Exit), rng=123);
+y, X = unpack(channing, ==(:Exit), rng=123)
 nothing # hide
 ```
 
 Here `y` is the `:Exit` column and `X` everything else except `:Time`:
 
 ```@example workflows
-y, X =  unpack(channing,
-               ==(:Exit),
-               !=(:Time);
-               rng=123);
+y, X = unpack(channing,
+              ==(:Exit),
+              !=(:Time);
+              rng=123);
 scitype(y)
 ```
 
@@ -115,7 +111,7 @@ nothing # hide
 Or, if already horizontally split:
 
 ```@example workflows
-(Xtrain, Xtest), (ytrain, ytest)  = partition((X, y), 0.6, multi=true,  rng=123)
+(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.6, multi=true, rng=123)
 ```
 
 
@@ -171,7 +167,7 @@ nothing # hide
 
 ## Instantiating a model
 
-    *Reference:*   [Getting Started](@ref), [Loading Model Code](@ref)
+*Reference:*   [Getting Started](@ref), [Loading Model Code](@ref)
 
 Assumes `MLJDecisionTreeClassifier` is in your environment. Otherwise, try interactive
 loading with `@iload`:
@@ -183,7 +179,7 @@ tree = Tree(min_samples_split=5, max_depth=4)
 
 or
 
-```@julia
+```julia
 tree = (@load DecisionTreeClassifier)()
 tree.min_samples_split = 5
 tree.max_depth = 4
@@ -208,7 +204,7 @@ Do `measures()` to list all losses and scores and their aliases, or refer to the
 StatisticalMeasures.jl [docs](https://juliaai.github.io/StatisticalMeasures.jl/dev/).
 
 
-##  Basic fit/evaluate/predict by hand:
+##  Basic fit/evaluate/predict by hand
 
 *Reference:*   [Getting Started](index.md), [Machines](machines.md),
 [Evaluating Model Performance](evaluating_model_performance.md), [Performance Measures](performance_measures.md)
@@ -251,7 +247,7 @@ Note `LogLoss()` has aliases `log_loss` and `cross_entropy`.
 Predict on the new data set:
 
 ```@example workflows
-Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD =rand(3))
+Xnew = (FL = rand(3), RW = rand(3), CL = rand(3), CW = rand(3), BD = rand(3))
 predict(mach, Xnew)      # a vector of distributions
 ```
 
@@ -379,8 +375,8 @@ z = transform(mach, y);
 
 *Reference:*   [Tuning Models](tuning_models.md)
 
-```@example workflows
-X, y = @load_iris; nothing # hide
+```@setup workflows
+X, y = @load_iris
 ```
 
 Define a model with nested hyperparameters:
@@ -502,7 +498,7 @@ Tree = @load DecisionTreeRegressor pkg=DecisionTree verbosity=0
 tree_with_target = TransformedTargetModel(model=Tree(),
                                           transformer=y -> log.(y),
                                           inverse = z -> exp.(z))
-pipe2 = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> tree_with_target;
+pipe2 = (X -> coerce(X, :age=>Continuous)) |> OneHotEncoder() |> tree_with_target
 nothing # hide
 ```
 
@@ -538,7 +534,8 @@ curve = learning_curve(mach,
 
 ```julia
 using Plots
-plot(curve.parameter_values, curve.measurements, xlab=curve.parameter_name, xscale=curve.parameter_scale)
+plot(curve.parameter_values, curve.measurements,
+     xlab=curve.parameter_name, xscale=curve.parameter_scale)
 ```
 
 ![](img/workflows_learning_curve.png)
@@ -558,7 +555,7 @@ curve = learning_curve(mach,
 
 ```julia
 plot(curve.parameter_values, curve.measurements,
-xlab=curve.parameter_name, xscale=curve.parameter_scale)
+     xlab=curve.parameter_name, xscale=curve.parameter_scale)
 ```
 
 ![](img/workflows_learning_curves.png)
diff --git a/docs/src/controlling_iterative_models.md b/docs/src/controlling_iterative_models.md
@@ -98,7 +98,7 @@ control                                                        | description
 [`TimeLimit`](@ref EarlyStopping.TimeLimit)`(t=0.5)`           | Stop after `t` hours                                                                    | yes
 [`NumberLimit`](@ref EarlyStopping.NumberLimit)`(n=100)`       | Stop after `n` applications of the control                                              | yes
 [`NumberSinceBest`](@ref EarlyStopping.NumberSinceBest)`(n=6)` | Stop when best loss occurred `n` control applications ago                               | yes
-[`InvalidValue`](@ref IterationControl.InvalidValue)()         | Stop when `NaN`, `Inf` or `-Inf` loss/training loss encountered                         | yes 
+[`InvalidValue`](@ref IterationControl.InvalidValue)()         | Stop when `NaN`, `Inf` or `-Inf` loss/training loss encountered                         | yes
 [`Threshold`](@ref EarlyStopping.Threshold)`(value=0.0)`       | Stop when `loss < value`                                                                | yes
 [`GL`](@ref EarlyStopping.GL)`(alpha=2.0)`                     | † Stop after the "generalization loss (GL)" exceeds `alpha`                             | yes
 [`PQ`](@ref EarlyStopping.PQ)`(alpha=0.75, k=5)`               | † Stop after "progress-modified GL" exceeds `alpha`                                     | yes
@@ -109,15 +109,15 @@ control                                                        | description
 [`Error`](@ref IterationControl.Error)`(predicate; f="")`      | Log to `Error` the value of `f` or `f(mach)`, if `predicate(mach)` holds and then stop  | yes
 [`Callback`](@ref IterationControl.Callback)`(f=mach->nothing)`| Call `f(mach)`                                                                          | yes
 [`WithNumberDo`](@ref IterationControl.WithNumberDo)`(f=n->@info(n))`                       | Call `f(n + 1)` where `n` is the number of complete control cycles so far | yes
-[`WithIterationsDo`](@ref MLJIteration.WithIterationsDo)`(f=i->@info("iterations: $i"))`| Call `f(i)`, where `i` is total number of iterations           | yes
+[`WithIterationsDo`](@ref MLJIteration.WithIterationsDo)`(f=i->@info("iterations: $i"))`    | Call `f(i)`, where `i` is total number of iterations       | yes
 [`WithLossDo`](@ref IterationControl.WithLossDo)`(f=x->@info("loss: $x"))`                  | Call `f(loss)` where `loss` is the current loss            | yes
-[`WithTrainingLossesDo`](@ref IterationControl.WithTrainingLossesDo)`(f=v->@info(v))`       | Call `f(v)` where `v` is the current batch of training losses | yes
-[`WithEvaluationDo`](@ref MLJIteration.WithEvaluationDo)`(f->e->@info("evaluation: $e))`| Call `f(e)` where `e` is the current performance evaluation object | yes
+[`WithTrainingLossesDo`](@ref IterationControl.WithTrainingLossesDo)`(f=v->@info(v))`       | Call `f(v)` where `v` is the current batch of training losses         | yes
+[`WithEvaluationDo`](@ref MLJIteration.WithEvaluationDo)`(f->e->@info("evaluation: $e))`    | Call `f(e)` where `e` is the current performance evaluation object    | yes
 [`WithFittedParamsDo`](@ref MLJIteration.WithFittedParamsDo)`(f->fp->@info("fitted_params: $fp))`| Call `f(fp)` where `fp` is fitted parameters of training machine | yes
-[`WithReportDo`](@ref MLJIteration.WithReportDo)`(f->e->@info("report: $e))`| Call `f(r)` where `r` is the training machine report                    | yes
-[`WithModelDo`](@ref MLJIteration.WithModelDo)`(f->m->@info("model: $m))`| Call `f(m)` where `m` is the model, which may be mutated by `f`             | yes
-[`WithMachineDo`](@ref MLJIteration.WithMachineDo)`(f->mach->@info("report: $mach))`| Call `f(mach)` wher `mach` is the training machine in its current state    | yes
-[`Save`](@ref MLJIteration.Save)`(filename="machine.jls")`|Save current training machine to `machine1.jls`, `machine2.jsl`, etc                         | yes
+[`WithReportDo`](@ref MLJIteration.WithReportDo)`(f->e->@info("report: $e))`| Call `f(r)` where `r` is the training machine report                       | yes
+[`WithModelDo`](@ref MLJIteration.WithModelDo)`(f->m->@info("model: $m))`| Call `f(m)` where `m` is the model, which may be mutated by `f`               | yes
+[`WithMachineDo`](@ref MLJIteration.WithMachineDo)`(f->mach->@info("report: $mach))`| Call `f(mach)` wher `mach` is the training machine in its current state       | yes
+[`Save`](@ref MLJIteration.Save)`(filename="machine.jls")`     | Save current training machine to `machine1.jls`, `machine2.jsl`, etc                    | yes
 
 > Table 1. Atomic controls. Some advanced options are omitted.
 
@@ -253,7 +253,6 @@ In the code, `wrapper` is an object that wraps the training machine
 in this example).
 
 ```julia
-
 import IterationControl # or MLJ.IterationControl
 
 struct IterateFromList

diff --git a/docs/src/evaluating_model_performance.md b/docs/src/evaluating_model_performance.md
@@ -27,7 +27,7 @@ using MLJ
 X = (a=rand(12), b=rand(12), c=rand(12));
 y = X.a + 2X.b + 0.05*rand(12);
 model = (@load RidgeRegressor pkg=MultivariateStats verbosity=0)()
-cv=CV(nfolds=3)
+cv = CV(nfolds=3)
 evaluate(model, X, y, resampling=cv, measure=l2, verbosity=0)
 ```
 
@@ -51,8 +51,8 @@ Multiple measures are specified as a vector:
 evaluate!(
     mach,
     resampling=cv,
-    measures=[l1, rms, rmslp1], 
-	verbosity=0,
+    measures=[l1, rms, rmslp1],
+    verbosity=0,
 )
 ```
 
@@ -70,7 +70,7 @@ evaluate!(
     mach,
     resampling=CV(nfolds=3),
     measure=[l2, rsquared],
-    weights=weights, 
+    weights=weights,
 )
 ```
 
@@ -91,12 +91,12 @@ fold1 = 1:6; fold2 = 7:12;
 evaluate!(
     mach,
     resampling = [(fold1, fold2), (fold2, fold1)],
-    measures=[l1, l2], 
-	verbosity=0,
+    measures=[l1, l2],
+    verbosity=0,
 )
 ```
 
-Or the user can define their own re-usable `ResamplingStrategy` objects, - see [Custom
+Or the user can define their own re-usable `ResamplingStrategy` objects; see [Custom
 resampling strategies](@ref) below.
 
 
@@ -170,4 +170,3 @@ function train_test_pairs(holdout::Holdout, rows)
     return [(train, test),]
 end
 ```
-
diff --git a/docs/src/frequently_asked_questions.md b/docs/src/frequently_asked_questions.md
diff --git a/docs/src/getting_started.md b/docs/src/getting_started.md
@@ -5,14 +5,14 @@ For an outline of MLJ's **goals** and **features**, see
 
 This page introduces some MLJ basics, assuming some familiarity with
 machine learning. For a complete list of other MLJ learning resources,
-see [Learning MLJ](@ref). 
+see [Learning MLJ](@ref).
 
 MLJ collects together the functionality provided by mutliple packages. To learn how to
 install components separately, run `using MLJ; @doc MLJ`.
 
 This section introduces only the most basic MLJ operations and
 concepts. It assumes MLJ has been successfully installed. See
-[Installation](@ref) if this is not the case. 
+[Installation](@ref) if this is not the case.
 
 
 ```@setup doda
@@ -31,7 +31,7 @@ column vectors:
 ```@repl doda
 using MLJ
 iris = load_iris();
-selectrows(iris, 1:3)  |> pretty
+selectrows(iris, 1:3) |> pretty
 schema(iris)
 ```
 
@@ -114,8 +114,8 @@ computing the mode of each prediction):
 ```@repl doda
 evaluate(tree, X, y,
          resampling=CV(shuffle=true),
-                 measures=[log_loss, accuracy],
-                 verbosity=0)
+         measures=[log_loss, accuracy],
+         verbosity=0)
 ```
 
 Under the hood, `evaluate` calls lower level functions `predict` or
@@ -260,7 +260,7 @@ evaluate!(mach, resampling=Holdout(fraction_train=0.7),
 Changing a hyperparameter and re-evaluating:
 
 ```@repl doda
-tree.max_depth = 3
+tree.max_depth = 3;
 evaluate!(mach, resampling=Holdout(fraction_train=0.7),
           measures=[log_loss, accuracy],
           verbosity=0)