-
Notifications
You must be signed in to change notification settings - Fork 0
/
print.html
1457 lines (1350 loc) · 71 KB
/
print.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE HTML>
<html lang="en" class="sidebar-visible no-js light">
<head>
<!-- Book generated using mdBook -->
<meta charset="UTF-8">
<title>distribute-docs</title>
<meta name="robots" content="noindex" />
<!-- Custom HTML head -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#ffffff" />
<link rel="icon" href="favicon.svg">
<link rel="shortcut icon" href="favicon.png">
<link rel="stylesheet" href="css/variables.css">
<link rel="stylesheet" href="css/general.css">
<link rel="stylesheet" href="css/chrome.css">
<link rel="stylesheet" href="css/print.css" media="print">
<!-- Fonts -->
<link rel="stylesheet" href="FontAwesome/css/font-awesome.css">
<link rel="stylesheet" href="fonts/fonts.css">
<!-- Highlight.js Stylesheets -->
<link rel="stylesheet" href="highlight.css">
<link rel="stylesheet" href="tomorrow-night.css">
<link rel="stylesheet" href="ayu-highlight.css">
<!-- Custom theme stylesheets -->
</head>
<body>
<!-- Provide site root to javascript -->
<script type="text/javascript">
var path_to_root = "";
var default_theme = window.matchMedia("(prefers-color-scheme: dark)").matches ? "navy" : "light";
</script>
<!-- Work around some values being stored in localStorage wrapped in quotes -->
<script type="text/javascript">
try {
var theme = localStorage.getItem('mdbook-theme');
var sidebar = localStorage.getItem('mdbook-sidebar');
if (theme.startsWith('"') && theme.endsWith('"')) {
localStorage.setItem('mdbook-theme', theme.slice(1, theme.length - 1));
}
if (sidebar.startsWith('"') && sidebar.endsWith('"')) {
localStorage.setItem('mdbook-sidebar', sidebar.slice(1, sidebar.length - 1));
}
} catch (e) { }
</script>
<!-- Set the theme before any content is loaded, prevents flash -->
<script type="text/javascript">
var theme;
try { theme = localStorage.getItem('mdbook-theme'); } catch(e) { }
if (theme === null || theme === undefined) { theme = default_theme; }
var html = document.querySelector('html');
html.classList.remove('no-js')
html.classList.remove('light')
html.classList.add(theme);
html.classList.add('js');
</script>
<!-- Hide / unhide sidebar before it is displayed -->
<script type="text/javascript">
var html = document.querySelector('html');
var sidebar = 'hidden';
if (document.body.clientWidth >= 1080) {
try { sidebar = localStorage.getItem('mdbook-sidebar'); } catch(e) { }
sidebar = sidebar || 'visible';
}
html.classList.remove('sidebar-visible');
html.classList.add("sidebar-" + sidebar);
</script>
<nav id="sidebar" class="sidebar" aria-label="Table of contents">
<div class="sidebar-scrollbox">
<ol class="chapter"><li class="chapter-item expanded "><a href="introduction.html"><strong aria-hidden="true">1.</strong> Introduction</a></li><li class="chapter-item expanded "><a href="install.html"><strong aria-hidden="true">2.</strong> Installation</a></li><li class="chapter-item expanded "><a href="commands.html"><strong aria-hidden="true">3.</strong> Command Basics</a></li><li class="chapter-item expanded "><a href="configuration.html"><strong aria-hidden="true">4.</strong> Configuration</a></li><li><ol class="section"><li class="chapter-item expanded "><a href="python.html"><strong aria-hidden="true">4.1.</strong> Python Jobs</a></li><li class="chapter-item expanded "><a href="apptainer.html"><strong aria-hidden="true">4.2.</strong> Apptainer Jobs</a></li></ol></li><li class="chapter-item expanded "><a href="python_api.html"><strong aria-hidden="true">5.</strong> Python Api</a></li><li class="chapter-item expanded "><a href="capabilities.html"><strong aria-hidden="true">6.</strong> Available Capabilities</a></li><li class="chapter-item expanded "><a href="machines.html"><strong aria-hidden="true">7.</strong> Machines</a></li></ol> </div>
<div id="sidebar-resize-handle" class="sidebar-resize-handle"></div>
</nav>
<div id="page-wrapper" class="page-wrapper">
<div class="page">
<div id="menu-bar-hover-placeholder"></div>
<div id="menu-bar" class="menu-bar sticky bordered">
<div class="left-buttons">
<button id="sidebar-toggle" class="icon-button" type="button" title="Toggle Table of Contents" aria-label="Toggle Table of Contents" aria-controls="sidebar">
<i class="fa fa-bars"></i>
</button>
<button id="theme-toggle" class="icon-button" type="button" title="Change theme" aria-label="Change theme" aria-haspopup="true" aria-expanded="false" aria-controls="theme-list">
<i class="fa fa-paint-brush"></i>
</button>
<ul id="theme-list" class="theme-popup" aria-label="Themes" role="menu">
<li role="none"><button role="menuitem" class="theme" id="light">Light (default)</button></li>
<li role="none"><button role="menuitem" class="theme" id="rust">Rust</button></li>
<li role="none"><button role="menuitem" class="theme" id="coal">Coal</button></li>
<li role="none"><button role="menuitem" class="theme" id="navy">Navy</button></li>
<li role="none"><button role="menuitem" class="theme" id="ayu">Ayu</button></li>
</ul>
<button id="search-toggle" class="icon-button" type="button" title="Search. (Shortkey: s)" aria-label="Toggle Searchbar" aria-expanded="false" aria-keyshortcuts="S" aria-controls="searchbar">
<i class="fa fa-search"></i>
</button>
</div>
<h1 class="menu-title">distribute-docs</h1>
<div class="right-buttons">
<a href="print.html" title="Print this book" aria-label="Print this book">
<i id="print-button" class="fa fa-print"></i>
</a>
</div>
</div>
<div id="search-wrapper" class="hidden">
<form id="searchbar-outer" class="searchbar-outer">
<input type="search" id="searchbar" name="searchbar" placeholder="Search this book ..." aria-controls="searchresults-outer" aria-describedby="searchresults-header">
</form>
<div id="searchresults-outer" class="searchresults-outer hidden">
<div id="searchresults-header" class="searchresults-header"></div>
<ul id="searchresults">
</ul>
</div>
</div>
<!-- Apply ARIA attributes after the sidebar and the sidebar toggle button are added to the DOM -->
<script type="text/javascript">
document.getElementById('sidebar-toggle').setAttribute('aria-expanded', sidebar === 'visible');
document.getElementById('sidebar').setAttribute('aria-hidden', sidebar !== 'visible');
Array.from(document.querySelectorAll('#sidebar a')).forEach(function(link) {
link.setAttribute('tabIndex', sidebar === 'visible' ? 0 : -1);
});
</script>
<div id="content" class="content">
<main>
<h1 id="distribute"><a class="header" href="#distribute">distribute</a></h1>
<p><code>distribute</code> is a relatively simple command line utility for distributing compute jobs across the powerful
lab computers. In essence, <code>distribute</code> provides a simple way to automatically schedule dozens of jobs
from different people across the small number of powerful computers in the lab. </p>
<p>Besides having the configuration files begin easier to use, <code>distribute</code> also contains a mechanism for
only scheduling your jobs on nodes that meet your criteria. If you require OpenFoam to run your simulation,
<code>distribute</code> automatically knows which of the three computers it can run the job on. This also allows you
to robustly choose what your requirements are for your tasks. This allows us to prioritize
use of the gpu machine to jobs requiring a gpu, increasing the overall throughput of jobs between all lab
members.</p>
<p>Another cool feature of <code>distribute</code> is that files that are not needed after each compute run are automatically
wiped from the hard drive, preserving limited disk space on the compute machines. Files that are specified to be
saved (by you) are archived automatically on a 24 TB storage machine, and can be retrieved (and filtered)
to your personal computer with a single short command.</p>
<p><code>distribute</code> competes in the same space as <a href="https://slurm.schedmd.com/overview.html">slurm</a>, which you would
likely find on an actual compute cluster. The benefit of <code>distribute</code> is an all-in-one solution to running,
archiving, and scheduling jobs with a single streamlined utility without messing around with the complexities
of the (very detailed) slurm documentation. If you are still unconvinced, take a look at the overall architecture
diagram that slurm provides:</p>
<p><img src="https://slurm.schedmd.com/arch.gif" alt="" /></p>
<p>Since the lab computers also function as day-to-day workstations for some lab members, some additional
features are required to ensure that they are functional outside of running jobs. <code>distribute</code> solves this issue
by allowing a user that is sitting at a computer to temporarily pause the currently executing job so that
they may perform some simple work. This allows lab members to still quickly iterate on ideas without waiting
hours for their jobs to reach the front of the queue. Since cluster computers are <em>never</em> used as
day-to-day workstations, popular compute schedulers like slurm don't provision for this.</p>
<h2 id="architecture"><a class="header" href="#architecture">Architecture</a></h2>
<p>Instead of complex scheduling algorithms and job queues, we can distill the overall architecture of the
system to a simple diagram:</p>
<p><img src="https://i.imgur.com/e4YnOQG.png" alt="" /></p>
<p>In summary, there is a very simple flow of information from the server to the nodes, and from the nodes to
the server. The server is charged with sending the nodes any user-specified files (such as initial conditions,
solver input files, or csv's of information) as well as instructions on how to compile and run the project.
Once the job has finished, the user's script will move any and all files that they wish to archive to
a special directory. All files in the special directory will be transfered to the server and saved
indefinitely. </p>
<p>The archiving structure of <code>distribute</code> helps free up disk space on your laptop of workstation, and instead
keep large files (that will surely be useful at a later date) stored away on a purpose-build machine to
hold them. As long as your are connected to the university network - VPN or otherwise - you can access the
files dumped by your compute job at any time.</p>
<h2 id="specifying-jobs"><a class="header" href="#specifying-jobs">Specifying Jobs</a></h2>
<p>We have thus far talked about all the cool things we can do with <code>distribute</code>, but none of this is free. As
a famous Italian engineer once said, "Theres no such thing as free lunch." The largest complexity with working
with <code>distribute</code> is the configuration file that specifies how to compile run project. <code>distribute template python</code>
will generate the following file:</p>
<pre><code class="language-yaml">meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
python:
initialize:
build_file: /path/to/build.py
jobs:
- name: job_1
file: execute_job.py
- name: job_2
file: execute_job_2.py
</code></pre>
<p>We will explain all of these fields later, but surmise it to say that the configuration files come in 3 main sections.
The <code>meta</code> section will describe things that the head node must do, including what "capabilities" each node is required
to have to run your server, a <code>batch_name</code> and <code>namespace</code> so that your compute results do not overwrite someone else's,
and a <code>matrix</code> field so that you can specify an optional matrix username that will be pinged once all your
jobs have finished.</p>
<p>The next section is the <code>initialize</code> section. This section specifies all the files and instructions that are required
to compile your project before it is run. This step is kept separate from the running step so that we can ensure
that your project is compiled only once before being run with different jobs in the third section.</p>
<p>The third section tells <code>distribute</code> <em>how</em> to execute each job. If you are using a python configuration then your
<code>file</code> parameter will likely seek out the compiled binary from the second step and run the binary using whatever
files you chose to be available.</p>
<p>The specifics of the configuration file will be discussed in greater detail in a later section.</p>
<div style="break-before: page; page-break-before: always;"></div><h1 id="installation"><a class="header" href="#installation">Installation</a></h1>
<p>In order to install <code>distribute</code> you must have a recent version of <code>rustc</code> and <code>cargo</code>.
Install instructions can be found <a href="https://www.rust-lang.org/tools/install">here</a>. </p>
<p>Once you have it (and running <code>cargo</code> shows some output), you can install the project with </p>
<pre><code>cargo install --git https://github.com/fluid-Dynamics-Group/distribute --force
</code></pre>
<p>and you are good to go! If you run into any trouble installing, let Brooks know.</p>
<h2 id="python-api-install"><a class="header" href="#python-api-install">Python api install</a></h2>
<pre><code>pip3 install distribute_compute_config
</code></pre>
<div style="break-before: page; page-break-before: always;"></div><h1 id="command-basics"><a class="header" href="#command-basics">Command Basics</a></h1>
<p>There are a few commands that you will need to know to effectively work with <code>distribute</code>. Don't worry,
they are not too complex. The full list of commands and their specific parameters can be found by running</p>
<pre><code class="language-bash">$ distribute
</code></pre>
<p>at the time of writing, this yields:</p>
<pre><code>distribute 0.9.4
A utility for scheduling jobs on a cluster
USAGE:
distribute [FLAGS] <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
--save-log
--show-logs
-V, --version Prints version information
SUBCOMMANDS:
add add a job set to the queue
client start this workstation as a node and prepare it for a server connection
help Prints this message or the help of the given subcommand(s)
kill terminate any running jobs of a given batch name and remove the batch from the queue
node-status check the status of all the nodes
pause pause all currently running processes on this node for a specified amount of time
pull Pull files from the server to your machine
run run a apptainer configuration file locally (without sending it off to a server)
server start serving jobs out to nodes using the provied configuration file
server-status check the status of all the nodes
template generate a template file to fill for executing with `distribute add`
</code></pre>
<h2 id="add"><a class="header" href="#add">add</a></h2>
<p><code>distribute add</code> is how you can add jobs to the server queue. There are two main things needed to operate this command:
a configuration file and the IP of the main server node. If you do not specify the name of a configuration
file, it will default to <code>distribute-jobs.yaml</code>. This command can be run (for most cases) as such:</p>
<pre><code class="language-bash">distribute add --ip <server ip address here> my-distribute-jobs-file.yaml
</code></pre>
<p>or, using defaults:</p>
<pre><code class="language-bash">distribute add --ip <server ip address here>
</code></pre>
<p>If there exists no node that matches all of your required capabilities, the job will not be run. There also exists a <code>--dry</code> flag
if you want to check that your configuration file syntax is correct, and a <code>--show-caps</code> flag to print the capabilities
of each node.</p>
<h2 id="template"><a class="header" href="#template">template</a></h2>
<p><code>distribute template</code> is a simple way to create a <code>distribute-jobs.yaml</code> file that either runs with <code>python</code> or <code>apptainer</code>s. The specifics
of each configuration file will be discussed later.</p>
<pre><code class="language-bash">distribute template python
</code></pre>
<pre><code class="language-yaml">---
meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
python:
initialize:
build_file: /path/to/build.py
required_files:
- path: /file/always/present/1.txt
alias: optional_alias.txt
- path: /another/file/2.json
alias: ~
- path: /maybe/python/utils_file.py
alias: ~
jobs:
- name: job_1
file: execute_job.py
required_files:
- path: job_configuration_file.json
alias: ~
- path: job_configuration_file_with_alias.json
alias: input.json
</code></pre>
<p>and</p>
<pre><code class="language-bash">distribute template apptainer
</code></pre>
<pre><code class="language-yaml">---
meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
apptainer:
initialize:
sif: execute_container.sif
required_files:
- path: /file/always/present/1.txt
alias: optional_alias.txt
- path: /another/file/2.json
alias: ~
- path: /maybe/python/utils_file.py
alias: ~
required_mounts:
- /path/inside/container/to/mount
jobs:
- name: job_1
required_files:
- path: job_configuration_file.json
alias: ~
- path: job_configuration_file_with_alias.json
alias: input.json
</code></pre>
<h2 id="pause"><a class="header" href="#pause">pause</a></h2>
<p>If you use a compute node as a work station, <code>distribute pause</code> will pause all locally running jobs so that you
can use the workstation normally. It takes a simple argument as an upper bound on how long the tasks can be paused. The maximum amount of time that
a job can be paused is four hours (<code>4h</code>), but if this is not enough you can simply rerun the command. This
upper bound is just present to remove any chance of you accidentally leaving the jobs paused for an extended
period of time.</p>
<p>If you decide that you no longer need the tasks paused, you can simply <code>Ctrl-C</code> to quit the hanging command
and all processes will be automatically resumed. <strong>Do not close your terminal</strong> before the pausing finishes or
you have canceled it with <code>Ctrl-C</code> as the job on your machine will never resume.</p>
<p>some examples of this command:</p>
<pre><code class="language-bash">sudo distribute pause --duration 4h
</code></pre>
<pre><code class="language-bash">sudo distribute pause --duration 1h30m10s
</code></pre>
<pre><code class="language-bash">sudo distribute pause --duration 60s
</code></pre>
<h2 id="server-status"><a class="header" href="#server-status">server-status</a></h2>
<p><code>distribute status</code> prints out all the running jobs at the head node. It will show you all the job batches
that are currently running, as well as the number of jobs in that set currently running and the
names of the jobs that have not been run yet. You can use this command to fetch the required parameters
to execute the <code>kill</code> command if needed.</p>
<pre><code class="language-bash">distribute server-status --ip <server ip here>
</code></pre>
<p>If there is no output then there are no jobs currently in the queue or executing on nodes.</p>
<p><strong>TODO</strong> An example output here</p>
<pre><code>260sec
:jobs running now: 1
10sec_positive
-unforced_viscous_decay
-unforced_inviscid_decay
-viscous_forcing_no_compensation_eh_first
-viscous_forcing_no_compensation_eh_second
-viscous_forcing_no_compensation_eh_both
:jobs running now: 0
</code></pre>
<h2 id="pull"><a class="header" href="#pull">pull</a></h2>
<p><code>distribute pull</code> takes a <code>distribute-jobs.yaml</code> config file and pulls all the files associated with that batch
to a specified <code>--save-dir</code> (default is the current directory). This is really convenient because the only thing
you need to fetch your files is the original file you used to compute the results in the first place!</p>
<p>Since you often dont want to pull <em>all the files</em> - which might include tens or hundreds of gigabytes of flowfield
files - this command also accepts <code>include</code> or <code>exclude</code> filters, which consist of a list of regular expressions
to apply to the file path. If using a <code>include</code> query, any file matching one of the regexs will be pulled to
your machine. If using a <code>exclude</code> query, any file matching a regex will <em>not</em> be pulled to your computer. </p>
<p>The full documentation on regular expressions is found <a href="https://docs.rs/regex/latest/regex/">here</a>, but luckily
most character strings are valid regular exprssions (barring characters like <code>+</code>, <code>-</code>, <code>(</code>, <code>)</code>). Lets say your
<code>meta</code> section of the config file looks like this:</p>
<pre><code class="language-yaml">---
meta:
batch_name: incompressible_5second_cases
namespace: brooks_openfoam_cases
capabilities: []
</code></pre>
<p>and your directory tree looks something like this</p>
<pre><code>├── incompressible_5second_cases
├── case1
│ ├── flowfield.vtk
│ └── statistics.csv
├── case2
│ ├── flowfield.vtk
│ └── statistics.csv
└── case3
├── flowfield.vtk
└── statistics.csv
</code></pre>
<p>If you wanted to exclude any file with a <code>vtk</code> extension, you could</p>
<pre><code class="language-bash">distribute pull distribute-jobs.yaml --ip <server ip here> \
exclude \
--exclude "vtk"
</code></pre>
<p>Or, if you wanted to exclude all of the case3 files and all vtk files:</p>
<pre><code class="language-bash">distribute pull distribute-jobs.yaml --ip <server ip here> \
exclude \
--exclude "vtk" \
--exclude "case3"
</code></pre>
<p>Maybe you only want to pull case1 files:</p>
<pre><code class="language-bash">distribute pull distribute-jobs.yaml --ip <server ip here> \
include \
--include "case1"
</code></pre>
<h2 id="run"><a class="header" href="#run">run</a></h2>
<p><code>distribute run</code> will run an apptainer job locally. It is usefull for debugging apptainer jobs
since the exact commands that are passed to the container are not always intuitive. </p>
<pre><code>distribute run --help
</code></pre>
<pre><code>distribute-run 0.6.0
run a apptainer configuration file locally (without sending it off to a server)
USAGE:
distribute run [FLAGS] [OPTIONS] [job-file]
FLAGS:
--clean-save allow the save_dir to exist, but remove all the contents of it before executing the code
-h, --help Prints help information
-V, --version Prints version information
OPTIONS:
-s, --save-dir <save-dir> the directory where all the work will be performed [default: ./distribute-run]
ARGS:
<job-file> location of your configuration file [default: distribute-jobs.yaml]
</code></pre>
<p>An example is provided in the apptainer jobs section.</p>
<div style="break-before: page; page-break-before: always;"></div><h1 id="configuration"><a class="header" href="#configuration">Configuration</a></h1>
<p>Configuration files are fundamental to how <code>distribute</code> works. Without a configuration file, the server would not
know what nodes that jobs could be run on, or even what the content of each job is. Configuration files
are also useful in <code>pull</code>ing the files you want from your compute job to your local machine. Therefore,
they are imperative to understand.</p>
<h2 id="configuration-files"><a class="header" href="#configuration-files">Configuration files</a></h2>
<p>As mentioned in the introduction, configuration files (usually named <code>distribute-jobs.yaml</code>) come in two flavors:
python scripts and apptainer images. </p>
<p>The advantage of python scripts is that they are relatively easy to produce:
you need to have a single script that specifies how to build your project, and another script (for each job) that specifies
how to run each file. The disadvantage of python configurations is that they are very brittle - the exact server configuration may
be slightly different from your environment and therefore can fail in unpredictable ways.
Since all nodes with your capabilities are treated equally, a node failing to execute
your files will quickly chew through your jobs and spit out some errors.</p>
<p>The advantage of apptainer jobs is that you can be sure that <strong>the way the job is run
on <code>distribute</code> nodes is exactly how it would run on your local machine</strong>. This means that, while it may take
slightly longer to make a apptainer job, you can directly ensure that all the dependencies are present, and that there wont
be any unexpected differences in the environment to ruin your job execution. <em>The importance of this cannot be
understated</em>. The other advantage of apptainer jobs is that they can be directly run on other compute clusters (as
well as every lab machine), and they are much easier to debug if you want to hand off the project to another lab
member for help. The disadvantage of apptainer jobs is that <em>the file system is not mutable</em> - you cannot write
to any files in the the container. Any attempt to write a file in the apptainer filesystem will result in an error
and the job will fail. Fear not, the fix for this is relatively easy: you will just bind folders from the host file system
(via configuration file) to your container that <em>will</em> be writeable. All you have to do then is ensure that your
compute job only writes to folders that have been bound to the container from the host filesystem.</p>
<p>Regardless of using a python or apptainer configuration, the three main areas of the configuration file remain the same:</p>
<table>
<tr>
<th>Section</th>
<th>Python Configuration</th>
<th>Apptainer Configuration</th>
</tr>
<tr>
<td>Meta</td>
<td>
<ul>
<li>
Specifies how the files are saved on the head node (<code class="hljs">namespace</code> and <code class="hljs">batch_name</code> fields)
</li>
<li>
Describes all the
"<code class="hljs">capabilities</code>"
that are required to actually run the file. Nodes that do not meet your
<code class="hljs">capabilities</code> will not have the job scheduled on them.
</li>
<li>
Provides an optional field for your matrix username. If specified, you will receive
a message on matrix when all your jobs are completed.
</li>
</ul>
</td>
<td>
The same as python
</td>
</tr>
<tr>
<td>
Building
</td>
<td>
<ul>
<li>specifies a path to a python file </li>
<ul>
<li>Clone all repositories you require</li>
<li>Compile your project and make sure everything is ready for jobs</li>
</ul>
<li>Gives the paths to some files you want to be available on the node when you are compiling</li>
</ul>
</td>
<td>
<ul>
<li> Gives the path to a apptainer image file (compiled on your machine)</li>
</ul>
</td>
</tr>
<tr>
<td>
Running
</td>
<td>
<ul>
<li>
A list of jobs names
<ul>
<li>
Each job specifies a python file and some additional files you want to be present
</li>
<li>
Your python file will drop you in the exact same directory that you built from. You
are responsible for finding and running your previously compiled project with (optionally)
whatever input files you have ensured are present ( in ./input).
</li>
</ul>
</li>
</ul>
</td>
<td>
<ul>
<li>
A list of job names
<ul>
<li>
Similarly, also specify the files you want to be present
</li>
<li>
the /input directory of your container will contain all the files you specify in each job section
</li>
<li>
You are responsible for reading in the input files and running the solver
</li>
</ul>
</li>
<li>
You dont need to specify any run time scripts
</li>
</ul>
</td>
</tr>
</table>
<h2 id="how-files-are-saved"><a class="header" href="#how-files-are-saved">How files are saved</a></h2>
<p>Files are saved on the server using your <code>namespace</code>, <code>batch_name</code>, and <code>job_name</code>s. Take the following configuration file that specifies
a apptainer job that does not save any of its own files:</p>
<pre><code class="language-yaml">meta:
batch_name: example_jobset_name
namespace: example_namespace
matrix: @your-username:matrix.org
capabilities: []
apptainer:
initialize:
sif: execute_container.sif
required_files: []
required_mounts:
- /path/inside/container/to/mount
jobs:
- name: job_1
required_files: []
- name: job_2
required_files: []
- name: job_3
required_files: []
</code></pre>
<p>The resulting folder structure on the head node will be</p>
<pre><code>.
└── example_namespace
└── example_jobset_name
├── example_jobset_name_build_ouput-node-1.txt
├── example_jobset_name_build_ouput-node-2.txt
├── example_jobset_name_build_ouput-node-3.txt
├── job_1
│ └── stdout.txt
├── job_2
│ └── stdout.txt
└── job_3
└── stdout.txt
</code></pre>
<p>The nice thing about <code>distribute</code> is that you also receive the output that would appear on your terminal
as a text file. Namely, you will have text files for how your project was compiled (<code>example_jobset_name_build_ouput-node-1.txt</code>
is the python build script output for node-1), as well as the output for each job inside each respective folder.</p>
<p>If you were to execute another configuration file using a different batch name, like this:</p>
<pre><code class="language-yaml">meta:
batch_name: example_jobset_name
namespace: example_namespace
matrix: @your-username:matrix.org
capabilities: []
# -- snip -- #
</code></pre>
<p>the output would look like this:</p>
<pre><code>.
└── example_namespace
├── another_jobset
│ ├── example_jobset_name_build_ouput-node-1.txt
│ ├── example_jobset_name_build_ouput-node-2.txt
│ ├── example_jobset_name_build_ouput-node-3.txt
│ ├── job_1
│ │ └── stdout.txt
│ ├── job_2
│ │ └── stdout.txt
│ └── job_3
│ └── stdout.txt
└── example_jobset_name
├── example_jobset_name_build_ouput-node-1.txt
├── example_jobset_name_build_ouput-node-2.txt
├── example_jobset_name_build_ouput-node-3.txt
├── job_1
│ └── stdout.txt
├── job_2
│ └── stdout.txt
└── job_3
└── stdout.txt
</code></pre>
<p>Therefore, its important to <strong>ensure that your <code>batch_name</code> fields are unique</strong>. If you don't, the output of
the previous batch will be deleted or combined with the new job.</p>
<h2 id="examples"><a class="header" href="#examples">Examples</a></h2>
<p>Examples creating each configuration file can be found in the current page's subchapters.</p>
<div style="break-before: page; page-break-before: always;"></div><h1 id="python"><a class="header" href="#python">Python</a></h1>
<p>Python configuration file templates can be generated as follows:</p>
<pre><code>distribute template python
</code></pre>
<p>At the time of writing, it outputs something like this:</p>
<pre><code class="language-yaml">---
meta:
batch_name: your_jobset_name
namespace: example_namespace
matrix: ~
capabilities:
- gfortran
- python3
- apptainer
python:
initialize:
build_file: /path/to/build.py
required_files:
- path: /file/always/present/1.txt
alias: optional_alias.txt
- path: /another/file/2.json
alias: ~
- path: /maybe/python/utils_file.py
alias: ~
jobs:
- name: job_1
file: execute_job.py
required_files:
- path: job_configuration_file.json
alias: ~
- path: job_configuration_file_with_alias.json
alias: input.json
</code></pre>
<h2 id="what-you-are-provided"><a class="header" href="#what-you-are-provided">What You Are Provided</a></h2>
<p>You may ask, what do your see when they are executed on a node? While the base folder structure remains the same,
the files you are provided differ. Lets say you are executing the following section of a configuration file:</p>
<pre><code class="language-yaml">python:
initialize:
build_file: /path/to/build.py
required_files:
- path: file1.txt
- path: file999.txt
alias: file2.txt
jobs:
- name: job_1
file: execute_job.py
required_files:
- path: file3.txt
- name: job_2
file: execute_job.py
required_files: []
</code></pre>
<p>When executing the compilation, the folder structure would look like this:</p>
<pre><code>.
├── build.py
├── distribute_save
├── initial_files
│ ├── file1.txt
│ └── file2.txt
└── input
├── file1.txt
├── file2.txt
</code></pre>
<p>In other words: when building you only have access to the files from the <code>required_files</code> section in <code>initialize</code>. Another thing
to note is that even though you have specified the path to the <code>file999.txt</code> file on your local computer, the file has <em>actually</em>
been named <code>file2.txt</code> on the node. This is an additional feature to help your job execution scripts work uniform file names; you
dont actually need to need to keep a bunch of solver inputs named <code>solver_input.json</code> in separate folders to prevent name collision.
You can instead have several inputs <code>solver_input_1.json</code>, <code>solver_input_2.json</code>, <code>solver_input_3.json</code> on your local machine and
then set the <code>alias</code> filed to <code>solver_input.json</code> so that you run script can simply read the file at <code>./input/solver_input.json</code>!</p>
<p>Lets say your python build script (which has been renamed to <code>build.py</code> by <code>distribute</code> for uniformity) clones the STREAmS solver
repository and compiled the project. Then, when executing <code>job_1</code> your folder structure would look something like this:</p>
<pre><code>.
├── job.py
├── distribute_save
├── initial_files
│ ├── file1.txt
│ └── file2.txt
├── input
│ ├── file1.txt
│ ├── file2.txt
│ └── file3.txt
└── STREAmS
├── README.md
└── src
└── main.f90
</code></pre>
<p>Now, the folder structure is <em>exactly</em> as you have left it, plus the addition of a new <code>file3.txt</code> that you specified in your <code>required_files</code>
section under <code>jobs</code>. Since <code>job_2</code> does not specify any additional <code>required_files</code>, the directory structure when running the python
script would look like this:</p>
<pre><code>.
├── job.py
├── distribute_save
├── initial_files
│ ├── file1.txt
│ └── file2.txt
├── input
│ ├── file1.txt
│ ├── file2.txt
└── STREAmS
├── README.md
└── src
└── main.f90
</code></pre>
<p>In general, the presence of <code>./initial_files</code> is an implementation detail. The files in this section are <em>not</em> refreshed
between job executions. You should not rely on the existance of this folder - or modify any of the contents of it. The
contents of the folder are copied to <code>./input</code> with every new job; use those files instead.</p>
<h2 id="saving-results-of-your-compute-jobs"><a class="header" href="#saving-results-of-your-compute-jobs">Saving results of your compute jobs</a></h2>
<p>Archiving jobs to the head node is <em>super</em> easy. All you have to do is ensure that your execution script moves all files
you wish to save to the <code>./distribute_save</code> folder before exiting. <code>distribute</code> will automatically read all the files
in <code>./distribute_save</code> and save them to the corresponding job folder on the head node permenantly. <code>distribute</code> will
also clear out the <code>./distribute_save</code> folder for you between jobs so that you dont end up with duplicate files.</p>
<h2 id="build-scripts"><a class="header" href="#build-scripts">Build Scripts</a></h2>
<p>The build script is specified in the <code>initialize</code> section under the <code>build_file</code> key. The build script is simply responsible
for cloning relevant git repositories and compiling any scripts in the project. Since private repositories require
a github SSH key, a read-only ssh key is provided on the system so that you can clone any private <code>fluid-dynamics-group</code>
repo. An example build script that I have personally used for working with <code>hit3d</code> looks like this:</p>
<pre><code class="language-python">import subprocess
import os
import sys
import shutil
# hit3d_helpers is a python script that I have specified in
# my `required_files` section of `initialize`
from initial_files import hit3d_helpers
import traceback
HIT3D = "https://github.com/Fluid-Dynamics-Group/hit3d.git"
HIT3D_UTILS = "https://github.com/Fluid-Dynamics-Group/hit3d-utils.git"
VTK = "https://github.com/Fluid-Dynamics-Group/vtk.git"
VTK_ANALYSIS = "https://github.com/Fluid-Dynamics-Group/vtk-analysis.git"
FOURIER = "https://github.com/Fluid-Dynamics-Group/fourier-analysis.git"
GRADIENT = "https://github.com/Fluid-Dynamics-Group/ndarray-gradient.git"
DIST = "https://github.com/Fluid-Dynamics-Group/distribute.git"
NOTIFY = "https://github.com/Fluid-Dynamics-Group/matrix-notify.git"
# executes a command as if you were typing it in a terminal
def run_shell_command(command):
print(f"running {command}")
output = subprocess.run(command,shell=True, check=True)
if not output.stdout is None:
print(output.stdout)
# construct a `git clone` string to run as a shell command
def make_clone_url(ssh_url, branch=None):
if branch is not None:
return f"git clone -b {branch} {ssh_url} --depth 1"
else:
return f"git clone {ssh_url} --depth 1"
def main():
build = hit3d_helpers.Build.load_json("./initial_files")
print("input files:")
run_shell_command(make_clone_url(HIT3D, build.hit3d_branch))
run_shell_command(make_clone_url(HIT3D_UTILS, build.hit3d_utils_branch))
run_shell_command(make_clone_url(VTK))
run_shell_command(make_clone_url(VTK_ANALYSIS))
run_shell_command(make_clone_url(FOURIER))
run_shell_command(make_clone_url(GRADIENT))
run_shell_command(make_clone_url(DIST, "cancel-tasks"))
run_shell_command(make_clone_url(NOTIFY))
# move the directory for book keeping purposes
shutil.move("fourier-analysis", "fourier")
# build hit3d
os.chdir("hit3d/src")
run_shell_command("make")
os.chdir("../../")
# build hit3d-utils
os.chdir("hit3d-utils")
run_shell_command("cargo install --path .")
os.chdir("../")
# build vtk-analysis
os.chdir("vtk-analysis")
run_shell_command("cargo install --path .")
os.chdir("../")
# all the other projects cloned are dependencies of the built projects
# they don't need to be explicitly built themselves
if __name__ == "__main__":
main()
</code></pre>
<p>note that <code>os.chdir</code> is the equivalent of the GNU coreutils <code>cd</code> command: it simply changes the current working
directory.</p>
<h2 id="job-execution-scripts"><a class="header" href="#job-execution-scripts">Job Execution Scripts</a></h2>
<p>Execution scripts are specified in the <code>file</code> key of a list item a job <code>name</code> in <code>jobs</code>. Execution scripts
can do a lot of things. I have found it productive to write a single <code>generic_run.py</code> script that
reads a configuration file from <code>./input/input.json</code> is spefied under my <code>required_files</code> for the job)
and then run the sovler from there. </p>
<p>One import thing about execution scripts is that they are run with a command line argument specifying
how many cores you are allowed to use. If you hardcode the number of cores you use you will either
oversaturate the processor (therefore slowing down the overall execution speed), or undersaturate
the resources available on the machine. Your script will be "executed" as if it was a command line
program. If the computer had 16 cores available, this would be the command:</p>
<pre><code class="language-bash">python3 ./job.py 16
</code></pre>
<p>you can parse this value using the <code>sys.argv</code> value in your script:</p>
<pre><code class="language-python">import sys
allowed_processors = sys.argv[1]
allowed_processors_int = int(allowed_processors)
assert(allowed_processors_int, 16)
</code></pre>
<p><strong>You must ensure that you use all available cores on the machine</strong>. If your code can only use a reduced number
of cores, make sure you specify this in your <code>capabilities</code> section! <strong>Do not run single threaded
processes on the distributed computing network - they will not go faster</strong>.</p>
<p>A full working example of a run script that I use is this:</p>
<pre><code class="language-python">import os
import sys
import json
from input import hit3d_helpers
import shutil
import traceback
IC_SPEC_NAME = "initial_condition_espec.pkg"
IC_WRK_NAME = "initial_condition_wrk.pkg"
def load_json():
path = "./input/input.json"
with open(path, 'r') as f:
data = json.load(f)
print(data)
return data
# copies some initial condition files from ./input
# to the ./hit3d/src directory so they can be used
# by the solver
def move_wrk_files(is_root):
outfile = "hit3d/src/"
infile = "input/"
shutil.copy(infile + IC_SPEC_NAME, outfile + IC_SPEC_NAME)
shutil.copy(infile + IC_WRK_NAME, outfile + IC_WRK_NAME)
# copy the ./input/input.json file to the output directory
# so that we can see it later when we download the data for viewing
def copy_input_json(is_root):
outfile = "distribute_save/"
infile = "input/"
shutil.copy(infile + "input.json", outfile + "input.json")
if __name__ == "__main__":
try:
data = load_json();
# get the number of cores that we are allowed to use from the command line
nprocs = int(sys.argv[1])
print(f"running with nprocs = ", nprocs)
# parse the json data into parameters to run the solver with
skip_diffusion = data["skip_diffusion"]
size = data["size"]
dt = data["dt"]
steps = data["steps"]
restarts = data["restarts"]
reynolds_number = data["reynolds_number"]
path = data["path"]
load_initial_data = data["load_initial_data"]
export_vtk = data["export_vtk"]
epsilon1 = data["epsilon1"]
epsilon2 = data["epsilon2"]
restart_time = data["restart_time"]
skip_steps = data["skip_steps"]
scalar_type = data["scalar_type"]
validate_viscous_compensation = data["validate_viscous_compensation"]
viscous_compensation = data["viscous_compensation"]
require_forcing = data["require_forcing"]
io_steps = data["io_steps"]
export_divergence = data["export_divergence"]
# if we need initial condition data then we copy it into ./hit3d/src/
if not load_initial_data == 1:
move_wrk_files()
root = os.getcwd()
# open hit3d folder
openhit3d(is_root)
# run the solver using the `hit3d_helpers` file that we have
# ensured is present from `required_files`
hit3d_helpers.RunCase(
skip_diffusion,
size,
dt,
steps,
restarts,
reynolds_number,
path,
load_initial_data,
nprocs,
export_vtk,
epsilon1,
epsilon2,
restart_time,
skip_steps,
scalar_type,
validate_viscous_compensation,
viscous_compensation,
require_forcing,
io_steps,
export_divergence
).run(0)
# go back to main folder that we started in
os.chdir(root)
copy_input_json(is_root)
sys.exit(0)
# this section will ensure that the exception and traceback
# is printed to the console (and therefore appears in stdout files saved
# on the server
except Exception as e:
print("EXCEPTION occured:\n",e)
print(e.__cause__)
print(e.__context__)
traceback.print_exc()
sys.exit(1)
</code></pre>
<h2 id="full-example"><a class="header" href="#full-example">Full Example</a></h2>
<p>A simpler example of a python job has been compiled and verified
<a href="https://github.com/Fluid-Dynamics-Group/distribute/tree/cancel-tasks/examples/python">here</a>.</p>
<div style="break-before: page; page-break-before: always;"></div><h1 id="apptainer"><a class="header" href="#apptainer">Apptainer</a></h1>
<p>Apptainer (previously named Singularity) is a container system often used for packaging HPC applications. For us,
apptainer is useful for distributing your compute jobs since you can specify the exact dependencies required
for running. If your container runs on your machine, it will run on the <code>distribute</code>d cluster!</p>
<p>As mentioned in the introduction, you <strong>must ensure that your container does not write to any directories that are not bound by the host
system</strong>. This will be discussed further below, but suffice it to say that writing to apptainer's immutable filesystem
will crash your compute job.</p>
<h2 id="versus-docker"><a class="header" href="#versus-docker">Versus Docker</a></h2>
<p>There is an official documentation page discussing the differences between docker
and apptainer <a href="https://apptainer.org/user-docs/master/singularity_and_docker.html">here</a>. There a few primary
benefits for using apptainer from an implementation standpoint in <code>distribute</code>:</p>
<ol>
<li>Its easy to use GPU compute from apptainer</li>
<li>Apptainer compiles down to a single <code>.sif</code> file that can easily be sent to the <code>distribute</code> server and passed to compute nodes</li>
<li>Once your code has been packaged in apptainer, it is very easy to run it on paid HPC clusters</li>
</ol>
<h2 id="overview-of-apptainer-configuration-files"><a class="header" href="#overview-of-apptainer-configuration-files">Overview of Apptainer configuration files</a></h2>
<p><img src="./figs/apptainer_config_flowchart.png" alt="" /></p>
<h2 id="apptainer-definition-files"><a class="header" href="#apptainer-definition-files">Apptainer definition files</a></h2>
<p>This documentation is not the place to discuss the intricacies of apptainer. As a user, we have tried to make
it as easy as possible to build an image that can run on <code>distribute</code>.
The <a href="https://github.com/Fluid-Dynamics-Group/apptainer-common">apptainer-common</a> was purpose built to give you a good
starting place with compilers and runtimes (including fortran, C++, openfoam, python3). Your definition file
needs to look something like this:</p>
<pre><code>Bootstrap: library
From: library://vanillabrooks/default/fluid-dynamics-common
%files from build
# in here you copy files / directories from your host machine into the
# container so that they may be accessed and compiled.
# the sintax is:
/path/to/host/file /path/to/container/file