This repository has been archived by the owner on Dec 19, 2023. It is now read-only.
forked from Yelp/mrjob
-
Notifications
You must be signed in to change notification settings - Fork 1
/
CHANGES.txt
1366 lines (1307 loc) · 62 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
v0.7.4, 2020-09-17 -- Docker, concurrent steps, and pooling
* library requirement changes:
* [emr] requires boto3>=1.10.0, botocore>=1.13.26 (#2193)
* [google] requires google-cloud-dataproc<=1.1.0
* cloud runners (Dataproc, EMR):
* mrjob is now bootstrapped through py_files, not at bootstrap time
* EMR Runner:
* default image_version is now 6.0.0
* support Docker on 6.x AMIs (#2179)
* added docker_client_config, docker_image, docker_mounts opts
* allow concurrent steps on EMR clusters (#2185)
* max_concurrent_steps option
* for multi-step jobs, can add steps to cluster one at a time
* by default, does this if cluster supports concurrent steps
* can be controlled directly with add_steps_in_batch option
* pooling:
* join pooled clusters based on YARN cluster metrics (#2191)
* min_available_mb, min_available_virtual_cores opts
* upgrades to timing and cluster management:
* max_clusters_in_pool option (#2192)
* pool_timeout_minutes (#2199)
* pool_jitter_seconds to prevent race conditions (#2200)
* wait for S3 sync after uploading to S3, not before launching cluster
* don't wait pool_wait_minutes if no clusters to wait for (#2198)
* get_job_steps() is deprecated
v0.7.3, 2020-06-05 -- API-efficient cluster pooling
* cluster pooling changes:
* clusters locking now uses EMR tags, not S3 objects (#2160)
* cluster locks always expire after one minute (#2162)
* deprecated --max-mins-locked (terminate-idle-clusters), does nothing
* pooling uses API more efficiently
* most cluster pooling info is in job name (#2160)
* don't list pooled clusters' steps (#2159)
* use any matching cluster, not just the "best" one (#2164)
* "best" cluster determined by NormalizedInstanceHours / hours run
* matching rules are slightly more strict:
* mrjob version must always match
* application list must match exactly
* terminate_idle_clusters no longer locks pooled clusters
* spark runner:
* counters work when spark_tmp_dir is a local path (#2176)
* manifest download script correctly handles errors with dash (#2175)
v0.7.2, 2019-04-11 -- archives on all Spark platforms
* archives work on non-YARN Spark installations (#1993)
* mrjob.util.file_ext() ignores initial dots
* archives in setup scripts etc. are auto-named without file extension
* bootstrap now recognizes archives with names like *.0.7.tar.gz
* don't copy SSH key to master node when accessing other nodes on EMR (#1209)
* added ssh_add_bin option
* extra_cluster_params merges dict params rather than overwriting them (#2154)
* default python_bin on Python 2 is now 'python2.7' (#2151)
* ensure working PyYAML installs on Python 3.4 (#2149)
v0.7.1, 2019-12-20 -- Improve logging
* enable mrjob to show invoked runner with kwargs (#2129)
* set default value of VisibleToAllUsers to true (#2131)
* added archives to EMR pool hash during bootstrapping (#2136)
v0.7.0, 2019-10-22 -- fall cleaning
* moved support for AWS and Google Cloud to extras_require (#1935)
* use e.g. `pip install mrjob[aws]`
* removed support for non-Python MRJobs (#2087)
* removed interpreter and steps_interpreter options (see below)
* removed the `mrjob run` command
* removed mr_wc.rb from mrjob/examples/
* merged the MRJobLauncher class back into MRJob
* MRJob classes initialized without args read them from sys.argv (#2124)
* use SomeMRJob([]) to simulate running with no args (e.g. for tests)
* revamped and tested mrjob/examples/ (#2122)
* mr_grep.py no longer errors on no matches
* mr_log_sampler.py correctly randomizes lines
* mr_spark_wordcount.py is no longer case sensitive
* same with mr_spark_wordcount_script.py
* mr_text_classifier.py now reads text files directly, no need to encode
* public domain examples are in mrjob/examples/docs-to-classify
* renamed mr_words_containing_u_freq_count.py
* removed some examples that were difficult to test or maintain
* mrjob audit-emr-usage no longer reads pre-v0.6.0 cluster pool names (#1815)
* filesystem methods now have consistent arg naming
* removed the following deprecated code:
* runner options:
* emr_api_params
* interpreter
* max_hours_idle
* mins_to_end_of_hour
* steps_interpreter
* steps_python_bin
* visible_to_all_users
* singular switches (use --archives, etc.):
* --archive
* --dir
* --file
* --hadoop-arg
* --libjar
* --py-file
* --spark-arg
* --steps switch from MRJobs (#2046)
* use --help -v to see help for --mapper etc.
* MRJob:
* optparse simulation:
* add_file_option()
* add_passthrough_option()
* configure_options()
* load_options()
* pass_through_option()
* self.args
* self.OPTION_CLASS
* parse_output_line()
* MRJobRunner:
* file_upload_args kwarg to runner constructor
* stream_output()
* mrjob.util:
* parse_and_save_options()
* read_file()
* read_input()
* filesystems:
* arguments to CompositeFilesystem constructor (use add_fs())
* useless local_tmp_dir arg to GCSFilesystem constructor
* chunk_size arg to GCSFilesystem.put()
v0.6.12, 2019-10-23 -- unbreak Google
* default image_version on Dataproc is now 1.3 (#2110)
* local filesystem can now handle file:// URIs (#1986)
* sim runners accept file:// URIs as input files, upload files/archives
v0.6.11, 2019-10-07 -- Spark log parsing
* Python 3.4 is again supported, except for Google libraries (#2090)
* can intermix positional (input file) args to MRJobs on Python 3.7 (#1701)
* all runners
* can parse logs to find cause of error in Spark (#2056)
* EMR runner
* retrying on transient API errors now works with pagination (#2005)
* default image_version (AMI) is now 5.27.0 (#2105)
* restored m4.large as default instance type for pre-5.13.0 AMIs (#2098)
* can override emr_configurations with !clear or by Classification (#2097)
* Spark runner
* can run scripts with spark-submit without pyspark in $PYTHONPATH (#2091)
v0.6.10, 2019-07-19 -- official PyPy support
* officially support PyPy (#1011)
* when launched in PyPy, defaults python_bin to pypy or pypy3
* Spark runner
* turn off internal protocol with --skip-internal-protocol (#1952)
* spark Harness can run inside EMR (#2070)
* EMR runner
* default instance type is now m5.xlarge (#2071)
* log DNS of master node as soon as we know it (#2074)
* better error when reading YAML conf file without YAML library (#2047)
v0.6.9, 2019-05-29 -- Better emulation
* formally dropped support for Python 3.4
* (still seems to work except for Google libraries)
* jobs:
* deprecated add_*_option() methods can take types as their type arg (#2058)
* all runners
* archives no longer go into working dir mirror (#2059)
* fixes bug in v0.6.8 that could break archives on Hadoop
* sim runners (local, inline)
* simulated mapreduce.map.input.file is now a file:// URL (#2066)
* Spark runner
* added emulate_map_input_file option (#2061)
* can optionally emulate mapreduce.map.input.file in first step's mapper
* increment counter() emulation now uses correct arg names (#2060)
* warns if spark_tmp_dir and master aren't both local/remote (#2062)
* mrjob spark-submit can take switches to script without using "--" (#2055)
v0.6.8, 2019-04-25 -- Spark runner
* updated library dependencies (#2019, #2025)
* google-cloud-dataproc>=0.3.0
* google-cloud-logging>=1.9.0
* google-cloud-storage>=1.13.1
* PyYAML>=3.10
* jobs:
* MRJobs are now Spark-serializable (without calling sandbox())
* spark() can pass job methods to rdd.map() etc. (#2039)
* all runners:
* inline runner runs Spark jobs through PySpark (#1965)
* local runner runs Spark jobs on local-cluster master (#1361)
* cat_output() now ignores files and subdirs starting with "." too (#1337)
* this includes Spark checksum files (e.g. .part-00000.crc)
* empty *_bin options mean use the default, not a no-args command (#1926)
* affected gcloud_bin, hadoop_bin, sh_bin, ssh_bin
* *python_bin options already worked this way
* improved Spark support
* full support for setup scripts (was just YARN) (#2048)
* fully supports uploading files to Spark working dir (#1922)
* including renaming files (#2017)
* uploading archives/dirs is still unsupported except on YARN
* spark.yarn.appMasterEnv.* now only set on YARN (#1919)
* add_file_arg() works on Spark
* even on local[*] master (#2031)
* uses file:// as appropriate when running locally (#1985)
* won't hang if Hadoop or Spark binary can't be run (#2024)
* spark master/deploy mode can't be overridden by jobconf (#2032)
* can search for spark-submit binary in pyspark installation (#1984)
* (Dataproc runner does not yet support Spark)
* EMR runner:
* fixed fs bug that prevented running with non-default temp bucket (#2015)
* less API calls when job retries joining a pooled cluster (#1990)
* extra_cluster_params can set nested sub-params (#1934)
* e.g. Instances.EmrManagedMasterSecurityGroup
* --subnet '' un-sets subnet set in mrjob.conf (#1931)
* added Spark runner (#1940)
* runs jobs entirely on Spark, uses `hadoop fs` for HDFS only
* can use any fs mrjob supports (HDFS, EMR, Dataproc, local)
* can run "classic" MRJobs normally run on Hadoop streaming (#1972)
* supports mappers, combiners, reducers, including _init() and _final()
* makes efficient use of combiners, if available (#1946)
* supports Hadoop input/output format set in job (#1944)
* can run consecutive MRSteps in a single Spark step (#1950)
* respects SORT_VALUES (#1945)
* emulates Hadoop output compression (#1943)
* set the same jobconf variables you would in Hadoop
* can control number of output files
* set Hadoop jobconf variables to control # of reducers (#1953)
* or use --max-output-files (#2040)
* can simulate counters with accumulators (#1955)
* can handle jobs that load file args in their constructor (#2044)
* does not support commands (e.g. mapper_cmd(), mapper_pre_filter())
* (Spark runner does not yet parse logs for probable cause of error)
* Spark harness renamed to mrjob/spark/harness.py, no need to run directly
* `mrjob spark-submit` now defaults to spark runner
* works on emr, hadoop, and local runners as well (#1975)
* runner filesystems:
* added put() method to all filesystems (#1977)
* part size for uploads is now set at fs init time
* CompositeFilesystem can can give up on an un-configured filesystem (#1974)
* used by the Spark runner when GCS/S3 aren't set up
* mkdir() can now create buckets (#2014)
* fs-specific methods now accessed through fs.<name>
* e.g. runner.fs.s3.make_s3_client()
* deprecated useless local_tmp_dir arg to GCSFilesystem (#1961)
* missing mrjob.examples support files now installed
v0.6.7, 2019-01-16 -- mrjob spark-submit
* tools:
* added mrjob spark-submit subcommand (#1382)
* add subcommand to usage in --help for subcommands (#1885)
* added --emr-action-on-failure switch to mrjob create-cluster (#1959)
* jobs:
* added *_pairs() methods to MRJob (#1947)
* jobs pass steps description to runner constructor (#1845)
* --steps is deprecated
* all runners:
* sh_bin defaults to /bin/sh -ex, not just sh -ex (#1924)
* sh_bin may not be empty and should not take more than one argument
* warn about command-line switches for wrong runner (#1898)
* added plural command-line switches (#1882):
* added --applications, --archives, --dirs, --files, --libjars, --py-files
* deprecated --archive, --dir, --file, --libjar, --py-file
* interpreter and steps_interpreter opts are deprecated (#1850)
* steps_python_bin is deprecated (#1851)
* can set separate SPARK_PYTHON and SPARK_DRIVER_PYTHON if need be
* inline runner:
* no longer attempts to run command substeps (#1878)
* inline and local runner:
* no longer attempts to run non-streaming steps (#1915)
* Dataproc and EMR runners:
* fixed SIGKILL import error on Windows (#1892)
* Hadoop and EMR runners:
* setup opt works with Spark scripts on YARN (#1376)
* Hadoop runner:
* removed useless bootstrap_spark opt (#1382)
* EMR runner:
* fail fast if input files are archived in Glacier (#1887)
* default instance type is m4.large (#1932)
* pooling knows about c5 and m5 instance types (#1930, #1936)
* create_bucket() was broken in us-east-1, fixed (#1927)
* idle timeout silently failed on 2.x AMIs, fixed (#1909)
* updated deprecated escape sequences that would break in Python 3.8 (#1920)
* raise ValueError, not AssertionError (#1877)
* added experimental harness to submit basic MRJobs on Spark (#1941)
v0.6.6, 2018-11-05 -- nicer options and switches
* configs:
* boolean jobconf values in mrjob.conf now work correctly (#323)
* added mrjob.conf.combine_jobconfs()
* jobs:
* fixed "usage: usage: " in --help (#1866)
* overriding jobconf() and libjars() can no longer clobber
command-line options (#1453)
* JarSteps use GENERIC_ARGS to interpolate -D/-libjars (#1863)
* add_file_arg() supports explicit type=str (#1858)
* add_file_option() and add_passthrough_option() support type='str' (#1857)
* all runners:
* py_files are always uploaded to HDFS/S3 (#1852)
* options and switches:
* added -D as a synonym for --jobconf (#1839)
* added --local-tmp-dir switch (#1870)
* setting local_tmp_dir to '' uses default temp dir
* added --hadoop-args and --spark-args switches (#1844)
* --hadoop-arg and --spark-arg are now deprecated
* EMR runner:
* can now fetch history log via SSH, eliminating wait for S3 (#1253)
* Hadoop runner:
* added spark_deploy_mode option (#1864)
* sim runners:
* fixed permission error on Windows (#1847)
v0.6.5, 2018-09-07 -- custom AMIs
* all runners:
* can turn off log parsing with --no-read-logs (#1825)
* EMR Runner:
* transient API errors:
* EMR client won't retry faster than check_cluster_every option (#1799)
* all AWS clients will retry on SSL timeouts (#1827)
* RetryWrapper now passes through docstring of wrapped methods
* AMIs:
* default AMI is now 5.16.0 (#1818)
* choose custom AMIs with image_id option (#1805)
* find base AMIs with mrjob.ami.describe_base_emr_images() (#1829)
* added make_ec2_client() method and ec2_endpoint option
* choose EBS root volume size with ebs_root_volume_gb option (#1812)
* new clusters are tagged with __mrjob_label and __mrjob_owner (#1828)
* idle self-termination script (max_hours_idle):
* retries if shutdown fails (#1819)
* logs its output to mrjob-idle-termination.log
* cluster pooling recover now works on single-node clusters (#1822)
v0.6.4, 2018-07-31 -- FILES, DIRS, ARCHIVES
* drop support for Python 3.3
* unvendored google-cloud-dataproc, version 0.2.0+ required (#1796)
* link static files to MRJobs with FILES, DIRS, ARCHIVES (#1431)
* also added files(), dirs(), archives() methods
* termination protection doesn't make terminate-idle-clusters crash (#1801)
v0.6.3, 2018-05-31 -- Dataproc parity
* jobs:
* use mapper_raw() to read entire file, in any format (#754)
* log interpretation:
* handles "not a valid JAR" error from Hadoop (#1771)
* less dependencies on Google libraries (#1746):
* google-cloud-logging 1.5.0+
* google-cloud-storage 1.9.0+
* google-cloud-dataproc is vendored (future releases will require 0.11.0+)
* RetryWrapper now sets __name__ of wrapped functions (#1790)
* runners:
* all runners:
* don't stream output if --output-dir is specified (#1739)
* --no-output switch is now --no-cat-output
* added --cat-output switch
* cloud runners (Dataproc, EMR):
* renamed cloud_upload_part_size option to cloud_part_size_mb (#1774)
* DataprocJobRunner:
* options now supported:
* cloud_part_size_mb: control chunked uploading (#1404)
* {core,master,task}_instance_config (#1681):
* set disk_config, is_preemptible, other instance options
* cluster_config: set properties in Hadoop config files (#1680)
* hadoop_streaming_jar: specify custom Hadoop streaming JAR (#1676)
* network/subnet: specify network/subnetwork (#1683)
* service_account: specify custom IAM service account (#1682)
* service_account_scopes: specify custom permissions for cluster (#1682)
* ssh_tunnel/ssh_tunnel_is_open: access resource manager (#1670)
* fs:
* cat() streams data rather than dumping to a temp file (#1674)
* exists() no longer swallows exceptions (#1675)
* full support for parsing probable cause of job failure (#1672)
* full support for fetching and parsing counters (#1703)
* job progress messages (#1671)
* can now run JAR steps (#1677)
* uses Dataproc's built-in idle timeout, not a script (#1705)
* bootstrap script runs in temp dir, not / (#1601)
v0.6.2, 2018-03-23 -- log parsing at scale
* runners:
* local runners
* added --num-cores option to control parallelism and file splits (#1727)
* cloud runners (EMR and Dataproc):
* idle timeout script has 10-minute grace period (#1694)
* Dataproc:
* replaced google-api-python-client with google-cloud-sdk (#1730)
* works without gcloud util config installed (#1742)
* credentials can be read from $GOOGLE_APPLICATION_CREDENTIALS
* or from gcloud util config (if installed)
* no longer required to set region or zone (#1732)
* auto zone placement (just set region) is enabled
* defaults to auto zone placement in us-west1
* no longer reads zone or region from gcloud GCE configs
* Dataproc Quickstart is now up-to-date (#1589)
* api_client attr has been replaced with cluster_client and job_client
* GCSFilesystem method changes:
* api_client attr has been replaced with client
* create_bucket() no longer takes a project ID
* delete_bucket() is disabled (use get_bucket(...).delete())
* get_bucket() returns a google.cloud.storage.bucket.Bucket
* list_buckets() is disabled (use get_all_bucket_names())
* EMR:
* much faster error log parsing (#1706)
* may have to wait for logs to transfer to S3 on some AMIs
* tools:
* terminate-idle-job-flows is faster and uses less API calls
v0.6.1, 2017-11-27 -- mrjob diagnose
* fixed serious error log parsing issue (#1708)
* added mrjob diagnose utility to find why previously run jobs failed (#1707)
* exposed EMRJobRunner.get_job_steps() (#1625)
v0.6.0, 2017-11-01 -- wave of deprecation
* dropped support for Python 2.6
* use boto3 instead of boto (#1304)
* job output is now byte-based, not line-based:
* runner.fs.cat() now yields chunks of bytes, not lines (#1533)
* runner.cat_output() yields chunks of bytes (#1604)
* runner.stream_output() is deprecated
* job.parse_output() translates chunks of bytes to records (#1604)
* job.parse_output_line() is deprecated
* replaced optparse with argparse (#1587)
* renamed attributes/methods (old name is a deprecated alias)
* add_file_option() -> add_file_arg()
* add_passthrough_option() -> add_passthru_arg()
* configure_options() -> configure_args()
* load_options() -> load_args()
* pass_through_option() -> pass_arg_through()
* self.args -> self.options.args
* duplicate file upload args are passed through to job
* runners:
* all runners:
* file_upload_args kwarg is deprecated (#1656)
* pass path dicts to extra_args instead (see mrjob.setup)
* sim runners (inline and local):
* local mode runs one mapper/reducer per CPU (#1083)
* step_output_dir works (#1515)
* only sort by reducer key unless SORT_VALUES is set (#660)
* files in working dir are marked user-executable (#1619)
* don't crash if os.symlink doesnt work on Windows (#1649)
* input passed to jobs as stdin, not as arguments (#567)
* input decompressed by runner, not mrjob.cat utility
* cloud runners (Dataproc and EMR):
* bootstrap can take archives and dirs as well as files (#1530)
* bootstrap files are now only made executable by current user (#1602)
* extra_cluster_params for unsupported API params (#1648)
* max_mins_idle option (#1686)
* default is 10 minutes
* <10 minutes may result in premature cluster shutdown (see #1693)
* max_hours_idle is deprecated
* EMRJobRunner:
* persistent clusters will always idle-time-out
* default AMI is 5.8.0 (#1594)
* instance_fleets option (#1569)
* instance_groups option (use for EBS volumes) (#1357)
* region names are now case-sensitive
* 'EU' alias for EMR region 'eu-west-1' no longer works (#1538)
* __mrjob_pool_hash and __mrjob_pool_name EMR tags on cluster (#1086)
* __mrjob_version tag on cluster (#1600)
* jobs no longer add tags to cluster (#1565)
* enable_emr_debugging now also works AMI 4.x and later
* SSH filesystem no longer dumps file contents to memory (#1544)
* bootstrapping errors no longer logged as JSON (#1580)
* "latest" AMI alias no longer works (#1595)
* implicitly dropped support for AMI 2.4.2 and earlier (no Python 2.7)
* deprecated visible_to_all_users option
* mins_to_end_of_hour no longer works (EMR now bills by the second)
* pooling changes:
* ensure that extra instance roles (e.g. "task") can run job (#1630)
* only running instances are counted (#1633)
* boto -> boto3 changes:
* added make_emr_client()
* added make_iam_client()
* removed make_*_conn() methods (see below)
* emr_api_params no longer works (#1574) (use extra_cluster_params)
* boto3 reads $AWS_SESSION_TOKEN, not $AWS_SECURITY_TOKEN
* added fs methods:
* added get_all_bucket_names()
* added make_s3_client()
* added make_s3_resource()
* removed methods that return boto 2 objects (see below)
* create_bucket()'s second arg is now named region, not location
* mrjob tools:
* updated billing calculations in audit-emr-usage (#1688)
* mrjob.util
* deprecated read_file() (#1605) (use mrjob.cat.decompress())
* deprecated read_input()
* removed code (mostly due to deprecation)
* (for runner changes, see mrjob.conf, mrjob.options, and mrjob.runner)
* mrjob.aws:
* removed emr_endpoint_for_region()
* removed emr_ssl_host_for_region()
* removed s3_endpoint_for_region()
* removed s3_location_constraint_for_region()
* mrjob.cat:
* this module is no longer executable
* mrjob.cmd:
* removed deprecated mrjob command aliases:
* create-job-flow
* terminate-idle-job-flows
* terminate-job-flow
* mrjob.conf:
* removed OptionStore class and subclasses (#1615)
* mrjob.emr:
* removed s3_key_to_uri()
* EMRJobRunner:
* removed make_emr_conn() (use make_emr_client())
* removed make_iam_conn() (use make_iam_client())
* removed get_ami_version() (use get_image_version())
* removed get_emr_job_flow_id() (use get_cluster_id())
* make_persistent_job_flow() (use make_persistent_cluster())
* see also mrjob.fs.s3.S3Filesystem
* mrjob.fs.base:
* base filesystem methods:
* removed path_exists() (use exists())
* removed path_join() (use join())
* mrjob.fs.s3:
* removed wrap_aws_conn()
* S3Filesystem
* removed make_s3_conn() (use make_s3_client() or make_s3_resource())
* removed get_all_buckets() (use get_all_bucket_names())
* removed get_s3_key()
* removed get_s3_keys()
* removed make_s3_key()
* mrjob.fs.ssh:
* SSHFilesystem
* removed ssh_slave_hosts()
* mrjob.job:
* MRJob:
* see mrjob.launch.MRJobLauncher
* removed loose protocols (#1021)
* mrjob.launch
* MRJobLauncher:
* removed generate_file_upload_args()
* removed generate_passthough_arguments()
* removed is_mapper_or_reducer() (use is_task())
* removed mr() (use mrjob.step.MRStep)
* removed *job_runner_kwargs()
* removed --partitioner switch
* removed OptionGroups (#1611):
* removed all_option_groups()
* removed *_opt_group attributes
* mrjob.options:
* removed deprecated options (#1022):
* bootstrap_cmds
* bootstrap_files
* bootstrap_scripts
* hadoop_home
* hadoop_streaming_jar_on_emr
* num_ec2_instances
* python_archives
* setup_cmds
* setup_scripts
* strict_protocols
* removed deprecated option aliases:
* ami_version
* aws_availability_zone
* aws_region
* aws_security_token
* base_tmp_dir
* check_emr_status_every
* ec2_core_instance_bid_price
* ec2_core_instance_type
* ec2_instance_type
* ec2_master_instance_bid_price
* ec2_master_instance_type
* ec2_slave_instance_type
* ec2_task_instance_bid_price
* ec2_task_instance_type
* emr_job_flow_id
* emr_job_flow_pool_name
* emr_tags
* hdfs_scratch_dir
* num_ec2_core_instances
* num_ec2_task_instances
* pool_emr_job_flows
* s3_log_uri
* s3_scratch_uri
* s3_sync_wait_time
* s3_tmp_dir
* s3_upload_part_size
* ssh_tunnel_to_job_tracker
* mrjob.parse:
* removed is_windows_path()
* removed iso8601_to_timestamp()
* removed iso8601_to_datetime()
* removed parse_key_value_list()
* removed parse_port_range_list()
* mrjob.retry:
* removed RetryGoRound
* mrjob.runner:
* MRJobRunner:
* removed get_job_name()
* removed OPTION_STORE_CLASS attribute
* removed deprecated passthrough to runner.fs
* removed deprecated JOB_FLOW and *SCRATCH cleanup types
* mrjob.setup:
* removed BootstrapWorkingDirManager
* mrjob.step:
* removed INPUT, OUTPUT attributes from JarStep
* mrjob.ssh (removed entirely)
* mrjob.util:
* removed args_for_opt_dest_subset()
* removed bash_wrap()
* removed buffer_iterator_to_line_iterator()
* removed bunzip2_stream() (now in mrjob.cat)
* removed gunzip_stream() (now in mrjob.cat)
* removed populate_option_groups_with_options()
* removed scrape_options_and_index_by_dest()
* removed scrape_options_into_new_groups()
v0.5.12, 2018-07-24 -- v0.6.x backport
* dropped support for Python 2.6 and 3.3
* termination protection doesn't make terminate-idle-clusters crash (#1802)
* mrjob.parse.parse_s3_uri() handles s3a:// URIs (#1709)
* mins_to_end_of_hour option defaults to 60.0, disabling it (#1808)
* always use str in environment dictionaries (affects Python 2 on Windows)
v0.5.11, 2017-08-28 -- tweak report-long-jobs
* report-long-jobs tool can exclude jobs based on tag (#1636)
* mrjob won't crash when inspecting instance fleet clusters (#1639)
v0.5.10, 2017-05-12 -- loose ends
* JSON protcols use rapidjson if ujson unavailable (#1579)
* can also explicitly use RapidJSONProtocol, RapidJSONValueProtocol
* EMR runner:
* aws_security_token option renamed to aws_session_token (#1536)
* EMR and Dataproc runners:
* bootstrapping mrjob no longer stalls if mrjob already installed (#1567)
* master bootstrap script has correct extension: .sh, not .py (#1504)
v0.5.9, 2017-03-20 -- Docker hooks
* fixes which affect Docker:
* task_python_bin option, used by tasks but not setup script (#1394)
* local mode references mrjob/cat.py by relative path, not absolute (#1540)
* EMR runner
* re-launch SSH tunnel when cluster pooling auto-recovers (#1549)
* get job progress using `ssh curl` when tunnel is unavailable (#1547)
* work around `sh -e` setup script bug on AMI 5.2.0+ (#1548)
* renamed emr_applications option to "applications" (#1420)
* small fix to terminate-idle-cluster command's S3 "locking" code (#1545)
v0.5.8, 2017-02-01 -- upload_dirs, pre-filters
* automatically tarball and upload directories with --dir, setup hooks (#23)
* specify path for inter-step output with --step-output-dir #263
* jobs:
* better --help printout
* deprecated option groups in MRJobs
* deprecated MRJob.get_all_option_groups()
* overriding *_pre_filter() methods in MRJob works again (#1521)
* all step types accept jobconf (#1447)
* quieted warning about SORT_VALUES on Hadoop 2 (#1286)
* all runners:
* wrap tasks that require pipes with sh_bin, not bash (#1330)
* local runner:
* allows non-zero exit status from pre-filters (#1524)
* pre-filters can now handle compressed input (#1061)
* EMR runner:
* fetch logs from task nodes as well as core nodes (#1400)
* use ListInstances rather than dfsadmin to get node list (#1345)
* moved mrjob.util.bunzip2_stream() to mrjob.cat
* moved mrjob.util.gunzip_stream() to mrjob.cat
* mrjob.util.parse_and_save_options() now returns dict, not defaultdict
* deprecated:
* mrjob.util.args_for_opt_dest_subset()
* mrjob.util.bash_wrap()
* mrjob.util.populate_option_groups_with_options()
* mrjob.util.scrape_options_and_index_by_dest()
* mrjob.util.tar_and_gz()
* SSHFilesystem.ssh_slave_hosts()
v0.5.7, 2016-12-19 -- Spark
* EMR and Hadoop runners:
* full support for Spark (#1320)
* includes spark() method in MRJob and SparkStep/SparkScriptStep
* can use environment variables and ~ in hadoop_streaming_jar option
* EMR runner:
* default AMI version is now 4.8.2 (#1486)
* default instance type is m1.large when running Spark jobs (#1465)
* added debug logging for matching available pooled clusters (#1449)
* defaults to cheapest instance type that will work (#1369)
* master bootstrap script always created when pooling
* no longer crashes when trying to use missing ssh binary (#1474)
* pooled clusters may have 1000 steps (#1463)
* failed jobs no longer reported as 100% complete (#793)
* All runners:
* py_files option for Spark and streaming steps (#1375)
* bootstrap mrjob with a .zip rather than a tarball
* options refactor, added missing command-line switches (#1439)
* mrjob terminate-idle-clusters works with all step types (#1363)
* log interpretation
* dropped unnecessary container-to-attempt-ID mapping (#1487)
* more efficient search for task log errors (#1450)
* cleaner error messages when bootstrapped mrjob won't compile
* JarSteps
* now support libjars, jobconf (#1481)
* JarStep.{INPUT,OUTPUT} are deprecated (use mrjob.step.{INPUT,OUTPUT})
* is_uri() now only matches URIs containing "://" (#1455)
* works in Anaconda3 Jupyter Notebook (#1441)
* deprecated mrjob.parse.is_windows_path()
* deprecated mrjob.parse.parse_key_value_list()
* deprecated mrjob.parse.parse_port_range_list()
* deprecated mrjob.util.scrape_options_into_new_groups()
* deprecated non-strict protocols (#1452)
* deprecated python_archives (#1056)
v0.5.6, 2016-09-12 -- dataproc crash fix
* Dataproc runner:
* fix Hadoop version crash on unknown image version (#1428)
* EMR and Hadoop runners:
* prioritize task errors as probable cause of failure (#1429)
* ignore Java stack trace in task stderr logs (#1430)
v0.5.5, 2016-09-05 -- missing ami_version option
* EMR runner:
* deprecate, don't remove ami_version option in v0.5.4 (#1421)
* update memory/CPU stats for EC2 instances for pooling (#1414)
* pooling treats application names as case-insensitive (#1417)
v0.5.4, 2016-08-26 -- pooling auto-recovery
* jobs:
* pass_through_option(), for existing command-line options (#1075)
* MRJob.options.runner now defaults to None, not 'inline' or 'local'
* runners:
* all:
* names of uploaded files now never start with . or _ (#1200)
* Hadoop:
* log parsing:
* handles more log4j patterns (#1405)
* gracefully handles IOError from exists() (#1355)
* fixed crash bug in Hadoop FS on Python 3 (#1396)
* EMR:
* pooling auto-recovers from joining a cluster that self-terminated (#708)
* log fetching uses sudo on 4.3.0+ AMIs (#1244)
* fixed broken --ssh-bind-ports switch (#1402)
* idle termination script now only runs on master node (#1398)
* ssh tunnel connects to internal IP of resource manager (#1397)
* AWS credentials no longer logged in verbose mode (#1353)
* many option names are now more generic (#1247)
* ami_version -> image_version
* accidentally removed ami_version option entirely (fixed in v0.5.5)
* aws_availability_zone -> zone
* aws_region -> region
* check_emr_status_every -> check_cluster_every
* ec2_core_instance_bid_price -> core_instance_bid_price
* ec2_core_instance_type -> core_instance_type
* ec2_instance_type -> instance_type
* ec2_master_instance_bid_price -> master_instance_bid_price
* ec2_master_instance_type -> master_instance_type
* ec2_task_instance_bid_price -> task_instance_bid_price
* ec2_task_instance_type -> task_instance_type
* emr_tags -> tags
* num_ec2_core_instances -> num_core_instances
* num_ec2_task_instances -> num_task_instances
* s3_log_uri -> cloud_log_dir
* s3_sync_wait_time -> cloud_fs_sync_secs
* s3_tmp_dir -> cloud_tmp_dir
* s3_upload_part_size -> cloud_upload_part_size
* num_ec2_instances is deprecated (use num_core_instances)
* ec2_slave_instance_type is deprecated (use core_instance_type)
* hadoop_streaming_jar_on_emr is deprecated (#1405)
* hadoop_streaming_jar handles this instead with file:// URIs
* bootstrap_python does nothing on AMI 4.6.0+, as not needed (#1358)
* mrjob audit-emr-usage should show less/no API throttling warnings (#1091)
v0.5.3, 2016-07-15 -- libjars
* jobs:
* LIBJARS and libjars method (#1341)
* runners:
* all:
* .cpython-3*.pyc files no longer included when bootstrapping mrjob
* local:
*PATH envvars combined with local separator (#1321)
* Hadoop and EMR:
* libjars option (#198)
* fixes to ordering of generic and JAR-specific options (#1331, #1332)
* Hadoop:
* more default log dirs (#1339)
* hadoop_tmp_dir handles ~ and envvars (#1322) (broken in v0.5.0)
* EMR:
* determine cause of failure of bootstrap scripts (#370)
* master bootstrap script now redirects stdout to stderr
* emr_configurations option (#1276)
* subnet option (#1323)
* SSH tunnel opened as soon as cluster is ready (#1115)
* SSH tunnel leaves stdin alone (#1161)
* combine_lists() treats dicts as values, not sequences
v0.5.2, 2016-05-23 -- initial Cloud Dataproc support
* basic support for Google Cloud Dataproc (#1243)
* lacks log interpretation, JarStep support
* on EMR, wait for steps to complete in correct order (#1316)
* correctly handle ~ in include path in mrjob.conf (#1308)
* new emr_applications option (#1293)
* fix running deprecated tools with python -m (#1312)
* fix ssh tunneling to 2.x AMIs on EMR in VPCs (#1311)
v0.5.1, 2016-04-29 -- post-release bugfixes
* strict_protocols in mrjob.conf is no longer ignored (#1302)
* check_input_paths in mrjob.conf is no longer ignored
* partitioner() is no longer ignored, fixing SORT_VALUES (#1294)
* --partitioner switch is deprecated
* improved probable cause of error from pre-YARN logs (#1288)
* ssh_bind_ports now defaults to (x)range, not list (#1284)
* mrjob terminate-idle-clusters handles debugging jar from boto 2.40.0 (#1306)
v0.5.0, 2016-03-28 -- the future is in the past
* supports Python 3 (#989)
* requires boto 2.35.0 or newer (#980)
* removed many workarounds for S3 and EMR (#980), IAM (#1062)
* jobs:
* is_mapper_or_reducer() is now is_task() (#1072)
* mr() no longer takes positional arguments (#814)
* removed jar() (use mrjob.step.JarStep)
* removed testing methods parse_counters() and parse_output()
* protocols:
* protocols are strict by default (#724)
* JSON protocols use ujson when available, then simplejson (#1002, #1266)
* can explicitly choose Standard, Simple or Ultra JSON protocol
* raw protocols handle bytes or unicode depending on Python version
* can explicitly choose Text or Bytes protocol
* mrjob.step:
* JarStep only takes "args" and "main_class" keyword args
* removed MRJobStep (use MRStep)
* runners:
* All runners:
* totally revamped log handling (#1123)
* runner status/log messages are less noisy (#1044)
* don't bootstrap mrjob if interpreter is set (#1041)
* fs methods path_exists() and path_join() are now exists() and join()
* deprecation warning: use runner.fs explicitly (#1146)
* changes to cleanup options:
* removed IS_SUCCESSFUL (use ALL)
* LOCAL_SCRATCH is now LOCAL_TMP (#318)
* new HADOOP_TMP option handles HDFS cleanup (#1261)
* REMOTE_SCRATCH is now CLOUD_TMP (#1261)
* base_tmp_dir option is now local_tmp_dir (#318)
* non-inline runners raise StepFailedException on step failure (#1219)
* steps_python_bin defaults to current python interpreter (#1038)
* _job_name is now _job_key (#982)
* EMR:
* default AWS region is us-west-2 (#1025)
* default instance type is m1.medium (#992)
* visible_to_all_users defaults to true (#1016)
* matches your minor version of Python 2 on 3.x and 4.x AMIs (#1265)
* 4.x AMIs are supported (#1105)
* added --release-label switch (--ami-version 4.x.y also works)
* can fetch counters and probable cause of failure on 3.x and 4.x AMIs
* SSH tunnel now works on 3.x and 4.x AMIs (#1013)
* ssh_tunnel_to_job_tracker option is now ssh_tunnel
* correctly fetch step logs by step ID (#1117)
* bootstrap_python option
* s3_scratch_uri option is now s3_tmp_dir (#318)
* aws_region is no longer inferred from s3_tmp_dir
* create/select temp bucket in same region as EMR jobs (#687)
* added iam_endpoint option (#1067)
* removed s3_conn args from methods in EMRJobRunner and S3Filesystem
* S3 Filesystem:
* connect to each S3 bucket on appropriate endpoint (#1028)
* fall back to default if we can't get bucket location (#1170)
* removed special treatment of _$folder$ keys
* removed deprecated S3Filesystem method get_s3_folder_keys()
* recurse "subdirectories" even if uri lacks trailing / (#1183)
* removed iam_job_flow_role option (use iam_instance_profile)
* custom hadoop_streaming_jar gets properly uploaded
* job cleanup temporarily disabled (#1241)
* pooling respects key pair (#1230)
* idle cluster self-termination respects non-streaming jobs (#1145)
* deprecated "latest" AMI version not passed through to EMR (#1269)
* emr_job_flow_id option is now cluster_id (#1082)
* emr_job_flow_pool_name is now pool_name (#1082)
* pool_emr_job_flows is now pool_clusters (#1082)
* Hadoop
* works out-of the-box on most Hadoop setups (#1160)
* works out-of the box inside EMR (2.x, 3.x, and 4.x AMIs)
* counters are parsed from Hadoop binary stderr in YARN (#1153)
* can find logs and probable cause of failure in YARN (#1195)
* will search in <output dir>/_logs, to support Cloudera (#565)
* HDFS Filesystem:
* use fs -ls -R and fs -rm -R in YARN (#1152)
* mkdir() now uses -p on YARN (#991)
* fs.du() now works on YARN (#1155)
* fs.du() now returns 0 for nonexistent files instead of erroring
* fs.rm() now uses -skipTrash
* dropped support for Hadoop prior to 0.20.203 (#1208)
* added hadoop_log_dirs option
* hdfs_scratch_dir option is now hadoop_tmp_dir (#318)
* hadoop_home is deprecated
* uses -D and correct property name when step has no reduces (#1213)
* Inline/Local
* runner.fs raises IOError if passed URIs (#1185)
* version-agnostic by default (#735)
* removed ignored hadoop_extra_args and hadoop_streaming_jar opts (#1275)
* inline runner uses multiple splits by default (#1276)
* removed mrjob.compat.get_jobconf_value() (use jobconf_from_env())
* removed mrjob.compat methods to support Hadoop prior to 0.20.203:
* supports_combiners_in_hadoop_streaming()
* supports_new_distributed_cache_options()
* uses_generic_jobconf()
* removed mrjob.conf.combine_cmd_lists()
* removed fetch-logs tool (#1127)
* mrjob subcommands use "cluster" rather than "job-flow" (#1082)
* create-job-flow is now create-cluster
* terminate-idle-job-flows is now terminate-idle-clusters
* terminate-job-flow is now terminate-cluster
* Python-version-specific mrjob-x and mrjob-x.y commands (#1104)
* use followlinks=True with os.walk()
* all internal constants/functions/methods explicitly start with _ (#681)
* mrjob.util:
* file_ext() takes filename, not path
* random_identifier() moved here from mrjob.aws
* buffer_iterator_to_line_iterator() is now to_lines()
* to_lines() no longer appends a newline to data (#819)
* removed extract_dir_for_tar()
* gunzip_stream() now yields chunks, not lines
* removed hash_object()
v0.4.6, 2015-11-09 -- config files
* PyYAML>=3.08 is required
* !clear tag in conf files (#1162)
* combine_lists() and combine_path_lists() can handle scalars (#1172)
* include: paths in conf files are relative to real path of conf file (#1166)
* mrjob.conf.combine_cmd_lists() is deprecated (#1168)
* EMR runner: pool_wait_minutes can now be loaded from mrjob.conf (#1070)
* support for wheel packaging format (#1140)
v0.4.5, 2015-07-28 -- DescribeJobFlows begone
* boto>=2.6.0 is required (used to be 2.2.0)
* runners:
* EMR:
* moved off deprecated DescribeJobFlows API (#876)
* time-to-end-of-hour now uses creation time, not "start" time
* aws_security_token for temporary credentials (#1003)
* Use AWS managed policies when creating IAM objects (#1026)
* Fall back to default role/instance profile when no IAM access (#1008)
* added emr_tags option (#1058)
* added get_ami_version() method
* hadoop_version option no longer has any effect (#1017)
* Hadoop:
* --hadoop-home switch now works (#1037)
* EMR tools:
* added switches for AWS connection options etc. (#1087)
* mrboss is available from command line tool: mrjob boss [args]
* terminate_idle_job_flows:
* less prone to race condition (#910)
* prints results to stdout in dry_run mode (#1102)
* job flows stuck in STARTING state no longer considered idle
* report_long_jobs reports job flows stuck in STARTING state
* collect_emr_stats and job_flow_pool are deprecated
* more efficient decoding of bz2 files
* mrjob.retry.RetryWrapper raises exception when out of tries (#1093)
v0.4.4, 2015-04-21 -- EMRgency!
* runners:
* EMR:
* Create IAM objects as needed (unbreaks mrjob for new accounts) (#999)
* --iam-job-flow-role renamed to --iam-instance-profile (#1001)
* new --iam-service-role option (#1005)
v0.4.3, 2015-04-08 -- SO many bugfixes
* jobs:
* MRStep's constructor treats kwarg=None same as not setting it (#970)
* parse_counters() and parse_output() are deprecated (#829)
* self.mr is deprecated in favor of MRStep (#815)
* runners:
* All runners:
* You can now set strict_protocols from mrjob.conf (#726)
* new --no-strict-protocols command-line option
* streaming output from closed runner shows a warning (#853)
* EMR:
* --check-input-paths and --no-check-input-paths options (#864)
* skip (very slow) validation of s3 buckets if boto < 2.25.0 (#865)
* Fix for max_hours_idle bug that was terminating job flows early (#932)
* --emr-api-param allows users to pass additional parameters to boto's
EMR API (#879)
* unset paramaters with --no-emr-api-param
* bootstrap_python_packages (deprecated) now works on 3.x EMR AMIs (#863)
* Use TERMINATE_CLUSTER instead of deprecated TERMINATE_JOB_FLOW (#974)
* updated EC2 instance type data for pooling (#995)
* Hadoop:
* exclude hadoop source jars when looking for streaming jar (#861)
* Fixed mkdir_on_hdfs for Hadoop version 2.x (#923)
* Fixed hadoop_bin on Windows (#843)
* Local
* bootstrap mrjob by default (#984)
* Inline
* fix for add_file_option() (#851)
* cd to job's working directory before instantiating mrjob class (#988)
* Use pytest to run tests (#898)
* collect-emr-active-stats subcommand (#947)
* Using xtrace flag to get more output during bootstrap (#943)
* Fixed log printouts for command line tools (#901)
* Fix to avoid interpreting windows paths as URIs (#880)
* Better error message when ssh keyfile is missing (#858)
* Update EMR tool ISO8601 parsing to be consistent with EMR runner (#869)
* Dropped support for Python 2.5 (#713)
* Dropped support for the 1.x EMR AMI series, which uses Python 2.5
v0.4.2, 2013-11-27 -- that's one small step for a JAR
* jobs:
* can interpolate input and output path(s) into arguments of JarSteps,
so they can be part of multi-step jobs (#773)
* see mrjob/examples/mr_jar_step_example.py
* JarStep now takes keyword arguments only (#769)
* removed useless "name" field; "step_args" is now just "args"
* MRJobStep (usually accessed via MRJob.mr()) is now MRStep
* runners:
* All runners:
* --setup is now fully functional (#206)
* --python-archive, --setup-cmd, and --setup-script are deprecated
* --bootstrap option works and uses sh (#206)
* --bootstrap-cmd, --bootstrap-file, --bootstrap-python-package,