Skip to content

Releases: macrocosm-os/finetuning

Release 2.6.0

05 Dec 03:54
c6dce9d
Compare
Choose a tag to compare

Announcing Release 2.6.0

This release introduces a 2nd competition starting at block 4,451,695: INSTRUCT 8B. As we strive for SOTA models, we believe it's important that miners have as much flexibility as possible - hence, this competition allows you to bring-your-own-tokenizer!

New competition

Other updates

  • This change also updates the subnet to bittensor 8.4.3. As part of this update, we have found the logs are quite noisy. However, we wanted to get this release out to kick-off the new competition! We will greatly improve validator logging in the next release
  • Note: One benign log that shows up frequently is File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError

Validators should update as soon as they can. Note that due to requirement version updates you will need to rerun
python -m pip install -e .

Release 2.5.1

21 Nov 16:24
1237391
Compare
Choose a tag to compare

This is a small release that includes a fix for the upcoming IfEval task and an improvement for validator weight setting.

Subnet

  • IfEval rules regarding word and sentence counts have been adjusted to align better with token generation limits. Thanks @PawKanarek

Validators

  • Create a new subtensor instance for each set weights attempt.

Release 2.5.0

19 Nov 03:22
7f23ad9
Compare
Choose a tag to compare

This release adds a fourth evaluation task (IfEval) into the current competition starting on block 4,344,030. At this time the weighting of each task will be 85% MMLU, 5% Word Sorting, 5% Fineweb, and 5% IfEval.

Subnet

  • Added new IfEval (Instruction Following) evaluation task.
  • This evaluation scores models based on how well they follow generated rules about their response. To start with this will include rules about casing, comma usage, word count, and sentence count.
  • Includes a check to make sure models are generating reasonable output. Meaning they are not using the same response for the same rules when asked different questions.

Validators

  • The expected time per evaluation cycle has increased due to the new evaluation task.

  • TTLs have been adjusted and each model is required to complete all evaluation tasks in 12 minutes.

  • Alpha has also been adjusted. Models should first receive weight after 2 cycles (~360 blocks) and will receive all weight after 17 cycles (~3060 blocks) of consecutive wins.

  • Output width is set explicitly to improve readability of pm2 rich tables in logging. Thanks coldint!

Miners

This release requires running pip install -e . to pick up the latest dependencies

Release 2.4.1

14 Nov 17:37
726fb93
Compare
Choose a tag to compare

Announcing Release 2.4.1

This release is focused on improving vTrust by adjusting the speed at which models receive (and lose) weight internally for each validator.

Subnet

  • Leaderboard has been updated to better handle old models become the top model due to a competition adjustment.

Validators

  • The alpha validators use for their weight moving average has been adjusted from 0.5 to 0.05.

    • This will improve vtrust when a new top model arrives since validators will no longer shift their weights so rapidly.
  • The minimum internal weight before a validator starts setting weights on the chain for a miner has been adjusted to 0.1.

    • This will help avoid blips where one model gets lucky on a single set of samples.
    • It takes 3 cycles of winning in a row to go from 0 weight to 0.143 weight and cross the 0.1 threshold.
    • It takes 45 cycles of winning in a row to go from 0 weight to 0.901 weight and ensure one model receives all of the weight.

Release 2.4.0

07 Nov 03:00
f9c499a
Compare
Choose a tag to compare

Release 2.4.0

This release incorporates a third evaluation task (Fineweb) into the current competition starting on block 4,250,808. At this time the weighting of each dataset will be 90% MMLU, 5% Word Sorting, and 5% Fineweb.

Subnet

  • Added new Fineweb evaluation task.
    • This evaluation scores models based on the computed average cross entropy loss on samples from Fineweb.
    • It is the same evaluation from subnet 9. Including it helps ensure the finetuned models do not lose too much of their original context.
    • Includes a check to make sure models are generating reasonable output. Meaning they are not too repetitive within or across responses.
  • Improved definition of the competition schedule to include eval tasks.
    • This makes it easier to add new evaluations to competitions at specific weights and makes it easier to view them as a miner.
    • See COMPETITION_SCHEDULE_BY_BLOCK in constants/__init.py__ to view for yourself.

Validators

  • Improved the logic around strategy selection for sharing files across subprocess boundaries. This will help avoid overflowing /dev/shm.

Miners

  • The new dataset loader for the fineweb task can be found at https://github.com/macrocosm-os/finetuning/blob/main/finetune/datasets/hugging_face/hugging_face_loader.py.

    • As mentioned this will be incorporated into the existing competition starting in block 4,250,808 so please take this into consideration for your training.
    • Note that this supports general hugging face datasets. Currently constants are included for Falcon and Fineweb. The current competition is only using Fineweb data.

    Validators should update as soon as they can. Note that due to requirement version updates you will need to rerun
    python -m pip install -e .

Release 2.3.0

01 Nov 21:10
9598e92
Compare
Choose a tag to compare

This release addresses the current wandb sampling issue from SN 1 and adds functionalities to improve v-trust.

V-trust improvements:

  • We've improved the PromptingDatasetLoader to more reliably and consistently fetch samples. Validators will now fetch 700 samples instead of 400
  • Validators now align to "sync blocks" to use the same set of eval samples, as well as pace how frequently evaluations are performed. This should improve v-trust across the board, particularly in situations where the top model changes.
  • Miner weights are now fully converted to a winner-takes-all, where exactly 1 model will receive weight. Previously a 2nd model could receive a small amount of weight (due to soft-maxing of weights) if enough models were evaluated in a batch
  • Added better retry behavior for set_weights

Release 2.2.1

28 Oct 05:13
ecd856f
Compare
Choose a tag to compare

This is a minor release to address the current ongoing issue with SN 1's wandb integration. If there are no samples to use for synthetic MMLU, that evaluation task will be skipped and all weight will be given to the remaining evaluation tasks (currently, just word sorting).

Other fixes

Fixes model lookup issue that can occur if a hotkey is reregistered.

Release 2.2.0.

24 Oct 01:58
ed975ab
Compare
Choose a tag to compare
  • Adds a new Word sorting eval task at block 4139465. At first, it's worth 2.5% of a miner's score
  • Fixed grace period check during model cleanup to respect the most recent file instead of the oldest file in a folder.
  • Validators now use a seed generated from the hash of a recent block for the dataset loaders. This will improve vTrust as validators will use the same seed if they evaluate within the same ~30 minute window.

Release 2.1.2.

08 Oct 02:59
8e2b0cb
Compare
Choose a tag to compare

VTrust improvements and code cleanups.

  • Use a deterministic generation configuration for model evaluation. This ensures that validators evaluating the same models over the same samples will get the same results.
  • Increase the number of samples from 300 to 400.
  • Cleanup code relating to the now deprecated competition. The provided miner is now a shell that needs to be filled in based on your training strategy.
  • Fixed a few example notebooks to work against the refactored codebase.

Release 2.1.1

14 Sep 03:28
656c4a8
Compare
Choose a tag to compare

Hotfix release to address the wandb issue that causes the main thread to hang indefinitely.

There are currently 7 running runs in the prompting wandb prompting. One of those runs (hhodrv2s) is poisoned and all attempts to perform a history scan on it result in a 502 from wandb. Furthermore, the wandb client will INFINITELY RETRY, which is ridiculous.

This change addresses the issue in 2 ways:

  1. We reimplement the wandb history client so we can add a sane amount of retries (3). We combine this with a reduction in collected samples to 300, to make it more likely we'll fulfill the 300 samples from 6 runs, should one be poisoned in future.
  2. We also use a sampled history scan, which additionally filters (server-side) the steps returned to only those that contain the requested keys. The returned steps also only contain the requested metrics. As a result, it now takes a few seconds to load 300 samples rather than the ~1-2 minutes before!