Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zos bcachefs assessment #2396

Open
iwanbk opened this issue Aug 14, 2024 · 4 comments
Open

zos bcachefs assessment #2396

iwanbk opened this issue Aug 14, 2024 · 4 comments
Assignees
Milestone

Comments

@iwanbk
Copy link
Member

iwanbk commented Aug 14, 2024

Assess how we can use bcachefs on zos

related issues:

Is your feature request related to a problem? Please describe

Why we need to move out from btrfs:

  • ...

Why bcachefs:

  • improve performance of the HDD by employing SSD as the internal cache
  • with HDD improved performance -> more HDD usage -> more economic
  • ....

scope:

  • bcachefs will only be used for the workload, because.....
  • not using LVM

Describe the solution you'd like

The assessment will be done in two phases

  1. backward compatibility check

We do this check because we need to know how btrfs is currently used in zos for these reasons:

  • seeing how btrfs features used in current zos can help us to design the bcachefs usage.
    examples: how subvolume limit & usage being used, how nocow file attribute is currently used
  • i also new to zos, need to understand the full flow that relates to the use of btrfs usage

For things that are compatible: good
For non compatible things:

  • check if we really need it
  • if needed, how we work around that
  1. plan/specs to use bcachefs on zos
  • it doesn't need to be backward compatible
  • employ multi device filesystem features of bcachefs
  • one idea is we create one partition for each VM, this way we won't have issue with quota but another trade-off might come as briefly mentioned by Azmy at support bcachefs #2074 (comment)

cc @delandtj

@iwanbk iwanbk self-assigned this Aug 14, 2024
@iwanbk
Copy link
Member Author

iwanbk commented Aug 14, 2024

backward compatibility check

This check involves the work on porting current btrfs code to bcachefs, it is WIP in #2375
Deep diving the code is expected to give more understanding, although function call is not always obvious because of zbus usage. (zbus is a good thing, we only need to be more throughout when tracing the call flow)

No support for subvolume limit limit/quota

what we really need:

  • set the usage limit of the allocated subvolume/workload

how subvolume limit used:
a. set limit on zos cache: no issue here, we will keep it on btrfs

b. when creating volume for a container

  • created by calling VolumeCreate during container creation
  • the volume will be used as overlay mount on top the provided flist
    return mountpoint, f.mountOverlay(ctx, name, ro, &opt)

possible solutions:

  • use disk image instead of volume, but it will be slower
  • lvm is not a choice for us -> ..... need explanation ....
  • use one partition for each VM/container: it is quite hard to manage lot of partitions
  • use stratis to manage, but need add support for bcachefs: possibly a lot of works?
  • does bcachefs has a plan to support subvolume quota? (looks like not)
  • usrquota, prjquota, grpquota in bcachefs

c. on VolumeUpdate

if err := storage.VolumeUpdate(ctx, volumeName, volume.Size); err != nil {

it is used by:

d. on pkg/flist

err = f.storage.VolumeUpdate(ctx, name, limit)

it is used by qsfsd when ....

No support for FS_NOCOW_FL file attribute

what we really need:

  • set disk image file as nocow, it supposed to have better performance

possible solution:

  • leave it as cow, it is less performance but not that much (need proof?)

No subvolume info command

what we really need:

  • to know subvolume disk usage

possible solutions:
we don't really need it. Subvolume disk usage only really needed when there is no limit on the subvolume. And the only occurence for this is when we create zdb cache.
zdb cache disk usage is counted using it's own method.

current lsblk doesn't have bcachefs support

what we really need:
Get disk label/fstype on startup

** solution **
Maxus will upgrade it

@iwanbk
Copy link
Member Author

iwanbk commented Aug 15, 2024

Specification

The new bcachefs based storage must provide all the features provided by the btrfs based storage.

Backward compatibility

Because all disk of the old nodes already formatted with btrfs, we only support new nodes

bcachefs only for the workloads

Root filesystem still use btrfs with it's /var/run/cache

multidevice filesystem strategy

bcachefs supports a real pool, where multiple devices can be formatted into a single filesystem:

  • a filesystem is created from SSD(s) and HDD(s)

caching

writeback caching:

  • write to the fast device (SSD)
  • background worker periodically move data from the fast device to the slow device
  • when reading, the data will be copied to the fast device if not there

config

--foreground_target=ssd
--background_target=hdd
--promote_target=ssd

quota management

  • no limit/quota for now

the language (Rust or Go)

Rust is the way to go, but the prototype can be build using Go

@iwanbk
Copy link
Member Author

iwanbk commented Aug 15, 2024

mkfs.bcachefs also has this option, worth to check

--usrquota              Enable user quotas
--grpquota              Enable group quotas
--prjquota              Enable project quotas

@ramezsaeed ramezsaeed added this to the 3.13 milestone Aug 19, 2024
@iwanbk
Copy link
Member Author

iwanbk commented Oct 28, 2024

There was drama on LKML about bcachefs https://www.phoronix.com/news/Bcachefs-Fixes-Two-Choices.

Or "take your toy and go home" effectively alluding to taking it out of the mainline Linux kernel and go back to developing it out-of-tree.

The risk is that bcachefs could be out of mainline kernel.
So, we observe and see for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants