-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS option to add raw disks without creating GPT partitions #94
Comments
Certainly not creating the GPT partition tables is possible, but it's still a unclear to me exactly why this is a problem. The idea here is that given a full device (LUNs look like full devices) we create a GPT partition table and attempt to automatically align the first partition 1MiB in to the device. We want to keep it 4kB aligned. Based on your results it does look like something went wrong. Your reporting 1049KB instead of 1024kB which seems wrong so that certain looks like a bug. Alternately, if when your setting up the system you know best and want to set your own alignment you can always manually create the partition tables and use these precreated partitions for you vdev. The code will detect that this is a partition and simply use it without adjusting the partition tables. |
Besides the bug that is secondary to the discussion (but needs to be fixed, the 1049KiB offset opposed to 1024KiB), there are two issues.
If on the other hand you don't have any partition table your lun and zfs will /ALWAYS/ be aligned with no manual/end user intervention.
Lastly, I would suggest that the default function of creating a new pool would NOT create the GPT so that the commands function the same way as under solaris. (i.e. use a new flag if the user WANTS to do a GPT but default is no flag and no GPT). that way it would be easier for people to move from one OS to another for support and expect the commands to work the same way. In our deployments I have systems that generally have upwards of 100 luns/drives per system, having to do the additional steps across those at 2am is not something I would relish. |
Thanks for the detailed reply. I believe the Sun/Oracle ZFS team would suggest not using a raid array to back individual vdevs because of the fist issue you mention. ZFS is designed to work best when it's managing the individual devices. The partition tables are created with the assumption that your just using a directly attached SAS/SATA/SSD/etc device. In which case they just do the right thing, even for your case 2) above. That said, we're actually considering doing the same thing for some of our systems as a short term measure. Our concern isn't with ZFS persay, but that we still need to get a better JBOD/device management infrastructure in place for Linux. Solaris has FMA which is OK, but not great, but we still need to build up those sort of tools for Linux. I think the first order of business here is to make sure the commands work exactly the same as they do under Solaris. I thought that's what I'd done with the partition table creation but if that's not the case it should be fixed. Once we have the basic tool behavior the same we can look at adding additional command line options which add any needed functionality. So my question is... what exactly does OpenSolaris do with a blank drive when you create a new pool, and then what does the Linux port do? If you could provide the raw data that would help. |
Yeah, with large environments where there exists a san infrastructure already getting raw disks exported is not something that really happens, in some cases you can export say raid1 mirrors or what I do here, the smallest raid5's I can, but even then that's pricey from another point of view. If we were all solaris and had 10gbit end-end, probably would come up with what you're doing (zfs/luster) or even just clustered filers and export to other clients via iscsi). From the sun boxes I have here, I think it's more complex as from the couple I just logged into I have 5.10/zfs systems both running SMI as well as EFI/GPT disk labels on them. I was surprised by the EFI ones but those also happen to have been newer media installs so perhaps sun/oracle changed the default at some point? Originally on ZFS it was SMI only and was optional for EFI/GPT. This raises a question w/ DR now when you may need to bring back a pool that may have EFI on a box with an older version of solaris that may only handle SMI? Does the linux port have code to handle both SMI/EMI? From my tests it appears it only creates EMI, so similar to what I mentioned above then an option to NOT do EMI/GPT on creation would still be useful for porting purposes but it does not appear to be as bad at first glance. |
Yes, I think the EFI scheme may be a late addition to ZFS as you were saying. The Linux port currently will always create an EFI label if it detects your using a whole disk for the vdev. If it determines your using a partition/slice it will just use the full partition/slice for ZFS and not do anything. It occurs to me if you want to trick the current tools in to doing nothing you could probably just create a single partition which spans the entire device, and just use that. As for accessing old ZFS pools which may have been created with SMI partitioning I'm not exactly sure how Linux will handle that. If the Linux kernel can properly read the partition information and construct the partitions everything should be fine. If it can't you won't be able to import those pool. For pools created with EFI labels on Solaris I have verified that you can access these pools under Linux. |
(1) How does the code currently detect wheter we're using a whole disk (as opposed to say, an LV) for the vdev? (2) When CONFIG_SUN_PARTITION enabled (like in RHEL kernel), Linux can detect SMI label/Solaris slices correctly. dmesg will show something like this xvda: xvda1 (3) Solaris can detect zfs directly on whole disk (without partition) just fine. It will use "p0" (e.g. c1t0d0p0, instead of the usual c1t0d0s0 when using SMI label). (4) Solaris by default will create EFI label when presented with whole disk for zfs pool. The exception is when it's going to be used for boot (rpool), on which it will use SMI label. (5) zfs-fuse doesn't create any partition by default (whole disk and regular file is treated the same way), but mostly it was because licensing issue (something about libparted being GPL, while zfs is CDDL) |
Detecting if your using a wholedisk in Linux is actually surprisingly tricky. The current logic resides in the in_whole_disk() function, it basically just attempts to create an EFI label for the provided device name. This will only succeed if we were given a device which can be partitioned. From what you say the ZFS on Linux port should then have no trouble accessing either the Solaris SMI labels, or a pool created without partitions such as zfs-fuse. It also sounds like this port is doing the right thing by creating EFI labels by default. Yes, libparted is GPL so we can't link ZFS with its development headers and use it to create the partitions. The ZFS on Linux port gets around this by not using libparted and instead used a modified version of libefi from Solaris which is compatibly licensed. |
just an update, the parted offset is a units issue with that command it seems: GNU Parted 2.2 Number Start End Size File system Name Flags (parted) unit Number Start End Size File system Name Flags If you set your units to bytes you have exactly 1048576 (1MiB) so there must be some routing or something going on w/ parted which, though screwed, is not a problem w/ your offset code. So that leaves just the original request for a means to avoid creating partitions at all. |
Hi. I also wanted to ask about this long time ago. I tried linux zfs, and found it very iritating that zpool create tries to be too clever, and creates GPT automatically. Please just use block devices, and do not pretend what this block device is at all. It will also make it more similar to the way a zfs works under solaris, freebsd and zfs-fuse. I like raw devices as they are easier to work in virtualized environments, iscsi, xen, kvm, as they are simpler to be managed or mounted outside of the virtualized system also. Or maybe now EFI labels or GPT (or whetever) is a recomended way of giving whole disks to the zfs? I'm also concerned about aligment issues here. Thanks. |
The default behavior is (under solaris) is when presented with a raw disk cXtXdX to use SMI labels if the disk is to be booted; or EFI/GPT labels for data partitions and creates the first partition at a 1MiB offset from the start of the volume. Brian appears to have attempted to copy this behavior which is correct in that it would be the EXPECTED behavior (as it would match what happens by default under solaris). To your point, and the reason why I created the bug report is that this does not work correctly in all cases (mainly ones where you have complex storage). In these scenarios the ability to use a raw partition without a volume header/partition table is very useful if not required for good performance. A work-around until something gets done to user space tools is to create your arrays with zfs-fuse which due to it complete lack of such functionality will allow you to use any raw block device to create your zvol/pools. You can then export them and import them into native zfs. This is far from ideal but it would allow you to create some items in a pinch. |
On newer opensolaris (I tested b148) with "autoexpand=on" zpool property, it currently does the "right thing" when the disk becomes larger (e.g. by resizing LUN from storage side). The GPT partition was automatically adjusted so the new size was recognized correctly. |
Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB. |
Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB. |
I'm not opposed to fixing this, but I think we need a concrete proposal for what to do. First let me suggest that we not change the current default behavior. Not only is it consistent with the Solaris behavior, which is good for interoperability, but it's also what most people probably want. I suggest we add a new property called 'aoffset' which can only be set once at zpool creation time. By default it will be the current 1MB offset, but if you have a specific offset requirement for your storage you can set it here. This could be further extended if needed so that an 'aoffset=-1' might indicate that no GPT partition tables should be created at all. Related to this there is further work being done is issue #195 to allow the ashift to be set via a property as well. I'd very much like to see the same mechanism be used to set both these tunings. Is this sufficient to cover everyones needs? |
I also agree that the base utility command set should mimic what's on solaris. Similar to what we're talking about doing for ashift values such as using long options (which do not exist in oracle/zfs) would provide command similarities to across the board only exposing localizations when explicitly called for. I like your idea of setting a specific size for the disk partition alignment offset opposed to just having it there or not. That would give much more flexibility. something like "--diskpartalignment=xx" where xx would be a reference in kibibytes KiB. Since no underlaying controllers expose this alignment factor in anything but base 2 numbers. Checks would be that the offset value has to be: comments? |
Part of my motivation for setting this as a new zpool property, rather than simply a long option, is that it provides an easy way to preserve the originally requested value. That means if you ever need to replace a disk in the pool it can automatically be created with a correctly aligned GPT table. Otherwise, you will always need to respecify this offset when replacing a disk, and that's the kind of thing which is easily accidentally forgotten. Your first two checks for the offset look good, we should absolutely check the offset for sanity when setting it. However, I think your one sector off on the minimum size. To allow space for the GPT table itself we need at least the following room left at the beginning of the device. The first partition may start at LBA 34, or 17408 for 512-byte sectors. Or 8x that for 4k sectors. LBA 0 - Legacy MBR |
Setting it as a zpool option would also allow for having different values on different zdevs which would allow for mixing of devices (not in the same zdev obviously but in same zpool). as for the offset, you are correct, I shouldn't hop-around when writing a post. ;) |
How exactly would the syntax work for setting it per vdev? I see what your saying but without substantially rewriting the parser I don't see how you would specify that level of detail. My suggested syntax would be something like this. The ashift and aoffset properties would only be settable at pool creation time. zpool create -o ashift=12 -o aoffset=4MB pool vdev ... |
sorry, didn't get an update on this for some reason when you posted. You are correct you would need to do it at the zpool level. I was thinking that it would be something that we /could/ do per vdev however as you pointed out there is no parsing of that both at creation time as well as import/export time (it's only at the zpool level) so the only way to do this at this point would be there. This would mean that you would need to create your zpool with the ashift value that you want and require a full destroy/re-create if you want to change it. As for your suggested syntax, I think that would work however would suggest using IEC 60027-2 standards/parsing (2^n) to avoid ambiguity opposed to base 10 numbers. Likewise the ability to parse the KiB abbreviation would be needed minimally (to do MiB or GiB may be useful) but offsets can be required less than 1MiB but are very unlikely to ever exist below 1KiB. (hard pressed to think of anything like that in relation to ZFS unless ZFS has internal support for fat sectors (520/528byte) native which I doubt. |
If current zfs on linux have same behaviour as current default on solaris, and default aligment is good (1MB), then it is of low importance to me personally. I'm also slightly against adding custom fields or changing CLI interface, until similar thing is introduced in FreeBSD or Solaris. |
In the example above: What is partition 9 for? I'm using a VM for testing. If I use virtio for the virtual disk, it shows up as /dev/vda. Running |
@rlaager: Partition 9 is the EFI System Partition, which should be FAT32. The loader is installed there if you are booting from a GPT disk. |
@dajhorn: Just on Solaris, or is it used on Linux as well? I don't see how that fits in with the bios_grub partition in the Linux scheme of things. |
What you're seeing as partition 9 is part of the EFI/GPT standard. It is not peculiar to Solaris. It should be created by Linux and Microsoft Windows too. The bios_grub flag is, however, peculiar to the way that GRUB2 is implemented. If you don't install the grub loader exactly according to the documentation, then it won't work and you won't get a sensible error message. GRUB on EFI is brittle vice GRUB on MBR. |
@dajhorn: It sounds like you're describing the EFI system partition: http://en.wikipedia.org/wiki/EFI_System_partition However, the GUID for that is supposed to be C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EF00 in gdisk). The partition being created by ZoL (and presumably Solaris, but I haven't verified that) is 6A945A3B-1DD2-11B2-99A6-080020736631 (BF07 in gdisk). |
I came across this confusing problem today while janitoring my server. I have three raidz2's in my pool, and each of them was created at a different time (the first and second were on different versions of zfs-fuse, the third was with zfsonlinux, and some drive switching occurred after i started using zfsonlinux too). It was rather confusing to try and decipher all this because some drives were "true" block devices only, and others had this two-partition scheme thing going on. There is a lot of assumption in the zpool import/add commands that expects your block device symlinks to also have partition symlinks in the same directory. In some cases, like "zpool add tank spare a1 b1 c1", I can tell there is a path search going on that uses solaris style labels like "a1p1" before erroring out. I suppose that since "whole disk" isn't really "whole disk" after all, and it seems to be this way upstream in Solaris, then this behavior is to be expected from here on out. In my case the confusion was compounded by the fact that I created my first two raidz2 zvols using zfs-fuse, which was using something apparently non-standard. In my case I was trying to be slick and create my own symlink folder for labeling purposes. I created /zdev/ and put symlinks to each block-level device in my pool (e.g. A1->/dev/disk/by-id/scsi-A123456). I know this might be solved with zdev.conf but that only supports by-path, which I find to be too volatile on my setup (affected by scan order). So after I made all these symlinks nothing really worked right, because the zpool command was unable to properly tack on pathnames to get a handle on the partition. In the end I fixed my symlinks so the drives that used partitions would have their data partition symlinked instead of the block device. Maybe it is an uncommon case, but I still kinda wish errors (or even the manual) for zpool add/import would include a note that you might need to point it to a partition symlink in the case of recent-Solaris or zfsonlinux "whole drive" initialization. It's very confusing indeed when zpool swears your device isn't available when the block device is right there, and you know you used "whole disk" when creating the vdev. The terminology is what got me. |
I'm rather astonished such basic functionality is still broken after 7 years.
No, there is certainly not any potential for there to be partition information in the MBR there. The code should check for the presence of such a thing before giving such an obviously broken error message. In this case, outside of the test VM /dev/xvdk has already had a GPT stripped off and it was greatly desired for For the sanity of the next person to run into this breakage, I did manage a workaround:
Yes, that really was successful at bypassing this insanity. |
@ehem Couple things I can think of. This is more of a brain dump than "here's where the problem lies":
|
Indeed. Didn't bring it up, but I'd ended up writing a small program to explicitly whack all traces of both. There was no possibility of a GPT or other such format on the device beforehand. Also note the error message
Indeed, you're referring to As near as I can tell
This isn't 100% accurate of course. Notably the above wouldn't produce the error message for /dev/xvdk0, whereas This is a really awful misfeature here. One should include slack for slightly differing device sizes, but this is ridiculous. Using whole media devices as filesystems has been a long-time tradition, I highly dislike having to spend a fair bit of time to get a tool to do the basic operation I desire without adding useless garbage overhead on top. |
I am experimenting using loopback devices to hide the device from ZoL's zpool create command.
as an end result I have my pool directly on the unpartitioned device. |
…zfs#94) Signed-off-by: Vishnu Itta <vitta@mayadata.io>
I came across this issue while searching for something similar. In the first post there is discussion about the "bug" of 1049 kB offset. I think this happens simply because |
I'm closing this in favor of #3452, which has more notes about implementation. |
…fication) (openzfs#94) Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Does this link need to be recreated after every reboot, or can it safely disappear once the zpool has been created? EDIT: It seems that exporting the pool, removing the link and then importing it is enough. |
Yup. The link gets stored in I'm sure someone wanted a GPT and had to add it later, but 1000x that many people want it to take the whole device specified. There is no way to know actual the desires of the person running the command. Whereas it is easy to add a GPT beforehand if desired, but quite difficult with this bug to prevent it from being added. |
This should be in the README or some other very prominent place!! The |
Thanks for posting this workaround, since large 20TB (Seagate EXOS X20) GPT formatted disks on MacPro 3,1 seems to stall/freeze the firmware during boot (boot picker will time out finding nothing or will shutdown after x minutes). Formatting these large disks in a "layout/partition less" fashion with ZFS or (even btrfs) makes boot behavior normal again on MacPro 3,1 boot picker instantly shows up and let you boot any OS again. Also using this hard link tip works for Do I understand correctly @pepa65 that this kind of partition-less formatting introduces issue(s) when replacing a bad disk in a broken pool? Note: These disks were re-formatted with openSeaChest for using 4K native vs 512e... |
No. That issue relates to someone having a FreeBSD-created ZFS pool which was then migrated/attached to a Linux system where OpenZFS decided to auto-create partitions and it resulted in loss of the last 8MByte of their disk(s). If the pool was quite full (utilisation-wise), this probably caused them data loss, and for a bootable pool almost certainly destroyed the (preexisting) backup GPT header. That ticket is a good example of why this auto-partitioning stuff should never have been created in the first place (and I said something to this effect 5 years ago). I am a proponent of auto-partitioning being added as a flag to OpenZFS for Linux, but by default it should not be doing this. To add a bit more detail: FreeBSD does not do any kind auto-partitioning on ZFS disk members. If you want to use partitions as pool members instead of raw disks, you are expected to do the partitioning yourself with gpart(8). If you want to have bootable ZFS pool, you are expected to do the partitioning and bootloader setup yourself (though if installing the OS from scratch, I believe the installer can now do this for you). Solaris behaves identically in this regard. It is only OpenZFS on Linux that behaves unlike the rest. |
@walterav1984 What @koitsu said..! |
Your understanding of a vdev is mostly correct: you must replace a bad disk with one of equal or larger size (i.e. LBA count). This is how ZFS is designed and isn't OpenZFS on Linux specific. In fact, even HBA RAID controllers operate this same way. If you want to use partitions to try and alleviate this, go right ahead, but it's wasted effort (IMO) since you cannot reliably predict "how small" a replacement disk in the future might be. The general advice we sysadmins will give you is: buy spare disks (of the exact same model/part number) at the time you buy disks for your pool, or go with a vendor (ex. Dell) that can guarantee replacement disks in the future -- even from other manufacturers -- have the same LBA count. |
Thanks for the input, will try to break the pool and replace and test disks to see if it really doesn't introduce a replacement disk with forced partitioning and less space or that I have to replace a disk with a hardlink |
A similar approach is to use a loop device and set the loop device to be 10% smaller than the disk. That is usually enough to compensate for manufacturer variances and as suggested by @ttyS4 this also takes care of the GPT issue. This requires autoexpand to be disabled. At this point I've concluded the OpenZFS project is unresponsive and am now asking Linux distributions to patch out this bug. Hopefully having every Linux distribution patch out the bug might get the message across. |
10% is a huge amount even on a 1TB disk. And actually I didn't come across any variance between vendors during the past 10 years or even more. Yes, SSD and HDD have significantly different typical sizes (e.g. 960 vs 1000GB) but all |
@Nable80 this is off the topic of OpenZFS adding GPTs being a Bad Idea. The number wasn't meant to be exact, you simply shave a little bit off every disk to avoid future pain. Indeed this is rather less of an issue with there now being only 2.5 disk vendors, but when there were 10 it was a significant issue. Another alternative is to simply plan on incrementally growing your array and get larger storage each time (flash cannot yet compete with disks for shear space, but the only constant is change). |
@ehem I believe the OpenZFS project is largely in favor of raw disk support. We just need someone to implement the feature: "We would just need someone will to put together a PR for this which passes the CI, and then some folks to review and test it." |
You're looking to have it done your way (you have to add an extra flag), instead of the way everyone expects it to work (operating similar to I'm left extremely confused as to what possible useful purpose this bug serves? People will make mistakes and choose poor configurations. Doing highly unexpected things causes far greater damage as it will be unable to prevent people from shooting themselves in the foot. and breaks things for people who expected the program to follow well-established behaviors. Seriously, what useful purpose does this provide? I haven't the faintest clue, unless it lets someone pad their resume. |
If it would be way easier to just rip it out (and most likely it is!) then I would be greatly in favour of doing that. If people want partitioning, there are standard recommended tools for that, that are very easy to use, as precise as you would want to. It would simplify the code base of OpenZFS, there would be nothing lost. |
I see in the change logs that this was added a while ago (2009-11-02) that when adding a raw volume to a pool zfs will create a GPT partition on the disk and actually add the partition not the raw volume as the user requested, this is also started at ~1049KB offset:
Disk /dev/sdq: 251GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Number Start End Size File system Name Flags
1 1049kB 251GB 251GB zfs
9 251GB 251GB 8389kB
This poses problems when the volumes presented to ZFS are not physical drives but are luns due to alignment issues. Also this creates more of a management issue when dealing with many devices as seen in large deployments when trying to remove/replace a device with one of a larger size (having to now not only replace all the devices in the vdev but also modify the gpt tables).
If possible would like an option to NOT create a partition table but to use the device as presented.
The text was updated successfully, but these errors were encountered: