Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS option to add raw disks without creating GPT partitions #94

Closed
stevecs opened this issue Feb 8, 2011 · 52 comments
Closed

ZFS option to add raw disks without creating GPT partitions #94

stevecs opened this issue Feb 8, 2011 · 52 comments
Labels
Type: Feature Feature request or new feature

Comments

@stevecs
Copy link

stevecs commented Feb 8, 2011

I see in the change logs that this was added a while ago (2009-11-02) that when adding a raw volume to a pool zfs will create a GPT partition on the disk and actually add the partition not the raw volume as the user requested, this is also started at ~1049KB offset:

Disk /dev/sdq: 251GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 251GB 251GB zfs
9 251GB 251GB 8389kB

This poses problems when the volumes presented to ZFS are not physical drives but are luns due to alignment issues. Also this creates more of a management issue when dealing with many devices as seen in large deployments when trying to remove/replace a device with one of a larger size (having to now not only replace all the devices in the vdev but also modify the gpt tables).

If possible would like an option to NOT create a partition table but to use the device as presented.

@behlendorf
Copy link
Contributor

Certainly not creating the GPT partition tables is possible, but it's still a unclear to me exactly why this is a problem.

The idea here is that given a full device (LUNs look like full devices) we create a GPT partition table and attempt to automatically align the first partition 1MiB in to the device. We want to keep it 4kB aligned. Based on your results it does look like something went wrong. Your reporting 1049KB instead of 1024kB which seems wrong so that certain looks like a bug.

Alternately, if when your setting up the system you know best and want to set your own alignment you can always manually create the partition tables and use these precreated partitions for you vdev. The code will detect that this is a partition and simply use it without adjusting the partition tables.

@stevecs
Copy link
Author

stevecs commented Feb 9, 2011

Besides the bug that is secondary to the discussion (but needs to be fixed, the 1049KiB offset opposed to 1024KiB), there are two issues.

  1. with any GPT offset there is a problem in proper alignment both to the stripe size of an underlaying device as well as it's stripe width. For example with a 1024KiB offset that will work to fit stripe sizes of 128, 256, 512, or 1024KiB (since ZFS itself uses 128KiB sizes that would be the smallest). However you then have the secondary issue of the stripe width alignment. i.e. say you have a san which has luns comprised of 3D+1P (raid 5 of 4 disks). Now you effectively have a stripe width of 384KiB. 384KiB will not align with 1024KiB. To your point you can manually re-size this GPT offset to say 768KiB or 1536KiB. But then if the back-end san changes the array configuration now you have the issue of being mis-aligned again.

If on the other hand you don't have any partition table your lun and zfs will /ALWAYS/ be aligned with no manual/end user intervention.

  1. Since ZFS carves up a block device into discrete atomic units (i.e. it allocates sectors as needed) you can create a vdev based on say a couple 1TB drives, you can then remove those drives (one at a time) and replace them with say 2TB drives. When all members are replaced you will have double the capacity available for use. If you also have to play with partition schemes this gets much more involved and further raises the bar as to the level of the admin supporting the subsystem. KISS principle applies. This is similar with many other large deployments of disks that are aggregated into logical volume groups; raids; or other management structures under the file system layers.

Lastly, I would suggest that the default function of creating a new pool would NOT create the GPT so that the commands function the same way as under solaris. (i.e. use a new flag if the user WANTS to do a GPT but default is no flag and no GPT). that way it would be easier for people to move from one OS to another for support and expect the commands to work the same way.

In our deployments I have systems that generally have upwards of 100 luns/drives per system, having to do the additional steps across those at 2am is not something I would relish.

@behlendorf
Copy link
Contributor

Thanks for the detailed reply. I believe the Sun/Oracle ZFS team would suggest not using a raid array to back individual vdevs because of the fist issue you mention. ZFS is designed to work best when it's managing the individual devices. The partition tables are created with the assumption that your just using a directly attached SAS/SATA/SSD/etc device. In which case they just do the right thing, even for your case 2) above.

That said, we're actually considering doing the same thing for some of our systems as a short term measure. Our concern isn't with ZFS persay, but that we still need to get a better JBOD/device management infrastructure in place for Linux. Solaris has FMA which is OK, but not great, but we still need to build up those sort of tools for Linux.

I think the first order of business here is to make sure the commands work exactly the same as they do under Solaris. I thought that's what I'd done with the partition table creation but if that's not the case it should be fixed. Once we have the basic tool behavior the same we can look at adding additional command line options which add any needed functionality.

So my question is... what exactly does OpenSolaris do with a blank drive when you create a new pool, and then what does the Linux port do? If you could provide the raw data that would help.

@stevecs
Copy link
Author

stevecs commented Feb 15, 2011

Yeah, with large environments where there exists a san infrastructure already getting raw disks exported is not something that really happens, in some cases you can export say raid1 mirrors or what I do here, the smallest raid5's I can, but even then that's pricey from another point of view. If we were all solaris and had 10gbit end-end, probably would come up with what you're doing (zfs/luster) or even just clustered filers and export to other clients via iscsi).

From the sun boxes I have here, I think it's more complex as from the couple I just logged into I have 5.10/zfs systems both running SMI as well as EFI/GPT disk labels on them. I was surprised by the EFI ones but those also happen to have been newer media installs so perhaps sun/oracle changed the default at some point? Originally on ZFS it was SMI only and was optional for EFI/GPT.

This raises a question w/ DR now when you may need to bring back a pool that may have EFI on a box with an older version of solaris that may only handle SMI?

Does the linux port have code to handle both SMI/EMI? From my tests it appears it only creates EMI, so similar to what I mentioned above then an option to NOT do EMI/GPT on creation would still be useful for porting purposes but it does not appear to be as bad at first glance.

@behlendorf
Copy link
Contributor

Yes, I think the EFI scheme may be a late addition to ZFS as you were saying. The Linux port currently will always create an EFI label if it detects your using a whole disk for the vdev. If it determines your using a partition/slice it will just use the full partition/slice for ZFS and not do anything. It occurs to me if you want to trick the current tools in to doing nothing you could probably just create a single partition which spans the entire device, and just use that.

As for accessing old ZFS pools which may have been created with SMI partitioning I'm not exactly sure how Linux will handle that. If the Linux kernel can properly read the partition information and construct the partitions everything should be fine. If it can't you won't be able to import those pool. For pools created with EFI labels on Solaris I have verified that you can access these pools under Linux.

@fajarnugraha
Copy link
Contributor

(1) How does the code currently detect wheter we're using a whole disk (as opposed to say, an LV) for the vdev?

(2) When CONFIG_SUN_PARTITION enabled (like in RHEL kernel), Linux can detect SMI label/Solaris slices correctly. dmesg will show something like this

xvda: xvda1
xvda1: <solaris: [s0] xvda5 [s2] xvda6 [s8] xvda7 >

(3) Solaris can detect zfs directly on whole disk (without partition) just fine. It will use "p0" (e.g. c1t0d0p0, instead of the usual c1t0d0s0 when using SMI label).

(4) Solaris by default will create EFI label when presented with whole disk for zfs pool. The exception is when it's going to be used for boot (rpool), on which it will use SMI label.

(5) zfs-fuse doesn't create any partition by default (whole disk and regular file is treated the same way), but mostly it was because licensing issue (something about libparted being GPL, while zfs is CDDL)

@behlendorf
Copy link
Contributor

Detecting if your using a wholedisk in Linux is actually surprisingly tricky. The current logic resides in the in_whole_disk() function, it basically just attempts to create an EFI label for the provided device name. This will only succeed if we were given a device which can be partitioned.

From what you say the ZFS on Linux port should then have no trouble accessing either the Solaris SMI labels, or a pool created without partitions such as zfs-fuse. It also sounds like this port is doing the right thing by creating EFI labels by default.

Yes, libparted is GPL so we can't link ZFS with its development headers and use it to create the partitions. The ZFS on Linux port gets around this by not using libparted and instead used a modified version of libefi from Solaris which is compatibly licensed.

@stevecs
Copy link
Author

stevecs commented Mar 2, 2011

just an update, the parted offset is a units issue with that command it seems:

GNU Parted 2.2
Using /dev/sdo
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Seagate ST2000DL003-9VT1 (scsi)
Disk /dev/sdo: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1049kB 2000GB 2000GB zfs
9 2000GB 2000GB 8389kB

(parted) unit
Unit? [compact]? B
(parted) print
Model: Seagate ST2000DL003-9VT1 (scsi)
Disk /dev/sdo: 2000398934016B
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
1 1048576B 2000390528511B 2000389479936B zfs
9 2000390528512B 2000398917119B 8388608B

If you set your units to bytes you have exactly 1048576 (1MiB) so there must be some routing or something going on w/ parted which, though screwed, is not a problem w/ your offset code.

So that leaves just the original request for a means to avoid creating partitions at all.

@baryluk
Copy link

baryluk commented Apr 25, 2011

Hi. I also wanted to ask about this long time ago. I tried linux zfs, and found it very iritating that zpool create tries to be too clever, and creates GPT automatically. Please just use block devices, and do not pretend what this block device is at all. It will also make it more similar to the way a zfs works under solaris, freebsd and zfs-fuse. I like raw devices as they are easier to work in virtualized environments, iscsi, xen, kvm, as they are simpler to be managed or mounted outside of the virtualized system also.

Or maybe now EFI labels or GPT (or whetever) is a recomended way of giving whole disks to the zfs? I'm also concerned about aligment issues here.

Thanks.

@stevecs
Copy link
Author

stevecs commented Apr 25, 2011

The default behavior is (under solaris) is when presented with a raw disk cXtXdX to use SMI labels if the disk is to be booted; or EFI/GPT labels for data partitions and creates the first partition at a 1MiB offset from the start of the volume. Brian appears to have attempted to copy this behavior which is correct in that it would be the EXPECTED behavior (as it would match what happens by default under solaris).

To your point, and the reason why I created the bug report is that this does not work correctly in all cases (mainly ones where you have complex storage). In these scenarios the ability to use a raw partition without a volume header/partition table is very useful if not required for good performance.

A work-around until something gets done to user space tools is to create your arrays with zfs-fuse which due to it complete lack of such functionality will allow you to use any raw block device to create your zvol/pools. You can then export them and import them into native zfs. This is far from ideal but it would allow you to create some items in a pinch.

@fajarnugraha
Copy link
Contributor

On newer opensolaris (I tested b148) with "autoexpand=on" zpool property, it currently does the "right thing" when the disk becomes larger (e.g. by resizing LUN from storage side). The GPT partition was automatically adjusted so the new size was recognized correctly.

@stevecs
Copy link
Author

stevecs commented Apr 25, 2011

Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB.

@stevecs stevecs closed this as completed Apr 25, 2011
@stevecs
Copy link
Author

stevecs commented Apr 25, 2011

Yes, that has been true under solaris for a while actually. What it does not do however is allow you to set the offset from sector 0 for the beginning of the ZFS partition (item #1 above). This is what causes mis-alignment to underlaying stripe widths and/or if you have a stripe size of >1024KiB.

@stevecs stevecs reopened this Apr 25, 2011
@behlendorf
Copy link
Contributor

I'm not opposed to fixing this, but I think we need a concrete proposal for what to do. First let me suggest that we not change the current default behavior. Not only is it consistent with the Solaris behavior, which is good for interoperability, but it's also what most people probably want.

I suggest we add a new property called 'aoffset' which can only be set once at zpool creation time. By default it will be the current 1MB offset, but if you have a specific offset requirement for your storage you can set it here. This could be further extended if needed so that an 'aoffset=-1' might indicate that no GPT partition tables should be created at all.

Related to this there is further work being done is issue #195 to allow the ashift to be set via a property as well. I'd very much like to see the same mechanism be used to set both these tunings. Is this sufficient to cover everyones needs?

@stevecs
Copy link
Author

stevecs commented Apr 26, 2011

I also agree that the base utility command set should mimic what's on solaris. Similar to what we're talking about doing for ashift values such as using long options (which do not exist in oracle/zfs) would provide command similarities to across the board only exposing localizations when explicitly called for.

I like your idea of setting a specific size for the disk partition alignment offset opposed to just having it there or not. That would give much more flexibility.

something like "--diskpartalignment=xx" where xx would be a reference in kibibytes KiB. Since no underlaying controllers expose this alignment factor in anything but base 2 numbers.

Checks would be that the offset value has to be:
1) multiple of zfs file system block size
2) multiple of ashift value
3) offset size must at least 33*ashift (LBA size). For 2^9 (512byte sectors) this would be a minimum of 16896 bytes (legacy lba0 + 32size must be at least 16896 bytes offset from 0 (LBA0 (legacy + 128 partition entries (32bytes each)). Note: with ashift values > 9 this would inflate this beyond the base minimum of 16896 however I think this is probably the best method for offset as it would not hard-code sector sizes (would follow what is being set on the particular drive so if 4K or larger sector devices come out should auto-adapt).

comments?

@behlendorf
Copy link
Contributor

Part of my motivation for setting this as a new zpool property, rather than simply a long option, is that it provides an easy way to preserve the originally requested value. That means if you ever need to replace a disk in the pool it can automatically be created with a correctly aligned GPT table. Otherwise, you will always need to respecify this offset when replacing a disk, and that's the kind of thing which is easily accidentally forgotten.

Your first two checks for the offset look good, we should absolutely check the offset for sanity when setting it. However, I think your one sector off on the minimum size. To allow space for the GPT table itself we need at least the following room left at the beginning of the device. The first partition may start at LBA 34, or 17408 for 512-byte sectors. Or 8x that for 4k sectors.

LBA 0 - Legacy MBR
LBA 1 - Partition Table Header
LBA 2-33 - Partition Table Entries
LBA34 - First Partition

@stevecs
Copy link
Author

stevecs commented Apr 26, 2011

Setting it as a zpool option would also allow for having different values on different zdevs which would allow for mixing of devices (not in the same zdev obviously but in same zpool).

as for the offset, you are correct, I shouldn't hop-around when writing a post. ;)

@behlendorf
Copy link
Contributor

How exactly would the syntax work for setting it per vdev? I see what your saying but without substantially rewriting the parser I don't see how you would specify that level of detail. My suggested syntax would be something like this. The ashift and aoffset properties would only be settable at pool creation time.

zpool create -o ashift=12 -o aoffset=4MB pool vdev ...

@stevecs
Copy link
Author

stevecs commented May 3, 2011

sorry, didn't get an update on this for some reason when you posted. You are correct you would need to do it at the zpool level. I was thinking that it would be something that we /could/ do per vdev however as you pointed out there is no parsing of that both at creation time as well as import/export time (it's only at the zpool level) so the only way to do this at this point would be there. This would mean that you would need to create your zpool with the ashift value that you want and require a full destroy/re-create if you want to change it.

As for your suggested syntax, I think that would work however would suggest using IEC 60027-2 standards/parsing (2^n) to avoid ambiguity opposed to base 10 numbers. Likewise the ability to parse the KiB abbreviation would be needed minimally (to do MiB or GiB may be useful) but offsets can be required less than 1MiB but are very unlikely to ever exist below 1KiB. (hard pressed to think of anything like that in relation to ZFS unless ZFS has internal support for fat sectors (520/528byte) native which I doubt.

@baryluk
Copy link

baryluk commented May 14, 2011

If current zfs on linux have same behaviour as current default on solaris, and default aligment is good (1MB), then it is of low importance to me personally. I'm also slightly against adding custom fields or changing CLI interface, until similar thing is introduced in FreeBSD or Solaris.
Thanks.

@rlaager
Copy link
Member

rlaager commented Jan 7, 2012

In the example above:
1 1049kB 2000GB 2000GB zfs
9 2000GB 2000GB 8389kB

What is partition 9 for?

I'm using a VM for testing. If I use virtio for the virtual disk, it shows up as /dev/vda. Running zpool create tank vda does not create a GPT partition label. But if I use SCSI for the virtual disk instead, it shows up as /dev/sda and a zpool create tank sda results in it getting a GPT label. Any ideas why (or how I might debug this)?

@dajhorn
Copy link
Contributor

dajhorn commented Jan 7, 2012

@rlaager: Partition 9 is the EFI System Partition, which should be FAT32. The loader is installed there if you are booting from a GPT disk.

@rlaager
Copy link
Member

rlaager commented Jan 7, 2012

@dajhorn: Just on Solaris, or is it used on Linux as well? I don't see how that fits in with the bios_grub partition in the Linux scheme of things.

@dajhorn
Copy link
Contributor

dajhorn commented Jan 7, 2012

@rlaager:

What you're seeing as partition 9 is part of the EFI/GPT standard. It is not peculiar to Solaris. It should be created by Linux and Microsoft Windows too.

The bios_grub flag is, however, peculiar to the way that GRUB2 is implemented. If you don't install the grub loader exactly according to the documentation, then it won't work and you won't get a sensible error message. GRUB on EFI is brittle vice GRUB on MBR.

@rlaager
Copy link
Member

rlaager commented Jan 8, 2012

@dajhorn: It sounds like you're describing the EFI system partition: http://en.wikipedia.org/wiki/EFI_System_partition However, the GUID for that is supposed to be C12A7328-F81F-11D2-BA4B-00A0C93EC93B (EF00 in gdisk). The partition being created by ZoL (and presumably Solaris, but I haven't verified that) is 6A945A3B-1DD2-11B2-99A6-080020736631 (BF07 in gdisk).

@putnam
Copy link

putnam commented Sep 5, 2012

I came across this confusing problem today while janitoring my server.

I have three raidz2's in my pool, and each of them was created at a different time (the first and second were on different versions of zfs-fuse, the third was with zfsonlinux, and some drive switching occurred after i started using zfsonlinux too).

It was rather confusing to try and decipher all this because some drives were "true" block devices only, and others had this two-partition scheme thing going on. There is a lot of assumption in the zpool import/add commands that expects your block device symlinks to also have partition symlinks in the same directory. In some cases, like "zpool add tank spare a1 b1 c1", I can tell there is a path search going on that uses solaris style labels like "a1p1" before erroring out.

I suppose that since "whole disk" isn't really "whole disk" after all, and it seems to be this way upstream in Solaris, then this behavior is to be expected from here on out. In my case the confusion was compounded by the fact that I created my first two raidz2 zvols using zfs-fuse, which was using something apparently non-standard.

In my case I was trying to be slick and create my own symlink folder for labeling purposes. I created /zdev/ and put symlinks to each block-level device in my pool (e.g. A1->/dev/disk/by-id/scsi-A123456). I know this might be solved with zdev.conf but that only supports by-path, which I find to be too volatile on my setup (affected by scan order). So after I made all these symlinks nothing really worked right, because the zpool command was unable to properly tack on pathnames to get a handle on the partition. In the end I fixed my symlinks so the drives that used partitions would have their data partition symlinked instead of the block device.

Maybe it is an uncommon case, but I still kinda wish errors (or even the manual) for zpool add/import would include a note that you might need to point it to a partition symlink in the case of recent-Solaris or zfsonlinux "whole drive" initialization. It's very confusing indeed when zpool swears your device isn't available when the block device is right there, and you know you used "whole disk" when creating the vdev. The terminology is what got me.

@ehem
Copy link

ehem commented Jul 16, 2018

I'm rather astonished such basic functionality is still broken after 7 years. mkfs -t <type> /dev/<anything> has worked since the begining. Insisting upon creating a GPT on the target is astonishingly broken. Even if something appears at first glance to be a whole disk, it may not be if you can see the bigger picture.

# zpool add test /dev/xvdk
invalid vdev specification
use '-f' to override the following errors:
/dev/xvdk does not contain an EFI label but it may contain partition
information in the MBR.
# dd if=/dev/zero count=1024 of=/dev/xvdk
1024+0 records in
1024+0 records out
524288 bytes (524 kB, 512 KiB) copied, 0.0359066 s, 14.6 MB/s
# zpool create test /dev/xvdk
invalid vdev specification
use '-f' to override the following errors:
/dev/xvdk does not contain an EFI label but it may contain partition
information in the MBR.
# zpool create test -f /dev/xvdk
[100.000000]  xvdk: xvdk1 xvdk9

No, there is certainly not any potential for there to be partition information in the MBR there. The code should check for the presence of such a thing before giving such an obviously broken error message. In this case, outside of the test VM /dev/xvdk has already had a GPT stripped off and it was greatly desired for zpool to use the entire block device it was given.

For the sanity of the next person to run into this breakage, I did manage a workaround:

# cp -l /dev/xvdk /dev/xvdk1
# zpool create test -f /dev/xvdk1

Yes, that really was successful at bypassing this insanity.

@koitsu
Copy link

koitsu commented Jul 24, 2018

@ehem Couple things I can think of. This is more of a brain dump than "here's where the problem lies":

  1. With GPT, there is a primary table (LBAs 1 through 33) and backup table (LBAs {last-LBA-of-disk-minus-33} to {last-LBA}). Your dd only clears the primary. I don't know about Linux, but FreeBSD will actually tell you (in console/dmesg) if the primary is corrupt and will fall back to using data from the backup table at the end of the physical disk. As such, when working with GPT, you have to clear out both tables. LBA 0 is usually the PMBR.

  2. On Linux (at least back in the 2.6.x days; not sure about today), the kernel kept an internal cache of what the MBR/GPT contained, all the way down to partitions. If you manipulated the regions directly on the disk, you had to inform the kernel via a special ioctl() to get the kernel to "re-taste" (re-read) MBR/GPT areas. The only way I know how to do this on Linux is to use fdisk followed by the w command, but as said before, there may be a newer command today that can do it.

@ehem
Copy link

ehem commented Jul 25, 2018

@ehem Couple things I can think of. This is more of a brain dump than "here's where the problem lies":

  1. With GPT, there is a primary table (LBAs 1 through 33) and backup table (LBAs {last-LBA-of-disk-minus-33} to {last-LBA}). Your dd only clears the primary. I don't know about Linux, but FreeBSD will actually tell you (in console/dmesg) if the primary is corrupt and will fall back to using data from the backup table at the end of the physical disk. As such, when working with GPT, you have to clear out both tables. LBA 0 is usually the PMBR.

Indeed. Didn't bring it up, but I'd ended up writing a small program to explicitly whack all traces of both. There was no possibility of a GPT or other such format on the device beforehand. Also note the error message /dev/xvdk does not contain an EFI label but it may contain partition information in the MBR. I don't know whether zpool create will look for a backup GPT, but zpool was clearly indicating no GPT ("EFI label") was present. At which point the dd would of nuked all traces of any such construct which had been present. That error is garbage.

  1. On Linux (at least back in the 2.6.x days; not sure about today), the kernel kept an internal cache of what the MBR/GPT contained, all the way down to partitions. If you manipulated the regions directly on the disk, you had to inform the kernel via a special ioctl() to get the kernel to "re-taste" (re-read) MBR/GPT areas. The only way I know how to do this on Linux is to use fdisk followed by the w command, but as said before, there may be a newer command today that can do it.

Indeed, you're referring to ioctl(fd, BLKRRPART). I don't recall the exact sequence of actions I took prior to the above, but the kernel was unaware of any disk slicing before the commands were run (either I'd included the ioctl() during the erase, or perhaps restarted the VM).

As near as I can tell zpool create's logic is roughly:

if(gpt_is_present(passed_in_dev)) {
  /* some action or error message I didn't reproduce */
  return EDIE_FULLDEVICE;
} else if(!isdigit(passed_in_devname[strlen(passed_in_devname)-1]) {
  errormsg("%s does not contain an EFI label but it may contain partition\ninformation in the MBR.\n", passed_in_devname);
  return EDIE_FULLDEVICE;
}

This isn't 100% accurate of course. Notably the above wouldn't produce the error message for /dev/xvdk0, whereas zpool create gave the error for that situation (no error for /dev/xvdk1 though). I hope the code checks for GPTs inside devices named "/dev/xvdk1" (yes, many setups GPTs inside slices will be found and sub-devices created), but haven't tried this since I was trying to eliminate all traces of any such constructs.

This is a really awful misfeature here. One should include slack for slightly differing device sizes, but this is ridiculous. Using whole media devices as filesystems has been a long-time tradition, I highly dislike having to spend a fair bit of time to get a tool to do the basic operation I desire without adding useless garbage overhead on top.

@ttyS4
Copy link

ttyS4 commented Oct 5, 2018

I am experimenting using loopback devices to hide the device from ZoL's zpool create command.
So with just a single device pool it is:

losetup /dev/loop2 /dev/vdb
zpool create mypool /dev/loop2
zpool export mypool
losetup -d /dev/loop2
zpool import
zpool import mypool

as an end result I have my pool directly on the unpartitioned device.

@llamafilm
Copy link

I came across this issue while searching for something similar. In the first post there is discussion about the "bug" of 1049 kB offset. I think this happens simply because parted uses base 10 instead of base 2 for its calculation:
512B * 2048 sector offset = 1048576 B = 1049 kB = 1024 KiB

@rlaager
Copy link
Member

rlaager commented Dec 14, 2019

I'm closing this in favor of #3452, which has more notes about implementation.

@pepa65
Copy link

pepa65 commented Jan 22, 2023

For the sanity of the next person to run into this breakage, I did manage a workaround:

# cp -l /dev/xvdk /dev/xvdk1
# zpool create test -f /dev/xvdk1

Yes, that really was successful at bypassing this insanity.

Does this link need to be recreated after every reboot, or can it safely disappear once the zpool has been created?

EDIT: It seems that exporting the pool, removing the link and then importing it is enough.

@ehem
Copy link

ehem commented Jan 23, 2023

Does this link need to be recreated after every reboot, or can it safely disappear once the zpool has been created?

EDIT: It seems that exporting the pool, removing the link and then importing it is enough.

Yup. The link gets stored in /etc/zfs/zpool.cache (eww! writable cache file in /etc!) and it simply needs to be updated. Doing anything other than taking the whole of the device provided on the command-line is a bug.

I'm sure someone wanted a GPT and had to add it later, but 1000x that many people want it to take the whole device specified. There is no way to know actual the desires of the person running the command. Whereas it is easy to add a GPT beforehand if desired, but quite difficult with this bug to prevent it from being added.

@pepa65
Copy link

pepa65 commented Nov 28, 2023

cp -l /dev/sdx /dev/sdx1
zpool create pool /dev/sdx1
zpool export pool
rm /dev/sdx1
zpool import pool

This should be in the README or some other very prominent place!! The -l flag for cp is important. Softlinking does not do the job..!

@walterav1984
Copy link

walterav1984 commented Jan 27, 2024

Thanks for posting this workaround, since large 20TB (Seagate EXOS X20) GPT formatted disks on MacPro 3,1 seems to stall/freeze the firmware during boot (boot picker will time out finding nothing or will shutdown after x minutes).
First I thought this was a ZFS only issue, but no matter if the disk is formatted as GPT with just HFS+ in MacOS Montery self or in linux with gdisk (without partition) or with gnome disk-utility it will hang using the onboard 6-port intel esb2 sata2 controller. Maybe other certain exotic Uefi 64 machines are also affected?

Formatting these large disks in a "layout/partition less" fashion with ZFS or (even btrfs) makes boot behavior normal again on MacPro 3,1 boot picker instantly shows up and let you boot any OS again.

Also using this hard link tip works for /dev/disk/by-id/ata-STxxxxxx-part1 using the -part1 addition.

Do I understand correctly @pepa65 that this kind of partition-less formatting introduces issue(s) when replacing a bad disk in a broken pool?

Note: These disks were re-formatted with openSeaChest for using 4K native vs 512e...

@koitsu
Copy link

koitsu commented Jan 28, 2024

Do I understand correctly @pepa65 that this kind of partition-less formatting introduces issue(s) when replacing a bad disk in a broken pool?

No. That issue relates to someone having a FreeBSD-created ZFS pool which was then migrated/attached to a Linux system where OpenZFS decided to auto-create partitions and it resulted in loss of the last 8MByte of their disk(s). If the pool was quite full (utilisation-wise), this probably caused them data loss, and for a bootable pool almost certainly destroyed the (preexisting) backup GPT header. That ticket is a good example of why this auto-partitioning stuff should never have been created in the first place (and I said something to this effect 5 years ago). I am a proponent of auto-partitioning being added as a flag to OpenZFS for Linux, but by default it should not be doing this.

To add a bit more detail: FreeBSD does not do any kind auto-partitioning on ZFS disk members. If you want to use partitions as pool members instead of raw disks, you are expected to do the partitioning yourself with gpart(8). If you want to have bootable ZFS pool, you are expected to do the partitioning and bootloader setup yourself (though if installing the OS from scratch, I believe the installer can now do this for you). Solaris behaves identically in this regard. It is only OpenZFS on Linux that behaves unlike the rest.

@pepa65
Copy link

pepa65 commented Jan 28, 2024

@walterav1984 What @koitsu said..!
The argument was about slightly unequally sized disks in raid, and if you then put a slightly smaller one in, it doesn't quite fit in size-wise..? But if you had made the earlier ones a bit smaller, it would not be a problem..! In any case, ZFS is intelligent enough to not let that lead to any casualties.

@koitsu
Copy link

koitsu commented Jan 28, 2024

@walterav1984 What @koitsu said..! The argument was about slightly unequally sized disks in raid, and if you then put a slightly smaller one in, it doesn't quite fit in size-wise..? But if you had made the earlier ones a bit smaller, it would not be a problem..! In any case, ZFS is intelligent enough to not let that lead to any casualties.

Your understanding of a vdev is mostly correct: you must replace a bad disk with one of equal or larger size (i.e. LBA count). This is how ZFS is designed and isn't OpenZFS on Linux specific. In fact, even HBA RAID controllers operate this same way. If you want to use partitions to try and alleviate this, go right ahead, but it's wasted effort (IMO) since you cannot reliably predict "how small" a replacement disk in the future might be. The general advice we sysadmins will give you is: buy spare disks (of the exact same model/part number) at the time you buy disks for your pool, or go with a vendor (ex. Dell) that can guarantee replacement disks in the future -- even from other manufacturers -- have the same LBA count.

@walterav1984
Copy link

Thanks for the input, will try to break the pool and replace and test disks to see if it really doesn't introduce a replacement disk with forced partitioning and less space or that I have to replace a disk with a hardlink -part1 before keeping this as a solution.

@ehem
Copy link

ehem commented Jan 29, 2024

A similar approach is to use a loop device and set the loop device to be 10% smaller than the disk. That is usually enough to compensate for manufacturer variances and as suggested by @ttyS4 this also takes care of the GPT issue. This requires autoexpand to be disabled.

At this point I've concluded the OpenZFS project is unresponsive and am now asking Linux distributions to patch out this bug. Hopefully having every Linux distribution patch out the bug might get the message across.

@Nable80
Copy link

Nable80 commented Jan 29, 2024

10% is a huge amount even on a 1TB disk. And actually I didn't come across any variance between vendors during the past 10 years or even more. Yes, SSD and HDD have significantly different typical sizes (e.g. 960 vs 1000GB) but all X TB HDD that I used had the same LBA count regardless of the vendor.

@ehem
Copy link

ehem commented Jan 29, 2024

@Nable80 this is off the topic of OpenZFS adding GPTs being a Bad Idea.

The number wasn't meant to be exact, you simply shave a little bit off every disk to avoid future pain. Indeed this is rather less of an issue with there now being only 2.5 disk vendors, but when there were 10 it was a significant issue. Another alternative is to simply plan on incrementally growing your array and get larger storage each time (flash cannot yet compete with disks for shear space, but the only constant is change).

@tonyhutter
Copy link
Contributor

At this point I've concluded the OpenZFS project is unresponsive and am now asking Linux distributions to patch out this bug. Hopefully having every Linux distribution patch out the bug might get the message across.

@ehem I believe the OpenZFS project is largely in favor of raw disk support. We just need someone to implement the feature:

"We would just need someone will to put together a PR for this which passes the CI, and then some folks to review and test it."
#11408 (comment)

@ehem
Copy link

ehem commented Jan 30, 2024

@ehem I believe the OpenZFS project is largely in favor of raw disk support. We just need someone to implement the feature:

You're looking to have it done your way (you have to add an extra flag), instead of the way everyone expects it to work (operating similar to mkfs). This is a major difference since the former requires significant familiarity with the source, whereas the latter simply involves ripping the thing out.

I'm left extremely confused as to what possible useful purpose this bug serves? People will make mistakes and choose poor configurations. Doing highly unexpected things causes far greater damage as it will be unable to prevent people from shooting themselves in the foot. and breaks things for people who expected the program to follow well-established behaviors.

Seriously, what useful purpose does this provide? I haven't the faintest clue, unless it lets someone pad their resume.

@pepa65
Copy link

pepa65 commented Jan 30, 2024

If it would be way easier to just rip it out (and most likely it is!) then I would be greatly in favour of doing that. If people want partitioning, there are standard recommended tools for that, that are very easy to use, as precise as you would want to. It would simplify the code base of OpenZFS, there would be nothing lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.