Linux ZFS Notes
Been playing around with ZFS. ZoL seems to be maturing along nicely and I thought it was worth a day of playing around with it and learn the basics. I don't think I'm ready to take the plunge yet with anything serious, but it's an interesting and very likable FS/LVM. It supposably supports 256 ZiB/Zebibytes. These are my notes for Debian (updated somewhat for buster).
Memory requirements for running ZFS
- Have 8GB ECC RAM + 1GB for every additional 1TB of storage. ^ This old saying is based on FreeNAS requirements as it runs directly from RAM. - FreeBSD wiki quote: "To use ZFS, at least 1 GB of memory is recommended (for all architectures) but more is helpful as ZFS needs *lots* of memory. Depending on your workload, it may be possible to use ZFS on systems with less memory, but it requires careful tuning to avoid panics from memory exhaustion in the kernel.
ECC RAM is preferred, but no more than any other system. Matt Ahrens, one of the ZFS founders at Sun Microsystems, now one of the founding members of OpenZFS, said in a forum post (2014):
"There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error. I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS."
Installing ZFS and extras
(If using buster) # nano /etc/apt/sources.list ^ deb http://ftp.no.debian.org/debian/ buster-backports contrib # apt update # apt install linux-headers-$(uname -r) # apt install -t buster-backports zfs-dkms zfs-zed zfsutils-linux zfs-initramfs parted samba smartmontools ^ Note and update to self: dkms are kernel modules, not needed for Ubuntu. zfs-initramfs will now be referred to by zfsutils-linux as well. -t buster-backports is to explicitly make debian apt use backports as recommended.
Setting up a relay for outgoing mail# apt-get remove exim*; apt-get purge exim*
# apt-get install postfix mailutils libsasl2-2 ca-certificates libsasl2-modules postfix-pcre postfix-mysql
# nano /etc/postfix/main.cf
Add to end of main.cf:
# USE GMAIL,ZOHO,etc SMTP
smtp_header_checks = pcre:/etc/postfix/smtp_header_checks
relayhost = [smtp.gmail.com]:587
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl_passwd
smtp_sasl_security_options = noanonymous
smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt
smtp_use_tls = yes
Put in /etc/postfix/smtp_header_checks:
/^From:.*/ REPLACE From: ServerName <[email protected]>
Put in /etc/postfix/sasl_passwd:
[smtp.gmail.com]:587 [email protected]:pass
Secure and activate authentication:
# chmod 400 /etc/postfix/sasl_passwd
# postmap /etc/postfix/sasl_passwd
# service postfix restart
# echo "This is a test mail from X" | mail -s "Test mail" [email protected] -a "FROM:[email protected]"
# echo "This is a test mail from X" | mail -s "Test mail" [email protected]
^ Both should work, last one to check default working relay allowed.
Configure ZFS Event Daemon(ZED) notification e-mail address
# nano /etc/zfs/zed.d/zed.rc ^ ZED_EMAIL_ADDR="[email protected]"
Smartd for monitoring SMART attributes and general drive health.
I'm setting check interval to 30min and automatic mail when SMART error is found:
# apt-get install smartmontools # nano /etc/smartmontools/run.d/10mail ^ Edit the mail command to include -a "FROM:[email protected]" or whatever accepted relay is. Maybe you don't need it, but I did when using Zoho as a mail relayer instead of gmail. # nano /etc/default/smartmontools ^ start_smartd=yes smartd_opts="--interval=1800" # nano /etc/smartd.conf ^ DEVICESCAN -d removable -n standby -m [email protected] -M test -M exec /usr/share/smartmontools/smartd-runner ^ Remove '-M test' after using it to test if mails can get through. It's just for testing SMART mail error send. # service smartd stop/start to test etc.
List available drives to use (many ways to do this)
# lsblk -o NAME,SIZE,MOUNTPOINT,MODEL,SERIAL,UUID There are 4 types of dev labels you can use with ZFS/zpool /dev/sdX : Best for development/test pools as names are not persistent. /dev/disk/by-id/: Nice for small systems with a single disk controller, allows to mix disks without import problem. /dev/disk/by-path/: Good for large pools as name describes the PCI bus number, enclosure name and port number. /dev/disk/by-vdev/: Best for large pools, but relies on having a /etc/zfs/vdev_id.conf file properly configured for your system. ^ It can be smart to set up aliases in /etc/zfs/vdev_id.conf, e.g. shelf and slot number. e.g. # name fully qualified or base name of device link alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca Run "udevadm trigger" afterwards.
Converting drives to GPT if they are bigger than 2TB (ZFS should take care of it, left for reference)
# parted /dev/<drive> mklabel gpt
# wipefs -a /dev/<drive> ^ Can use sd[a-z] style syntax to clean multiple at the same time, just like when creating vdevs.
Creating a simple Raid-0 pool with no redundancy
# zpool create -o ashift=12 mypool sde sdf sdg sdh (ashift=12 will force 4096 byte sectors instead of default auto detect).
Replacing a drive
# zpool offline mypool DiskID (hotswap with # hdparm -Y /dev/sd* or shutdown and change drive) # zpool replace mypool DiskID NewDiskID Alternatively add new disk as hot spare: # zpool add mypool spare NewDiskID # zpool replace mypool MyDisk NewDiskID ^ After silvering new disk into pool, the old one will be taken offline.
One way of recognizing a drive if you got hotswap chassis with caddy leds and a supported controller
# apt update; apt install ledmon # ledctl locate=/dev/disk/by-id/[drive-id] or # ledctl locate=/dev/sda Turning off: # ledctl locate_off=/dev/sda
Destroying a pool
# zpool destroy mypool
Checking the status of a pool
# zpool status mypool # zfs list # zpool status -x (Nice for script-checking general pool health)
Creating a Raid-1+0 (Raid-10) type pool
Mirror RAID-1 based. Fastest, but only 50% storage capacity: # zpool create -o ashift=12 mypool mirror sde sdf mirror sdg sdh RAIDZ based. Slower, but more capacity depending on number of drives per vdev: # zpool create -o ashift=12 mypool raidz1 sde sdf sdg raidz1 sdh sdi sdj raidz1 sdk sdl sdm raidz1 sdn sdo sdp Both these setups will stripe across vdevs and be faster than single big RAIDZ pools. RAID-0 (fastest) RAID-1 RAIDZ-1 RAIDZ-2 RAIDZ-3 (slowest) There are three different RAID-Z modes which distribute parity across the drives: RAID-Z1 (similar to RAID 5, allows one disk to fail), RAID-Z2 (similar to RAID 6, allows two disks to fail), RAID-Z3 (Also referred to as RAID 7 allows three disks to fail). Optimal number of drives per vdev for best performance: Start a single-parity RAIDZ (raidz) configuration at 3 or 5 disks (2^N+1) Start a double-parity RAIDZ (raidz2) configuration at 6 or 8 disks (2^N+2) Start a triple-parity RAIDZ (raidz3) configuration at 9 or 11 disks (2^N+3) ^ NOTE: This is for distributing blocks evenly. But if using compression like LZ4 (recommended) you can no longer expect blocksize^2 as block sizes when using compression may be odd in size. Meaning it's pointless to use it as a foundation for deciding widths anymore. At that point, focus mainly about risk you're willing to take with the amount of possible drive failures. However, it may also be a good idea to keep recordsizing in mind for optimal padding (se below).
Adding another vdev to the pool
# zpool add -o ashift=12 mypool mirror sdg sdh
Checking pool IO activity
# zpool iostat -v
# zpool scrub mypool # zpool scrub -s mypool (if you need to stop it) # zpool clear mypool (remove any no-longer-relevant error messages) It's recommended to scrub at least once a week.
Creating a file system / mounting point / dataset
# zfs create mypool/mypool1 # mkdir /mnt/mypoolstorage (optional) # zfs set mountpoint=/mnt/mypoolstorage mypool/mypool1 (optional) # zfs list # zfs destroy mypool/mypool1 You should only store data on these type of mounting points / datasets and not directly on the pool. This is because they will have their own sets of attributes that can be set.
Check automatic mounting on startup/shutdown
# nano /etc/default/zfs ZFS_MOUNT=’YES’ ZFS_UNMOUNT=’YES’
Check all configuration values
# zpool get all mypool # zfs get all mypool/<dataset>
# zpool set autoexpand=on mypool autoexpand: Must be set before replacing the first drive in your pool. Controls automatic pool expansion when the underlying LUN is grown. Default is "off". After all drives in the pool have been replaced with larger drives, the pool will automatically grow to the new size. This setting is a boolean, with values either "on" or "off".
Snapshots, Backups and Recovery
Creating a snapshot: # zfs snapshot mypool/[email protected] (dataset) # zfs snapshot [email protected] (entire pool) # zfs list –t all (or just snapshot) # cd /mypool/.zfs/snapshot/mandag (viewing and handling snapshotted files) Deleting a snapshot # zfs destroy mypool/[email protected] # zfs destroy [email protected] Restoring data from snapshots # zfs rollback mypool/[email protected] # zfs rollback [email protected] Make a backup to an image file # zfs send mypool/[email protected] > /backup/mypool-mypool1.img # zfs send [email protected] > /backup/mypool.img (normal) # zfs send [email protected] | xz > /backup/mypool.img (compressed) # zfs send [email protected] | xz | openssl enc –aes-256-cbc –a –salt > /backup/mypool.img (with encryption) Recover from image file # zfs receive mypool/mypool1 < /backup/mypool-mypool1.img If compressed and encrypted # openssl enc -d -aes-256-cbc -a -in /backup/mypool-mypool1.img | unxz | zfs receive mypool/mypool1 Send and receive over SSH # zfs send mypool/[email protected] | ssh [email protected] "zfs receive mypool/mypool1"
Record Size (will only affect new files)
# zfs set recordsize=1M mypool NOTE: ^ Default size is 128K. Consider 1M for only large file storage. Record Size is the data stripe divided over the number of disks (minus parity level). Consider how recordsize is divided on N disks to consider sector sized padding. E.g. 128/(5-1)=32K=GOOD/divisable by 4K sectors when using ashift=12 (RAIDZ1 with 5 drives). No padding. E.g. 128/(8-2)=21.3 will need to pad to 24K to be divisable by 4K and spend ~11% storage in padding. E.g. 1M/(8-2)=170.6 will need to pad to 172K to be divisable by 4K and spend ~1% storage in padding. Meaning larger files will spend less storage on padding. Also consider how using small record sizes with large files can lead to performance penalties with wide vdevs because of the increased IOPS of having more records per drive.
Compression (will only affect new files)
# zfs get compression mypool # zfs set compression=lz4 mypool # zfs get compressratio mypool NOTE: ^ compressratio should be 1.00x until data is actually starting to come it. The CPU overhead of using lz4 is so little that it makes little sense not using it.
Deduplication (not recommended, left only for reference)
# zfs set dedup=on mypool/folder (or entire pool) This saves space on the device but comes at a large cost to memory. Deduplication is almost never worth the performance penalty.
Moving to another system
# zpool upgrade -v; zfs upgrade -v ^ Match systems before migration. # zpool export -f mypool ^ On the old system. # zpool import mypool ^ On the new.
Sharing via Samba
# zfs create -o casesensitivity=mixed mypool/srv (to mimic Windows CI) # zfs set sharesmb=on mypool/srv # nano /etc/samba/smb.conf [Mypool1] comment = Debian ZFS Pool read only = no locking = no path = /mypool/mypool1 guest ok = no client min protocol = SMB2 client max protocol = SMB3 # smbpasswd –a <new_samba_user> # service samba reload (or smbd) Remember to set suitable user rights on shared folders.