FreeBSD Developer Summit File Systems Working Group

May 11th 2011 (Wednesday), 15:00-16:30.

Overview

As part of the May 2011 FreeBSD developer summit, this working group will focus on file systems. This is a by-invitation event open to developer summit attendees.

Topics

Default to journaled soft updates in bsdinstall (mckusick)
- announced that bsdinstall will default it on in 9.0 — shit hit the fan
- fixes turning into one giant diff (5k lines)
- speak now or forever hold your peace
- Q: soft updates on root? A: should be usable now
- Note: core dumps to SCSI disks broken — any volunteers?
Raising UFS default block/fragment size to 32K/4K (mckusick)
- last raised in 2001 (to 16k)
- disks now use native 4k sectors; smaller frags are slow
- Q: should we raise inode density (inodes/block)? A: will happen when we raise the block size
- Q: perf improvement? A: yes, always
- Q: might it make sense to only raise default for large filesystems? A: can always change defaults. Q: but shouldn't defaults on small disks be smaller — consider small roots? A: maybe. NOTE: consider making default depend on filesystem size.
- Q: should we make inodes larger, e.g., 64bit inos? A: that's a UFS3 question, not now — also involves a system call interface change
- Q: at Juniper we have huge filesystems but small files, so large movies aren't an issue. A: this is really about matching the software to the hardware.
- Q: if you increase the size, microfiles become more attractive. A: perhaps
- Q: is growfs still working? A: so far as I know, yes, but only offline.
- Q: cannot grow GEOM providers online now, would be useful.
Use of 4K sectors (ivoras/pjd)
- Mostly covered by use of 4k frags
- Drives lie about block size (claims 512b when reality is 4k)
  - but performance sucks so bad if you get it wrong
  - maybe bad performance is a way to figure out if the drive is lying?
- Note: need to change more than newfs
- Note: compare sector size to stripe size (some disks report both?)
  - get used by different things
  - apparently tricky to get this right
  - IOCTL for block size, one for filesystem, nothing for stripe size
  - Justin just wants a way to align fdisk partitions to a given offset (did I get that right?)
  - Justin: ZFS allows record size to change dynamically
    - needs to be reported to apps, IOCTL not enough
- Q: what can we trust coming from the drive? A: maybe max size, that's all
- 64-bit inode numbers (pjd)
  - we're the only one that doesn't have them yet (Windows, Linux, Solaris, etc.) current activity on how to get there
- GEOM performance (pjd)
  - Geom doesn't handle huge io req/sec (based on ZERO GEOM class)
  - no hardware involved with ZERO
  - 512 sectorsize, 1 exabyte, 4 cores
  - kern.geom.collectstats=1
    - raidtest test -r -d /dev/gzero -n <NPRODC>
    - 1 proc: 75k
    - 4: 156k
    - 8: 190k
    - 16: 187k
  - kern.geom.collectstats=0
    - about 10% faster
  - debug.geomperf_direct_io=1
    - eliminates g_up/g_down kernel threads & passes the I/O requests directly in both directions
    - 187% faster
      - effect diminishes with number of processes → poor scaling
  - debug.geomperf_dev_eternal=1
    - adds the MAKEDEV_ETERNAL flag to make_dev_p() call in GEOM DEV class; avoids contentions on the global devmtx mutex
    - about 32% better than previous case
  - debug.geomperf_percpu_pbufs=1
    - uses per-CPU pbufs, avoids contentions on the global pbuf_mtx mutex
    - improves about 22% over previous case
    - near about 800k req/sec
  - increase MAX_POOL_SLEEP_SIZE from 128 to 1024
    - reduces contention on mtxpool
    - no difference at 95% confidence
  - change pa_index(pa) from PDRSHIFT to 17
    - reduces contention on vm_page lock
    - keeps everything on a single superpage
    - about 4.7%
  - totals:
    - 1: 525k
    - 4: 837k
    - 8: 814k
    - 16: 735k
    - around 436% improvement
  - Proposals:
    - flags to GEOM classes
      - "I can handle direct I/O" (separate for up and down)
  - Note: on some architectures can use direct map if pages are contiguous
    - avoids TLB shootdown, which causes implicit contention
Update on ZFS (pjd)
- v15 in 8 STABLE, v28 in CURRENT
- Q: any work on unifying caches between UFS and ZFS? A: not high on the list because most people don't mix types on a single system
- Q: Macklem hearing problems about heavy write loads, maybe a problem with logs? A: seems to be fundamental, need to use separate SSDs
Default to GPT labels in bsdinstall (pjd)
- as opposed to MBR
- doesn't work for people who mix FreeBSD with Windows
New RAID scheme for GEOM (mav)
- ATA RAID stuff gone in 9.0
- trying to unify RAID
- Currently handles mirror, concat, RAID 10, ...
  - should be able to make RAID 5
  - unclear about RAID 3
- issues with metadata support
Update on NFS (macklem)
- Defaults switched in HEAD/CURRENT
  - so far, so good
  - Peter Holm's testing going well
- Should provide NFSv3 exactly as the old version
- Also supports NFSv4.0 (RFC 3530)
  - main advantages: file locking, ACLs
  - Includes KRB support, but that came from v3
- Future:
  - packrats: local disk aggressively caches data, should be better perf over long-haul links (probably not a big deal on LANs)
    - has to be enabled; you give it a cache on the local disk
  - "4.1 client stuff"
  - Isilon working on 4.1 server side stuff
- Q: performance? A: about the same as v3
Update on OpenAFS (kaduk/brashear)
- Cache
  - chunk-based fetching and storing
  - persistent cache
    - memory- or disk-backed
      - disk-backed means VFS routines call into the KPI to access cache backing store files
        recursive calls through vnode layer
        locking issues (eek!)
    - memory-backed cache ok
- today, we have a port
  - http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/152467
  - works with 8.* and 9-CURRENT
    - with a memory cache
    - vnode locking is hard
- Q: based on IBM code from c. 2000? A: yes
- Q: CODA? A: different system, not a lot of commonality, not much to learn, clients and servers can't interoperate
Backing store in FreeBSD for other domains (Justin)
- would like better asynchronous I/O support
- Good for virtualization
- Any other interest? No one else far enough along
SSD
- FTL — flash translation layer
- Very wide addressing (e.g., 128 bits)
  - half to identify the file
  - half to index into the file
  - eliminates inode mapping
Block de-duping in flash drives (mckusick)
- please don't
Filesystem Layout (mckusick)
- Options:
  - one big root
    - 2 takers
  - / + /var + /usr/local
    - /usr/local includes home
    - 9 takers
  - / + /var + /usr + /usr/local
    - 4 takers
- Other choices
  - good for ZFS
    - /
    - /usr/local
    - /usr/home
    - /usr/ports
    - /usr/ports/distfiles
    - /var/tmp
    - /var/log
    - /var/audit
    - /tmp
    - Nice properties:
      - can adjust compression
      - can adjust setuid, exec
      - jails can snapshot / and clone to create a new jail
- Q: why separate /bin and /usr/bin? A: historic

Attendees

Name	Affiliation	Topics of Interest	Other Notes
KirkMcKusick	FreeBSD Project		Session chair
Eric Allman	Sendmail	Filesystem stability
Derrick Brashear	OpenAFS	OpenAFS
PawelJakubDawidek	FreeBSD Project
BenKaduk	OpenAFS	OpenAFS
ADAM David Alan Martin	FalconStor Software, Inc	deduplication
AlexanderMotin		filesystem and disk subsystem interaction
IvanVoras	University of Zagreb	4K sectors, BIO_FLUSH
RickMacklem		NFS
ZackKirsch	Isilon Systems	NFS
Colin Percival
Erin Marnell
Alan Cox
Brad Davis (bd@)
Julian Elischer `<julian AT elischer DOT org>`
Ollivier Robert (roberto@)
David O'Brien (obrien@)
Michael Lucas (mwlucas@)
Ed Cronke
Kevin Nomura `<nomura AT netapp DOT com>`
Joe Cara Donna (in for Chris Faylor) `<joecd AT netapp DOT com>`
Will Andrews (will@)
Justin Gibbs (gibbs@)
Artem Belevich (art@)
Michael Dexter `<dester AT bsdfund DOT org>`

CategoryHistorical

DevSummit/201105/FileSystems (last edited 2021-04-26T05:17:39+0000 by JethroNederhof)