FreeBSD Developer Summit File Systems Working Group
May 11th 2011 (Wednesday), 15:00-16:30.
Overview
As part of the May 2011 FreeBSD developer summit, this working group will focus on file systems. This is a by-invitation event open to developer summit attendees.
Topics
- Default to journaled soft updates in bsdinstall (mckusick)
- announced that bsdinstall will default it on in 9.0 — shit hit the fan
- fixes turning into one giant diff (5k lines)
- speak now or forever hold your peace
- Q: soft updates on root? A: should be usable now
- Note: core dumps to SCSI disks broken — any volunteers?
- Raising UFS default block/fragment size to 32K/4K (mckusick)
- last raised in 2001 (to 16k)
- disks now use native 4k sectors; smaller frags are slow
- Q: should we raise inode density (inodes/block)? A: will happen when we raise the block size
- Q: perf improvement? A: yes, always
- Q: might it make sense to only raise default for large filesystems? A: can always change defaults. Q: but shouldn't defaults on small disks be smaller — consider small roots? A: maybe. NOTE: consider making default depend on filesystem size.
- Q: should we make inodes larger, e.g., 64bit inos? A: that's a UFS3 question, not now — also involves a system call interface change
- Q: at Juniper we have huge filesystems but small files, so large movies aren't an issue. A: this is really about matching the software to the hardware.
- Q: if you increase the size, microfiles become more attractive. A: perhaps
- Q: is growfs still working? A: so far as I know, yes, but only offline.
- Q: cannot grow GEOM providers online now, would be useful.
- Use of 4K sectors (ivoras/pjd)
- Mostly covered by use of 4k frags
- Drives lie about block size (claims 512b when reality is 4k)
- but performance sucks so bad if you get it wrong
- maybe bad performance is a way to figure out if the drive is lying?
- Note: need to change more than newfs
- Note: compare sector size to stripe size (some disks report both?)
- get used by different things
- apparently tricky to get this right
- IOCTL for block size, one for filesystem, nothing for stripe size
- Justin just wants a way to align fdisk partitions to a given offset (did I get that right?)
- Justin: ZFS allows record size to change dynamically
- needs to be reported to apps, IOCTL not enough
- Q: what can we trust coming from the drive? A: maybe max size, that's all
- 64-bit inode numbers (pjd)
- we're the only one that doesn't have them yet (Windows, Linux, Solaris, etc.) current activity on how to get there
- GEOM performance (pjd)
- Geom doesn't handle huge io req/sec (based on ZERO GEOM class)
- no hardware involved with ZERO
- 512 sectorsize, 1 exabyte, 4 cores
- kern.geom.collectstats=1
raidtest test -r -d /dev/gzero -n <NPRODC>
- 1 proc: 75k
- 4: 156k
- 8: 190k
- 16: 187k
- kern.geom.collectstats=0
- about 10% faster
- debug.geomperf_direct_io=1
eliminates g_up/g_down kernel threads & passes the I/O requests directly in both directions
- 187% faster
- effect diminishes with number of processes → poor scaling
- debug.geomperf_dev_eternal=1
- adds the MAKEDEV_ETERNAL flag to make_dev_p() call in GEOM DEV class; avoids contentions on the global devmtx mutex
- about 32% better than previous case
- debug.geomperf_percpu_pbufs=1
- uses per-CPU pbufs, avoids contentions on the global pbuf_mtx mutex
- improves about 22% over previous case
- near about 800k req/sec
- increase MAX_POOL_SLEEP_SIZE from 128 to 1024
- reduces contention on mtxpool
- no difference at 95% confidence
- change pa_index(pa) from PDRSHIFT to 17
- reduces contention on vm_page lock
- keeps everything on a single superpage
- about 4.7%
- totals:
- 1: 525k
- 4: 837k
- 8: 814k
- 16: 735k
- around 436% improvement
- Proposals:
- flags to GEOM classes
- "I can handle direct I/O" (separate for up and down)
- flags to GEOM classes
- Note: on some architectures can use direct map if pages are contiguous
- avoids TLB shootdown, which causes implicit contention
- Update on ZFS (pjd)
- v15 in 8 STABLE, v28 in CURRENT
- Q: any work on unifying caches between UFS and ZFS? A: not high on the list because most people don't mix types on a single system
- Q: Macklem hearing problems about heavy write loads, maybe a problem with logs? A: seems to be fundamental, need to use separate SSDs
- Default to GPT labels in bsdinstall (pjd)
- as opposed to MBR
- doesn't work for people who mix FreeBSD with Windows
- New RAID scheme for GEOM (mav)
- ATA RAID stuff gone in 9.0
- trying to unify RAID
- Currently handles mirror, concat, RAID 10, ...
- should be able to make RAID 5
- unclear about RAID 3
- issues with metadata support
- Update on NFS (macklem)
- Defaults switched in HEAD/CURRENT
- so far, so good
- Peter Holm's testing going well
- Should provide NFSv3 exactly as the old version
- Also supports NFSv4.0 (RFC 3530)
- main advantages: file locking, ACLs
- Includes KRB support, but that came from v3
- Future:
- packrats: local disk aggressively caches data, should be better perf over long-haul links (probably not a big deal on LANs)
- has to be enabled; you give it a cache on the local disk
- "4.1 client stuff"
- Isilon working on 4.1 server side stuff
- packrats: local disk aggressively caches data, should be better perf over long-haul links (probably not a big deal on LANs)
- Q: performance? A: about the same as v3
- Defaults switched in HEAD/CURRENT
- Update on OpenAFS (kaduk/brashear)
- Cache
- chunk-based fetching and storing
- persistent cache
- memory- or disk-backed
- disk-backed means VFS routines call into the KPI to access cache backing store files
- recursive calls through vnode layer
- locking issues (eek!)
- disk-backed means VFS routines call into the KPI to access cache backing store files
- memory-backed cache ok
- memory- or disk-backed
- today, we have a port
- works with 8.* and 9-CURRENT
- with a memory cache
- vnode locking is hard
- Q: based on IBM code from c. 2000? A: yes
- Q: CODA? A: different system, not a lot of commonality, not much to learn, clients and servers can't interoperate
- Cache
- Backing store in FreeBSD for other domains (Justin)
- would like better asynchronous I/O support
- Good for virtualization
- Any other interest? No one else far enough along
- SSD
- FTL — flash translation layer
- Very wide addressing (e.g., 128 bits)
- half to identify the file
- half to index into the file
- eliminates inode mapping
- Block de-duping in flash drives (mckusick)
- please don't
- Filesystem Layout (mckusick)
- Options:
- one big root
- 2 takers
- / + /var + /usr/local
- /usr/local includes home
- 9 takers
- / + /var + /usr + /usr/local
- 4 takers
- one big root
- Other choices
- good for ZFS
- /
- /usr/local
- /usr/home
- /usr/ports
- /usr/ports/distfiles
- /var/tmp
- /var/log
- /var/audit
- /tmp
- Nice properties:
- can adjust compression
- can adjust setuid, exec
- jails can snapshot / and clone to create a new jail
- good for ZFS
- Q: why separate /bin and /usr/bin? A: historic
- Options:
Attendees
Name |
Affiliation |
Topics of Interest |
Other Notes |
FreeBSD Project |
|
Session chair |
|
Eric Allman |
Sendmail |
Filesystem stability |
|
Derrick Brashear |
OpenAFS |
OpenAFS |
|
FreeBSD Project |
|
|
|
OpenAFS |
OpenAFS |
|
|
ADAM David Alan Martin |
FalconStor Software, Inc |
deduplication |
|
|
filesystem and disk subsystem interaction |
|
|
University of Zagreb |
4K sectors, BIO_FLUSH |
|
|
|
NFS |
|
|
Isilon Systems |
NFS |
|
|
Colin Percival |
|
|
|
Erin Marnell |
|
|
|
Alan Cox |
|
|
|
Brad Davis (bd@) |
|
|
|
Julian Elischer <julian AT elischer DOT org> |
|
|
|
Ollivier Robert (roberto@) |
|
|
|
David O'Brien (obrien@) |
|
|
|
Michael Lucas (mwlucas@) |
|
|
|
Ed Cronke |
|
|
|
Kevin Nomura <nomura AT netapp DOT com> |
|
|
|
Joe Cara Donna (in for Chris Faylor) <joecd AT netapp DOT com> |
|
|
|
Will Andrews (will@) |
|
|
|
Justin Gibbs (gibbs@) |
|
|
|
Artem Belevich (art@) |
|
|
|
Michael Dexter <dester AT bsdfund DOT org> |
|
|
|