File System Logging Versus Clustering: A Performance Comparison

Margo Seltzer, Keith A. Smith

Harvard University

Hari Balakrishnan, Jacqueline Chang, Sara McMains, Venkata Padmanabhan

University of California, Berkeley

Abstract

The Log-structured File System (LFS), introduced in

1991 [8], has received much attention for its potential

order-of-magnitude improvement in ﬁle system

performance. Early research results [9] showed that

small ﬁle performance could scale with processor

speed and that cleaning costs could be kept low,

allowing LFS to write at an effective bandwidth of 62

to 83% of the maximum. Later work showed that the

presence of synchronous disk operations could

degrade performance by as much as 62% and that

cleaning overhead could become prohibitive in

transaction processing workloads, reducing

performance by as much as 40% [10]. The same work

showed that the addition of clustered reads and writes

in the Berkeley Fast File System [6] (FFS) made it

competitive with LFS in large-ﬁle handling and

software development environments as approximated

by the Andrew benchmark [4].

These seemingly inconsistent results have caused

confusion in the ﬁle system research community. This

paper presents a detailed performance comparison of

the 4.4BSD Log-structured File System and the

4.4BSD Fast File System. Ignoring cleaner overhead,

our results show that the order-of-magnitude

improvement in performance claimed for LFS applies

only to meta-data intensive activities, speciﬁcally the

creation of ﬁles one-kilobyte or less and deletion of

ﬁles 64 kilobytes or less.

For small ﬁles, both systems provide comparable read

performance, but LFS offers superior performance on

writes. For large ﬁles (one megabyte and larger), the

performance of the two ﬁle systems is comparable.

When FFS is tuned for writing, its large-ﬁle write

performance is approximately 15% better than LFS,

but its read performance is 25% worse. When FFS is

optimized for reading, its large-ﬁle read and write

performance is comparable to LFS.

Both LFS and FFS can suffer performance

degradation, due to cleaning and disk fragmentation

respectively. We ﬁnd that active FFS ﬁle systems

function at approximately 85-95% of their maximum

performance after two to three years. We examine

LFS cleaner performance in a transaction processing

environment and ﬁnd that cleaner overhead reduces

LFS performance by more than 33% when the disk is

50% full.

1 Introduction

The challenge in building high performance ﬁle sys-

tems is in using the disk system efﬁciently. Since

large caches reduce read trafﬁc but do little to reduce

write trafﬁc, the critical issue is write performance.

Achieving efﬁcient writes to disk implies writing data

in large, contiguous units. The central idea in log-

structured ﬁle systems is that, by aggressively caching

data and applying database logging techniques, all

disk writes can be made sequential.

The early work in log-structured ﬁle systems

focused on how to build such ﬁle systems. The key

issues were providing efﬁcient reads in a ﬁle system

that was written sequentially and maintaining large

contiguous regions on disk. The seminal work on log-

structured ﬁle systems [9] showed how conventional

ﬁle system structures could be implemented in an LFS

and how the combination of a segmented log and a

cleaner process (garbage collector) could be used to

maintain large, contiguous regions of disk space. The

work’s main focus was on the design of log-structured

ﬁle systems and on the derivation of efﬁcient

algorithms for segment cleaning. The performance

results reported long-term cleaning summaries (e.g.

number of segments cleaned and average utilization

of cleaned segments) and micro-benchmarks that

demonstrated the strengths of LFS.

The paper by Seltzer et al. [10] discussed design

modiﬁcations necessary to incorporate LFS into a

BSD framework. The performance analysis presented

there focused on areas not covered by Rosenblum and

Ousterhout, with an emphasis on workloads that

stressed the cleaning capabilities of LFS. It concluded

that the clustering modiﬁcations to FFS made it

competitive with LFS in reading and writing large

ﬁles and in software development environments (as

characterized by the Andrew benchmark), cleaning

overhead in LFS degraded transaction processing

performance by as much as 40%, and the general

applicability of LFS and its competitiveness with FFS

warranted further investigation. This paper is part of

that further investigation, analyzing the performance

of LFS and FFS, and focusing on the areas that pose

the greatest challenges to each system. We focus on

four main issues:

• validating the BSD-LFS implementation by

comparing its performance to that of Sprite-LFS,

• the interaction of ﬁle size and performance for

sequential access,

• the impact of disk fullness on cleaning overhead

in a transaction processing workload, and

• the impact of free space fragmentation on FFS

performance.

In Section 2 we compare the BSD

implementation of LFS to the Sprite implementation

of LFS to validate that we are measuring a

representative implementation of a log-structured ﬁle

system. In Section 3 we examine performance as a

function of ﬁle size. In Section 4 we examine the

performance of the two ﬁle systems in a transaction

processing environment, with special attention given

to LFS cleaning and its performance as a function of

disk fullness. In Section 5 we discuss the effect of

disk fragmentation on FFS performance. Section 6

summarizes this study.

1.1 Overview of FFS

The BSD Fast File System can be described in terms

of its bitmap, which keeps track of the free space, and

its cylinder groups, which correspond to collections

of cylinders and provide for regions of allocation and

clustering. Information is arranged on disk in terms of

three units: blocks, partial blocks called fragments,

and contiguous ranges of blocks called clusters. In

principle, placing related ﬁles and their inodes in the

same cylinder group provides for a high degree of

locality, and allocating blocks contiguously in clusters

provides the opportunity for large, sequential reads

and writes. In practice, there are two potential prob-

lems with FFS. First, operations that affect the ﬁle

system meta-data (e.g. creating and deleting ﬁles)

require a large number of I/O operations, many of

which are synchronous. For example, it takes poten-

tially six distinct I/O operations to create a new one-

kilobyte ﬁle (the inode of the new ﬁle, the directory

containing the new ﬁle, the directory’s inode, the data

of the new ﬁle, the inode bitmap, and the block bit-

map), the ﬁrst two of which are synchronous. The sec-

ond potential problem is that the FFS block allocator

does not take any special measures to preserve large

free extents. File systems that have been in use for a

long time may become sufﬁciently fragmented that it

is impossible to ﬁnd large clusters of blocks. The

challenges facing FFS can be summarized as reducing

the number and synchronous behavior of writes and

avoiding ﬁle system fragmentation.

1.2 Overview of LFS

In contrast to FFS, LFS avoids both the multiple-write

and synchronous write problems by batching large

numbers of writes into large, contiguous writes. How-

ever, it must maintain large contiguous regions of free

space on disk, called segments. LFS uses a genera-

tional garbage collector [5], called the cleaner, to

regenerate large free extents. If there is available disk

space, the cleaner can always coalesce that free space

to produce clean segments. The cleaner can be run

during idle periods so as not to interfere with normal

ﬁle access; however, during periods of high activity it

may also be necessary to run the cleaner concurrently

with normal ﬁle accesses. Depending on the ﬁle

access patterns, cleaning can potentially be very

expensive and can degrade system performance. Thus

for LFS the key issue is the cost of cleaning.

2 Validation of 4.4BSD-LFS

The system under study in this work is the 4.4BSD-

Lite operating system with implementations of the

fast ﬁle system (FFS) and the BSD log-structured ﬁle

system (BSD-LFS), both of which have been

improved since the study by Seltzer [10]. Speciﬁcally,

the read-ahead code used by both FFS and LFS has

been largely rewritten. Fragments have been added to

LFS. The algorithm for writing blocks to disk in LFS

has been modiﬁed so that ﬁles spanning multiple seg-

ments are written with lower-numbered blocks writ-

ten ﬁrst. (On a new ﬁle system, if N is the number of

blocks per segment, blocks zero through N–1 are

written to the ﬁrst segment, N through to the

2N 1–

next segment, and so on.) Finally, the LFS ﬁle writing

algorithm has been modiﬁed so that ﬁles are written to

segments in the order in which they entered the cache.

This does not affect write performance, but improves

read performance when ﬁles are read in their creation

order.

To show that BSD-LFS is a faithful

implementation of a log-structured ﬁle system, we

have run the benchmarks described by Rosenblum

and Ousterhout [9] and compared our results to those

reported for Sprite-LFS. The two hardware

conﬁgurations are shown in Table 1. In each

benchmark presented in this section, we scale the

Sprite measurements so that the performance of the

critical components match those of the BSD

conﬁguration. Table 2 shows the relevant parameters

and their scale factors.

In the following discussion, we refer to the two

ﬁle systems Rosenblum studied: the Sprite Log-

Structured File System (Sprite-LFS) and the default

Sun ﬁle system without clustering in effect (Sun-FFS)

[7], and the three ﬁle systems we have studied: the

BSD Log-Structured File System (BSD-LFS), the

BSD Fast File System with maxcontig of one so that

clustering is not in effect (BSD-FFS-m1r2), and the

BSD Fast File System with maxcontig of eight (BSD-

FFS-m8r2). As in Rosenblum’s tests, we use a 4 KB

ﬁle system for the LFS implementation and an 8 KB

ﬁle system for the FFS implementations. We deduced

that the rotdelay parameter in the Sun-FFS ﬁle system

Benchmark Conﬁgurations

BSD Sprite

CPU Parameters

CPU SparcStation II Sun 4/260

Mhz 40 25 Mhz

SPEC int92 21.8 8.7

Disk Parameters

Disk Type DSP 3105 Wren IV

RPM 5400 3600

Sector Size 512 bytes 512 bytes

Sectors per Track 57 45

Cylinders 2568 1546

Tracks per Cylinder 14 9

Track Buffer 256 KB 32 KB

Average Seek 9.5 ms 17.5

Maximum Bus

Bandwidth

2.3 MB/sec 1.3 MB/sec

able 1: Benchmark conﬁguration. The Spr

ite

olumn describes the benchmark conﬁguration us

n the Rosenblum and Ousterhout study.

was set to one disk block, indicating that the system

was optimized for writing. Similarly, we have

optimized the BSD-FFS ﬁle systems for writing,

setting rotdelay equal to two disk blocks. The tests in

Section 3 will examine the interaction of performance

and rotdelay in more detail.

Rosenblum used two micro-benchmarks to

evaluate the performance of Sprite-LFS. The ﬁrst of

these tests measures small-ﬁle performance,

speciﬁcally, meta-data operations (creating and

deleting ﬁles) and reading small ﬁles. Ten thousand

one-kilobyte ﬁles are created, read in creation order,

and deleted. Figure 1 shows the results of this

benchmark for the ﬁve ﬁle systems.

According to Rosenblum, the create and delete

tests for LFS are limited by processor speed, while

Sun-FFS is limited by disk bandwidth. Our

measurements for disk and CPU utilization show that

LFS uses 64% of the CPU for creates and 73% for

deletes. FFS uses 5-6% for both. Therefore, the LFS

results for these tests are scaled by 2.5 (the CPU scale

factor) and the Sun-FFS results are scaled by 1.7 (the

average access scale factor).

Both log-structured ﬁle systems provide an order

of magnitude performance improvement over the Fast

File Systems for the create phase of the benchmark.

Since the ﬁles are small, this aspect of the benchmark

is dominated by the performance of meta-data

operations, and LFS provides superior performance

because it batches such operations.

The read test performance is limited by the disk

bandwidth, so we scale both Sprite-LFS and Sun-FFS

by the bandwidth scale factor (1.8). In both the Sprite

benchmarks and the BSD benchmarks, the read

performance of LFS and FFS are comparable. The

BSD-LFS ﬁle system outperforms Sprite-LFS due to

larger track buffers on the BSD disk. Once a block of

inodes has been read from disk, all the ﬁles described

by those inodes are resident in the track buffer. Since

Sprite’s track buffers are one-eighth the size of ours,

they cannot cache all the ﬁle data.

Parameter Sprite BSD Scale

CPU (SPECint92) 8.7 21.8 2.5

Disk Bandwidth 1.4 2.5 1.8

Avg Access (I/Os per second) 39 67 1.7

Table 2: Scale factors for Sprite/BSD comparison.

In order to validate the 4.4BSD implementation of

LFS, we use these scale factors to compare the

measurements on different systems. The average

accesses per second is based on an average seek plus

one-half rotation.

In the delete test, both LFS systems outperform

the FFS systems by an order of magnitude, again due

to the meta-data updates, but BSD-LFS outperforms

Sprite-LFS by a factor of two. The Sprite-LFS delete

performance is bafﬂing. Delete processing in LFS

requires the following four steps.

1. Remove the directory entry.

2. Read the inode.

3. Update the appropriate segment usage

information to reﬂect that one kilobyte is no

longer in use.

4. Increment the version number in the inode map.

The only disk I/O required is reading the

directories and the ﬁle inode blocks. As in the create

case, we expect this test to be limited by the CPU.

However, the CPU processing required for deletes is

greater than that required for creates, and the amount

of data written for deletes is less than that required for

creates. The Sprite-LFS results indicate that the CPU

was saturated during the create test, so we cannot

explain why the Sprite-LFS delete performance

exceeds the create performance. We expect the deletes

to occur at close to memory speed with a single write

at the end of the benchmark to write the empty

directories, the inode map, and segment usage table.

This is consistent with the BSD-LFS results.

igure 1. Validation of the BSD-LFS small ﬁ

erformance. The Sprite-LFS results have be

caled to compensate for the performance of t

ifferent processors and disks on the two system

he LFS create and delete tests are scaled by the CP

cale factor, the FFS create and delete tests by t

verage access scale factor (1.7), and read tests by t

andwidth scale factor (1.8). The BSD-LFS crea

erformance is approximately 25% worse than Sprit

FS because BSD is not processor CPU bound, an

he bandwidth scale factor is half the CPU sca

actor. The BSD-LFS read performance dominat

prite-LFS’s due to larger tracks and track buffe

rs.

e cannot explain Sprite-LFS’s delete performance

Sprite-LFS

BSD-LFS

SunOS-FFS

BSD-FFS-m1r2

BSD-FFS-m8r2

Create Read Delete

Files per second

200

400

600

800

1000

1200

Rosenblum’s second benchmark evaluates the

performance of large ﬁle I/O. This test consists of the

following ﬁve passes through a 100 megabyte test

ﬁle.

1. Create the ﬁle by sequentially writing 8 KB

units.

2. Read the ﬁle sequentially in 8 KB units.

3. Write 100 KB of data randomly in 8 KB units.

4. Read 100 KB of data randomly in 8 KB units.

5. Re-read the ﬁle sequentially in 8 KB units.

Figure 2 summarizes the results for the ﬁve ﬁle

systems. All the sequential benchmarks and the LFS

random write test are scaled by the bandwidth scale

factor (1.8) and the remaining random tests are scaled

by average access time scale factor (1.7). BSD-LFS

and Sprite-LFS display nearly identical performance

for the write tests. On the read tests, BSD-LFS

displays superior performance due to its aggressive

read-ahead and clustered I/O. During a sequential

read, the ﬁle is read in 64 KB clusters and read-ahead

is invoked on clusters rather than individual blocks.

The read-ahead algorithm improves the BSD-LFS re-

read performance but degrades the random read

performance. Although the ﬁle was written randomly,

all the blocks within a segment are sorted. Since the

eight kilobyte I/O requests consist of two logical

blocks, read-ahead is always invoked on the second of

igure 2. Validation of the BSD-LFS large ﬁ

erformance. The Sprite-LFS sequential read an

rite and random write results, and the Sun-FF

equential read, write, and re-read results are scale

y the bandwidth scale factor (1.8). The random rea

nd Sun-FFS random write results are scaled by th

verage access scale factor (1.7). The BSD-LF

mplementation offers write performance equivale

o that of Sprite-LFS. BSD-LFS outperforms Sprit

FS on sequential reads due to aggressive clusterin

nd read-ahead. The read-ahead policy is al

esponsible for the degraded performance of rando

eads in BSD-LFS.

Sprite-LFS

BSD-LFS

SunOS-FFS

BSD-FFS-m1-r2

BSD-FFS-m8r2

seq_write seq_read rand_write rand_read re-read

Bandwidth (in MB/sec)

0.0

0.5

1.0

1.5

2.0

2.5

those blocks. Additionally, if logically sequential

eight kilobyte units are written before the buffer cache

is ﬂushed to disk, the two eight kilobyte units (four

blocks) will be allocated contiguously on disk. During

the sequential re-read, BSD-LFS will read maximal-

sized clusters based on disk layout.

In the random read case, the combination of read-

ahead, the eight kilobyte I/O size, and the four

kilobyte block size is a worst-case conﬁguration for

read-ahead. Two random eight kilobyte reads are

actually four block reads to the ﬁle system. Assume

that we are reading two random eight kilobyte units

containing blocks 100, 101, 200, and 201. The read of

block 100 is not sequential, so read-ahead is not

triggered. However, when block 101 is read, the ﬁle

system detects sequential access and triggers a read

ahead operation on block 102. Unfortunately the next

block read is block 200, requiring a seek. Since this is

not a sequential access, read-ahead is not invoked on

block 201. The end result is that we perform read-

ahead when we do not want it (between eight kilobyte

units) and we do not perform read-ahead when we do

want it (after reading the ﬁrst four kilobytes of an

eight kilobyte unit). This phenomenon explains the

low random read performance for BSD-LFS. The

Sprite-LFS read numbers are consistent with an

implementation that performs no read-ahead.

Although the BSD-LFS random read performance is

inferior to that of Sprite-LFS, none of the benchmarks

analyzed in the remaining sections of this paper

trigger the phenomenon described. The TPC-B

benchmark discussed in Section 4 is the only

benchmark that performs random I/O and its I/O size

is equal to the ﬁle system block size, so the ﬁle system

does not identify a pattern of sequential access.

Since Rosenblum did not set maxcontig high in

Sun-FFS, we expect its performance to be comparable

to BSD-FFS-m1r2 performance. This is consistent

with all the results in Figure 2 except for the

sequential write performance. To understand the

performance for this test, we must deduce the rotdelay

setting for the Rosenblum benchmarks. Typically,

rotdelay is set to optimize write performance so that a

disk revolution is not lost between successive

contiguous blocks [6]. Sun-FFS obtains

approximately 40% of the possible disk bandwidth

when writing sequentially. This is consistent with a

rotdelay of one block. In BSD-FFS, our disks were

sufﬁciently faster that we had to use a rotdelay of two

blocks to avoid missing a rotation on every write.

This yields write performance of slightly less than

one-third the maximal bandwidth, which is consistent

with our measurements.

The similarity in performance between Sprite-

LFS and BSD-LFS demonstrates that BSD-LFS is an

equivalently performing implementation of a log-

structured ﬁle system. Its write performance is as

good as Sprite’s and it generally outperforms Sprite

for reading due to the aggressive read-ahead and

clustered I/O. Sun-FFS is a faithful implementation of

the 4BSD FFS. The clustering modiﬁcations in

4.4BSD are a reimplementation of the modiﬁcations

described by McVoy [7]. With these points

established, we now compare the performance of

BSD-LFS to that of the 4.4BSD FFS with these

clustering enhancements. For the remainder of the

paper, we use LFS to mean BSD-LFS and FFS to

mean BSD-FFS.

3 Sequential Performance as a Function

of File Size

Our ﬁrst comparative benchmark examines the

sequential read and write performance of the two ﬁle

systems across a range of ﬁle sizes. The data set con-

sists of 32 megabytes of data, decomposed into the

appropriate number of ﬁles for the ﬁle size being mea-

sured. In the case of small ﬁles, where directory

lookup time dominates all other processing overhead,

the ﬁles are divided into subdirectories, each contain-

ing no more than 100 ﬁles. In the case of large ﬁles,

we use either 32 MB or ten ﬁles, whichever generates

more data. There are four phases to this benchmark:

• Create: All the ﬁles are created by issuing the

minimal number of writes to create a ﬁle of the

appropriate size. For all ﬁle sizes of four

megabytes or less, this means that the test

program does one I/O operation. For larger ﬁles,

I/Os are issued in four-megabyte portions.

• Read: Each ﬁle is read in its creation order. The

I/O operations are identical in size to those

during the create test.

• Write: Each ﬁle is rewritten in its creation order.

• Delete: All the ﬁles are deleted.

The LFS measurements represent performance when

there is no cleaner present and when the set of tests

for each ﬁle size are run on a newly created ﬁle sys-

tem. The FFS measurements represent empty ﬁle sys-

tem performance; although a new ﬁle system is not

created for each ﬁle size, all ﬁles from previous tests

are deleted before the next data set is created. Both of

these conﬁgurations yield the optimal operating con-

ditions for the two ﬁle systems.

For all test results shown we provide the LFS

performance and two FFS measurements. In both

cases, maxcontig has been set to 64 kilobytes (the

maximum transfer size supported for our controller),

but the rotdelay parameter varies. On our system, a

rotdelay of two blocks provides the optimal write

performance and a rotdelay of zero produces the

optimal read performance. While a second write

cannot be written before the ﬁrst write completes, the

track buffer caches read data so that the rotdelay is

unnecessary. We show results for both rotdelay

settings.

All measurements shown have conﬁdence

intervals of plus or minus one percent of the reported

number.

3.1 Create Performance

Figure 3 shows the performance for the create phase

of the sequential benchmarks. Since LFS buffers a

large number of writes before writing any data to disk,

the ﬁle creates in this benchmark occur asynchro-

nously. In contrast, each time a creat system call

returns, FFS guarantees that the ﬁle has been created

and will exist after a system crash. The journaling ﬁle

systems [4] avoid the synchronous writes of FFS by

logging all meta-data operations to an auxiliary data

structure, replacing multiple, random, synchronous

I/Os with a single sequential one. When ﬁles are small

the create test measures the ability of the systems to

igure 3. Create performance. When the speed

eta-data operations dominates (for small ﬁles le

han a few blocks or 64 KB), LFS performance

nywhere from 4 to 10 times better than FFS. As t

rite bandwidth of the system becomes the limiti

actor, the two systems perform comparably.

LFS FFS-m8r0 FFS-m8r2

File Size (in KB)

1 16 256 4096 65536

Throughput (in MB/sec)

0.0

0.5

1.0

1.5

2.0

2.5

perform meta-data operations; when ﬁles are large,

the create test measures the write performance of the

ﬁle systems.

As expected, the asynchronous creation and the

sequential writes of LFS yield superior performance

to FFS for small ﬁles. The order of magnitude

performance improvement advertised for LFS is

demonstrated during the create benchmark when the

ﬁle size is one kilobyte. In the two kilobyte to one-

half megabyte range, the superiority of LFS degrades

from a factor of four at two kilobytes to a factor of

two at 32 kilobytes to the same performance at one-

half megabyte. When the created ﬁles are large (one-

half megabyte or more) the performance of the two

systems is comparable. The LFS and FFS-m8r0

systems use approximately 67% of the disk

bandwidth, losing one rotation between each 64

kilobyte request. The FFS-m8r2 achieves

approximately 75% of the disk bandwidth, losing

only two blocks or two-thirds of one rotation between

64 kilobyte writes.

3.2 Read Performance

Figure 4 shows the read performance for both LFS

and FFS as a function of the log of the ﬁle size. In the

region between one and eight kilobytes both ﬁle sys-

tems show low bandwidth, but steady growth as the

igure 4. Read performance. For ﬁles of less th

4 KB, performance is comparable in all the ﬁ

ystems. At 64 KB, ﬁles are composed of multip

lusters and seek penalties rise. In the range betwe

4 KB and 2 MB, LFS performance dominat

ecause FFS is seeking between cylinder groups

istribute data evenly.

LFS FFS-m8r0 FFS-m8r2

File Size (in KB)

1 16 256 4096 65536

Throughput (in MB/sec)

0.0

0.5

1.0

1.5

2.0

2.5

transfer size between the disk and memory increases.

Although the track buffer can hide disk latency, each

ﬁle read results in a separate command to the device.

The dip in performance at sixteen kilobytes is due

to the fact that two I/Os are required and read-ahead

has not yet become operative. In the region between 8

KB and 64 KB, performance on both ﬁle systems

improves as each approaches the 64 KB maximum

I/O size. Files larger than 64 KB occupy more than

one cluster and we see a performance dip when the

next I/O operation is added.

FFS performance declines more steeply than

LFS. There are two factors at work. FFS leaves a gap

between clusters, the size of which is controlled by

the rotdelay parameter. When rotdelay is non-zero

and a new ﬁle is allocated, its blocks are placed in the

gaps left by previously created ﬁles. Therefore, at ﬁle

sizes greater than 64 KB, the ﬁles for FFS-m8r2

become increasingly fragmented. LFS does not suffer

from this fragmentation since all of the blocks of a ﬁle

are allocated contiguously. The ability to achieve the

superior write performance of the FFS-m8r2 in the

create test is precisely the limiting factor in the read

case.

The second factor is the introduction of indirect

blocks at ﬁles of 96 KB or larger. This drop is more

pronounced for FFS since FFS begins allocation in a

new cylinder group when the ﬁrst indirect block is

added. LFS continues allocating blocks sequentially.

There is a third parameter that can adversely

affect FFS read performance in a test that creates

many ﬁles. The FFS cpc (cylinders per cycle)

parameter speciﬁes the number of rotationally

equivalent positions that will be considered in disk

allocation. When a block is allocated, a preferred

location is selected. If that location is unavailable,

FFS will attempt to ﬁnd up to cpc rotationally

equivalent positions. This is exactly correct when

allocating blocks to an existing ﬁle. However, when

creating new ﬁles, the preferred block is set to the ﬁrst

block in a cylinder group. After the ﬁrst ﬁle is created,

that block is no longer available. Ideally, the next ﬁle

should be allocated contiguously, but if cpc is non-

zero, then the ﬁle will be allocated to a block that is

rotationally equivalent to the ﬁrst block in the

cylinder group. Accessing these two blocks in

succession requires a full rotation. Since the disk

geometry of many of today’s disks (e.g. SCSI) is not

exposed, determining an accurate value for cpc is

virtually impossible. The numbers reported here use a

cpc of zero.

For large ﬁles, both ﬁle systems approach the 2.3

MB/sec bandwidth achievable over our SCSI bus.

3.3 Overwrite Performance

In this test, the 32 MB of data are written to the ﬁles

already created in the create phase of the benchmark.

The results are shown in Figure 5. In this test, FFS

need perform no allocation, since the blocks are

reused, but LFS must mark blocks dead and create

new segments.

For ﬁles smaller than a cluster (the maximum I/O

size supported by the controller, typically 64 KB), the

sequential layout of LFS dominates. The performance

drop for both systems at 64 KB is due to the cluster

size. LFS and FFS-m8r0 lose a disk rotation between

each pair of ﬁles, because although the ﬁles are

allocated contiguously on disk, both systems must

issue two separate I/O requests to write them. While

track buffers can hide this effect during reading, they

cannot diminish the effect while writing. The rotdelay

gap alleviates this problem, but it introduces the

performance penalty of fragmenting ﬁles whose

blocks are allocated in these gaps. The 4.4BSD ﬁle

system has been enhanced with a new reallocation

algorithm that coalesces such fragmented ﬁles when

they are created. However, the long term effects of

this policy have not been thoroughly investigated, so

we did not enable this functionality during the test.

For large ﬁles (greater than 256 KB), the one-

third rotation that FFS-m8r2 saves over FFS-m8r0

and LFS accounts for the 15% performance

improvement.

igure 5. Overwrite performance. The ma

ifference between the overwrite test and the crea

est is that FFS need not perform synchronous di

perations and LFS must invalidate dead blocks

hey are overwritten. As a result, the performance

he two systems is closer with LFS dominating f

ﬁ

les of up to 256 KB and FFS dominating for larg

ﬁ

le sizes.

LFS FFS-m8r0 FFS-m8r2

File Size (in KB)

1 16 256 4096 65536

Throughput (in MB/sec)

0.0

0.5

1.0

1.5

2.0

2.5

3.4 Delete Performance

The ﬁnal phase of this benchmark consists of the

delete phase. Since this test writes little data, the

results in Figure 6 are expressed in deleted ﬁles per

second. Once again, the asynchronous behavior of

LFS meta-data operations provides an order of magni-

tude performance improvement for small ﬁles. The

sudden drop in performance occurs when indirect

blocks are required. When a ﬁle exceeds its direct

blocks, the indirect block must be retrieved from disk

in order to free blocks (in FFS) or mark them dead (in

LFS). Even for large ﬁle sizes, the asynchronous

deletes of LFS provide three to four times the perfor-

mance of FFS.

3.5 Benchmark Summary

To summarize these benchmarks, LFS offers an order

of magnitude performance improvement in perform-

ing meta-data operations (creates and deletes) on

small to medium-sized ﬁles. For deletes, LFS main-

tains its performance superiority at large ﬁle sizes,

deleting at three to four times the rate of FFS.

The read, write, and create bandwidth for large

ﬁles is comparable on both systems. FFS provides

slightly better write performance when the rotdelay

parameter is adjusted to avoid missing an entire

rotation, and LFS provides slightly better

igure 6. Delete performance. As was the case f

reates, the delete performance is a measure of met

ata update performance and the asynchrono

peration of LFS gives it an order of magnitu

erformance advantage over FFS. As the ﬁle si

ncreases, the synchronous writes become le

igniﬁcant and LFS provides a factor of 3-4 bett

erformance.

LFS FFS-m8r0 FFS-m8r2

File Size (in KB)

1 16 256 4096 65536

Files per second (log scale)

128

256

512

1024

performance reading since there is no effort to

distribute the data across cylinder groups as is done in

FFS.

For ﬁles smaller than the cluster size, read

performance is comparable in both systems. LFS

provides better write performance as its clustering of

multiple small writes into large contiguous ones

results in its using 43-65% of the available

bandwidth.

As in all benchmarks, this one has many

shortcomings. Both ﬁle systems are presented under

optimal circumstances: all accesses are sequential,

access order is identical to create order, the request

stream is single-user, no cleaning is required for LFS,

and FFS operates on an empty disk. The next section

presents a more demanding workload, the TPC-B

transaction processing benchmark.

4 Transaction Processing Performance

Although LFS was designed for a UNIX time-sharing

workload, there has been speculation that the ability

to convert small, random I/Os into large sequential

ones would make it ideal for transaction processing

[11]. Seltzer et al. measured a modiﬁed TPC-B imple-

mentation and found that the cleaning overhead

severely limited its performance. The disk was 80%

full in the benchmark conﬁguration. In this section,

we examine the performance of the same benchmark

across a range of ﬁle system utilizations, since the ﬁle

system utilization can affect cleaning cost. The bench-

mark conﬁguration is identical to that described in

Section 3, except that the ﬁle systems are conﬁgured

with a four-kilobyte block size to match the block size

of the database indexing structures.

The TPC-B benchmark simulates a check-

cashing application [13]. The four ﬁles in the

benchmark are described in Table 3. For each

transaction, an account record is read randomly, its

balance is updated, the balances in the associated

teller and branch records are updated, and a history

File Size Description

account 237 MB 1,000,000 100-byte

records

branch 44 KB 10 100-byte records

teller 44 KB 100 100-byte record

history 70 KB Append-only; 50

bytes per transactio

able 3: File speciﬁcations for a 10 TPS TPC

-B

atabase. Although our system is capable

upporting more than 10 TPS, we scaled t

enchmark database to allow experimentation wi

isk utilization.

record is written. Although our implementation

supports full concurrency control, we ran a single-

user version of the test to minimize synchronization

overhead and concentrate on disk behavior. All

updates to the database are logged to a non-duplexed

log residing on a separate disk.

The application maintains its own 4 MB block

cache in user virtual memory. As the branch and teller

ﬁles are rather small, they are always memory

resident. In contrast, the account ﬁle is large, so only

the internal pages of the B-tree index structure are

likely to be memory resident. Therefore, the ﬁle

system activity generated for each transaction is a

random read of a block in the account ﬁle, followed

by a random write from the user-level cache to the ﬁle

system in order to make room for the newly retrieved

account block. The newly retrieved account block is

then dirtied and left in the cache until it is later

evicted.

Three sets of results are shown in Figure 7. The

top line indicates the performance of LFS in the

absence of the cleaner. The performance was

measured on the lowest utilization (48%) and

projected across all utilizations since LFS will quickly

run out of disk space if no cleaner is running. The

second line on the graph shows FFS performance as a

function of ﬁle system utilization. As expected, FFS

shows no performance ﬂuctuation as the disk

becomes fuller. With the exception of the history ﬁle,

every disk write in the benchmark merely overwrites

an existing disk block, so there is no allocation and

the fullness of the disk is irrelevant. Each data point

represents the average of 100 iterations whose

standard deviation was less than 1%.

In the absence of the cleaner, LFS provides

approximately 50% better performance than FFS. The

50% performance difference can be attributed to

LFS’s ability to perform the random writes as

sequential writes. In the LFS case, as dirty pages are

evicted from the user-level buffer cache, they are

copied into the ﬁle system cache. The dirty blocks

remain in the cache until the number of dirty blocks

exceeds a write threshold and LFS triggers a disk

write. With the current system conﬁguration, this

triggering occurs when 115 blocks have accumulated

(representing 115 transactions). These transactions

progress at a rate limited by the time required to

randomly read the account records from disk. To read,

we must copy the page being read from the kernel into

the user cache and must also evict a page from the

user cache, copying it into the kernel. On our system,

these two copies take approximately 1.8 ms. With a

9.5 ms average seek, a 5.5 ms average rotational

delay, and a 1.6 ms transfer time, each random read

requires 18.4 ms for a throughput of 54 transactions

per second. Next, the segment must be ﬂushed. The

115 data blocks are likely to have caused the 58

indirect blocks in the account ﬁle to be dirtied, so our

segment contains 115 data blocks, 58 indirect blocks,

one inode block, and one segment summary for a total

of approximately 700 KB. Using the bandwidth

numbers from Section 3.3, we can write the 700 KB at

a rate of 1.3 MB/sec for a total time of 0.5 seconds.

Therefore, processing 115 transactions requires

ms yielding 44 transactions per

second, within 7% of our measurement.

The calculation for FFS is much simpler:

throughput is limited by the performance of the

random reads and writes. Each random I/O requires a

9.5 ms seek, a 5.5 ms rotation, a 0.9 ms copy, and a

1.6 ms transfer for a total of 17.5 ms yielding

throughput of 28.6 transactions per second, within 7%

of our measurement.

The introduction of the cleaner changes the

results substantially. At a ﬁle system utilization of

48%, LFS performance is comparable to FFS,

yielding a performance degradation of 34% due to the

cleaner. As the disk becomes more full, the impact

increases to the approximately 41% degradation, as

observed in [10]. The good news is that this

performance is comparable to FFS, but the

unfortunate result is that any performance advantage

of LFS is already lost at a ﬁle system utilization of

only 48%.

115 18.4

500+

igure 7. Transaction processing performanc

hile LFS can potentially provide dramatical

mproved performance, when the cleaner runs,

its

erformance is comparable to FFS. The performan

s largely independent of the disk utilization as t

teady-state cleaning overhead is dominated b

egment read time.

LFS w/cleaner LFS w/out cleaner FFS

Disk utilization (percent)

40.00 50.00 60.00 70.00 80.00 90.00

Transactions per second

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

To understand the LFS performance, we must

examine the operation of the cleaner and its

interaction with the benchmark. In the steady state,

each segment dirtied by LFS requires that the cleaner

produce one clean segment. If we were insensitive to

response time and wished only to clean most

efﬁciently, we would run the benchmark until the disk

ﬁlled, then clean, and then restart the benchmark. This

should produce the best possible throughput in the

presence of the cleaner.

As discussed earlier, LFS ﬁlls segments at a rate

of 115 transactions per 700 KB or 168 transactions

per segment. For simplicity, call the database 256 MB

and the disk system 512 MB. This requires 256

segments, 43,000 transactions, or 1000 seconds at the

“no-cleaning” LFS rate. After the 1000 seconds have

elapsed, LFS must clean. If we wish to clean the

entire disk, we must read all 512 segments and write

256 new ones. Let us assume, optimistically, that we

can read segments at the full bus bandwidth (2.3

MB/sec) and write them at two-thirds of the disk

bandwidth (1.7 MB/sec), missing a rotation between

every 64 KB transfer. The cleaning process will take

223 seconds to read and 151 seconds to write for a

total of 374 seconds. Therefore, at 50% utilization our

best case throughput in the presence of the cleaner is

31.3 transactions per second. This is within 15% of

our measured performance.

Unfortunately, LFS cannot clean at the optimal

rate described above. First, the transaction response

would be unacceptably slow while the cleaner

stopped for six minutes to clean. Secondly, the

calculations above assumed that the disk is read

sequentially. Since the selection of segments is based

on Rosenblum’s cost-beneﬁt algorithm [9], there is no

guarantee that collections of segments being cleaned

will be contiguous on disk. Thirdly, the history ﬁle in

the benchmark grows by 50 bytes per transaction, so

ﬁle system utilization increases during the test run.

Finally, cleaning requires that multiple segments be

cached in memory for processing, so we must limit

the number of segments cleaned simultaneously.

Since LFS cannot clean at its maximal rate, it

should clean at a rate that permits it to perform its

segment reads and writes at near-optimal speed. At

50% utilization, we should be able to read two dirty

segments and produce one clean segment. Reading

one megabyte requires a random seek (9.5 ms) one-

half rotation (5.5 ms) and a 1 MB transfer (435 ms)

for a total of 450 ms per segment read. Rewriting the

segment requires the same seek and rotation, but the

transfer requires 588 ms for a total of 603 ms for the

write or 1.5 seconds to clean the segment. In the

steady state, this cleaning must be done for each 168

transactions. Our throughput without the cleaner is 41

transactions per second, so it takes 4.1 seconds to

execute 168 transactions and 1.5 seconds to clean,

yielding 5.6 seconds or 30.0 TPS. This is within 10%

of our measured number.

It can be argued that LFS loses performance

because it writes indirect blocks too frequently

(approximately once every three seconds in our

benchmarking environment). The current BSD-LFS

write policy assumes that when the number of dirty

buffers in the cache exceeds the write threshold (the

point at which the kernel triggers a segment write),

generating clean buffers is essential and it uses an

aggressive policy of writing all the dirty buffers to

disk. If the dirty indirect blocks were cached during

this benchmark, the number of dirty data blocks that

are allowed to accumulate in the cache would be

reduced and segment writes would occur more

frequently. While suboptimal for this benchmark, we

believe that ﬂushing indirect blocks with their data

blocks is the correct default policy.

5 Effects of Free Space Fragmentation

on FFS Performance

Both LFS and FFS rely on the allocation of contigu-

ous disk blocks to achieve high levels of performance.

Because the results in Section 3 were obtained from

newly-created, empty ﬁle systems, there was no short-

age of contiguous extents of free space. On real sys-

tems, in use for extended periods of time (months or

years), the ﬁle system cannot expect to ﬁnd such an

ideal arrangement of free space. LFS and FFS deal

with this reality in two different ways, both of which

cause performance overhead for the respective ﬁle

systems.

LFS relies on the cleaner process to garbage

collect old segments, creating large regions of free

space—clean segments. The cleaner imposes

overhead on the overall performance of LFS. Section

4 discusses this overhead in the context of a

transaction processing workload.

In contrast, FFS makes no assumptions about the

layout of free space on the ﬁle system. FFS uses

whatever free space is available on the disk,

contiguous or not. In fact, the block allocation policy

of FFS remained unchanged when clustering was

added [7]. FFS may not allocate contiguous blocks to

a ﬁle, even when contiguous free space is available.

As with the LFS cleaner, this may adversely effect

performance.

The fragmentation of free space in an FFS may

increase with time or with ﬁle system utilization. This

fragmentation can degrade performance as an FFS

ages. To assess this risk, we studied a collection of

FFS ﬁle systems on the ﬁle servers of the Division of

Applied Science at Harvard University over a period

of nine months to examine how the performance of

FFS ﬁle systems under real workloads differs from

the performance of the empty FFS ﬁle systems

typically used for benchmarking.

5.1 Data Collection

We collected data from ﬁfty-one ﬁle systems on four

ﬁle servers. All of the ﬁle servers were SparcStations

running SunOS 4.1.3. Although the operating system

is substantially different than the 4.4BSD operating

system used elsewhere in this study, the ﬁle systems

are nearly identical. (The BSD-FFS clustering

enhancements were based on those originally imple-

mented in SunOS [7].)

Our data collection consisted of daily snapshots

recorded for every ﬁle system under study. A snapshot

is a summary of a ﬁle system’s meta-data, including

information about the size and conﬁguration of the

ﬁle system, the age, size, and location of each ﬁle, and

a map of the locations of free blocks on the ﬁle

system.

In the interest of brevity, the presentation here is

limited to six representative ﬁle systems. The

remaining ﬁle systems in the study demonstrated

behavior similar to one or more of these. The

important attributes of these ﬁle systems are

summarized in Table 4.

Name Server

Size

(MB)

Age

(months)

ncg bpg Descriptions

cnews das-news 565 31 49 1539 News articles and software

glan5 virtual12 1705 24 89 2596 Installed software and sources

staff das-news 353 30 31 1539 System Administrators’

accounts

usr4 speed 953 17 85 1520 User accounts

usr6 endor 953 14 85 1520 Course accounts; 1st-year grad

accounts

white endor 450 14 40 1520 User accounts for theory

able 4: File system summary for FFS fragmentation study. All ﬁle systems had an eight kilobyte block siz

axcontig of seven and rotdelay of 0. The Age column indicates the ﬁle system age as of November 1, 1994. Subtra

ine months to obtain the age at the beginning of the study. The ncg and bpg columns indicate the number of cylind

roups and the number of ﬁle system blocks per cylinder group, respectively.

5.2 Data Analysis

A separate study performed extensive analysis of this

snapshot data [12]. Examining this data in conjunc-

tion with the FFS block allocation algorithm provided

a variety of interesting information characterizing the

layout of FFS ﬁle systems. Some of the important

results are summarized here.

An evaluation of the FFS block allocation

algorithm showed that when a new ﬁle is created, FFS

attempts to allocate space for it starting from the

beginning of its cylinder group. The ﬁrst free block in

a cylinder group is almost always allocated as the ﬁrst

block of a new ﬁle. FFS attempts to extend the ﬁle

according to the constraints of the maxcontig and

rotdelay parameters. In practice, this means that as a

ﬁle grows, it uses free blocks in order from the

beginning of the cylinder group. File systems with a

non-zero rotdelay may occasionally skip blocks as

dictated by this parameter. All of the ﬁle systems

studied here had a rotdelay of zero.

Because of this allocation pattern, free space

within a cylinder group tends to be unevenly

distributed, with most of the free space located at the

end of the cylinder group. Furthermore, the free space

near the beginning of a cylinder group is more

fragmented than the free space near its end. This

uneven distribution of free space, combined with the

FFS block allocation algorithm, causes small multi-

block ﬁles to be more fragmented than larger ﬁles.

When a small ﬁle is created, FFS allocates space to it

starting from the beginning of a cylinder group. This

space is likely to be highly fragmented, resulting in a

fragmented ﬁle. Space for larger ﬁles is also allocated

from the beginning of a cylinder group, but a large ﬁle

is more likely to be able to take advantage of the well

clustered free space later in the cylinder group.

The data collected by the snapshots showed that

only 37% of the two block ﬁles allocated on twelve

different ﬁle systems were allocated to two

consecutive blocks on the ﬁle system. In contrast,

80% of the blocks allocated to 256K (32 block) ﬁles

were contiguous.

The amount of fragmentation in a ﬁle is of critical

importance because it is one of the primary factors

determining ﬁle system performance when reading or

writing the ﬁle.

5.3 Performance Tests

Because the SunOS ﬁle systems we studied were in

active use, it was not feasible to run benchmarks on

them. Instead, the meta-data in the snapshots was

used to reconstruct the ﬁle systems on the disk of a

test machine. This not only allowed the ﬁle systems to

be analyzed in an otherwise quiescent environment,

but also made it easier to study the performance of

comparable empty ﬁle systems. In the following dis-

cussion, the term “original ﬁle system” is used to refer

to the actual ﬁle systems on the ﬁle servers, and “test

ﬁle system” or “copied ﬁle system” is used for the ﬁle

systems reproduced on the test machine. The bench-

marks described in this section were run on a SparcS-

tation I with 32 megabytes of memory, running

4.4BSD-Lite. The details of the system are summa-

rized in Table 5.

In order to approximate the original ﬁle systems’

conﬁgurations as closely as possible, the test ﬁle

systems were created using the same values of

rotdelay and maxcontig. The test ﬁle systems were

also conﬁgured with the same number of cylinder

groups as the corresponding original ﬁle systems.

When different sized cylinders on the original and test

FFS Fragmentation Benchmark Conﬁguration

CPU Parameters

CPU SparcStation I

Mhz 20

Disk Parameters

Disk Type Fujitsu M2694EXA

RPM 5400

Sector Size 512 bytes

Sectors per Track 68-111

Cylinders 1818

Tracks per Cylinder 15

Track Buffer 512 KB

Average Seek 9.5 ms

able 5: Fragmentation benchmark conﬁguratio

disks made it impossible to precisely duplicate the

size of the cylinder groups, slightly larger cylinder

groups were created on the test ﬁle system. The extra

blocks in each cylinder group were marked as

allocated, so the test ﬁle system could not use them.

Because all of our ﬁle system benchmarks rely on

the creation and use of new ﬁles, the most important

characteristic of the original ﬁle systems to reproduce

is the arrangement of free space. Other details, such as

the precise mapping of allocated blocks to ﬁles, and

the directory hierarchy, are less important, because

they have little or no impact on the layout or the

performance of newly created ﬁles. Therefore the

only meta-data copied from the snapshots of the

original ﬁle systems was the free space bitmap for

each cylinder group. The resulting test ﬁle system

contained the root directory and no other data, but

could only utilize the same free blocks as the original

ﬁle system.

The benchmarks from Section 3 were used to

compare these reconstructed ﬁle systems to

comparable empty ﬁle systems. The latter were

created in the same manner described above, but their

free space bitmaps were not modiﬁed. The create and

read phases of the sequential I/O benchmark were

used to analyze the performance of the ﬁle systems.

One minor modiﬁcation was made to the benchmark

program for these tests. Instead of creating one

hundred ﬁles per directory, the benchmark used here

creates only twenty-ﬁve ﬁles per directory. This

conﬁguration stresses the ﬁle system more and the

CPU less since it reduces the processing overhead in

ﬁnding directory entries and causes FFS to create

more directories, which are distributed among the

cylinder groups.

These results presented here show the

performance for three different ﬁle sizes (eight

kilobytes, sixty-four kilobytes, and one megabyte)

and four different dates spread over the nine month

measurement period.

5.4 Test Results

For each ﬁle size tested, the throughput for reading

and creating the copied ﬁle systems was compared to

the throughput for the corresponding empty ﬁle sys-

tem. The graphs in Figure 8 show the performance of

each of the test ﬁle systems as a percentage of the

throughput of the corresponding empty ﬁle system.

The disk utilization (percentage of total blocks that

are allocated) for each ﬁle system is also displayed.

Ultimately, the performance of these benchmarks

depends on the layout of the test ﬁles. Files that are

laid out contiguously on the disk can be read and

igure 8. Effect of fragmentation on FFS performance. The graphs display the performance of the six test ﬁ

ystems as a percentage of the performance of a comparable empty FFS on the same disk. The utilization of each ﬁ

ystem for each test date is also shown. The benchmarks were run using snapshots of the ﬁle systems taken on t

6th day of the designated months (except for the October snapshots for cnews and staff, which were taken on t

0th). Most ﬁle systems exhibit little deterioration over the time period, performing at 85-90% of the empty ﬁ

ystem performance. Where changes occur, they often correlate to utilization (e.g. cnews, staff, white, usr4). We

ot yet see a correlation between ﬁle system age and performance degradation.

cnews

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

staff

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

usr6

Read 8K

Create 8K

Read 64K

Create 64K

Read 1024K

Create 1024K

FS utilization

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

glan5

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

usr4

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

white

Read 8K

Create 8K

Read 64K

Create 64K

Read 1024K

Create 1024K

FS utilization

Snapshot Date (Month)

Jan Apr Jul Oct

Relative Performance (%)

100

written quickly. In contrast, ﬁles whose blocks are

scattered across the disk do not perform as well.

Achieving optimal ﬁle layout requires contiguous free

space on the ﬁle system. Thus, anything that causes

free space to become fragmented will degrade the

performance of a ﬁle system. There are three factors

that may contribute to the fragmentation of free space

in an FFS: high ﬁle turnover, high utilization, and ﬁle

system age.

The news partition, cnews, is an example of a ﬁle

system with high turnover. As a repository for net-

news articles, most of the ﬁles on this ﬁle system are

extremely small, and there is a high rate of ﬁle

turnover. This should cause its free space to rapidly

become fragmented. This hypothesis is supported by

the fact that approximately two weeks after the

January copy of this ﬁle system was made, cnews

reported that it was out of disk space, but an

examination of the ﬁle system showed that although

there were no free blocks, there was 92 megabytes of

free space (16% of the ﬁle system)—all of it in

fragments.

Not surprisingly, the January tests on cnews

exhibited the greatest performance differences of any

of the ﬁle systems. For ﬁve out of the six test cases,

performance was less than 80% of that of the empty

cnews ﬁle system. The single greatest performance

difference in all of the tests was in the 64 KB write

test on cnews. This test achieved only 71% of the

empty ﬁle system performance.

In contrast, glan5 is a ﬁle system with very little

ﬁle turnover. It is used to store the sources and

binaries for various software packages. Thus, ﬁles on

this ﬁle system are created or deleted only when new

software is installed or when existing software is

upgraded. Not surprisingly, glan5 performed better

than the other ﬁle systems. Fifteen of the twenty-four

tests on copies of glan5 achieved 95-100% of the

performance of the empty glan5 ﬁle system.

The second factor that contributes to

fragmentation is high disk utilization. A highly

utilized ﬁle system has few free blocks, which are

unlikely to be well-clustered. The effect of disk

utilization on ﬁle system performance is demonstrated

by several of the ﬁle systems in Figure 8. Nearly all of

the large changes in utilization are accompanied by

inverse changes in performance (see cnews and staff

for particularly noticeable correlation). Although

performance drops as the ﬁle systems become full, the

performance can be regained by removing a fraction

of the ﬁles.

The third parameter affecting FFS fragmentation

is the age of the ﬁle system. Even if a ﬁle system has a

light workload with little ﬁle turnover, after several

years the cumulative effect of these small changes

should be comparable to a high ﬁle turnover on a

younger ﬁle system. The data provides some evidence

for this phenomenon, but it is inconclusive. The two

oldest ﬁle systems (cnews and staff) are the only ﬁle

systems where performance on the copied ﬁle systems

was less than 80% of the performance of the empty

ﬁle system. As discussed above, however, part of the

cnews performance should be attributed to its usage

pattern rather than its age. Although several test cases

performed poorly on the staff ﬁle system, other tests

on same date performed as well on staff as on the

younger ﬁle systems. In our department, ﬁle systems

“turnover” approximately every three years. That is,

disks are replaced, so the data are dumped and

restored. An informal poll indicated that this three-

year turnaround is fairly typical, and coincides with

IRS regulations concerning equipment depreciation.

5.5 File Size and Performance

The graphs in Figure 8 indicate that there are perfor-

mance differences for the different ﬁle sizes tested.

Many of the largest differences between the copied

ﬁle systems and empty ﬁle systems occur in the 64

KB test. The 8 KB and 1024 KB were less sensitive to

the ﬁle system age, utilization, and turnover rate.

For each of the six ﬁle systems, we ran read and

create tests for each of three ﬁle sizes on four different

dates for a total of 144 tests. Of the forty-eight tests

that used a 1024 KB ﬁle size, only two performed at

less than 85% of the corresponding empty ﬁle

system’s bandwidth. These two points were for

January on the cnews partition. Excluding the January

tests on cnews, which were generally worse than any

other test, of the forty-six tests that used a ﬁle size of

8 KB, there was only one test case that achieved less

than 85% performance (the April write test on glan5,

84.2%). Of the forty-eight tests using 64 KB ﬁles,

nine performed at less than 85% of the empty ﬁle

system’s throughput. Of the remaining thirty-nine,

thirty-ﬁve of them were over 90%.

There are a variety of reasons for the

performance differences among the different ﬁle

sizes. All of the ﬁle sizes suffer from the increased

fragmentation of free space on the copied ﬁle

systems. This is most noticeable in the 64 KB ﬁles

because of ameliorating circumstances in the 8 KB

and 1024 KB tests.

The large size of the 1024 KB ﬁles allows FFS to

perform more read-ahead than for the smaller ﬁle

sizes. Read-ahead helps to offset the cost of

fragmentation by initiating a read (and possibly an

accompanying seek) before the data is needed. FFS

beneﬁts from read-ahead within a 64 KB ﬁle, but does

not perform predictive reading of different ﬁles.

However, on the empty ﬁle system, the beginning of

the next ﬁle will be in the disk’s track buffer (if the

ﬁle is in the same cylinder group and begins on the

same track) and can be accessed rapidly. On the

copied ﬁle system, a seek is often required to access

the ﬁrst block of the next ﬁle. This seek is not initiated

until the benchmark program has issued the

corresponding read system call. The 1024 KB ﬁle

benchmark does not demonstrate this phenomenon

because FFS changes to a new cylinder group after

allocating the ﬁrst twelve blocks of a ﬁle. In both the

empty and copied ﬁle systems, a seek is required to

move from the end of one ﬁle to the beginning of the

next.

The fragmentation of free space has little impact

in the 8 KB test case. The benchmark creates

directories containing twenty-ﬁve ﬁles. Each

directory is placed in a different cylinder group. Since

the block size is 8 KB for all of the ﬁle systems, the 8

KB test case will read or write twenty-ﬁve blocks of

data to one cylinder group, then seek to a different

cylinder group to start the next directory. Thus, the 8

KB test spends a larger portion of its time performing

seeks than either of the other test sizes. Increased

fragmentation in the copied ﬁle systems means that

the ﬁles in a given directory are typically spread out

more than on an empty ﬁle system. Because of the

large number of seeks between cylinder groups,

however, the amount of overhead introduced by this

fragmentation has a smaller impact on overall

performance than in the 64 KB test case.

5.6 Benchmark Summary

Performance tests using real-world ﬁle systems indi-

cate that there is justiﬁcation for concerns over the

performance impact of fragmentation in FFS. Such

concerns should be mitigated, however, by many of

the results of these tests. Although the greatest perfor-

mance differences between a real ﬁle system and an

empty one were almost 30%, most ﬁle systems

showed far smaller differences, especially for large

ﬁles. According to Baker et al. [1], over half the bytes

transferred to/from disk come from ﬁles over 1 MB in

size. The measured performance differences for such

ﬁles is less than 15% and in 70% of our tests, large

ﬁles performed at 95% or better on the copied ﬁle sys-

tems.

It is worth noting that many of the greatest

performance differences between real and empty ﬁle

systems occurred on a ﬁle system that demonstrates

worst case behavior in almost every way. The cnews

ﬁle system holds small, rapidly replaced ﬁles. This

ﬁle system suffered an unusual fragmentation failure

right after these large performance differences were

noted. It is difﬁcult to imagine a ﬁle system that

would incur a greater fragmentation penalty.

6 Conclusions and Future Work

Our results show that the comparison of FFS and LFS

is not an easy one. FFS performance can be modiﬁed

substantially by tweaking parameters such as maxcon-

tig, rotdelay, and cpc.

Unquestionably, when meta-data operations are

the bottleneck, LFS provides superior performance to

FFS. When creating ﬁles of one kilobyte or less, or

when deleting ﬁles of 64 KB or less, LFS provides an

order-of-magnitude performance improvement. This

improvement comes in part from LFS’s disk layout,

and in part from the asynchronous implementation of

these operations. There are several alternatives for

providing asynchronous meta-data operations,

including journaling ﬁle systems [4] and ordering

updates. (Using an ordered update approach, Ganger

reports a factor of ﬁve to six improvement in FFS

meta-data update performance [3].)

When LFS cleaner overhead is ignored, and FFS runs

on a new, unfragmented ﬁle system, each ﬁle system

has regions of performance dominance.

• LFS is an order of magnitude faster on small ﬁle

creates and deletes.

• The systems are comparable on creates of large

ﬁles (one-half megabyte or more).

• The systems are comparable on reads of ﬁles

less than 64 kilobytes.

• LFS read performance is superior between 64

kilobytes and four megabytes, after which FFS

is comparable.

• LFS write performance is superior for ﬁles of

256 kilobytes or less.

• FFS write performance is superior for ﬁles

larger than 256 kilobytes.

Cleaning overhead can degrade LFS performance by

more than 34% in a transaction processing

environment. Fragmentation can degrade FFS

performance, over a two to three year period, by at

most 15% in most environments but by as much as

30% in ﬁle systems such as a news partition.

There is more work to be done. The effects of

cleaning on LFS in other environments are still not

fully understood. Trace analysis indicates that in a

network of workstations environment, there may be

sufﬁcient idle time that cleaning can be accomplished

without I/O penalty [2].

7 Availability

The ﬁle system and benchmark source code and trace

data are available and freely redistributable. Send

electronic mail to mar[email protected]d.edu.

8 Acknowledgments

Many people helped make this paper possible. We

would like to thank Carl Staelin and Kirk McKusick

who provided considerable help in maintaining our

systems at Berkeley; Janusz Juda who ran all our data

gathering scripts for months on end; John Ousterhout

who originally suggested this work and offered much

advice and criticism; Ken Lutz who produced disk

speciﬁcations with amazing rapidity, and Diane Tang,

Larry McVoy, John Wilkes, and David Patterson who

made many useful suggestions about our presentation.

9 References

[1] Baker, M., Hartman, J., Kupfer, M., Shirriff, K.,

Ousterhout, J., “Measurements of a Distributed

File System,” Proceedings of the 13th Sympo-

sium on Operating System Principles, Paciﬁc

Grove, CA, October 1991, 192-212.

[2] Blackwell, T., Harris, J., Seltzer, M., “Heuristic

Cleaning Algorithms in LFS,” Proceedings of the

1995 Usenix Technical Conference, New

Orleans, LA, January 1995.

[3] Ganger, G., Patt, Y., “Metadata Update Perfor-

mance in File Systems,” Proceedings of the First

Usenix Symposium on Operating System Design

and Implementation, Monterey, CA, November,

1994, 49-60.

[4] Howard, J., Kazar, Menees, S., Nichols, D., Saty-

anarayanan, M., Sidebotham, N., West, M.,

“Scale and Performance in a Distributed File Sys-

tem,” ACM Transaction on Computer Systems 6,

1 (February 1988), 51-81.

[5] Lieberman, H., Hewitt, C., “A real-time garbage

collector based on the lifetimes of objects,” Com-

munications of the ACM, 26, 6, 1983, 419-429.

[6] McKusick, M.Joy, W., Lefﬂer, S., Fabry, R. “A

Fast File System for UNIX, “ACM Transactions

on Computer Systems 2, 3 (August 1984), 181-

197.

[7] McVoy, L., Kleiman, S., “Extent-like Perfor-

mance from a UNIX File System,” Proceedings

of the 1990 Summer Usenix, Anaheim, CA, June

1990, 137-144.

[8] Rosenblum, M., Ousterhout, J., “The LFS Stor-

age Manager,” Proceedings of the 1990 Summer

Usenix, Anaheim, CA, June 1990, 315-324.

[9] Rosenblum, M., Ousterhout, J., “The Design and

Implementation of a Log-Structured File Sys-

tem,” ACM Transactions on Computer Systems

10, 1 (February 1992), 26-52.

[10]Seltzer, M., Bostic, K., McKusick., M., Staelin,

C., “The Design and Implementation of the

4.4BSD Log-Structured File System,” Proceed-

ings of the 1993 Winter Usenix, January 1993,

San Diego, CA.

[11] Seltzer, M., “Transaction Support in a Log-Struc-

tured File System,” Proceedings of the Ninth

International Conference on Data Engineering,

Vienna, Austria, April 1993.

[12]Smith, K. A., Seltzer., M, “File Layout and File

System Performance,” Harvard Division of

Applied Sciences Technical Report, 1994.

[13]Transaction Processing Performance Council,

“TPC Benchmark B Standard Speciﬁcation,”

Waterside Associates, Fremont, CA., August

1990.