
07-02-2006, 02:21 PM
|
 |
Super Moderator
|
|
Join Date: Jul 2005
Location: Estonia
Posts: 3,610
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
Journaled file system on FreeBSD
|
Quote:
|
Hello.
For the last few months I have been working on gjournal project.
To stop confusion right here, I want to note, that this project is not
related to gjournal project on which Ivan Voras was working on the
last SoC (2005).
The lack of journaled file system in FreeBSD was a tendon of achilles
for many years. We do have many file systems, but none with journaling:
- ext2fs (journaling is in ext3fs),
- XFS (read-only),
- ReiserFS (read-only),
- HFS+ (read-write, but without journaling),
- NTFS (read-only).
GJournal was designed to journal GEOM providers, so it actually works
below file system layer, but it has hooks which allow to work with
file systems. In other words, gjournal is not file system-depended,
it can work probably with any file system with minimum knowledge
about it. I implemented only UFS support.
The patches are here:
http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)
To patch your sources you need to:
# cd /usr/src
# mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_journal
# patch < /path/to/gjournal.patch
Add 'options UFS_GJOURNAL' to your kernel configuration file and
recompile kernel and world.
How it works (in short). You may define one or two providers which
gjournal will use. If one provider is given, it will be used for both -
data and journal. If two providers are given, one will be used for data
and one for journal.
Every few seconds (you may define how many) journal is terminated and
marked as consistent and gjournal starts to copy data from it to the
data provider. In the same time new data are stored in new journal.
Let's call the moment in which journal is terminated as "journal switch".
Journal switch looks as follows:
1. Start journal switch if we have timeout or if we run out of cache.
Don't perform journal switch if there were no write requests.
2. If we have file system, synchronize it.
3. Mark file system as clean.
4. Block all write requests to the file system.
5. Terminate the journal.
6. Eventually wait if copying of the previous journal is not yet
finished.
7. Send BIO_FLUSH request (if the given provider supports it).
8. Mark new journal position on the journal provider.
9. Unblock write requests.
10. Start copying data from the terminated journal to the data provider.
There were few things I needed to implement outside gjournal to make it
work reliable:
- The BIO_FLUSH request. Currently we have three I/O requests: BIO_READ,
BIO_WRITE and BIO_DELETE. I added BIO_FLUSH, which means "flush your
write cache". The request is send always with the biggest bio_offset set
(mediasize of the destination provider), so it will work properly with
bioq_disksort(). The caller need to stop further I/O requests before
BIO_FLUSH return, so we don't have starvation effect.
The hard part is that is has to be implemented in every disk driver,
because flushing the cache is driver-depended operation. I implemented
it for ata(4) disks and amr(4). The good news is that it's easy.
GJournal can also work with providers that don't support BIO_FLUSH and
in my power-failure tests it worked well (no problems), but it depend
on fact, that gjournal cache is bigger than the controller cache, so it
is hard to call it reliable.
You can read in documentation to many journaled file systems, that you
should turn off write cache if you want to use it. This is not the case
for gjournal (especially when your disk driver does support BIO_FLUSH).
The 'gjournal' mount option. To implement gjournal support in UFS I
needed to change the way of how deleted, but still open objects are
handled. Currently when file or directory is open and we deleted last
name which reference it, it will still be usable by those who keep it
open. When the last consumer closes it, the inode and blocks are freed.
On journal switch I cannot leave such objects, because after a crash
fsck(Cool is not used to check the file system, so inode and blocks will
never be freed. When file system is mounted with 'gjournal' mount
option, such objects are not removed when they are open. When last
name is deleted, the file/directory is moved to the .deleted/
directory and removed from there on last close.
This way, I can just clean the .deleted/ directory after a crash at
mount time.
Quick start:
# gjournal label /dev/ad0
# gjournal load
# newfs /dev/ad0.journal
# mount -o async,gjournal /dev/ad0.journal /mnt
(yes, with gjournal 'async' is safe)
Now, after a power failure or system crash no fsck is needed (yay!).
There are two hacks in the current implementation, which I'd like to
reimplement. First is how 'gjournal' mount option is implemented.
There is a garbage collector thread which is responsible for deleting
objects from .deleted/ directory and it is using full paths. Because
of this when your mount point is /foo/bar/baz and you rename 'bar' to
something else, it will not work. This is not what is often done, but
definitely should be fixed and I'm working on it. The second hack is
related to communication between gjournal and file system. GJournal
decides when to make the switch and has to find file system which is
mounted on it. Looking for this file system is not nice and should be
reimplemented.
There are some additional goods which came with gjournal. For example
if gjournal is configured over gmirror or graid3, even on power failure
or system crash, there is no need to synchronize mirror/raid3 device,
because data will be consistent.
I spend a lot of time working on gjournal optimization. Because I've
few seconds before the data hit the data provider I can perform things
like combining smaller write requests into larger once, ignoring data
written twice to the same place, etc.
Because of this, operations on small files are quite fast. On the other
hand, operations on large files are slower, because I need to write the
data twice and there is no place for optimization. Here are some numbers.
gjournal(1) - the data provider and the journal provider on the same disk
gjournal(2) - the data provider and the journal provider on separate
disks
Copying one large file:
UFS: 8s
UFS+SU: 8s
gjournal(1): 16s
gjournal(2): 14s
Copying eight large files in parallel:
UFS: 120s
UFS+SU: 120s
gjournal(1): 184s
gjournal(2): 165s
Untaring eight src.tgz in parallel:
UFS: 791s
UFS+SU: 650s
gjournal(1): 333s
gjournal(2): 309s
Reading. grep -r on two src/ directories in parallel:
UFS: 84s
UFS+SU: 138s
gjournal(1): 102s
gjournal(2): 89s
As you can see, even on one disk, untaring eight src.tgz is two times
faster than UFS+SU. I've no idea why gjournal is faster in reading.
There are a bunch of sysctls to tune gjournal (kern.geom.journal tree).
When only one provider is given for both data and journal, the journal
part is placed at the end of the provider, so one can use file system
without journaling. If you use such configuration (one disk), it is
better for performance to place journal before data, so you may want to
create two partitions (eg. 2GB for ad0a and the rest for ad0d) and
create gjournal this way:
# gjournal label ad0d ad0a
Enjoy!
The work was sponsored by home.pl (http://home.pl).
The work was made by Wheel LTD (http://www.wheel.pl).
The work was tested in the netperf cluster.
I want to thank Alexander Kabaev (kan@) for the help with VFS and
Mike Tancsa for test hardware.
--
Pawel Jakub Dawidek http://www.wheel.pl
pjd@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
|
__________________
"All parts should go together without forcing. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer." -- IBM maintenance manual, 1975
|
|

07-02-2006, 03:48 PM
|
|
Senior Member
|
|
Join Date: Sep 2005
Location: .co.ZA
Posts: 151
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Yay! Finally....
|
|

07-02-2006, 05:22 PM
|
|
Senior Member
|
|
Join Date: Nov 2005
Location: Ga. USofA
Posts: 7,906
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Interesting
|
|

07-02-2006, 07:03 PM
|
|
Senior Member
|
|
Join Date: Sep 2005
Location: .co.ZA
Posts: 151
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
This should be added to the roadmap, since it eliminates Fscking Fscks :P
|
|

07-03-2006, 05:26 AM
|
|
Member
|
|
Join Date: Feb 2006
Posts: 33
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
I just did a buildworld and kernel with the patch(on a fbsd6.1STABLE machine) then folowed the instructions and everything is working as expected.I even did a few "emergency" hard resets and the system is booting without the need of fsck.However I didn't figured yet out how to make the process of loading the provider(gjournal) automated between reboots.I tried editing /boot/loader.conf /etc/rc.conf /etc/fstab with no luck.
LE:it was my fault.One must add "geom_journal_load="YES"" to /boot/loader.conf to enable the journaling between reboots.
|
|

07-03-2006, 12:42 PM
|
|
Senior Member
|
|
Join Date: May 2006
Location: Greater State of Northern Kaliforneea
Posts: 2,880
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Probably as a shell script in /etc/rc.d with an enabled flag in etc/rc.conf or /boot/loader.conf
|
|

07-03-2006, 04:04 PM
|
 |
Super Moderator
|
|
Join Date: Jul 2005
Location: Estonia
Posts: 3,610
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Originally Posted by gammaray
|
I just did a buildworld and kernel with the patch(on a fbsd6.1STABLE machine) then folowed the instructions and everything is working as expected.I even did a few "emergency" hard resets and the system is booting without the need of fsck.However I didn't figured yet out how to make the process of loading the provider(gjournal) automated between reboots.I tried editing /boot/loader.conf /etc/rc.conf /etc/fstab with no luck.
LE:it was my fault.One must add "geom_journal_load="YES"" to /boot/loader.conf to enable the journaling between reboots.
|
|
Quote:
|
You can configure gjournal on an existing file system, but, as always,
the last sector will be used for metadata.
For example, you have your file system on ad0s1d and swap on ad0s1b.
You can try to configure gjournal this way:
|
Code:
|
# swapoff /dev/ad0s1b
# umount /dev/ad0s1d
# gjournal label ad0s1d ad0s1b |
Your swap should have at least 2GB if your file system will be heavy loaded. Be warned that this will overwrite the last sector on ad0s1d, which should be safe, but you never know. This is not yet possible to use gjournal for the root file system.
|
__________________
"All parts should go together without forcing. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer." -- IBM maintenance manual, 1975
|
|

07-03-2006, 11:13 PM
|
 |
Super Moderator
|
|
Join Date: Jul 2005
Location: Estonia
Posts: 3,610
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
So, I tested what is journaling and how to use this feature:
|
Code:
|
# swapoff /dev/ad0s2b
# umount /dev/ad0s4
# gjournal label ad0s4 ad0s2b
# mount -o async,gjournal /dev/ad0s4.journal /mnt/ad0s4
# df -h
/dev/ad0s4.journal 56G 24G 27G 47% /mnt/ad0s4
# mount
/dev/ad0s4.journal on /mnt/ad0s4 (ufs, asynchronous, local, gjournal) |
Read speed is indeed better than standard ufs but writing is slower for about 10%. But I have to backup all files from ad0s4 and restore my swap usage then I can permanently prepare ad0s4 for journaling:
add geom_journal_load="YES" into /boot/loader.conf
|
Code:
|
# reboot
# gjournal label /dev/ad0s4
# newfs /dev/ad0s4.journal |
__________________
"All parts should go together without forcing. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer." -- IBM maintenance manual, 1975
|
|

07-04-2006, 05:48 AM
|
|
Member
|
|
Join Date: Feb 2006
Posts: 33
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
All of that code are done in single user mode as I think?
Since otherwise you can't work that way on a busy partition(ex. /usr).
I tested the gjournal on a external(to the FreeBSD slice) partition and everything was good.
Yesterday I made something wrong trying to install gjournal on /usr,the whole operation just wiped out the entire label :evil:And after that there ist was:a nice,journaled and empty /dev/adxxx.
Fortunately I had a back-up clone of that label prepared as always when doing critical things to the system. :lol:
|
|

07-04-2006, 06:18 AM
|
 |
Super Moderator
|
|
Join Date: Jul 2005
Location: Estonia
Posts: 3,610
Thanks: 0
Thanked 0 Times in 0 Posts
|
|
|
Originally Posted by gammaray
|
All of that code are done in single user mode as I think?
Since otherwise you can't work that way on a busy partition(ex. /usr).
I tested the gjournal on a external(to the FreeBSD slice) partition and everything was good.
Yesterday I made something wrong trying to install gjournal on /usr,the whole operation just wiped out the entire label :evil:And after that there ist was:a nice,journaled and empty /dev/adxxx.
Fortunately I had a back-up clone of that label prepared as always when doing critical things to the system. :lol:
|
I made ghost image before started "ruining" partition labels.
BTW, ghost knows FreeBSD patrition labels aka "a5" but he can't read exact filesystem, so if you want smaller compressed images, then fill unused space with zeroes.
|
Code:
|
# dd if=/dev/zero of=/0bits bs=20971520 # bs=20m
# rm /0bits |
__________________
"All parts should go together without forcing. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer." -- IBM maintenance manual, 1975
|
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT. The time now is 09:08 AM.
|
|