Thursday, November 29, 2007

How to test and monitor disk performance

In my RAID performance article I've given a lot of performance numbers, but it is indeed a good question on how to exactly measure disk performance. To do this you need 2 tools. One that does the actual benchmark and one that measure the different system parameters so you can know the impact of the benchmark.

For the benchmark tool itself I usually use 3 tools. The first one is good old dd. The problem with dd is that it can only do 1 type of benchmark (sequential read and write), but it can to this to and from many sources and that is its strength.
So how do you use dd for a disk benchmark. Well usually like this :
dd if=/dev/zero of=/dev/sdb bs=64k count=100k
dd of=/dev/null if=/dev/sdb bs=64k count=100k
First a word of warning. Using dd like this will mean that any data on /dev/sdb (including filesystems) will be overwritten. The dd command does a write test and the second dd command does a read test. The write test reads from /dev/zero and writes to /dev/sdb, it uses a blocksize of 64k and it reads 102400 blocks. The read test reads from /dev/sdb and writes to /dev/null, it uses the same amount and size of blocks like the read test.
So in this case we are testing /dev/sdb, but this can be any blockdevice. Like /dev/md1 (a software RAID device) or /dev/VolGroup00/Logvol01 (a LVM logical volume.
The amount of blocks you need to read or write to have a valid is very simple. Make sure you process at least twice the amount of memory you have in your system. Why ? Because the Linux kernel does caching you need to make sure you request more data then the size of the cache to make sure you are actually testing the disk and not the Linux kernel caching.
And then the blocksize remains. Well, this depends on the type of usage you will do on the disks or RAID arrays. There are no real fixed rules for this. But generally speaking when you have a lot of small files take a small blocksize (4 to 16k), for standard fileserving and databases take a medium blocksize (64 to 128k) and for large files or applications that do a lot of sequential I/O use a large blocksize (256k and larger).

One final note on dd, since CentOS 5 dd will, when it has finished a test, tell you the speed if obtained. But on older versions of CentOS you will need to calculate this yourself. Another option there is to use ddrescue, which is a tool similar to dd but also provides the bandwidth used in its output. You can find ddrescue for CentOS in Dag's RPMForge repository.

As I said, dd only does sequential I/O but since it does not need a filesystem it is very useful to get a baseline/indication/magnitude/... for the performance of a disk. The next step is to put a filesystem on the device being tested. Then the second tool I use to test performance is iozone. You can also find prebuild rpms for CentOS in Dag's RPMForge repository.

I usually use iozone like this
iozone -e -i0 -i1 -i2 -+n -r 16k -s8g -t1
Iozone has a look of options (that is part why I like it). If you do iozone -h you will get the complete list. I suggest you go over them when you start testing so you know all your options. But the example above shows enough to get started. The -e option adds the time of the flush operation to the results and the -+n disables the retest feature. The different -i options indicate the tests done by iozone, -i0 is the sequential write test (always needed), -i1 is sequential read and -i0 is random read and write. Another interresting test is the mixed read/write (-i8). The 3 remaining options are : -r 16k is the record (or block) size used, -s8g is the size of the data tested (8 GB in this case, remember the 2x RAM rule here) and -t1 is the amount of threads performing the test (in this case 1).

Depepding in the application you are planning to use you try to use iozone in a way that it mimics as close as possible the IO patterns of your real application. This means playing with the -i (the kind of test), the -r (block size) and -t (number of threads) options.

Other nice features of iozone are the autotest. Here iozone can iterate over different tests using different parameters for file size and record size. Also iozone can output its results in a Excell file, so you later make nice graphs of the tests.

The final test that I perform is a bit less scientific, but still a good indication of performance of a disk or a RAID array. The test is a copy of a directory that contains about 1GB of data. This data is stored in different subdirectories (I think it is about 5 levels deep) and all these subdirs contains either a couple of big files and for the rest a lot of small files.
This means that a copy of this directory involves a lot of metadata activity, and a mix of sequential I/O (the big files) and something that resembles random or mixed I/O (the small files). And the result is usually a big difference compared to the sequential speeds obtained with dd or iozone.

Now that we have all these benchmarks we still need a tool to monitor the system itself and see what is going on. The tool I always use for this is dstat. It is written and maintained by Dag of the RPMForge repo (and of course it also contains CentOS rpms of dstat). The nice thing about dstat is that is has different monitoring modules (and you can also write your own), its output formatting is really good and it can output the results in CSV files so you can also process the results in a spreadsheet. This is a example on how to use dstat :
dstat -c -m -y -n -N eth2,eth3 -d -D sda -i -I 98 3
----total-cpu-usage---- ------memory-usage----- ---system-- --net/eth2- --dsk/sda-- inter
usr sys idl wai hiq siq| used buff cach free| int csw | recv send| read writ| 98
0 0 100 0 0 0| 190M 195M 924M 2644M|1029 92 | 0 0 | 793B 4996B| 14
0 0 99 1 0 0| 190M 195M 924M 2644M|1008 41 | 21B 0 | 0 23k| 1

All the diferent output are different output modules and they get display in the order you have put them on the commandline. With the (for example) -n -N structure you can specifiy for which devices you would like to see output. In this case the -n refers to network-card statistics and with -N I specified I would like to see the output for eth2 and eth3. A similar systems is used for the disk statistics (-d and -D) and number of interrupts (-i and -I).

To get the list of extra modules available next to the standard onces do this :
dstat -M list
/usr/share/dstat:
app, battery, cpufreq, dbus, freespace, gpfs, gpfsop, nfs3, nfs3op, nfsd3, nfsd3op, postfix, rpc, rpcd, sendmail, thermal, utmp, vmkhba, vmkint, vzcpu, vzubc, wifi,
For more information about all the different dstat options do dstat -h or refer to the dstat man-page.

Monday, November 26, 2007

iSCSI Targets - part 1

I've already talked about the iSCSI Storage Server, so it looked like a good idea to talk a bit about iSCSI targets. First a bit of terminology. Since iSCSI is based on SCSI people also used the SCSI terms to describe the client and the server part. So when people talk about a iSCSI initiator they are referring the client-side (like the browser you use to surf the web). And when people talk about a iSCSI target they mean the server-side (like the Apache HTTPD web server).

The iSCSI initiator code has been present in most Linux distributions for a reasonable time. That code is nice and stable and reasonably well tested. this translates to a lot of hits if you google for it. But here I would like to talk about the target side. There have been different efforts to create a iSCSI target on Linux but the seemed to never have gotten the amount of attention as the initiator side did. Anyway, today if you are running a 2.6 kernel distribution there are - to my knowledge - 3 iSCSI target projects that you can use.

The first one is the IET or iSCSI Enterprise Target. It is probably the oldest of the 3 projects and to me the most mature. It is a port of a older iSCSI target that worked for 2.4 kernels but not for 2.6. The goal of the project mentioned on their website basically sums up what they want to create :
The aim of the project is to develop an open source iSCSI target with professional features, that works well in enterprise environment under real workload, and is scalable and versatile enough to meet the challenge of future storage needs and developements.
If I were to suggest a iSCSI target for production use today the IET would be it. Its stable code in my experience. It has a good managemen interface (a clean configfile and CLI tool) and perform very well. One particular feature of IET that I really liked is to be able to configure what kind of caching is done on the target side. When you define a LUN you have to mention the type, there you have 3 options. You have "nullio", which is only made for tested (it's kind like using /dev/null for the device). You have "fileio", this will access a device or a file using the standard Linux caching mechanisms. And you have "blockio", whereby the device will be accessed directly. In my experience when you use HW RAID controllers there is not much different between "fileio" and "blockio". So again, test both options for your specific setup and use what works best.

But when you use the "fileio" type you can set a extra option. It's called "IOMode". You can set this to "ro", which means read-only as you probable had guessed. But is has a second mode, and the most interesting one, that is called "wb". This stands for writeback. When you enable this the Linux page-cache will also be used for write-caching. To explain it simple, in this mode all the free memory on the iSCSI target machine will be used for read- and write-caching. This is easy way to have a large cache available without having to put expensive memory on hardware RAID controllers. This of course doesn't matter that much if you use the Linux software RAID.

Finally, one word of warning. When you do enable the writeback mode in IET it is best to have a UPS powering the server. Since a sudden power loss can result in the loss of data. So do not use that mode without any precautions.

In my next blog I'll talk about the 2 other iSCSI targets. They are architecturally a bit different from IET as you will see then.

Tuesday, November 20, 2007

3ware Hardware RAID vs. Linux Software RAID

As part of a project we are building a iSCSI storage server. It has 16 500GB SATA disks and to provide redundancy we needed some sort of RAID controller. So we went with the well known 3ware controllers. More specifically a 9550SXU 16ML. The server itself is running CentOS 5 with 2 Dual-Core Intel Xeon processors. This post is just to share my experiences with this controller.

First, everybody that uses a ext3 filesystem on top of this controller should upgrade to firmware 9.4.2. It improves the write performance. In our case we went from around 55MB/s to around 75 MB/s for sequential I/O. That is not bad for a simple firmware upgrade.

But we were having one issue. On initial tests with a iSCSI exported logical volume we did a copy of a directory tree (on the same volume) with a total size of 1GB. And this test took around 6 minutes, that is even less then 3MB/s. To be fair, this structure contains a lot of smaller files, different directories. So a copy also involves a lot of metadata activity. So we did the same test on the storage server itself and we got around 3 minutes there. Mmm, and this should be a top end RAID controller, but we are only getting 6MB/s in this test.

So, just for fun I decided to let the RAID controller export each disk individually and use the Linux software RAID and see what performance that would give. Well, for this specific test the time was 2'50". And for all the other tests I did software RAID outperformed the hardware RAID.

So what did we learn here. First, do not let the numbers of all the different hardware RAID controller vendors foul you. They are sequential I/O test. But most day-to-day I/O patterns are different.
Second, always test yourself. Benchmarks found online can be indicators, but always test everything yourself for your specific case. Third, when using RAID always give software RAID a change. It may save you some money.

As a final note. If we mounted the filesytems as ext2 in this specific test the copy would only take 1 minute (in both HW and SW RAID). So do not forget about ext2, it still has it's advantages.

Friday, November 16, 2007

Xen graphical console and foreign keyboards

For those people using CentOS 5.1 or later (Or RHEL for that matter) and that do not have a standard US qwerty keyboard layout. Which is a large part of Europe I guess, will probably have noticed that with 5.0 the keyboard mapping in the graphical console using the Virtual Machine Manager (virt-manager) do not really match.

Luckily since 5.1 this has been solved. The newer version of Xen and libvirt allow you to specifiy the keyboard mapping to use in the graphical console. Unfortunately there is no way to configure this in virt-manager itself, so you need to resort to some text editing. Open the Xen configfile or your domU with your favorite text editor, they should be located in /etc/xen. You should see something like this
name = "test2"
uuid = "a7296544-864c-4bf6-401d-d87e02306ba1"
maxmem = 500
memory = 500
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ "type=vnc,vncunused=1,keymap=en-us" ]
disk = [ "phy:/dev/rootvg/test2,xvda,w" ]
vif = [ "mac=00:16:3e:56:a7:ba,bridge=xenbr0" ]
Now if you change the "keymap=..." setting to your keyboard layout and start the domU again the keyboard mapping should match now.

To know what keyboard mappings are available you can do this :
rpm -ql xen | grep keymap
This shows you a list of available keyboard mappings.

Update : Konrad posted another interresting tip in the comments : "You can also simply set this for all DomUs in the file /etc/xen/xend-config.sxp e.g. for a french keyboard add a line (keymap 'fr') and restart xend". This is very usefull, tx Konrad !