Tuesday, December 18, 2007

How to scan the SCSI bus with a 2.6 kernel

If you are playing with SCSI devices (like Fibre Channel, SAS, ..) you sometimes need to rescan the SCSI bus to add devices or to tell the kernel a device is gone. Well, this is the way to do it in CentOS with versions that have a 2.6 kernel. This means CentOS 5 and CentOS 4 (starting from update 3).
  1. Find what's the host number for the HBA:
    ls /sys/class/fc_host/
    (You'll have something like host1 or host2, I'll refer to them as host$NUMBER from now on)

  2. Ask the HBA to issue a LIP signal to rescan the FC bus:
    echo 1 >/sys/class/fc_host/host$NUMBER/issue_lip
  3. Wait around 15 seconds for the LIP command to have effect

  4. Ask Linux to rescan the SCSI devices on that HBA:
    echo - - - >/sys/class/scsi_host/host$NUMBER/scan
    The wildcards "- - -" mean to look at every channel, every target, every lun.
That's it. You can look for log messages at "dmesg" to see if it's working, and you can look at /proc/scsi/scsi to see if the devices are there. In CentOS 5 there is also the "lsscsi" command that will show you the know devices plus the device entries that point to those devices (very usefull).

For more information about how this works see the the upstream release notes for 4.3.

Monday, December 10, 2007

Anaconda, Kickstart & iSCSI

The previous days I've been trying to get Anaconda to install a Xen domU with it's harddisk being a iSCSI target. And this all automated using kickstart. It is working now, but Anaconda needed a bit of help.

First a quick overview of the seutp. I have set of servers that will host all the virtual machines. The storage comes from 2 servers that have 12 disks in a RAID array (see some of the other articles on this blog). I use LVM to carve up this storage and export each logical volume as a iSCSI target for the different virtual machines.

Since I would like to be able to use the Xen migration feature to move virtual machines between the servers, I basically have 3 options to have a shared storage so the migration works. The first one is to use a cluster/distributed filesystem that is present on all the servers that contain image files of the different virtual machines. The second one is to use CLVM that makes logical volumes available to a set of machines. And the third one is to let the virtual machine import the disk directly. It's a bit to much to discuss the merit of the 3 choices here. Maybe in another post.
Anyway, the choice was made to use the last option. Our virtual machine would act itself as a iSCSI initiator and get a disk to install on that way.

If you read the CentOS 5 docs about kickstart (http://www.centos.org/docs/5/html/5.1/Installation_Guide/s1-kickstart2-options.html) you can see that there are options for iSCSI. So this was supposed to be easy. A first test indicates that it all worked, so that looked good. Now we had some requirements. To prevent the different virtual machines from touching the wrong iSCSI targets we using a username and password on each target.

The documentation indicates that Anaconda should support username and password during the import of the iSCSI targets. But testing indicates that this does not work. Then I looked at the code and see nothing that indicates that it should work.

A second issue that we are seeing is that we have quickly have a lot of targets on our storage server and Anaconda tries to do a login to all targets that it discovers. This slows down the installation process a lot.

So it was time to refresh my Python skills and do some Anaconda hacking. BTW, a good reference for this is the Anaconda wiki page. Using the updates support you can easily use modified version of the anaconda code during a installation. And after some testing but problems are now solved. Anaconda will now use the --user, --password and --target options if they are present.

I've created 2 entries in the upstream bugzilla about this with the patches containing my 402431 and 418781. I have also attached a updates.img to bug 418781. This can be used to start a install with the patches I've made but without the need to recreate all the installation images. See the Anaconda wiki or the docs that come with Anaconda on how to use them (it depends on how you actual do the install).

Monday, December 3, 2007

CentOS 5.1 Released

For those of you that have not been behind there computer this weekend. CentOS 5.1 for the i386 and x86_64 platforms has been released into the wild. As part of the QA team I can say that it looks like a solid release (except the autofs issue, but that is already fixed with a new kernel in the updates repo). Especially Xen looks like it is a lot more stable then the 5.0 version was.

Anyway you can find the official announcement here : http://lists.centos.org/pipermail/centos-announce/2007-December/014476.html.
And the release notes (considered essential reading) are here : http://wiki.centos.org/Manuals/ReleaseNotes/CentOS5.1/.

Enjoy 5.1 and thanks again to all the people involved to make it happen.

Thursday, November 29, 2007

How to test and monitor disk performance

In my RAID performance article I've given a lot of performance numbers, but it is indeed a good question on how to exactly measure disk performance. To do this you need 2 tools. One that does the actual benchmark and one that measure the different system parameters so you can know the impact of the benchmark.

For the benchmark tool itself I usually use 3 tools. The first one is good old dd. The problem with dd is that it can only do 1 type of benchmark (sequential read and write), but it can to this to and from many sources and that is its strength.
So how do you use dd for a disk benchmark. Well usually like this :
dd if=/dev/zero of=/dev/sdb bs=64k count=100k
dd of=/dev/null if=/dev/sdb bs=64k count=100k
First a word of warning. Using dd like this will mean that any data on /dev/sdb (including filesystems) will be overwritten. The dd command does a write test and the second dd command does a read test. The write test reads from /dev/zero and writes to /dev/sdb, it uses a blocksize of 64k and it reads 102400 blocks. The read test reads from /dev/sdb and writes to /dev/null, it uses the same amount and size of blocks like the read test.
So in this case we are testing /dev/sdb, but this can be any blockdevice. Like /dev/md1 (a software RAID device) or /dev/VolGroup00/Logvol01 (a LVM logical volume.
The amount of blocks you need to read or write to have a valid is very simple. Make sure you process at least twice the amount of memory you have in your system. Why ? Because the Linux kernel does caching you need to make sure you request more data then the size of the cache to make sure you are actually testing the disk and not the Linux kernel caching.
And then the blocksize remains. Well, this depends on the type of usage you will do on the disks or RAID arrays. There are no real fixed rules for this. But generally speaking when you have a lot of small files take a small blocksize (4 to 16k), for standard fileserving and databases take a medium blocksize (64 to 128k) and for large files or applications that do a lot of sequential I/O use a large blocksize (256k and larger).

One final note on dd, since CentOS 5 dd will, when it has finished a test, tell you the speed if obtained. But on older versions of CentOS you will need to calculate this yourself. Another option there is to use ddrescue, which is a tool similar to dd but also provides the bandwidth used in its output. You can find ddrescue for CentOS in Dag's RPMForge repository.

As I said, dd only does sequential I/O but since it does not need a filesystem it is very useful to get a baseline/indication/magnitude/... for the performance of a disk. The next step is to put a filesystem on the device being tested. Then the second tool I use to test performance is iozone. You can also find prebuild rpms for CentOS in Dag's RPMForge repository.

I usually use iozone like this
iozone -e -i0 -i1 -i2 -+n -r 16k -s8g -t1
Iozone has a look of options (that is part why I like it). If you do iozone -h you will get the complete list. I suggest you go over them when you start testing so you know all your options. But the example above shows enough to get started. The -e option adds the time of the flush operation to the results and the -+n disables the retest feature. The different -i options indicate the tests done by iozone, -i0 is the sequential write test (always needed), -i1 is sequential read and -i0 is random read and write. Another interresting test is the mixed read/write (-i8). The 3 remaining options are : -r 16k is the record (or block) size used, -s8g is the size of the data tested (8 GB in this case, remember the 2x RAM rule here) and -t1 is the amount of threads performing the test (in this case 1).

Depepding in the application you are planning to use you try to use iozone in a way that it mimics as close as possible the IO patterns of your real application. This means playing with the -i (the kind of test), the -r (block size) and -t (number of threads) options.

Other nice features of iozone are the autotest. Here iozone can iterate over different tests using different parameters for file size and record size. Also iozone can output its results in a Excell file, so you later make nice graphs of the tests.

The final test that I perform is a bit less scientific, but still a good indication of performance of a disk or a RAID array. The test is a copy of a directory that contains about 1GB of data. This data is stored in different subdirectories (I think it is about 5 levels deep) and all these subdirs contains either a couple of big files and for the rest a lot of small files.
This means that a copy of this directory involves a lot of metadata activity, and a mix of sequential I/O (the big files) and something that resembles random or mixed I/O (the small files). And the result is usually a big difference compared to the sequential speeds obtained with dd or iozone.

Now that we have all these benchmarks we still need a tool to monitor the system itself and see what is going on. The tool I always use for this is dstat. It is written and maintained by Dag of the RPMForge repo (and of course it also contains CentOS rpms of dstat). The nice thing about dstat is that is has different monitoring modules (and you can also write your own), its output formatting is really good and it can output the results in CSV files so you can also process the results in a spreadsheet. This is a example on how to use dstat :
dstat -c -m -y -n -N eth2,eth3 -d -D sda -i -I 98 3
----total-cpu-usage---- ------memory-usage----- ---system-- --net/eth2- --dsk/sda-- inter
usr sys idl wai hiq siq| used buff cach free| int csw | recv send| read writ| 98
0 0 100 0 0 0| 190M 195M 924M 2644M|1029 92 | 0 0 | 793B 4996B| 14
0 0 99 1 0 0| 190M 195M 924M 2644M|1008 41 | 21B 0 | 0 23k| 1

All the diferent output are different output modules and they get display in the order you have put them on the commandline. With the (for example) -n -N structure you can specifiy for which devices you would like to see output. In this case the -n refers to network-card statistics and with -N I specified I would like to see the output for eth2 and eth3. A similar systems is used for the disk statistics (-d and -D) and number of interrupts (-i and -I).

To get the list of extra modules available next to the standard onces do this :
dstat -M list
app, battery, cpufreq, dbus, freespace, gpfs, gpfsop, nfs3, nfs3op, nfsd3, nfsd3op, postfix, rpc, rpcd, sendmail, thermal, utmp, vmkhba, vmkint, vzcpu, vzubc, wifi,
For more information about all the different dstat options do dstat -h or refer to the dstat man-page.

Monday, November 26, 2007

iSCSI Targets - part 1

I've already talked about the iSCSI Storage Server, so it looked like a good idea to talk a bit about iSCSI targets. First a bit of terminology. Since iSCSI is based on SCSI people also used the SCSI terms to describe the client and the server part. So when people talk about a iSCSI initiator they are referring the client-side (like the browser you use to surf the web). And when people talk about a iSCSI target they mean the server-side (like the Apache HTTPD web server).

The iSCSI initiator code has been present in most Linux distributions for a reasonable time. That code is nice and stable and reasonably well tested. this translates to a lot of hits if you google for it. But here I would like to talk about the target side. There have been different efforts to create a iSCSI target on Linux but the seemed to never have gotten the amount of attention as the initiator side did. Anyway, today if you are running a 2.6 kernel distribution there are - to my knowledge - 3 iSCSI target projects that you can use.

The first one is the IET or iSCSI Enterprise Target. It is probably the oldest of the 3 projects and to me the most mature. It is a port of a older iSCSI target that worked for 2.4 kernels but not for 2.6. The goal of the project mentioned on their website basically sums up what they want to create :
The aim of the project is to develop an open source iSCSI target with professional features, that works well in enterprise environment under real workload, and is scalable and versatile enough to meet the challenge of future storage needs and developements.
If I were to suggest a iSCSI target for production use today the IET would be it. Its stable code in my experience. It has a good managemen interface (a clean configfile and CLI tool) and perform very well. One particular feature of IET that I really liked is to be able to configure what kind of caching is done on the target side. When you define a LUN you have to mention the type, there you have 3 options. You have "nullio", which is only made for tested (it's kind like using /dev/null for the device). You have "fileio", this will access a device or a file using the standard Linux caching mechanisms. And you have "blockio", whereby the device will be accessed directly. In my experience when you use HW RAID controllers there is not much different between "fileio" and "blockio". So again, test both options for your specific setup and use what works best.

But when you use the "fileio" type you can set a extra option. It's called "IOMode". You can set this to "ro", which means read-only as you probable had guessed. But is has a second mode, and the most interesting one, that is called "wb". This stands for writeback. When you enable this the Linux page-cache will also be used for write-caching. To explain it simple, in this mode all the free memory on the iSCSI target machine will be used for read- and write-caching. This is easy way to have a large cache available without having to put expensive memory on hardware RAID controllers. This of course doesn't matter that much if you use the Linux software RAID.

Finally, one word of warning. When you do enable the writeback mode in IET it is best to have a UPS powering the server. Since a sudden power loss can result in the loss of data. So do not use that mode without any precautions.

In my next blog I'll talk about the 2 other iSCSI targets. They are architecturally a bit different from IET as you will see then.

Tuesday, November 20, 2007

3ware Hardware RAID vs. Linux Software RAID

As part of a project we are building a iSCSI storage server. It has 16 500GB SATA disks and to provide redundancy we needed some sort of RAID controller. So we went with the well known 3ware controllers. More specifically a 9550SXU 16ML. The server itself is running CentOS 5 with 2 Dual-Core Intel Xeon processors. This post is just to share my experiences with this controller.

First, everybody that uses a ext3 filesystem on top of this controller should upgrade to firmware 9.4.2. It improves the write performance. In our case we went from around 55MB/s to around 75 MB/s for sequential I/O. That is not bad for a simple firmware upgrade.

But we were having one issue. On initial tests with a iSCSI exported logical volume we did a copy of a directory tree (on the same volume) with a total size of 1GB. And this test took around 6 minutes, that is even less then 3MB/s. To be fair, this structure contains a lot of smaller files, different directories. So a copy also involves a lot of metadata activity. So we did the same test on the storage server itself and we got around 3 minutes there. Mmm, and this should be a top end RAID controller, but we are only getting 6MB/s in this test.

So, just for fun I decided to let the RAID controller export each disk individually and use the Linux software RAID and see what performance that would give. Well, for this specific test the time was 2'50". And for all the other tests I did software RAID outperformed the hardware RAID.

So what did we learn here. First, do not let the numbers of all the different hardware RAID controller vendors foul you. They are sequential I/O test. But most day-to-day I/O patterns are different.
Second, always test yourself. Benchmarks found online can be indicators, but always test everything yourself for your specific case. Third, when using RAID always give software RAID a change. It may save you some money.

As a final note. If we mounted the filesytems as ext2 in this specific test the copy would only take 1 minute (in both HW and SW RAID). So do not forget about ext2, it still has it's advantages.

Friday, November 16, 2007

Xen graphical console and foreign keyboards

For those people using CentOS 5.1 or later (Or RHEL for that matter) and that do not have a standard US qwerty keyboard layout. Which is a large part of Europe I guess, will probably have noticed that with 5.0 the keyboard mapping in the graphical console using the Virtual Machine Manager (virt-manager) do not really match.

Luckily since 5.1 this has been solved. The newer version of Xen and libvirt allow you to specifiy the keyboard mapping to use in the graphical console. Unfortunately there is no way to configure this in virt-manager itself, so you need to resort to some text editing. Open the Xen configfile or your domU with your favorite text editor, they should be located in /etc/xen. You should see something like this
name = "test2"
uuid = "a7296544-864c-4bf6-401d-d87e02306ba1"
maxmem = 500
memory = 500
vcpus = 1
bootloader = "/usr/bin/pygrub"
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "restart"
vfb = [ "type=vnc,vncunused=1,keymap=en-us" ]
disk = [ "phy:/dev/rootvg/test2,xvda,w" ]
vif = [ "mac=00:16:3e:56:a7:ba,bridge=xenbr0" ]
Now if you change the "keymap=..." setting to your keyboard layout and start the domU again the keyboard mapping should match now.

To know what keyboard mappings are available you can do this :
rpm -ql xen | grep keymap
This shows you a list of available keyboard mappings.

Update : Konrad posted another interresting tip in the comments : "You can also simply set this for all DomUs in the file /etc/xen/xend-config.sxp e.g. for a french keyboard add a line (keymap 'fr') and restart xend". This is very usefull, tx Konrad !