Performance

What to Expect

clusterducks has a very straightforward profile for system requirements:

  • Network link is generally the first bottleneck for bare metal
    • 1Gbit works out long-term for smaller networks; extra local disks can be used for performance
    • Bonded interfaces won't help much because iSCSI (in most configurations) uses a single connection
      • Some devices will experience benefits; Linux workstations with multiple network cards
    • Even with 1Gbit, "boot storms" with bare metal systems aren't as severe as it is with full virtualization
  • Disk bandwidth becomes an issue with heavy random workloads and spinning rust (HDD)
    • Check ZFS ARC statistics; more memory might be helpful
    • Avoiding parity RAID levels (raidz/2/3) and sticking with mirrors will get the most IOPS
    • Adding more spinning disks (mirrored pairs) could make it faster
    • Using SSD instead of magnetic storage could absorb most of the issues
      • Some SSD (Samsung) will slow to a crawl without periodic TRIM, so be sure to research this topic before investing in new equipment
    • Large files, sequentially written, may become random reads
  • Memory & CPU limitations for compute nodes can be hit by having too many VMs running
    • Migrate contentious VMs to less utilized hardware
    • Bring new diskless compute nodes online for more capacity
    • If it's not possible to add more memory or CPUs, try migrating your largest virtual machines to dedicated hardware
  • CPU bottlenecks occur with heavily fragmented datasets and short replication intervals
    • Use a minimum snapshot size of 1MB or greater to cut down on unnecessary replication jobs when little amounts of data are written
      • Don't set the minimum threshold too high or you risk missing important document changes
    • Try increasing the replication interval to give more time between commands so the system can "settle"
    • Restore your images from backup if more than one server are experiencing this issue; ZFS has no "defrag" command

Example

Server:

  • AMD Opteron 2210
  • 16GB RAM
  • 2 x 2TB SATA (WD Black 64MB)
  • 2 x 1Gbit LAN (bonded)

Network:

  • 30 x Windows 7 PCs
  • 1Gbit LAN (CAT6)
  • Samba4 domain
    • Roaming profiles
    • Folder redirection to server
  • MS Office 2013, Firefox, accounting software, graphics design suites all run on different workstations

Test:

  • 64Kb IO size
  • 80% random, 20% sequential
  • 32 threads
  • 40GB partition on OS volume

Results:

  • Drive is filled at 40MB/s (IOmeter limitation)
  • IOmeter reports 9500 IOPS with one workstation executing test (154MB/s duplex, 77MB/s each way)
  • Other workstations did not slow down or have IO latency spike during test (60 minutes)
  • Server load stays below 5.0

This is a rather limited and non-scientific test, it would certainly be possible to overload the 1Gbit switch and cause issues if traffic control is not enabled for each device. With greater throughput levels comes higher CPU utilization and potentially greater latency for competing applications. On AWS 10Gbit xlarge instances it is possible to crash the system by using all available network capacity.