NML Performance Measures
Introduction
This document will describe various methods that have been used to test the performance of NML, the results of some of those tests and how they might be used to optimize NML-based applications. It assumes that the reader is familiar with either the NML C++ Interface or the NML Java Interface and with the options available in the NML Configuration Files.
Factors Affecting Performance
Unfortunately, the performance of NML can not be effectively characterized by a single number. Instead we need a series of tests and performance measures that take all of the following factors into account.
Platform: -- CPU, Operating System, and compiler Because NML is portable, there are a large number of platforms that should be tested. In systems with multiple platforms communicating, the number of possible combinations is infinite. Options/Protocols -- Methods of communicating selected within the configuration file Because NML is flexible there are a large number of options and protocols that should be considered. Application Type/Performance Measure Different applications can use NML in different ways and are likely to be interested in different measures of the performance. These measures will be defined in a later section. Message Size Generally for large messages the times, latencies and CPU load used will be proportional to the size of the message. For small messages the message size may have little effect.Performance Measures
Different types of applications will be interested in different measures of NML performance.
Maximum Read/Write Time. -- maximum amount of time spent during the test that between a call to NML::read() or NML::write() and it returning control to the calling function. It does not matter whether the CPU was available to process other tasks during that time or not. Maximum time measures can be greatly influenced by the system clock resolution. The resolution of clocks for reading vary from 1 microsecond to 20 milliseconds. Applications that need to run at a deterministic cycle time should pay close attention to this number. It can generally be controlled with the timeout option. NIST does not guarantee that the value will never be exceeded. Average Read/Write Time. -- average amount of time spent during the test between a call to NML::read() or NML::write() and it returning control to the calling function. It does not matter whether the CPU was available to process other tasks during that time or not. The effect of poor system clock resolution is less since the measurement is made over many reads. Applications that will be the only one of interest running on their CPU or applications for platforms where information the average CPU time used is not available should consider this number. Average CPU Time Used for a Read/Write -- average time spent by the CPU on the calling process per call to NML::read or NML::write() during a particular test. Since the operating system may swap out the process during a read or write or NML may block temporarily on a socket or semaphore this will be less than the average read/write time. This number is important to applications that should run on operating systems where it is important to reserve time for other applications on the same CPU should. Throughput -- average number of new messages that were received per second during a particular test. Since NML::read() can return and indicate "No New Data" this number will less than the number of calls to NML::read that occurred per second. Since calls to NML::write may overwrite each other it may also be less than the number of calls to NML::write per second. This number is of importance to applications that are tightly bound to one input. Since most RCS applications poll multiple inputs and do not care if they miss messages this number is of lesser importance to RCS applications. Latency -- average length of time required between a the return of a write and that the beginning of a read with the same message being received for a particular test. This number is important to some applications that have a long chain of modules that pass along a piece of data, applications that send large messages over a network and applications that check for data only asynchronously. However for most applications the latency will be sufficiently small so that its exact value is not relevant.NMLPERF - The NML Performance Testing Program
The program that we will used to perform most of the tests is called nmlperf.The source code for the NMLPERF program consists of 3 files:
perftype.hh -- a C++ header file that contains declarations for the NML_PERFORMANCE_TEST_MSG class which will be used to create the messages to be exchanged. The class is somewhat complicated. Because NML may need to convert messages from their native format to a neutral format such as XDR. XDR may be more efficient at converting certain data types than others. The class essentially contains a variable length array of a union of all the basic data types. An integer in the message will indicate to the update function which type of data the union is considered to currently contain and therefore which conversion is necessary. Rather than report different numbers for each data type we will rotate the data type constantly through each. The array length is also modified with each message so that the message length stays the same. So when the message contains characters it needs an array length that is four times larger than when it contains long integers. Special new and delete operators allow the message to be created at any size. The function set_real_test_msg size is used before a call to new to set the size that the message should be created. perftype.cc -- a C++ source file that contains the code for the NML format function, perf_types_format, the update function for NML_PERFORMANCE_TEST_MSG as well as the special new and delete operators. nmlperf.cc -- a C++ source file that contains the main function for the nmlperf program. When the program is started it will ask the user for the buffer name, process name and configuration file to use, and the maximum number of messages to read or write (which can be infinite). It will then create one NML object using those parameters and check that it is valid. If it is valid it will ask which role to play in the performance test, write only, read only or a combination alternating read and write. For the write or combination mode it will ask what size message to create. It will then perform the test by repeatedly reading or writing from the channel until the maximum number of messages were written or read or until the user presses Control-C. Finally, it prints out the information collected.An additional program must be run for some tests. The program perfsvr connects to each buffer with a given host name and provides an NML server for the buffer. There are currently no performance measures made within perfsvr. However it must be running for nmlperf to connect remotely and test the performance of the remote protocols. The source code is available for review at perfsvr.cc.
Platform Details
In the test results summary I have included the hostnames so that readers can look up the exact details of the platform used for each test.
dopey(sunos5) Sun Ultra 1 Creator, CPU Type: sparc, Number of CPUs: 1, App Architecture: sparc, Kernel Architecture: sun4u,OS Name: SunOS,OS Version: 5.5.1, Kernel Version: SunOS Release 5.5.1 Version Generic_103640-08 [UNIX(R) System V Release 4.0], gcc version: 2.7.2.1, System Clock Interval: 10ms rolle(sunos5) Sun SparcStation 20, CPU Type: sparc, Number of CPUs: 1,App Architecture: sparc, Kernel Architecture : sun4m, OS Name : SunOS,OS Version: 5.5.Kernel Version: SunOS Release 5.5 Version Generic_103093-06 [UNIX(R) System V Release 4.0], gcc version: 2.7.2.1, System Clock Interval: 10ms vx10(vxworks5.3) Motorola MVME162-22, CPU Type: 25MHz MC68040, Number of CPUs:1, OS Name: VxWorks (for Motorola MVME162), OS Version: 5.3.1, Kernel Version: WIND version 2.5, gcc version: cygnus-2.7.2-960126 feed(win32msc), System Clock Interval: 200 microseconds Windows NT PC, CPU Type: 133MHz Pentium, Number of CPUs: 1, OS Name: Windows NT, OS Version: 4.0(Build 1381: Service Pack 3), Visual C++ Version: 5.0It is also worth noting that all of the hosts were on the same Class C subnet connected with 10 Mbs Ethernet. Other computers could use the same network and network utilization averaged around 2% without the performance test traffic, however significant bursts seem to occur every few minutes. These observations are based on watching the graph of the NT Performance Monitor Network Utilization chart.
Test Results Summary
Test # | Test Platform | Buffer Location | Read/Write | Protocol | Options | Size | Max. Time | Avg. Time | CPU Time | Throughput | Latency |
1 | dopey(sunos5) | dopey(sunos5) | Both | SHMEM | default | 200 | 0.012 | 0.000113 | 0.000112 | 4438 | 0.0 |
2 | dopey(sunos5) | dopey(sunos5) | Both | SHMEM | queue | 200 | 0.020 | 0.000118 | 0.000117 | 4250 | 0.0 |
3 | dopey(sunos5) | dopey(sunos5) | Both | SHMEM | mutex=mao split | 200 | 0.012 | 0.000013 | 0.000012 | 39914 | 0.0 |
4 | dopey(sunos5) | dopey(sunos5) | Both | TCP | default | 200 | 0.097 | 0.000712 | 0.000291 | 702 | 0.0 |
5 | rolle(sunos5) | dopey(sunos5) | Both | TCP | default | 200 | 0.035 | 0.001471 | 0.000671 | 335 | 0.0 |
6 | rolle(sunos5) | dopey(sunos5) | Both | TCP | poll | 200 | 0.218 | 0.000427 | 0.000360 | 184 | 0.002565 |
7 | rolle(sunos5) | dopey(sunos5) | Both | TCP | confirm_write | 200 | 1.532 | 0.003127 | 0.000771 | 159 | 0.0 |
8 | rolle(sunos5) | dopey(sunos5) | Write | TCP | confirm_write | 200 | 0.569 | 0.003097 | 0.000707 | -- | -- |
9 | rolle(sunos5) | dopey(sunos5) | Write | TCP | default | 200 | 0.006 | 0.000521 | 0.000463 | -- | -- |
10 | dopey(sunos5) | dopey(sunos5) | Both | UDP | default | 200 | 0.072 | 0.000537 | 0.000220 | 931 | 0.0 |
11 | dopey(sunos5) | dopey(sunos5) | Both | UDP | poll | 200 | 0.016 | 0.000189 | 0.000140 | 15 | 0.000140 |
12 | rolle(sunos5) | dopey(sunos5) | Both | UDP | default | 200 | 0.058 | 0.001210 | 0.000461 | 410 | 0.0 |
13 | rolle(sunos5) | dopey(sunos5) | Both | UDP | poll | 200 | 0.005 | 0.000527 | 0.000452 | 43 | 0.011061 |
14 | rolle(sunos5) | dopey(sunos5) | Both | TCP | default | 100000 | 0.461 | 0.169032 | 0.030998 | 2.96 | -- |
15 | vx10(vxworks5.3) | vx10(vxworks5.3) | Both | SHMEM | default | 200 | 0.0002 | 0.000107 | -- | 4673 | 0.0 |
16 | vx10(vxworks5.3) | vx10(vxworks5.3) | Both | SHMEM | mutex=no_switching | 200 | 0.0002 | 0.000067 | -- | 7500 | 0.0 |
17 | vx10(vxworks5.3) | vx10/vx11 backplane | Both | GLOBMEM | default | 200 | 0.0004 | 0.000184 | -- | 2713 | 0.0 |
18 | vx10(vxworks5.3) | vx10/vx11 backplane | Both | GLOBMEM | lock_bus | 200 | 0.0002 | 0.000141 | -- | 3553 | 0.0 |
19 | vx10(vxworks5.3) | dopey(sunos5) | Both | TCP | default | 200 | 3.4936 | 0.012991 | -- | 37.8 | 0.0 |
20 | vx10(vxworks5.3) | dopey(sunos5) | Write | TCP | default | 200 | 0.0344 | 0.00143 | -- | -- | -- |
21 | vx10(vxworks5.3) | dopey(sunos5) | Write | TCP | default | 10000 | 3.486 | 0.069465 | -- | -- | -- |
22 | vx10(vxworks5.3) | dopey(sunos5) | Write | UDP | default | 200 | 0.0028 | 0.001099 | -- | -- | -- |
23 | feed(win32msc) | feed(win32msc) | Both | SHMEM | default | 200 | 0.011 | 0.000041 | -- | 12213 | 0.0 |
24 | feed(win32msc) | dopey(sunos5) | Both | TCP | default | 200 | 0.13 | 0.001349 | -- | 370 | -- |
25 | feed(win32msc) | feed(win32msc) | Both | TCP | default | 200 | 0.020 | 0.001220 | -- | 409 | -- |
Interpreting the Results
Here are some of the conclusions I draw from these tests.
- Shared Memory was 4 to 40 times faster than using TCP or UDP even when both processes ran on the same host.
- Eliminating the delays caused by semaphores by using an alternate mutex mechanism can improve shared memory speeds for small messages by about a factor of between 2 and 10.
- The "poll" option significantly improved average and maximum read times but decreased throughput and increased latency.
- The "confirm_write" option is about twice as slow as the default without the confirmation for small messages( < 1K).
- Backplane GLOBMEM is 50% to 100% slower than single board SHMEM.