Sunday, November 25, 2007 7:09:36 PM
Few days ago... I got read some articles about new thread scheduler in Freebsd, and the name of that new scheduler is ULE. And then what are the new features of ULE scheduler or the differences between ULE scheduler with 4BSD scheduler (FreeBSD's old scheduler) ?
Ok... let's discuss it !!!
The current FreeBSD scheduler (4BSD scheduler) has its roots in the 4.3BSD scheduler.
It has excellent interactive performance and efficient algorithms with small loads. It does not, however, take full advantage of multiple CPUs. It has no support for processor affinity or binding. It also has no mechanism for distinguishing between CPUs of varying capability, which is important for SMT (Symmetric Multi-Threading).
FreeBSD inherited the traditional BSD scheduler when it branched off from 4.3BSD. FreeBSD extended the scheduler's functionality, adding scheduling classes and basic SMP support.
Two new classes, real-time and idle, were added early on in FreeBSD. Idle priority threads are only run when there are no time sharing or real-time threads to run. Real-time threads are allowed to run until they block or until a higher priority real-time thread becomes available. When the SMP project was introduced, an interrupt class was added as well. Interrupt class threads have the same properties as real-time except that their priorities are lower, where lower priorities are given preference in BSD. The classes are simply implemented as subdivisions of the available priority space. The time sharing class is the only subdivision which adjusts priorities based on CPU usage. Much effort went into tuning the various parameters of the 4BSD scheduler to achieve good interactive performance under heavy load as was required by BSD's primary user base. It was very important that systems remain responsive while being used as a server. In addition to this, the nice concept was further refined. To facilitate the use of programs that wish to only consume idle CPU slices, processes with a nice setting more than 20 higher than the least nice currently running process will not be permitted to run at all. This allows distributed programs such as SETI or the rc4 cracking project to run without impacting the normal workload of a machine.
And... the ULE scheduler was designed to address the growing needs of FreeBSD on SMP/SMT platforms and under heavy workloads. It supports CPU affinity and has constant execution time regardless of the number of threads. In addition to these primary performance related goals, it also is careful to identify interactive tasks and give them the lowest latency response possible. The core scheduling components include several queues, two CPU load-balancing algorithms, an interactivity scorer, a CPU usage estimator, a slice calculator, and a priority calculator.
The original FreeBSD scheduler maintains a global list of threads that it traverses once per second to recalculate their priorities. The use of a single list for all threads means that the performance of the scheduler is dependent on the number of tasks in the system, and as the number of tasks grows, more CPU time must be spent in the scheduler maintaining the list. A design goal of the ULE scheduler was to avoid the need to consider all the runnable threads in the system to make a scheduling decision.
The ULE scheduler creates a set of three queues for each CPU in the system. Having per-processor queues makes it possible to implement processor affinity in an SMP system.
One queue is the idle queue, where all idle threads are stored. The other two queues are designated current and next. Threads are picked to run, in priority order, from the current queue until it is empty, at which point the current and next queues are swapped and scheduling is started again. Threads in the idle queue are run only when the other two queues are empty. Realtime and interrupt threads are always inserted into the current queue so that they will have the least possible scheduling latency. Interactive threads are also inserted into the current queue to keep the interactive response of the system acceptable. A thread is considered to be interactive if the ratio of its voluntary sleep time versus its runtime is below a certain threshold. The interactivity threshold is defined in the ULE code and is not configurable. ULE uses two equations to compute the interactivity score of a thread. For threads whose sleep time exceeds their runtime, the following equation is used :
When a thread’s runtime exceeds its sleep time, the following equation is used instead :
The scaling factor is the maximum interactivity score divided by two. Threads that score below the interactivity threshold are considered to be interactive; all others are noninteractive. The sched_interact_update() routine is called at several points in a thread’s existence—for example, when the thread is awakened by a wakeup() call—to update the thread’s runtime and sleep time. The sleep-time and runtime values are allowed to grow only to a certain limit. When the sum of the runtime and sleep time passes the limit, the values are reduced to bring them back into range. An interactive thread whose sleep history was not remembered at all would not remain interactive, resulting in a poor user experience. Remembering an interactive thread’s sleep time for too long would allow the thread to have more than its fair share of the CPU. The amount of history that is kept and the interactivity threshold are the two values that most strongly influence a user’s interactive experience on the system.
Noninteractive threads are put into the next queue and are scheduled to run when the queues are switched. Switching the queues guarantees that a thread gets to run at least once every two queue switches regardless of priority, which ensures fair sharing of the processor.
--------
Now, i will try to recompile my FreeBSD kernel and append the ULE Scheduler as thread scheduler on my system...
[blu3c4t@mahardhika ~]$ cd /usr/src/sys/i386/conf
[blu3c4t@mahardhika /usr/src/sys/i386/conf]$ su
Password:
[root@mahardhika /usr/src/sys/i386/conf]# cp GENERIC ULEKERNEL
[root@mahardhika /usr/src/sys/i386/conf]# vi ULEKERNEL
[edit the configuration file]
1. Comment out line "options SCHED_4BSD" with a # in the front. And add this line :
options SCHED_ULE # ULE scheduler
2. Change the ident line as follows :
ident GENERIC
Change the line to read :
ident ULEKERNEL
[save the configuration file]
It's time to compile the kernel.
[root@mahardhika /usr/src/sys/i386/conf]# cd /usr/src
[root@mahardhika /usr/src]# make buildkernel KERNCONF=ULEKERNEL
[root@mahardhika /usr/src]# make installkernel KERNCONF=ULEKERNEL
It's done, let's reboot now !!!
[root@mahardhika /usr/src]# shutdown -r now
Then after the reboot...
[root@mahardhika ~]# uname -a
FreeBSD mahardhika.stttelkom.ac.id 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Sun Nov 25 23:02:06 WIT 2007 blu3c4t@mahardhika.stttelkom.ac.id:/usr/obj/usr/src/sys/mahardhika i386
Summary
Until now there are many debates about ULE Sceheduler performance compare to 4BSD Scheduler, but as a FreeBSD fan, I hope the new scheduler will bring advance improvement to FreeBSD.
You can see ULE vs. 4BSD performance benchmark at this link:
http://www.thejemreport.com/mambo/content/view/113/
Word around the campfire has been that the ULE scheduler is in some way “faster” than the 4BSD scheduler in FreeBSD. While conducting a benchmarking project to compare hardware performance, I performed all of my testing with both the ULE and the 4BSD schedulers to show the difference in performance. Read on for the results.
The Hardware
For this article I acquired the following hardware to use for two systems. They shared the same optical drive, hard drive, video card, RAM, cables, power supply, and chassis. Only the motherboard and CPU were changed to switch from the AMD machine to the Intel machine. This was done to prevent variations that could be caused by hardware manufacturing flaws or differences in output due to brand.- Asus K8V Deluxe
- AMD Athlon64 3200+
- Thermaltake K8 Silent Boost HSF
- Intel D875PBZ (rev. 301)
- Intel Pentium4 3.2E
- Corsair PC3200 TwinX-LL 1024MB kit
- Western Digital 36GB Raptor SATA hard drive
- Sony DDU1621 DVD-ROM
- ATI Radeon 9800Pro All-In-Wonder 128MB
- Antec TrueBlue 480w power supply
- Skyhawk Galaxy case with front, rear, side, and cowl fans
The heatsink/fan (HSF) unit that I used for the Intel processor came from Intel. It’s a modified version of the traditional socket478 fan, except it has a larger copper core than the previous edition and the fins on the heatsink are in a sort of star pattern. The locking mechanism is the same. The heatsink compound I used was already on the bottom of the heatsink in a small gray pad. Intel provided a syringe of extra compound, but I didn’t have cause to use it.
The Thermaltake K8 Silentboost is an excellent solid-copper HSF. It’s a good thing, too — it was my only choice. It seems there aren’t (or weren’t when I bought this unit two months ago) many manufacturers that make HSFs for AMD64 processors. For the Athlon64 processor I used the standard-issue white heatsink compound, which is verified and certified by AMD.
The RAM was sent to me by Corsair for this and other benchmarking projects. It is the same retail box kit that you can buy through any authorized reseller. I could have requested RAM from a number of other manufacturers, but I chose Corsair for its high level of compatibility with motherboards, its low latency and reliable performance.
The WD Raptor is the fastest SATA drive on the market, and I acquired it for this and other benchmarking projects.
The Radeon 9800Pro AIW was sent to me by ATI for a previous review of SciTech’s SNAP Graphics drivers. I chose it for this review because it is a reasonable choice for a high-end single CPU workstation, and because although I didn’t do any graphics testing in this review, I plan on doing several graphic-intensive benchmarking projects in the future and I will be using this card for those tests. It helps to keep things as standard as possible to maintain cross-compatibility with my reviews.
The Antec TrueBlue 480 is both quiet and powerful. I had a long internal debate over whether I should get this power supply or the Vantec Stealth 420. Both are excellent supplies, but in the end I decided that the slightly cheaper Antec would be a better choice for this project because of its automatic fan control (the Vantec has a manual switch on the back) and its higher voltage and amperage ratings. The blue LED is superfluous — you can hardly see it when the case covers are on.
The Skyhawk case was a poor choice, but it’s all I had available to me. It is not FCC approved because of the acrylic window in the side, although I eliminated that variable by leaving the side cover off for all of my testing. This was also to improve ventilation and maintain a more consistent temperature inside the system. This case is totally unsuitable for a system based on the Prescott core because of its high operating temperature; despite all of the fans it has in it, ten seconds of idle operation with the side cover on forced the CPU fan to speeds of nearly 5000RPM.
Each system was assembled with care and all wires and connectors were correctly connected according to the manual. The BIOS was adjusted as necessary and the RAM sticks were in the proper slots for best performance. I decided to conduct my tests in a real computer chassis because, oddly, no other reviewers seem to do that. They bare-board everything, which means that they will never discover problems intrinsic to chassis assembly, such as the trouble I had with the Prescott system’s fan noise. This benchmarking project was designed to closely mimic a real workstation system, not a fictional lab testing environment.
The Software
The operating system I used was FreeBSD 5.2.1-RELEASE. If you’d like to learn more about how I configured the operating system and how I devised my benchmarking methods, or if you’d like to learn how to benchmark hardware using FreeBSD, I’ve written a separate article about it here.I used the standard Unix time command to conduct stopwatch tests, stream and ubench for synthetic tests, and OpenSSL, oggenc, and cdparanoia for my real-world tests. I did not conduct any testing in X — that would be a totally separate review, and the research and testing for it have already begun.
I generated statistics for comparing the schedulers using ministat, which is a part of the FreeBSD base system. I didn’t make any graphics to show differences in performance. If you want to see pretty graphs that mislead readers and suggest flawed conclusions, you’ll be disappointed with this review. You shouldn’t need a graph or chart to put this data in perspective anyway — it’s pretty straightforward.
Stopwatch Tests
All time is listed in seconds and each number represents the mean average of the real time (the total elapsed time), user time (the time it takes to execute the utility), and system overhead time of three distinct test iterations.It’s simple: I timed how long it took to compile the base system with varying numbers of concurrent processes. I also compiled Apache version 2.0.48_3 using no concurrent processes. I experimented with doing three buildworld iterations with ULE, then recompiling the kernel with 4BSD and running the same tests. I found that it didn’t produce a measurable difference in the results if I ran all nine scheduler tests in a row (with restarts in between, of course) or if I switched every three.
For the Apache2 build test I built the port and let it download and install all of the necessary dependencies. I then uninstalled Apache2 only — leaving the dependencies in place and the downloaded source code in the distfiles directory — and restarted in single-user mode, where Apache2 was rebuilt and timed. The time includes clean time; the exact command was time make install clean
Pentium4 Real Time
|
||||||||||||||||||||||||||||||||||||
|
Pentium4 User Time
|
||||||||||||||||||||||||||||||||||||
|
Pentium4 System Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/i386 Real Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/i386 User Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/i386 System Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/AMD64 Real Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/AMD64 User Time
|
||||||||||||||||||||||||||||||||||||
|
Athlon64/AMD64 System Time
|
||||||||||||||||||||||||||||||||||||
|
For my next test I compiled Apache2. With the Pentium4 and the Athlon64 in i386 mode there was no measurable difference in any of the three aspects of compile time. With many more test runs I could have probably shown a very small difference in performance, but after three tests I didn’t think it was worth the effort. However the AMD64 edition had different results:
Apache 2 Real Time
|
||||||||||||||||
|
Apache 2 User Time
|
||||||||||||||||
|
Apache 2 System Time
|
||||||||||||||||
|
Synthetic Benchmarks
Synthetic tests can reveal information that you might not otherwise be able to obtain, but in general you should not put a lot of stock in them. These numbers are not necessarily useful for comparing between systems, but they do show a significant difference in scheduler performance.I tested with two synthetic utilities: stream and ubench. Stream showed no difference in memory bandwidth between the two schedulers, but ubench showed a rather noticeable difference in the Pentium4 system and a slight difference in the Athlon64/AMD64 system. The only test case that was inconclusive was the Athlon64 in i386 mode which produced the same exact numbers in two out of three test runs.
Ubench on Pentium4
|
||||||||||||||||
|
Ubench on Athlon64/AMD64
|
||||||||||||||||
|
It seems to be quite buggy, as I never once got it to complete its testing procedure. It would do the CPU test and then exit on a signal 6 (in 64-bit mode) or signal 11 (in i386 mode) when doing the memory test. Despite its inaccuracy in judging CPU power, the results it shows for the schedulers is consistent with the results in other tests: again 4BSD seems to have a slight advantage.
Real-World Tests
This is the most useful of all of the data I collected because it shows how a system will perform in real-world scenarios. I didn’t test a lot of different programs here because many of the tests that I think would be best must be performed in X11. I tried ripping a CD with cdparanoia, but the results were too close to say that there was a meaningful difference between schedulers. The point of this project is to show where there are performance differences, and if there are none then I’m not going to spend the time putting inconclusive data into tables and generating statistics.Just in case you’re curious, the CD I tested with is LA Woman by The Doors, and it took roughly 660 seconds to rip it to the hard drive. From there I encoded the tracks with oggenc from the vorbis-tools port. The times below are, as above, listed in seconds and they represent mean averages from three separate testing runs. The exact command was time oggenc * and it was run in a directory containing only the ripped WAV files from the cdparanoia test. The only conclusive difference in times was in the Athlon64/i386 test; the Pentium4 and the Athlon64/AMD64 times were virtually the same between schedulers.
Oggenc Real Time
|
||||||||||||||||
|
Oggenc User Time
|
||||||||||||||||
|
Oggenc Sys Time
|
||||||||||||||||
|
Lastly I used OpenSSL from the FreeBSD base system as a test (click here for the OpenSSL documentation). The output was piped to a text file for each run. The exact command used was openssl speed >run1.txt replacing the number in the text file name to correspond with the number of the test run.
There are so many results generated for the OpenSSL benchmark that it would take days to put it all into proper files, generate statistics and then put it all into tables. To make matters worse, 4BSD was faster in some of the areas of the test and ULE was faster in others. The best I can do is make this information available to anyone who wants to download it to compare it for themselves. Please see the FTP information at the end of this article if you’d like to download the raw data for this and other tests that I performed.
When reviewing the data, please note that OpenSSL in FreeBSD has hand-optimized assembler code for i386 and will therefore be more favorable to the Pentium4 in this test scenario; the AMD64 code is all in C, so the code will perform differently. According to FreeBSD developers it is possible to optimize the code for AMD64 in the same way, but it would increase the clutter of the base system and it’s a matter of debate whether or not it should be done.
Conclusions
The true purpose of collecting all of this data was to attempt to determine what the performance difference between the test systems was, with a special emphasis on showing the difference in performance for the Athlon64 in both 64-bit and 32-bit modes. I performed all of my testing with both the ULE and 4BSD schedulers to see if they made any difference between architectures and technologies, and to provide this data for FreeBSD developers with the hope that they can use it to improve the performance of the newer ULE scheduler. I don’t know how useful this data will be to them or to anyone else, but here it is nonetheless.In general I found the 4BSD scheduler to be faster when the system was running concurrent processes. There is one thing that I regrettably could not show, and although I don’t have numbers to prove it, compiling multiple programs at once (in multiple terminal windows) will slow the ULE scheduler to a crawl where the 4BSD will keep right on going at only a slightly slower pace. I found this happening on all three systems, and it was a problem for me because I multitask extensively.
This review will hopefully stand as both a basis for more testing (for myself and others) and as a measuring stick by which all future comparisons of this nature are judged. I look forward to retesting in the future when more changes have been made, and next time I’ll use more real-world tests.
No comments:
Post a Comment