Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Post Categories

Programming

Thursday, June 18, 2015 #

Some bugs are easy some are hard. If you are dealing with a massive multithreaded distributed application consisting of many different processes talking to each other performance issues are pretty much impossible to detect with standard profilers. But with WPT the word impossible can be replaced with possible although much work is still needed. Recently I had an interesting regression issue where a use case did become over one second slower which was a quite big difference. So it should be easy to spot right? Two things were easy but one fraction ( 60% ) was still hiding in the data. Since the title of this post is already very clear here is a small test application which shows the issue:

    class Program
    {
        static Stopwatch StartTime = Stopwatch.StartNew();

        static TimeSpan WaitTime = TimeSpan.FromMilliseconds(2);

        static bool ReturnTrueAfter2ms()
        {
            if( StartTime.Elapsed.Ticks > WaitTime.Ticks )
            {
                StartTime.Restart();
                return true;
            }

            return false;
        }
        
        static void Main(string[] args)
        {
            while (true)
            {
                SpinWait.SpinUntil(ReturnTrueAfter2ms);
            }
        }
    }
In the main method we are continuously Spinwaiting for an operation which usually takes some ms (e.g. network IO) to complete. The goal of SpinWait is to spare you context switches in favor of burning CPU cycles. We are talking about some microseconds to spare instead of a context switch. Spinwait will therefore by default spin a few microseconds until it gives up and tries to yield to other threads with increasing persistence. If you look at the code of SpinWait

 

image

you find that for every 20th run it will sleep for 1ms. That in itself is not a big deal if your method takes longer to execute. Only by looking at the code all looks sound and reasonable. What could possibly go wrong? Let have a look at the costs for a context switch with Thread.Yield, Thread.Sleep ( 0 ) and Thread.Sleep( 1 ). Thread.Yield is quite cheap if the other thread has not much work to do. The cost looks like 6 microseconds for doing only a context switch and up to 70 microseconds if the other thread is doing some work until it waits again for something.

image

Thread.Sleep( 0 ) on the other hand lets the thread run forever if no other threads want to run on this CPU. The scheduler looks after the thread for ca. every 250ms where a non switching context switch event is logged but apart from that nothing is stopping my thread.

Things become interesting for Thread.Sleep(1) where we forcefully give up our thread quantum.

 

image

Remember the goal of SpinWait is to spare you an expensive context switch. It spins for about 20-80 microseconds. Then it tries from time to time a Thread.Yield, Thread.Sleep ( 0 ) and every 20th time a Thread.Sleep(1). In the comments of SpinWait it is noted that a thread can wait for a long time. In fact a Thread.Sleep(1) can sleep up to 15ms because you give up your current thread quantum and the scheduler looks after your thread only during the next clock tick which is 15,6ms later again to check if your thread is ready to run again. SpinWait is therefore only useful if your operation completes within a few microseconds in 99.9% of all cases. But if your operation frequently or nearly always takes much longer than a typical context switch then this class will not make you any faster. Instead of sparing a few microseconds you are potentially wasting up to 15 milliseconds every time you call SpingWait.SpinUntil!

If you have many call stacks in your application which cause SpinWait to sleep for a thread quantum of 15ms then you are using SpinWait wrongly!

image

This one was very hard to find because I usually search for long pauses when I am after a one second delay. But this time 600ms were scattered across dozens of calls almost evenly distributed throughout the use case because SpinWait was used in a central low level component which was frequently called.

If you looked at the Excel chart above and wondered where difference between with and without Visual Studio for Thread.Sleep(1) comes from I can tell you now. By default the windows timer ticks every 15,6ms which then triggers the thread scheduler to check out its running threads. The strange 15,6ms value comes from 64 ticks/s = 15,6 ms/tick. If someone uses a multimedia resolution timer or changes the system wide timer tick to increase the tick rate your threads will more often be watched after the thread scheduler. That is the reason why Thread.Sleep(1) can wait for "only" 2ms or 15 ms depending on which applications are running which may alter the system wide clock rate. Another great source of many more details can be found at https://randomascii.wordpress.com/2013/07/08/windows-timer-resolution-megawatts-wasted/.

You can use the clockres tool from SysInternals to check out your current timer resolution. The dynamic windows timer tick rate makes it especially tricky to find performance degradations with unit tests on machines where Visual Studio or another WPF GUI is running where the tick rate is 1kHz. But on an installed server the clock tick rate is 64 which will make a big difference. Luckily the performance regression tests did show the issue long before any customer would get this version. It really pays off to performance regression tests. It is even better if you not only execute performance regression tests but you use xperf to record profiling data for all performance tests and store profiling data somewhere. Later when a test breaks you don't need to go to the testing guys and ask them to install the current baseline which has the issue and an old baseline which was good. Then you need to profile both use cases and then you can start analyzing. What a waste of time!

It is much easier to let your central performance tests run with ETW profiling which will affect even heavy use cases < 10%. But you get all the data you could wish for to find the root cause much faster. If you train the people how to use WPA to analyze issues and how to use PerfView, Windbg, VMMap and memory dumps you can create software which has significantly less issues than your competitors because you really understand your system. The usual thinking to treat system libraries and the operating system as black boxes where miracles happen has its place to not be distracted by the many things you need to consider before you write one line of software. But you should not stop at the black boxes. Instead you should develop the ability to look into the black boxes when necessary and write a nice blog post about the found issues so other people can learn how things (not) work.