Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Image Galleries

Programming

When you are tracking down handle or other leaks you usually need to resort to Windbg which is a great tool for unmanaged code but not so much for managed code because the usual stuff like !htrace only get the unmanaged call stack.  For managed code this approach breaks because you will not see which managed callers did allocate or free a handle. Other Windbg approaches like putting a breakpoint and then print out the handle and the call stack at the allocate/free methods work, but for heavily called methods they slow down the application too much. A much easier and faster approach is to hook the methods in question and write out an ETW event for each acquire/release call where the ETW infrastructure takes care of walking the stack. To hook specific exported methods I usually use EasyHook which is a great library to hook into pretty much any C style exported method.

Below I present a generic way to use hooking of resource acquire and release methods and tracing them with ETW which enables a generic way to track down any resource leakage as long as the necessary functions are within the reach of an API interception framework. To my knowledge that was not done before in any public library and I take therefore credit to make this approach public. Perhaps I have invented this approach but one never knows for sure.

EasyHook Introduction

EasyHook lets you define managed callbacks to intercept publicly exported methods of any processes. In the managed callback you can extend the functionality of an existing API or trace method arguments which is more often the case. If you have e.g. an executable which exports a method named Func which takes 5 arguments and returns an int like

extern "C" int WINAPI Func(int *a, int *b, int *c, int *d, int *e)
{
    *a = 1;
    *b = 2;
    *c = 3;
    *d = 4;
    *e = 5;
    return 1234;
}

you can write in managed code an interceptor for it which additionally needs a matching delegate and a PInvoke signature of the original method

        [UnmanagedFunctionPointer(CallingConvention.StdCall, CharSet = CharSet.Unicode, SetLastError = true)]
        unsafe delegate int FuncDelegate(int* a, int* b, int* c, int* d, int* e);

        [DllImport("UnmanagedWithExports.exe", CharSet = CharSet.Unicode, SetLastError = true, CallingConvention = CallingConvention.StdCall)]
        unsafe static extern int Func(int* a, int* b, int* c, int* d, int* e);
 
        unsafe static int Func_Hooked(int* a, int* b, int* c, int* d, int* e)
        {
            int lret = Func(a, b, c, d, e);
            return lret;
        }

To override in the target process you need to execute

        FuncHook = LocalHook.Create(LocalHook.GetProcAddress("UnmanagedWithExports.exe", "Func"), new FuncDelegate(Func_Hooked), this);
        FuncHook.ThreadACL.SetExclusiveACL(null);

The Hook creation is self explanatory but the hook is  disabled for all threads by default. You can either specify a list of threads for which the hook should be enabled ( LocalHook.ThreadACL.SetInclusiveACL(...) ) or you can configure a list of threads for which it should be disabled ( LocalHook.ThreadACL.SetExclusiveACL(…) ). By calling SetExclusiveACL with a null list or new int [] { 0 } which is not a valid thread number you are effectively enabling the hook for all threads.

That was the whirlwind tour into EasyHook which makes it really easy to tap into otherwise unreachable parts of code execution. Most Windows APIs are prepared to be hooked so you can intercept pretty much anything. In practice all C style exported methods are usually within your reach. With that new capability you can do previously impossible things like

  • Write a generic resource leak detection tool which tracks every resource acquire and release call.
  • Export a dummy method in your C++ dll which is called from code where previously sporadic bugs did happen. E.g. trace each invocation of the dummy method which passes internal state variables from the sporadically failing method. 
    • Now you can hook the empty method and intercept the method calls with full call stacks during your automated builds only on the machines where it did occur with minimal overhead.
  • Record the passed parameters to any exported method without the need to change the source code (which you probably don't have anyway) of this method.
  • Alter method arguments at runtime (e.g. redirect file open calls to other files).
  • ….

If you try it out you will find that the interception part works pretty well but for some reason the call stack for x64 processes looks broken when you inspect it in the debugger. We only see the managed intercepted call but no stack frames from our callers:

0:000> kc
 # Call Site
00 EasyHook64!LhPatchHookReturnAddress
01 EasyHook64!LhBarrierBeginStackTrace
02 0x0
03 0x0
04 0x0
05 0x0
06 clr!UM2MThunk_WrapperHelper
07 clr!UM2MThunk_Wrapper
08 clr!Thread::DoADCallBack
09 clr!UM2MDoADCallBack
0a clr!UMThunkStub
0b 0x0
0c 0x0
0d 0x0
0e 0x0
0f 0x0
10 0x0
11 0x0
12 0x0
13 0x0
14 0x0
15 0x0
16 0x0

but it should show our callers as well which should look like
0:000> kc
 # Call Site
00 EasyHook64!LhPatchHookReturnAddress
01 EasyHook64!LhBarrierBeginStackTrace
02 0x0
03 0x0
04 0x0
05 0x0
06 clr!UM2MThunk_WrapperHelper
07 clr!UM2MThunk_Wrapper
08 clr!Thread::DoADCallBack
09 clr!UM2MDoADCallBack
0a clr!UMThunkStub
0b UnmanagedWithExports!main
0c UnmanagedWithExports!__tmainCRTStartup
0d KERNEL32!BaseThreadInitThunk
0e ntdll!RtlUserThreadStart

The problem comes from the fact that under 64 bit stack walking works differently than under 32 bit. When the Windows x64 calling convention was defined the engineers wanted to get rid of the exception handling overhead in the happy code path unlike in the Windows x86 execution model. With x86 every method which contains an exception handler needs to set their exception handler  callback even if no exception ever happens.

In contrast to that x64 code has zero overhead in terms of code execution speed no matter how many exception handlers are present in any method! To make that possible in x64 all code must additionally register exception unwind information structures which describe the stack layout and their exception handlers for every part of the function until function exit to enable proper stack unwinding by parsing the additional metadata if an exception is about to be thrown.

x64 Stack Walking and Unwind Information

The exception unwind information is also used by the stackwalker which is part of the reason why our stack is broken. To intercept method calls EasyHook copies assembly code to a memory location which acts as trampoline code for our managed hook callback. This assembler trampoline code has no associated exception unwind information which is a problem if you try to step with a debugger through a hooked method. I tried to add the missing exception unwind information to the dynamically emitted assembler code by patching EasyHook. This led down into the dark corners of assembler ABI in function Prologue and Epilogue.

Here are some links with further information if you are interested:

It took a while to wrap my head around it since it was not clear how to verify that the unwind info registration was correct. Luckily Windbg has a nice command to show you for any address the associated Unwind Information. The command is call .fnent xxx. xxxx is the method address. Below is the output of Windbg for the trampoline assembler code with the now dynamically added Unwind Information:

0:000> .fnent 7ff6`79fa0318
Debugger function entry 0000003a`d75be570 for:

BeginAddress      = 00000000`00000318
EndAddress        = 00000000`000003c9
UnwindInfoAddress = 00000000`00000274

Unwind info at 00007ff6`79fa0274, 22 bytes
  version 1, flags 0, prolog 2d, codes f
  frame reg 5 (rbp), frame offs 00h
  00: offs 26, unwind op 8, op info 3    UWOP_SAVE_XMM128 FrameOffset: 50 reg: xmm3.
  02: offs 21, unwind op 8, op info 2    UWOP_SAVE_XMM128 FrameOffset: 40 reg: xmm2.
  04: offs 1c, unwind op 8, op info 1    UWOP_SAVE_XMM128 FrameOffset: 30 reg: xmm1.
  06: offs 17, unwind op 8, op info 0    UWOP_SAVE_XMM128 FrameOffset: 20 reg: xmm0.
  08: offs 12, unwind op 3, op info 0    UWOP_SET_FPREG.
  09: offs e, unwind op 2, op info c    UWOP_ALLOC_SMALL.
  0a: offs a, unwind op 0, op info 5    UWOP_PUSH_NONVOL reg: rbp.
  0b: offs 9, unwind op 0, op info 9    UWOP_PUSH_NONVOL reg: r9.
  0c: offs 7, unwind op 0, op info 8    UWOP_PUSH_NONVOL reg: r8.
  0d: offs 5, unwind op 0, op info 2    UWOP_PUSH_NONVOL reg: rdx.
  0e: offs 4, unwind op 0, op info 1    UWOP_PUSH_NONVOL reg: rcx.

If you need to do it by yourself you can look at https://github.com/Alois-xx/EasyHook/blob/develop/DriverShared/LocalHook/install.c in method AddUnwindInfos

By using the Windbg stack commands (k and related) I could immediately see if I was on the right track. If you want to see correct x64 method examples you can check out any C/C++ x64 compiler output to see what is possible. The dumpbin tool can also display the exception unwind infos with

dumpbin /unwindinfo dddd.exe/.dll

for any executable or dll.

The version 1 unwind codes are only necessary to describe the function Prologue which is sufficient to reconstruct the stack layout of all saved registers at any time in a method. If you have more exotic Epilogues you need to describe them with specific Unwind codes as well which were introduced with Windows 10 (I guess)? The Unwind Info version 2 is not documented at MSDN anywhere yet but you can have a look at a small snippet of the Windows Kernel stack unwinder which is copied into CoreCLR at: https://github.com/dotnet/coreclr/blob/master/src/unwinder/amd64/unwinder_amd64.cpp. See the methods OOPStackUnwinderAMD64::UnwindEpilogue, OOPStackUnwinderAMD64::VirtualUnwind and related.

After adding correct unwind codes stack walking did work only up to certain point where it still got corrupted because EasyHook repeatedly replaces the return address of the current method with the CLR stub and the backpatch routine to reset the old return address after our managed hook returns. During this time the stack is by definition not walkable because the return address of our caller is not even present on the stack while we are executing our managed callback. To get out of this EasyHook already provides a solution by replacing as direct return address the original hook entry point temporarily to make the stack walkable again. But that works only while we are in a hook handler in  terms of EasyHook speak. That means that we can restore the stack only before we call the hooked method. After that we cannot restore the return location anymore. That is a problem if we want to log the returned arguments of a hooked method (e.g. the actual handle value of CreateWindow, CreateFile, …). Since I know that our current return address must be the hook specific assembler code to restore the original caller address I did extend  NativeAPI.LhBarrierBeginStackTrace and NativeAPI.LhBarrierEndStackTrace of EasyHook to locate the hook stub code and to replace it with the original entry point of the first hooked function of the current thread. After that change EasyHook is able to get you into a state where the stack is fully walkable before and after you call the hooked method for x64 and x86.

Unfortunately the x64 ETW Stackwalker of Windows 7 stops on any method which has no matching loaded module (e.g. trampoline hook code and JITed code) so this is still not working with ETW events in Windows 7. But custom ETW events with proper Unwind Infos will work on Windows 8 or later. I still wish that MS would fix the Windows 7 ETW stackwalker because this also breaks profiling JITed x64 code on Windows 7 machines. 32 bit code works though since Windows 7.

Generic ETW Resource Tracing

At the time of writing you can see the changes on my EasyHook fork at https://github.com/Alois-xx/EasyHook/commits/develop. I hope that the changes will get into the main line of EasyHook soon. I still have left out an important aspect of a generic ETW leak tracer. How should we trace it so it is easy to diagnose with WPA? WPA for Windows 10 can graph custom event data if they are numeric and non hexadecimal Sum columns. We need to trace every resource acquire and release. Usually you get back some handle so this should be included as well. The resource could be something big like a bitmap which has  a specific memory size so it makes sense to trace the allocation size as well. The basic ETW trace calls should therefore have the form

    AcquireResource(ulong Handle, long AllocSize, string Allocator)
    ReleaseResource(ulong Handle, long AllocSize, string Allocator)

With such an API we can trace for CreateFile, CreateWindow, … and such with a simple ETW trace call all relevant information. For simple handle based resources where no metadata is available how big they are it makes sense to use 1 in the Aquire call as allocation size and -1 for the Release call. If all Aquire and Release calls for the Allocator e.g. CreateWindow are balanced out the WPA Sum column for the total allocation size (Field 2) will balance out to zero. If there is a leak we will see a positive excess in WPA for which we can filter.

 

image

Here we see e.g. that our free method which seems to do the wrong thing seems to release sometimes with an allocation size of zero 68 times which is exactly the number of outstanding allocation size which is graphed in Field 2.

As an example for CreateWindow we would issue

   AcquireResource(windowHandle, 1, "CreateWindow")

and for each hooked DestroyWindow call we would trace

   ReleaseResource(handle, -1, "DestroyWindow")

You can use the tool ETWstackwalk as a simple window handle leak detector which is now part of my EasyHook fork. You need to compile ETWStackwalk.exe as Administrator under x64 to register the ETW manifest. You get then something like this as output.

Window Handle Leak Tracing

ETWStackwalk is part of EasyHook which serves as demonstration what can be done with it. Here is its help output

ETWStackWalk.exe v1.0.0.0 by Alois Kraus 2016
Injects an ETW tracer to external processes to find resource leaks. Works for x64 since Win 8 or later. Works for 32 bit processes since Windows 7.
ETWStackWalk [-hook [createfile] [-outfile xxx.etl] [createwindow]] [-debug] [-help] [-pid ddd] [-start executable with optionalargs]
   -help                 This message
   -hook                 Hook specified methods to trace handle leaks.
                         createwindow Intercept Create/DestroyWindow calls.
                         createfile   Intercept CreateFile calls.
   -outfile xxxx.etl     Output file name to which the hook tracing data will be written to. Default is %temp%\ResourceLeaks.etl
   -pid ddd              Hook into specified process
   -start executable     Start executable and then hook after 5s to give it time to initialize.
   -noetw                Disable ETW recording (used for unit testing).
   -debug                Used to debug EasyHook trampoline code. Hook notepad.exe and start Windbg of Windows 10 SDK.
Examples:
Monitor all CreateWindow/DestroyWindow calls to detect window handle leaks in an already running application with process id ddd
 ETWStackWalk.exe -hook createwindow -pid dddd
Monitor all CreateFile calls. There you can log e.g. return code, passed file name, flags and other flags of interest.
 ETWStackWalk.exe -hook createfile -pid dddd
Test hooking a method which writes if hooked an ETW event for every allocation and deallcation event to simulate a handle leak.
 ETWStackWalk.exe -hook funcetw -start UnmanagedWithExports.exe 500000
If something goes wrong you can directly debug it with Windbg if you have the Win10 SDK installed by adding the -debug switch
 ETWStackWalk.exe -hook funcetw -debug -start UnmanagedWithExports.exe 500000

To start window handle leak tracing you need to get the pid of an already running process because EasyHook is not able to inject its code while the process is still starting the CLR which can lead to all sorts of weird errors. Instead you can attach to an already running process and tell ETWStackWalk to enable createwindow/destroywindow tracing. You can also configure and output etl file if you wish our you stick to the default.

D:\Source\git\easyhook\Build\netfx4-Release\x64>ETWStackWalk.exe -hook createwindow -outfile c:\temp\resourceLeak.etl  -pid 14960
Start ETW Recording ...
Hooking into proces 14960 for createwindow 
Press Enter to stop recording
ETWStackwalk has been installed in target 14960.


Stopping ETW Recording. This can take some minutes ...
        Stop WPR Output: To analyze this trace run: wpa "c:\temp\resourceleak.etl"
Start Analysis with wpa c:\temp\resourceleak.etl -profile easyhook.wpaprofile

I have crated a suitable easyhook.wpaprofile which is dedicated to easily analyze resource leaks.  There you get a nice overview per handle and all associated aquire and release operations. The graph shows the duration how long a handle was open until it was closed again. The Start/Stop ETW Opcode Name corresponds to acquire/release calls for this handle. If handles are reused you can still see any excess (too many aquire or release calls) in the AllocSize column which is a sum of all calls  where each CreateWindow counts with +1 and all calls to DestroyWindow count as -1. If all calls cancel out you see for that handle an summed AllocSize (Field 2) of zero. The allocation size shows you that in my test while opening a solution in VS2015 where 6 additional windows were created and not yet closed. This makes it easy to filter for the outstanding window handles which have an AllocSize > 0. Then you can open drill into the call stacks and you have found the root cause of pretty much any window handle leak with practically zero speed impact of your application while this approach is generic and can be applied to any resource leak. This tester serves only as demonstration that is based on the combination of ETW and EasyHook but with this concept we can tackle a wide range of sporadic resource leaks which are now solvable with a systematic approach.

 

image

If we hook other generic methods like CloseHandle we are in a bad position since it is a generic close method which closes many different handle types. But we can (literally) sort things out if we group in WPA by handle (Field 1) and filter away all handles which have a negative allocation size (Field 2). This removes all handles for which CloseHandle was called but no corresponding e.g. CreateFile call was traced. From that on we can continue to search for outstanding file handles by looking at the totals as usual.

Now you know why the handle is always the first argument to our ETW trace calls. If we position the handle at the same method parameter location in our ETW manifest we can group the acquire and release calls by handle which is exactly what we want in WPA got generate the graph and table above.

Hooking and ETW Performance

I was telling you that this approach is fast. So how fast can it get? The y-axis are not ms but us where the average function execution time of a hooked basically empty int Func(int *a, int *b, int *c, int *d, int *e) was measured. It is interesting that the raw function execution speed from a few nanoseconds jumps to ca. 1us with ETW tracing and goes up to 4us (measured on my Haswell i7-4770K CPU @ 3.50GHz under Windows 8.1) if ETW stackwalking is enabled.  That means we can get full ETW events and stack walking at a rate of 250.000 calls/s by using that approach. That is more than most tracing frameworks offer and these do not capture the full call stack. If that is still too slow you can use EasyHook as a pure unmanaged Interceptor which could give you perhaps a factor 2-4 but then you need to take care of a lot more things.

image

What is odd is that x86 ETW tracing seems to be slower for x86 and under x64 but I did not look into that more deeply. In general x86 code should still be faster if we do the same thing because the CPU caches do not grow but the pointer and instruction size of code and data increases under x64 which simply needs more cycles as Rico Mariani explains in A little 64 bit follow-up. As long as you do not need the big address space you should not simply buy the argument that x64 is the future because it is more modern. If you want to do your users a favor you should favor the small is beautiful approach. With small I mean small code, small and efficient data structures and small code complexity. 

WPA Performance Issue (Windows 10 SDK 10.0.10586.15)

When parsing custom ETW events WPA parses the strings in a very inefficient way which can lock up your machine easily for minutes. On my 8 core machine the custom view takes minutes to build up and uses 4 cores to parse the strings from my custom ETW events. The problem is that WPA also uses code from a static helper class which is grossly inefficient. Have a look at this profiling data:

image

WPA uses 4/8 cores where 3 cores are spending CPU cycles spinning to get a lock while only one core is doing actual work (12,5% CPU is one core on a 8 core machine). This effectively single threaded operation is also very inefficient implemented. The problem comes from TDHHelper.GetStringForPropertyAtIndex which parses a string from an ETW event. This is done in a static class under a global lock:

image

When I drill down into one thread I see that EnsureCapacity and EnsureRoomForTokensAndClear are responsible for 43% of the execution time which would go away if the List<IntPtr> instances would be reused. It could also help that the parsed strings are cached which would reduce the time quite a bit if I play around with groupings in WPA. That is a major bottleneck and should be solved as my top performance issue in the current WPA implementation. I have not noticed such long delays with WPA 8.1. A quick check reveals that the code is nearly the same in previous versions. So it was always slow but with more and more custom ETW events this display is becoming much more important than before. If you look at custom CLR events (e.g. Exceptions grouped by type) you will notice the very slow reaction of WPA due to this issue.

Outlook

If you have read this far congrats. You had to digest quite many technical details from many different areas. If you are searching for a way to get rid of Windbg for resource leak analysis and use a simpler tool ETWStackwalk could be it. It is easy to extend for any other API you are interested in and it works for x86 and x64 (Win 8 and later). If you want to intercept another API you only need to extend the method injector which is declared in

easyhook\Test\ETWStackWalk\Injector.cs

In the Injector you see some examples for CreateFile and CreateWindow which contains mostly boilerplate code. The hook generation code could certainly be simplified. A nice addition would it be to compile the interception code on the fly and pass the actual Interceptor class as configuration file which would make hooking new methods truly dynamic without the need to recompile ETWStackwalk every time a new API needs to be hooked. I had used a variation of this approach for my GDI bug in Windows which actually lead to a Windows Hotfix. But this was always bound to x86 code only. Since I have found some free time to greatly extend the usefulness of the hooking approach by making EasyHook playing nice with x64 stack walking. So far every resource leak (GDI handles, File, Semaphores, Desktop Heap, … ) needed a different approach or tool. It is time to come up with an approach which is powerful and generic enough to tackle all of such resource leaks with one tool and approach. I expect that at least the idea should become state of the art how resource problems are tackled in the next years. If you have used this approach already or you had success with the idea please share it here so we all can benefit by learning from each other. It was fun to get my hands dirty with x64 assembly and to find out how the basics of exception handling really work together. Going deep is my first nature. That was the main reason why I have studied particle physics to really understand how the universe works. The discovery of the Higgs boson was really a great thing and I can at least say that I have tested some small parts of the huge ATLAS detector (which was part of the discovery of the Higgs boson) during my time at the university 15 years ago. It is kind of cool to understand how my computer works from software to hardware down to subatomic level.

If you are wondering what on earth is that guy doing when he is not in front of a computer? Apparently this:

image

With my two kids, not my boat and my wife we were enjoying two fantastic weeks at the Caribbean sea on St Thomas this January. The kids had great fun in the warm water and this was my first vacation where I did never freeze. Even the nights during winter time were above 25 Celsius. The water is so full of live I have not seen anywhere else. The Mediterranean sea looks like a desert compared to this. If you go to Coral World at St. Thomas you see many different fishes and corrals in aquariums. But if you go snorkeling you will see all animals from the zoo again which is really amazing. During that time home in Germany the people had quite a lot of snow but we were sweating the whole time which really felt odd. That's all for today. Happy bug hunting!

posted on Saturday, January 30, 2016 1:00 PM