Geeks With Blogs
malloc(); taking up more space on the interweb

My team has been working on and off all week trying to find out why one of our windows services was crashing at the same time every day with a very unhelpful eventlog message.

Like all good errors, a search on the error message returns many results where people get the same message but for a bunch of different reasons and where many of the search result threads are left unanswered.

This service was first developed in .NET 1.1 way before I joined the company (only 5 months ago). It had been running without crashes on a Windows 2k machine for years. Now that we are finally throwing out our 2k systems we have migrated all our 1.1 code to 2.0 and running them on Win2k3.

So what does this service do? It is the final step in a data stream processing system. Here is a quick graphic to display the flow:

As you can see, this service does the easy part of the process. Read from the MSMQ, spawn a new thread (thread pooled) for each message to take the SQL string value, connect to the DB, and execute the SQL. The thread does those simple steps and then dies off. The main thread continues to wait for new MSMQ messages and spawn new threads as they come in.

At first look of the code there was nothing obvious wrong. No "important" code missing try's or catch's. And all of the try/catches has logging code to tell us (using the eventlog) if anything goes wrong. However, the eventlog would only have the '.NET Runtime 2.0 Error Reporting' message after the crash and none of our messages.

After looking closer at all of the code, we found the problem. Here is a snippet of what ended up being the problem code. It is code that runs on the worker threads from the thread pool.

   1: Public Class DumpInThread
   2:     Public Query As System.String
   3:     Public Message As System.Messaging.Message
   4:     Private objConnection As System.Data.SqlClient.SqlConnection
   5:     '======================================================================================
   6:     Public Sub Dump(ByVal StateInfo As System.Object)
   7:         Do Until GetNewConnection() = True
   8:         Loop
   9:         InsertMessage(Query)
  10:     End Sub
  11:     '======================================================================================
  12:     Public Function GetNewConnection() As System.Boolean
  13:         Try
  14:             objConnection = New System.Data.SqlClient.SqlConnection("SqlServer")
  15:             objConnection.Open()
  16:             Return (True)
  17:         Catch Exception As System.Data.SqlClient.SqlException
  18:             SharedEventLog.WriteEntry("Could not obtain connection " & Connection & ":" & Exception.Message, System.Diagnostics.EventLogEntryType.Error)
  19:             objMessageQueue.Send(Message)
  20:             Return (False)
  21:         End Try
  22:     End Function

Look here (or just continue reading) to find out what we first realized. SqlConnection.Open() can throw either a a SqlException or an InvalidOperationException. To me, it makes more sense for a SqlConnection to throw a SqlException if it cannot connect to a DB, not a generic exception like InvalidOperationException. We changed the catch so it will also catch the InvalidOperationException. This stops the threads and therefor the process from crashing.

The second thing to realize is that this InvalidOperationException is not new to .NET 2.0. The .NET 1.1 api docs say that the SqlConnection.Open() throws an InvalidOperationException. This code has been incorrect since the day it was written. One thing that has changes between 1.1 and 2.0 has been the handling of threads. In 2.0 if a thread has an uncaught exception it will crash the entire process instead of being quietly swallowed.

"But didn't you say the process crashed at the same time every day, how does that factor into the fix?" Yes this process would crash daily at 2:14pm, and after all of this I still have no idea why :-). We could not find any jobs using the same DB at this time that may have caused a connection to timeout. There is not a large burst of messages at that time, it is actually one of the more quieter times in the day. But I know that the new catch block fixes the crash, and makes the code much more reliable then it has been since it was born.



Posted on Thursday, April 5, 2007 9:41 PM .net | Back to top

Comments on this post: Why I don't mind catching the type Exception

No comments posted yet.
Your comment:
 (will show your gravatar)

Copyright © malloc(); | Powered by: