Monday, September 15, 2008

Visual Studio quirk- it works on my box

I encountered an interesting Visual Studio thing today. Someone sent me a bug with a repro. I ran it on my machine and started stepping through the internals of the framework to see what the issue is. It worked fine. Hmm, that was interesting, I wonder why? So I run it again and this time I don't step through it and it fails. Ok, that's strange. When I just execute the code, it fails, but if I step into it and don't do ANYTHING except step through the code, it succeeds. WTF?

Well, I had forgotten about something: Autos and locals. In Visual Studio, when debugging, the debugger creates watches on local variables as well as a few things it just watches automatically. In order to get the values of these, it has to evaluate them, and therein lies the problem: If evaluating any of those variables causes any side effects that don't occur during the normal running of the application, it can cause unexpected behavior. Here's an example:

Let's say that I have some state property on my object that is initialized to null. I have a method that depends on this state property being set. That state property is set when you access another property somewhere. Assuming that property is not accessed in the code path that I'm executing, the state property will not be set. HOWEVER, if I trace through the code and evaluate the property that sets the state property, it will end up setting my state, thus changing the way my code executes. Let's look at a concrete example:

class testthing
{
private string s = null;

public string PropString
{
get { if (s == null)s = "new"; return "new";}
set { s = value; }
}

public bool forceit = false;

public bool DoSomething()
{

if (forceit)
Console.WriteLine(PropString);

return s == null;
}

}

class Program
{
static void Main(string[] args)
{
Console.WriteLine("Starting run");
testthing t = new testthing();
bool x = t.DoSomething();

Console.WriteLine("result was: " + (string)(x ? "true" : "false"));
Console.ReadKey();

}
}
So the class testthing has a property PropString that doesn't set the internal value of the private field s until the Get is called. Therefore, if you never call PropString.Get, it never sets the value of s, and DoSomething() will return true because s defaults to null. Run the example code and observe this, it's pretty straightforward.

Now, run it a second time, except this time put a breakpoint on the first line in DoSomething(). When it breaks, hover over the Console.WriteLine(PropString) so that it forces the debugger to evaluate PropString. Now, execute the rest of the code (f5) and observe the output is false, because the debugger has executed the getter of PropString which had a side-effect.

So, the next time you debug an application in Visual Studio and it works in the debugger but not in the code, look at your variables within the method throwing the exception and see if any of them could possibly be changed through evaluation. If so, then you may have found the problem.

One final word: If you have a unit test for something like this, your unit test will fail since it won't evaluate the property when it's run. It would be far easier to write a failing unit test around the method that has the bug and then figure out why it's failing than to step through the method and hope that you can see where it's going wrong.

Thursday, September 11, 2008

Let's just blame Microsoft!

This is a good one. Some guy named Steven J. Vaughn-Nichols is blaming the Sept. 10th London Stock Exchange crash on .Net. Wow, informative! It crashed. it runs on .Net, so that must be the reason. .Net isn't suited for real time systems! Right? Not so fast, dude.

Full disclosure

Before I start, let me just say that I do work for Microsoft and I work on the .Net framework. Does this make me biased? Probably, but I'm going to attempt to focus on other things besides "Microsoft good, .Net good" here and draw a logical conclusion.

What happens

So, what's the scenario? Well, apparently (according to Steven) the LSE runs some software called TradElec, which is a c# application. It also runs on Windows 2003 with Sql Server 2000. Clearly, the weak point is .Net here, nothing else it could possible be. Right?

You are full of fail


So Steven probably wrote all those "conclusions" down on a mat, which he then placed on the floor, so that he can "jump" to them. He clearly has. Something broke, so it's Microsoft's fault, because .Net just sucks for real-time applications. So does Sql Server 2000 and Windows Server 2003. There's nothing else that could have gone wrong, right?


There's no way it could be human error. No way at all

What he doesn't say is that this could possibly be programmer error. There are thousands of ways that a programmer could mess this up and just write crappy code. For network connections, the Asynchronous programming model is not trivial and requires some reasonably deep understanding before you can really make it work well for you. I see a lot of people mess this up, and unfortunately it's their fault and their problem most of the time because the performance you get through asynchronous programming comes at the price of being complex and involving multiple threads, which is something that a lot of people just don't understand.

Additionally, we don't know how they're doing their DB access here. Maybe they have some sort of transaction hell that's locking the shit out of their DB. Maybe they don't use stored procs (BIG performance issue in Sql2k, fixed in Sql2k5 so not a big deal there). Maybe they don't know how to create an index. My point is that we don't know, so we can't say for sure. Probably, however, this is an issue.

Finally, the .Net framework itself has some interesting quirks if you don't really understand the CLR well. I don't usually recommend books on specific software technologies, but go out and get a copy of CLR via C# by Jeffery Richter; I learned more about the CLR in that book in a month than I did in two years of using .Net every day. Granted, garbage collection takes away a lot of the complexities of memory management, which can be a big performance issue, however as a developer you STILL need to understand what the CLR is doing. Things like boxing and unboxing can take time, mis-using value types and reference types eat performance, even how you allocate objects can affect performance. For example, if you're using buffers for network traffic, if you allocate a new buffer each time, you may trigger garbage collection which will randomly hurt performance and be difficult to track down. if instead you allocate a massive pool of buffers and then just use those, they will live on the large object heap and they will NEVER trigger garbage collection so your app will be more consistent.

Blame Canada . . .um. . . er. . . .Net?

So do we blame .Net? With this much information, we really can't. It's far more likely that Sql 2000 is to blame (if anything), although I've seen shit databases created in open source just as often as MS Sql so it's entirely possible that it was just designed stupidly. It's also equally likely that the people who wrote this just screwed up, either in writing the code or improperly testing it. Again, these things would happen if the same programmers used open source software.

Wow, what a useful solution!!!

What does Steven suggest? Use linux. Wow, that will fix everything! I'll just go install it right now, with KDE and everything!!! Wait, no.

Next, he suggests Oracle. I've used Oracle and in some ways I love it way more than MS Sql Server but in other ways I hate it a lot. Oracle is better than Sql2k but I have yet to see proof that it's better than Sql2k5, however I won't pass judgement on that yet. Maybe Oracle would be a better db choice. Not that Oracle's open source or anything. It also works with .Net. I've used it.

Next, he recommends Java. Java, with the worst threading model in the history of the world (more on that later), is his recommendation for a fix! I have yet to see a case where a Java application works significantly better than a .Net application doing the same thing. A lot of the tools are similar. The languages are similar.

In conclusion, Steven is jumping to conclusions that Open Source software (+Oracle) is better for performance. He has no evidence other than "it was running .net and it crashed" to base this on. He is therefore wrong. I have an idea. So you take this mat, and you write various "conclusions" on it, and put it on the floor, so you can "jump" to them. I'll send him one!

And I KNOW it wasn't a .Net networking issue because

I am on the NCL team at Microsoft. We own the System.Net namespace, which is what handles networking in the .Net framework. It was my turn to handle issues that came that week. If it had been a .Net issue with networking, I would have heard about it. I heard nothing.