Monday, January 19, 2009

Why dispose is necessary and other insights on managed code

Jeff Atwood has once again jumped the shark, this time about managed code. Then there's this little gem from twitter:

I believe .Dispose() is a form of optimization, which is necessary *sometimes*
but not *always*. Anyone got links with evidence otherwise?

So for my next trick, I am going to show Jeff (and everyone else) how managed code works and when it fails.

First, I always have a saying when introducing managed code:
Managed code isn't.

The reason that I say this is that most people think that because your code is managed, you don't have to think about memory management at all; you can just allocate whatever you feel like and the garbage collector will just take care of it for you. According to Jeff, this is a good thing. It is indeed a good thing, although unfortunately it doesn't work that way, and if you don't know precisely what the CLR is doing to your code and how garbage collection works, you're going to have problems and when they occur you will have no idea where to look for solutions. The goal of this post is to try to educate you as to what is going on. I will not be able to go into a lot of depth in one post, however this should get you started and give you some resources to do further exploration when you run into these problems in your code. Let's start with looking at dispose().



For those of you who are unsure, dispose() is an implementation of the disposal pattern. The idea behind this is that you have a way to mark a class as having an explicit cleanup that needs to occur prior to that object being deallocated. .Net accomplishes this by an object implementing the IDisposable interface with the Dispose() method, which you should call when you are finished using an object so that it can free its resources. This method should invoke the same code as when Finalize() is called (more on finalization below), so if someone forgets to call Dispose(), it is still guaranteed to be cleaned up prior to garbage collection occurring. Therefore, it is a good place for cleaning up unmanaged resources, like database connections or open streams (e.g. files and network connections). That way you are 100% guaranteed that this cleanup will occur, even if the programmer simply stops using a particular object and lets it fall out of scope.

Using(disposal pattern)

This brings us to our second useful piece of code, which is the using() {} block. This block allows you to put something that implements IDisposable (henceforth "disposable") inside the using() keyword, and then upon leaving the basic block defined by the using statement (i.e. going outside of the {} by any means), the dispose() method WILL be run prior to the jump that occurs. This means that inside my using block I can return, break, exit, throw, or otherwise jump to another block of code and before that code exists, dispose is called on the object in the using() statement. These can be nested, in which case objects are disposed in FILO order (first in last out). These also work with finally blocks so that if I have a try..finally with a nested using block, the dispose is called prior to the finally, even in cases where code inside the using block throws an exception dispose is still called first. You can test this quite easily by writing a simple app with nested using blocks and then throwing from inside one of them and catching outside of it.



So why does using() exist?


When I interviewed for Microsoft, one of the questions I got asked was "Is the using() statement syntactical sugar?" My answer was that yes, it is, but it is important. Here's why: I define syntactical sugar as any language keywords that perform a function that you could still perform without using that specific keyword(s) but the implementation would be uglier, more difficult, or less likely for a programmer to use properly. For example, I could still close database connections with a try . . .finally block and explicitly call dispose(). In fact, if you look at the MSIL code generated for a using() block you'll see that it generates a try . . .finally block. I could even not have IDisposable and just have a cleanup method somewhere else or even explicitly define it. The nifty thing about using() is that I can just wrap an object around it and be guaranteed that the object will be disposed of properly in a deterministic fashion (with finally, it's impossible to know when dispose will be called because you don't know if an exception will be thrown until runtime and exception handling runs before finally does).


In addition, what if I forget to call dispose in the finally block? What if it is far away from where I use the disposable object so I don't see it right away? What if I need to ensure the order that dispose is called in for multiple objects and I mess that up? What if exception handling is significant to how I handle disposal? There are a lot of what-if's here and the using() statement basically give me a convenient shortcut while making my code a bit more clean.


So how does this tie into garbage collection now and why call dispose() at all?
The garbage collector mostly runs whenever it feels like it and what it does is entirely up to the garbage collector. Let's look at Jeff's SqlConnection cleanup example and see what's going on here:

sqlConnection.Close();
sqlConnection.Dispose();
sqlConnection = null;

So first, we call Close(). This has nothing to do with garbage collection, presumably it just closes the connection itself. Then we call Dispose(). Presumably, Dispose() would clean up an open connection, probably by calling Close(), but it's entirely possible that it does other things as well. According to MSDN, Close() and Dispose() are redundant, so it looks like Jeff is violating DRY here, although I think that having Close() and Dispose() do the same thing is ugly and misleading code. Finally, he's setting the object = null. Remember kids, this isn't actually setting the value of the object, he's just setting his current reference to it to null, which tells the garbage collector that he's not using it here anymore. If there are no other references still in scope in the rest of the application, it means that the garbage collector is free to clean it up (in unmanaged code, it would mean you have a memory leak).

Let's look at a non-real, contrived example; if I grab a connection to a db and I'm using winsock to connect, I'll get a handle to the socket that I'm using from winsock, which will be wrapped in some sort of SafeHandle object to ensure that it gets cleaned up. If Close() only closes the socket, it may not clean up the handles. If Dispose() calls Close() and then cleans up the handles, then it's entirely possible that the Close() and Dispose() calls are redundant, but that they do different things. Maybe if you call Close() you can re-use the connection object. That might be useful, particularly if that object takes a long time to create or is resource-heavy even when not connected (like if it allocates buffers for example). Ultimately, what you do depends on how you implement your object, so be sure that you have a good design and don't encourage bad code with redundant calls.

One other problem can exist: when an object is Disposing, what if somewhere you have another reference to that object and that reference is in use that you don't know about? If you let the garbage collector just run Dispose for you then obviously this won't be a problem, but if an eager developer calls it, you definitely have a problem. I think that if you have this problem, probably your code is not well designed, but there's a cheap way to manage it if you think it's necessary: create a private "bool isdisposed" property and when you call dispose, set this property immediately at the start of the Dispose() method (and make this thread-safe through correct use of some sort of locking strategy). Then, on every other method call, you can check the value of this property and then react accordingly, although again I would label this as code smell.

And then there's the object = null
Jeff mentions the confusing choice of calling dispose and setting null but in reality, if you know what these actually do, there's no confusion at all. If you call Dispose(), as I have mentioned at least six times now, it will run the cleanup of the object when you call it. If you set the object to null, then all you're doing is removing a reference to it and the garbage collector will then see that there are no more references to this object so it's ok to clean it up. If it wants to. When it does do cleanup, it will call Dispose for you, however it is completely non-deterministic as to when this occurs, so leaving it to the garbage collector for things like database connections or files or network connections is a bad idea since it would be easy for you to have a bunch of open connections this way, which can cause all kinds of bad problems.

So now, how is disposal "more of an optimization than anything else?"

It's not. Jeff is wrong because Jeff is right (different Jeff though. . .) and Jeff (me) says that calling Dispose is absolutely necessary so that you can have deterministic behavior in your application in terms of resource deallocation. There are lots of resources that you're using in the .Net framework, even if you are not aware of the fact (things like threads, files, even graphics tend to open handles into unmanaged code). If you are not careful about when you dispose things, it's easy to start leaking these, particularly in high-performance applications. It really isn't an "optimization" to call Dispose when you want to explicitly free up resources, particularly unmanaged ones.

Also, if you have some significant amount of work that has to be done to free up those resources, then you really should do that work at a time of your choosing. If you don't, then potentially any time you create a new reference type, the garbage collector may run and may collect something that you aren't using anymore, at which point all of that work that needs to be done to free your resources will just happen, which may affect the performance of your app in various ways, none of which will make it go faster. This isn't optimization though, it's simply good coding style.

So what should I do in my code?

First, if you allocate any resources in an object, particularly unmanaged ones, you should clean them up in your Dispose() method (while ensuring that your object implements IDisposable). I find it unusual that your objects will need to allocate external resources, but it's likely that you will consume some without realizing it, which means that you should either dispose of them explicitly inside the object when you're through with them, or that you should call their Dispose() methods from inside your object's own Dispose() method.

Next, if you have another method like "Close()" or "Shutdown()" or "DieYouGravySuckingPigDog()" then you should ensure that these methods don't do any disposal for you. These types of methods aren't meant to clean up an object so that it can't be used anymore (that's what Dispose() is for). If you have an object, then at any point in the object's lifecycle it should be usable in some way prior to calling Dispose(), and not-usable after calling Dispose(). Make sure that a disposed object won't try to do anything bad if you accidently call it after you've disposed it also, generally by writing good code where you don't attempt do this, but some checking in your object won't hurt. If you want to reuse objects a lot, consider using an object pool (a buffer pool for handling Socket reads would be a good idea as an example).

Finally, remember that just because you set an object to null doesn't mean that it's going to be disposed at any point in the near future, it is just being marked as no longer in use (unless you have another reference to it somewhere else). This is not a strategy for garbage collection

So what about Finalize()?
It turns out there is another method called Finalize() that is used by the garbage collector for cleaning up unmanaged resources. It's purpose in life, according to MSDN, is to clean up unmanaged resources in the event that Dispose was never called, but if you implement Dispose you should NOT have a Finalize and you should use GC.SurpressFinalize(this) to stop the finalize method from being called. This is to ensure that you do not do duplicate work that is not needed (generally the code called in Dispose and Finalize should be the same). There is also an object destructor, but for managed objects you should not use it. Again, this is all according to MSDN, and I would trust MSDN on garbage collection more than a person who used to code and stopped doing that to write about it.

Wait, wait, I didn't get that. What should I do? Can you sum this all up?
  1. If you are using something that is IDisposable, you should call Dispose() when you are finished with it and when it is a good time in your app to do so performance-wise. Just letting the garbage collector do it for you is not a strategy for anything except crappy coding (if crappy code is your goal, then never call dispose and you'll be about 80% there)
  2. If you are implementing an object that uses unmanaged resources, you should make that object IDisposable and free those unmanaged resources and then call GC.SurpressFinalize(this) in your Dispose() method. Ensure that calling Dispose() multiple times does not throw an exception or duplicate work (setting a private property is a useful idea here)
  3. Setting an object to null is not a strategy for garbage collection. It is a strategy for crap.
  4. If you have an object that is re-usable after calling some sort of closing method (like a connection, file, etc.) then consider pooling that object, particularly if that object is expensive to create or destroy. Remember, every time you allocate a reference type you may trigger garbage collection and you'll never know it.
  5. "Managed code" isn't managed all that well; you need to be absolutely aware of what the CLR is doing to your code and how garbage collection works. It's like walking on a tightrope- managed code gives you a safety net but that doesn't mean you should just jump into it whenever and then climb back up and keep going. That is not a strategy.
  6. Every time codinghorror posts something that jumps the shark, check out http://agilology.blogspot.com/ for clarification and amusement.