Monday, January 19, 2009

Why dispose is necessary and other insights on managed code

Jeff Atwood has once again jumped the shark, this time about managed code. Then there's this little gem from twitter:

I believe .Dispose() is a form of optimization, which is necessary *sometimes*
but not *always*. Anyone got links with evidence otherwise?

So for my next trick, I am going to show Jeff (and everyone else) how managed code works and when it fails.

First, I always have a saying when introducing managed code:
Managed code isn't.

The reason that I say this is that most people think that because your code is managed, you don't have to think about memory management at all; you can just allocate whatever you feel like and the garbage collector will just take care of it for you. According to Jeff, this is a good thing. It is indeed a good thing, although unfortunately it doesn't work that way, and if you don't know precisely what the CLR is doing to your code and how garbage collection works, you're going to have problems and when they occur you will have no idea where to look for solutions. The goal of this post is to try to educate you as to what is going on. I will not be able to go into a lot of depth in one post, however this should get you started and give you some resources to do further exploration when you run into these problems in your code. Let's start with looking at dispose().



For those of you who are unsure, dispose() is an implementation of the disposal pattern. The idea behind this is that you have a way to mark a class as having an explicit cleanup that needs to occur prior to that object being deallocated. .Net accomplishes this by an object implementing the IDisposable interface with the Dispose() method, which you should call when you are finished using an object so that it can free its resources. This method should invoke the same code as when Finalize() is called (more on finalization below), so if someone forgets to call Dispose(), it is still guaranteed to be cleaned up prior to garbage collection occurring. Therefore, it is a good place for cleaning up unmanaged resources, like database connections or open streams (e.g. files and network connections). That way you are 100% guaranteed that this cleanup will occur, even if the programmer simply stops using a particular object and lets it fall out of scope.

Using(disposal pattern)

This brings us to our second useful piece of code, which is the using() {} block. This block allows you to put something that implements IDisposable (henceforth "disposable") inside the using() keyword, and then upon leaving the basic block defined by the using statement (i.e. going outside of the {} by any means), the dispose() method WILL be run prior to the jump that occurs. This means that inside my using block I can return, break, exit, throw, or otherwise jump to another block of code and before that code exists, dispose is called on the object in the using() statement. These can be nested, in which case objects are disposed in FILO order (first in last out). These also work with finally blocks so that if I have a try..finally with a nested using block, the dispose is called prior to the finally, even in cases where code inside the using block throws an exception dispose is still called first. You can test this quite easily by writing a simple app with nested using blocks and then throwing from inside one of them and catching outside of it.



So why does using() exist?


When I interviewed for Microsoft, one of the questions I got asked was "Is the using() statement syntactical sugar?" My answer was that yes, it is, but it is important. Here's why: I define syntactical sugar as any language keywords that perform a function that you could still perform without using that specific keyword(s) but the implementation would be uglier, more difficult, or less likely for a programmer to use properly. For example, I could still close database connections with a try . . .finally block and explicitly call dispose(). In fact, if you look at the MSIL code generated for a using() block you'll see that it generates a try . . .finally block. I could even not have IDisposable and just have a cleanup method somewhere else or even explicitly define it. The nifty thing about using() is that I can just wrap an object around it and be guaranteed that the object will be disposed of properly in a deterministic fashion (with finally, it's impossible to know when dispose will be called because you don't know if an exception will be thrown until runtime and exception handling runs before finally does).


In addition, what if I forget to call dispose in the finally block? What if it is far away from where I use the disposable object so I don't see it right away? What if I need to ensure the order that dispose is called in for multiple objects and I mess that up? What if exception handling is significant to how I handle disposal? There are a lot of what-if's here and the using() statement basically give me a convenient shortcut while making my code a bit more clean.


So how does this tie into garbage collection now and why call dispose() at all?
The garbage collector mostly runs whenever it feels like it and what it does is entirely up to the garbage collector. Let's look at Jeff's SqlConnection cleanup example and see what's going on here:

sqlConnection.Close();
sqlConnection.Dispose();
sqlConnection = null;

So first, we call Close(). This has nothing to do with garbage collection, presumably it just closes the connection itself. Then we call Dispose(). Presumably, Dispose() would clean up an open connection, probably by calling Close(), but it's entirely possible that it does other things as well. According to MSDN, Close() and Dispose() are redundant, so it looks like Jeff is violating DRY here, although I think that having Close() and Dispose() do the same thing is ugly and misleading code. Finally, he's setting the object = null. Remember kids, this isn't actually setting the value of the object, he's just setting his current reference to it to null, which tells the garbage collector that he's not using it here anymore. If there are no other references still in scope in the rest of the application, it means that the garbage collector is free to clean it up (in unmanaged code, it would mean you have a memory leak).

Let's look at a non-real, contrived example; if I grab a connection to a db and I'm using winsock to connect, I'll get a handle to the socket that I'm using from winsock, which will be wrapped in some sort of SafeHandle object to ensure that it gets cleaned up. If Close() only closes the socket, it may not clean up the handles. If Dispose() calls Close() and then cleans up the handles, then it's entirely possible that the Close() and Dispose() calls are redundant, but that they do different things. Maybe if you call Close() you can re-use the connection object. That might be useful, particularly if that object takes a long time to create or is resource-heavy even when not connected (like if it allocates buffers for example). Ultimately, what you do depends on how you implement your object, so be sure that you have a good design and don't encourage bad code with redundant calls.

One other problem can exist: when an object is Disposing, what if somewhere you have another reference to that object and that reference is in use that you don't know about? If you let the garbage collector just run Dispose for you then obviously this won't be a problem, but if an eager developer calls it, you definitely have a problem. I think that if you have this problem, probably your code is not well designed, but there's a cheap way to manage it if you think it's necessary: create a private "bool isdisposed" property and when you call dispose, set this property immediately at the start of the Dispose() method (and make this thread-safe through correct use of some sort of locking strategy). Then, on every other method call, you can check the value of this property and then react accordingly, although again I would label this as code smell.

And then there's the object = null
Jeff mentions the confusing choice of calling dispose and setting null but in reality, if you know what these actually do, there's no confusion at all. If you call Dispose(), as I have mentioned at least six times now, it will run the cleanup of the object when you call it. If you set the object to null, then all you're doing is removing a reference to it and the garbage collector will then see that there are no more references to this object so it's ok to clean it up. If it wants to. When it does do cleanup, it will call Dispose for you, however it is completely non-deterministic as to when this occurs, so leaving it to the garbage collector for things like database connections or files or network connections is a bad idea since it would be easy for you to have a bunch of open connections this way, which can cause all kinds of bad problems.

So now, how is disposal "more of an optimization than anything else?"

It's not. Jeff is wrong because Jeff is right (different Jeff though. . .) and Jeff (me) says that calling Dispose is absolutely necessary so that you can have deterministic behavior in your application in terms of resource deallocation. There are lots of resources that you're using in the .Net framework, even if you are not aware of the fact (things like threads, files, even graphics tend to open handles into unmanaged code). If you are not careful about when you dispose things, it's easy to start leaking these, particularly in high-performance applications. It really isn't an "optimization" to call Dispose when you want to explicitly free up resources, particularly unmanaged ones.

Also, if you have some significant amount of work that has to be done to free up those resources, then you really should do that work at a time of your choosing. If you don't, then potentially any time you create a new reference type, the garbage collector may run and may collect something that you aren't using anymore, at which point all of that work that needs to be done to free your resources will just happen, which may affect the performance of your app in various ways, none of which will make it go faster. This isn't optimization though, it's simply good coding style.

So what should I do in my code?

First, if you allocate any resources in an object, particularly unmanaged ones, you should clean them up in your Dispose() method (while ensuring that your object implements IDisposable). I find it unusual that your objects will need to allocate external resources, but it's likely that you will consume some without realizing it, which means that you should either dispose of them explicitly inside the object when you're through with them, or that you should call their Dispose() methods from inside your object's own Dispose() method.

Next, if you have another method like "Close()" or "Shutdown()" or "DieYouGravySuckingPigDog()" then you should ensure that these methods don't do any disposal for you. These types of methods aren't meant to clean up an object so that it can't be used anymore (that's what Dispose() is for). If you have an object, then at any point in the object's lifecycle it should be usable in some way prior to calling Dispose(), and not-usable after calling Dispose(). Make sure that a disposed object won't try to do anything bad if you accidently call it after you've disposed it also, generally by writing good code where you don't attempt do this, but some checking in your object won't hurt. If you want to reuse objects a lot, consider using an object pool (a buffer pool for handling Socket reads would be a good idea as an example).

Finally, remember that just because you set an object to null doesn't mean that it's going to be disposed at any point in the near future, it is just being marked as no longer in use (unless you have another reference to it somewhere else). This is not a strategy for garbage collection

So what about Finalize()?
It turns out there is another method called Finalize() that is used by the garbage collector for cleaning up unmanaged resources. It's purpose in life, according to MSDN, is to clean up unmanaged resources in the event that Dispose was never called, but if you implement Dispose you should NOT have a Finalize and you should use GC.SurpressFinalize(this) to stop the finalize method from being called. This is to ensure that you do not do duplicate work that is not needed (generally the code called in Dispose and Finalize should be the same). There is also an object destructor, but for managed objects you should not use it. Again, this is all according to MSDN, and I would trust MSDN on garbage collection more than a person who used to code and stopped doing that to write about it.

Wait, wait, I didn't get that. What should I do? Can you sum this all up?
  1. If you are using something that is IDisposable, you should call Dispose() when you are finished with it and when it is a good time in your app to do so performance-wise. Just letting the garbage collector do it for you is not a strategy for anything except crappy coding (if crappy code is your goal, then never call dispose and you'll be about 80% there)
  2. If you are implementing an object that uses unmanaged resources, you should make that object IDisposable and free those unmanaged resources and then call GC.SurpressFinalize(this) in your Dispose() method. Ensure that calling Dispose() multiple times does not throw an exception or duplicate work (setting a private property is a useful idea here)
  3. Setting an object to null is not a strategy for garbage collection. It is a strategy for crap.
  4. If you have an object that is re-usable after calling some sort of closing method (like a connection, file, etc.) then consider pooling that object, particularly if that object is expensive to create or destroy. Remember, every time you allocate a reference type you may trigger garbage collection and you'll never know it.
  5. "Managed code" isn't managed all that well; you need to be absolutely aware of what the CLR is doing to your code and how garbage collection works. It's like walking on a tightrope- managed code gives you a safety net but that doesn't mean you should just jump into it whenever and then climb back up and keep going. That is not a strategy.
  6. Every time codinghorror posts something that jumps the shark, check out http://agilology.blogspot.com/ for clarification and amusement.

7 comments:

Kelly Leahy said...

about Dispose(), you said that "this method is guaranteed to be called prior to garbage collection..."

This is absolutely NOT true. That's why finalizers exist. If you have something you need to have done before GC collects your object, you need a finalizer. However, if your object is not directly owning something that needs to be freed you don't need a finalizer, as Jeff said.

BTW, you can verify what I'm saying with a simple test. Create an object that sets some global variable to true when dispose is called. Put this object in a weak reference (with no other references to it), and then call GC.Collect(). Verify that the weak reference doesn't reference your object anymore, and guarantee that all finalizers have been called by calling GC.WaitForPendingFinalizers(). Then, check your flag. At this point, the GC has collected your object instance, but you'll notice that your dispose method has not been called.

Now, that said, while I completely disagree with Jeff's (Tucker) assertions on whether calling .Dispose() is an optimization or not (it absolutely is, as the only effect it has, unless the pattern was implemented wrong by the object implementer, is to cause cleanup to occur earlier than it would normally have). Sometimes this really does matter, right? Like when you have files that you want to delete or pass to another process, etc., but sometimes it doesn't matter to you.

I believe it is good practice to call Dispose as soon as you can, if it's possible that resource 'hogging' can be a problem. However, if you are in a situation where cleanup costs would be more of a performance burden than resource 'hogging' (sometimes this is the case), it's perfectly acceptable to allow the finalizer to do the work for you. Any object that actually owns unmanaged resources directly should ABSOLUTELY implement both Dispose and a finalizer, and in Dispose should call GC.SuppressFinalize(this), and if it doesn't IT is implemented wrong, not your code that doesn't call Dispose!

Jeff Tucker said...

Dammit, you're right about the finalize vs. dispose thing, I can't believe that I didn't catch that. I originally crammed that all into one long paragraph (like I usually do) so I must have screwed it up when I split it out. I'll fix/clarify

Jeff Tucker said...
This comment has been removed by the author.
Kelly Leahy said...

So, now, let's talk about setting the reference to null.

This is absolutely a valid thing to do and it's the best way to say "I'm done with this object so you can collect it at will" for objects that are both IDisposable and objects that aren't. The difference being if you need to apply the optimization of "early" destruction (i.e. you want to dispose of the resources NOW), you should call Dispose before setting the object ref to null.

For most objects, Dispose will render the object useless, so it's best not to keep it around, so setting the ref to null is a perfectly valid thing to do. Of course, if you have a 'contained' object that has a shorter lifetime then you do, I'd say that might be a code smell in and of itself.

As for the rest of Jeff's (Tucker) post, I'd say the things he says are very good advice. My issue with the post is that they are not absolute rules and that calling Dispose() IS an optimization.

Jeff Tucker said...

I disagree on always setting objects to null. For your typical object that lives and dies in some method, it's not necessary because it goes out of scope quickly enough that it won't matter.

For an object that won't fall out of scope, then it's absolutely necessary, but I find these are few and far between (in good code, anyway). One other time this is useful is if an object won't go out of scope for a long time, however you finish using it quickly.

For example, if I allocate a large object at the start of a method, use it, and then finish using it, and then start some sort of long-running processing in the same method, then the object that I'm finished using will still exist, and worse, it may get sent to the gen2 heap or the large object heap, despite the fact that I'm finished with it, so in this case setting it to null is a good idea. However, I would argue that this is very much the exception, not the rule. There are a few other cases as well where this is true but I'm having a hard time thinking of a good example. I think a blog post with more details on garbage collection and performance may be a good idea in the future.

So now the question I have is if you have some resources like this that you need to set to null, would Dispose() be a good place to do that? Seems like this would be an exception to not having disposal on managed objects with no unmanaged resources, but I'm not convinced that it's 100% necessary. I think I'll try to come up with a scenario where Dispose() with managed resources being set to null gives a performance boost but this will take some contemplation on my part.

romkyns said...

> So now, how is disposal [NOT] "more of an optimization than anything else?"

I don't feel that you've answered this question. In this paragraph you simply go on and make the claim that it's NOT just an optimization, without explaining why.

C++ lets you free memory explicitly, but in C# if you've allocated a 10MB byte array the ONLY recourse you have is GC.

Why is it then so damn important to dispose of the rather lightweight HPEN resource underlying the Pen IDisposable object, and how is it not a micro-optimization, in comparison to your 10MB byte array?

Why do you advocate disposing of this lightweight resource manually when you have zero control over the disposal of a whopping 10MB of RAM? (which, of course, is as much of a system resource as an HPEN)

Alex said...

@romkyns: The 10 MB are not as mush a system resource as HPEN is. Memory is a resource only you need some, you simply don't care if it is the same memory or other. In other words, if there is a resource for which you can afford having non-deterministic disposal, that's the memory itself. The problem is that GC is designed to handle memory only, still people tend to forget that destructors are made to do other things besides freeing the memory. It is these "other things" that are important and must be done in deterministic way.

I simply fon't care if the system frees the 10 MB as long as it grants me other 10 MB of memory when I request them. It might be the same 10 MB or some others, I don't care, because the memory is not named.

But I do care if my SQL connection is closed when I want it to be closed, and I want my file to be closed when I want it to be closed.

Why that? Because memory is an internal resource of the program. It involves no interaction with the rest of the OS. But all the other resources involves that. A program could mess with its internal resources as much as it likes. But its external behavior shall be predictable and deterministic.