Things have gone a bit crazy lately, we’ve been under a huge workload and the time left for blogging was virtually non existent… but the good thing is that I’ve been working on a few interesting cases I hope I’ll have time to blog about, and this is the first one.
The problem
The application was running on a Windows 2003 cluster and the well known solution to avoid session problems in a multi server environment is to store user sessions out-of-process, either using a State Server or Sql Server, and in this case we were using Sql Server; this is also the right approach if you are using Web Garden for your application pool and this is the situation we were in (web garden + state on Sql Server). But some session-wide arrays the customer was using were suddenly empty as soon as the customer increased the number of worker processes above 1 for the application pool, which is exactly what is not supposed to happen with <sessionState mode=”SqlServer” … />. At the beginning I thought to a configuration problem on the affected servers but the usual quick check on the metabase shown that was not the case, so the next logical step was to get a repro from the customer and continue from there.
Well, despite mi attempts I’ve not been able to reproduce the problem but the customer confirmed he could, so were be back again to some machine-specific issue? I was still not convinced and decided to have a closer look to the source code of that repro; the essential part was made by a couple of static arrays the customer was filling on the Page Load event and then binding to a couple of DropDownList controls: changing the selection on those controls (handled by the SelectedIndexChange event) was changing some other UI parts on the page and here is where the customer was getting an IndexOutOfRangeException, clear sign the array was unexpectedly empty.
Ok, so this time I decided to have a closed look at the internal behavior on that code within Visual Studio, and while I was inspecting some variables in the debugger I suddenly got the exception the customer reported, a Javascript alert which was created within a catch clause.
This made me think; not that IIS7 (I’m using Windows 2008 on my main workstation) was not respecting the fact that there was a debugger attached to one of its pools (Johan quickly blogged about this issue here), but rather about the fact that due to its health monitoring mechanism IIS7 recycled the application pool I was debugging, and as a consequence the application thrown the exception. But neither the customer reported a process crash not the event log had any relevant entry… Moreover the repro was not even using neither Session() nor Cache() objects, so this could not be a “classical” session lost problem. From my past experience I turned my attention to the static arrays (there is only one “instance” of a static object for the whole application, and if someone changes the array that affects everyone using it), but that was not our case since the customer and I were able to reproduce the problem on a single machine, single user. But static objects have another peculiarity: they are scoped to the AppDomain.
If you have the detective instinct you might have guessed where I am going from there… The static arrays are “loaded automatically by the .NET Framework common language runtime (CLR) when the program or namespace that contains the class is loaded” (http://msdn2.microsoft.com/en-us/library/79b3xss3.aspx), and the first time a user requested the page the array was filled with some data coming from a database. To avoid useless travels to the database, the arrays were filled only if we were not in a postback (IsPostBack=false, so we are in a GET request); when we post the page back (so we’re in a POST) the initialization code does not run because IsPostBack is true and the code assumes the array is already filled and ready to use, and if we’re lucky everything works fine. This is true until IIS decides (for its own internal philosophical reasons) that our POST will be served by the same w3wp.exe instance which already served our first GET. But… what happens if we’re sent to a new w3wp.exe instance? Remember everything was working fine with 1 process per application pool and the problem appeared only increasing the number of processes per pool (Web Garden >= 2)…
Well… we will be still issuing a POST, which will be served by a new w3wp.exe instance which will load (for the first time) the CLR and initialize the application (the AppDomain and its static fields), the requested page will run… and IsPostBack will still equals to true. So we’ll not query the database to fill the array, we’ll simply assume the array is valid (and the array object is actually valid, we’re not getting a NullReferenceException, it’s just empty) and we’ll get the IndexOutOfRangeException when trying to read it.
A bit of theory
The main subject here is isolation; this post from Chris Brumme (see also this one for more details on AppDomains) explains the very basic of this matter:
By default, static fields are scoped to AppDomains. In other words, each AppDomain gets its own copy of all the static fields for the types that are loaded into that AppDomain. This is independent of whether the code was loaded as domain-neutral or not. Loading code as domain neutral affects whether we can share the code and certain other runtime structures. It is not supposed to have any effect other than performance
Processes are isolated by definition, it’s just the OS architecture which works that way (and not only on Windows, but also on other systems); if you think at how messages are exchanged in the Windows OS, there is a complex infrastructure which works at kernel level to make the communication between processes possible (Windows is a system based on messages, which are sent to windows and processes through low level system APIs), you can understand that isolation is essential for security and stability of the entire OS and for the application running on top of it. If an application could easily read/write data belonging to other processes, it would be a perfect environment for hackers and viruses… we need the opposite (of course with some flexibility left to run our applications). Since an AppDomain is loaded within a given process, and the main purpose of an AppDomain is isolation, e.g. prevent that one application can affect other applications running within the same process, change its data etc…, it’s also clear that when you load an AppDomain within a specific process, it cannot be accessed from AppDomains loaded in other processes (of course unless you want it explicitly, then you can use .NET Remoting to create this kind of communication).
How should be memory, objects, memory addresses, resources, threads etc… shared between different processes? And if a process throws an exception which affect that shared memory, it will kill all other processes which are using that same shared memory region… this is the opposite of isolation and safety, what we must have an our operating systems. And think to a cluster/NLB environment: how could we share those static objects (again, memory, threads, resources and so on) across different machines? How could a static variable travel across processes and machines?
So, an AppDomain is specific to a process, static fields are specific to an AppDomain… Sql Server (or any kind of state server) is not the solution for the problem the customer reported; this is rather an design issue. Also the customer enabled Web Garden to have performance benefits and this is actually the idea behind Web Garden, but in specific circumstances; you lose more performance than you gain in 90% of cases and it cannot overcome the system architecture, where again processes are isolated from each other. Web Garden only increases performance in cases where you don’t rely on cache and where your application is not CPU intensive; in essence it really has a measurable positive impact on performance is if you have a site like www.microsoft.com that is mostly static content and completely stateless. In most cases it causes performance to be worse because we have to maintain one cache per process and because of the overhead of having multiple processes.
Conclusion
So, the best solution for this problem is to not use Web Garden; alternatively, if the customer likes the idea to have multiple processes serving the application pool, then some re-coding is needed to change those static arrays into “simple” private objects and store them in Session() or Cache(), which is perfectly maintained using Sql Server. Bottom line is: does not use statics with Web Garden, or be sure you always check if your data objects are correctly initialized and do not simply assume they are in a good shape because you’re in a postback.
Carlo
Quote of the day:
Never explain: your friends do not need it and your enemies will not believe you anyway. – Elbert Hubbard