/ Database

Defeating deadlocks with READ_COMMITTED_SNAPSHOT isolation

I was recently asked by a client to look into an issue they were having with a WCF web service. The application was generating a large number of errors, filling a 5MB log file every 5 minutes, and the performance of the underlying database was so bad that a simple query such as:

SELECT COUNT (*) FROM Address

would take up to a minute and a half to return. Checking the error logs I could see a huge number of exceptions like:

Transaction (Process ID 112) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

First things first I made sure that the server wasn’t lacking any physical resources but memory was sitting at less than 50% usage, processors less than 10% and disk less than 5%. So I fired up SQL Profiler to let me trace exactly which queries were causing the deadlocks and which queries were being sacrificed. If I didn’t have direct access to the server I could have used Trace Flags to log details of the deadlock for me with trace flag 1222 but fortunately I didn’t have to jump through that extra hoop and I could point SQL Profiler directly at the server and select the ‘Locks:Deadlock graph’ event:
Trace Properties
I could also turn on ‘Lock:Deadlock’ and ‘Lock:Deadlock chain’ but the deadlock graph is great because it gives you visual representation of the deadlock and you can also see the queries causing the deadlocks so I just had to wait for a minute and up popped the following:
Deadlock Graph
As you can see the deadlocks were caused by a page lock in this case and I could simply hover my mouse over either transaction (the two ellipses) to see the specific queries that were involved in the deadlock. The victim in this case was a SELECT and the winner an INSERT which were both trying to access the same page of a specific table, each one trying to lock the page because NHibernate was configured to use a ReadComitted IsolationLevel. Now normally that’s exactly the isolation level I would recommend but the problem for this application was that it was doing a large number of read transactions and they were getting blocked by some rather slow write transactions. This was having a knock on effect on other reads/writes and eventually we ended up with locks all over the place and the deadlocks and horrible performance I started this post with.

OK so at this point we can modify the code to use a less strict isolation level for reads and redeploy it and hooray our issue will be fixed. But there’s a couple of reasons I didn’t want to do that, firstly the web service code was fiendishly complicated and had no test coverage (I should point out at this point that I didn’t write it)so any changes to the code were inherently risky. Secondly the magic of READ_COMMITTED_SNAPSHOT means that modifying the code is unnecessary. So what is ‘Snapshot Isolation’ I hear you cry, well the MSDN definition is:

Specifies that data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction. The transaction can only recognize data modifications that were committed before the start of the transaction. Data modifications made by other transactions after the start of the current transaction are not visible to statements executing in the current transaction. The effect is as if the statements in a transaction get a snapshot of the committed data as it existed at the start of the transaction.

Now you have probably realised that this isolation level has potential drawbacks, see this question on stackexchange for an excellent summary of them, but in this case the benefits far outweighed any potential risk and the application was capable of handling those risks anyway. By enabling READ_COMMITTED_SNAPSHOT at the database level we can make sure that although theapplication’s connection to the database is specifying ReadCommitted IsolationLevel it is achieved using a snapshot rather than a lock, we do that with the following SQL:

ALTER DATABASE database_with_deadlocks SET SINGLE_USER WITH ROLLBACK IMMEDIATE 
ALTER DATABASE database_with_deadlocks SET ALLOW_SNAPSHOT_ISOLATION ON 
ALTER DATABASE database_with_deadlocks SET READ_COMMITTED_SNAPSHOT ON 
ALTER DATABASE database_with_deadlocks SET MULTI_USER

N.B. being in single user mode stops the query from being suspended and never completing

After making this one change there has been a dramatic increase in the performance, there are no more deadlock exceptions thrown and simple queries are back to completing in less than a second. Obviously READ_COMMITTED_SNAPSHOT is not a panacea and should be used with caution and with an awareness of it’s possible risks but in the right situation it can be an incredibly powerful tool to have in your belt.