My last project at work involved building a records management system atop SharePoint. (SharePoint is a web collaboration portal and data storage system.) SharePoint doesn't support transactional updates of data — a single item can be updated atomically, but a collection of items cannot. This makes it very difficult to build a system that's both complex and reliable. If a system ever needs to make changes to multiple items in a single operation, a failure in the middle can cause data to get out of sync. And anybody familiar with SharePoint knows that failures are common.
The solution was locks, but locks are tricky. First, you have to make sure to lock everything that's going to be changed, before you start changing it. This requires all operations to have two stages: first, you need to simulate the operation enough to discover everything that will need to be changed, and then you need to lock all the objects, perform the operation, and unlock them. You also have to lock far-flung parts of the system itself to prevent the set of items needing to be locked from changing. Finally, you either have to either establish a global convention for locking all items in the same order, or make locks time out, to prevent deadlocks. Needless to say, the greatly increased complexity and stringent demands require very careful coding, and place an increased burden on developers using the system.
Locks have other problems, too. All that work of locking and unlocking reduces the scalability of the application, especially, as in our case, when locks have to function across processes and machines, entailing communication with a central lock server. Unless you're going to lock on reads too, decreasing performance further, it's still possible for a process to view a half-changed set of items. And it's highly pessimistic; 99% of the time there'll be no conflict, so you're doing a lot of work for an edge case. Since it's rare and intermittent, the edge case is hard to test and debug, too.
Although I knew about these problems, I figured they were all solvable. It would just take careful, skilled programming. But when I thought about how I could improve the system, I hit upon a fundamental problem with locks: lock-based operations can't be composed. Say you've carefully designed two operations, A and B, which each use locks to complete atomically. Now someone wants to create a new operation AB that calls uses both A and B and also completes atomically. How can they prevent someone from observing the intermediate state where A has completed but not B? Except in trivial cases, it's usually impossible. If it's supported at all, it's done by exposing the locking mechanism to the user, which is terrible (but so common that we're used to it).
This quickly changed my opinion about transactions: from something that's just nice to have into something absolutely necessary for good design. I already added transaction support to our object model, but it had a substantial cost in increased complexity, and was still built upon locking, due to the lack of transaction support in the underlying SharePoint API. In the next release, we plan to replace SharePoint with a more reliable custom system, and the desire to implement full transaction support throughout prompted me to find a way to simplify the creation of transactional software.
Enter software transactional memory (STM). Some decades ago, there was interest in the possibility of creating a machine that would support transactions all the way down to the hardware level. That never materialized, but in recent years there has been resurgence in interest and research in software implementations. The basic idea is that a program can run for a while, and eventually ask for all its changes to memory to be either rolled back or committed atomically. Combined with the long-standing support for transactions in databases and the recent support for transactions in the file system and registry, plus the ability to compose transactions using all kinds of resources across multiple machines into distributed transactions that either completely succeed atomically or roll back all changes everywhere, software can now take a huge leap forward in reliability and scalability while simultaneously being easier to write and maintain than ever before.
But that's only true if there exist good libraries to support this way of working. I searched for STM implementations for .NET and found four or five of them. Unfortunately, as is often the case, they all suck, to some degree or another.
- SXM is an implementation of STM for .NET by Microsoft Research. It seems to be a pretty solid system, is very simple to start using, and supports some advanced operations (like conditional retry and 'orElse'). But it has several major flaws, and it's not actively developed, so they'll likely never be fixed. These are forgivable since it's just a research project, but I can't recommend it for real-world use.
- SXM makes it very difficult to build a system that uses transactions in a transparent way. It requires all transactional objects to be allocated by SXM itself, which means that all users of your transactional objects must also be aware of SXM.
- It provides an extremely simple interface for using it — simply add the [Atomic] attribute to a class and it magically becomes transactional. That's really nice. But it provides no other interface, which means it can be hard to write flexible or high-performance transactional classes.
- SXM is unclear about how objects are copied. Reading the source code, I could see that all objects should either be immutable or implement ICloneable, but if you don't adhere to that, you may get silent violations of consistency.
- The core SXM system doesn't provide a real guarantee of progress. It may not deadlock, but it can still livelock. It allows you to provide a contention manager that will be used to resolve conflicts between transactions, but the design of a good contention manager is complex, and they recommend that it should be based on the behavior of the algorithms you're writing. This is another way that SXM can't be used transparently. What if your algorithm requires one kind of contention manager, but is called by a function in a library which wants a different kind of contention manager? Only one manager can be active, and libraries would need to avoid stepping on each other's toes somehow. It also seems highly unbalanced that the interface for using it is so oversimplified, except for the arcane requirement to select or design your own contention manager.
- SXM doesn't integrate with System.Transactions, meaning that it won't participate in other transactions, and other transactions won't participate in SXM. So memory changes may complete while database changes fail, or vice versa, meaning you don't have consistency.
- LibCMT is a C library that implements what might be a good STM system, and it includes a .NET binding. However, the fact that it's an unmanaged library means it faces severe limitations, and the .NET binding is pretty shoddy.
- It's only capable of managing types that implement a special, non-standard interface, so you can't use it with integers or strings or dates or any other built-in type unless you first wrap it in an object that implements an interface for cloning, comparing, and altering the type.
- Its unmanaged memory isn't garbage collected, and the .NET binding fails to use finalizers or even Dispose() methods, so unless you're careful, you can get memory leaks and dangling transactions. The system is not resilient against thread abortion, so I wouldn't recommend using it in any context where thread abortion may happen, like in web contexts.
- There will be overhead related to transitioning back and forth from unmanaged code, and its handling of unmanaged objects seems dubious.
- It doesn't seem to support nested transactions.
- It dumps a bunch of debugging information to the console window, so it can't be used as-is in a console application.
- It doesn't integrate with System.Transactions, and so won't provide system-wide consistency.
- MikroKosmos is a toy reimplementation of Microsoft's Bartok STM, but the point of the Bartok STM was to show how an STM system built into the .NET runtime could take advantage of its privileged position to implement various optimizations, and simulating the system from outside the runtime is rather silly. MikroKosmos is meant to be used as a benchmark for a particular algorithm, and is not a practical STM system.
- NSTM is the best library I found, but nonetheless isn't very good. It was written by Ralf Westphal, who's a pretty smart guy, and somebody I'd trust as an architect, analyst, or project manager, but not as a developer. The good part of NSTM is that it attempts to provide a flexible and full-featured system with both high- and low-level access points that integrates with System.Transactions and can be used transparently. It provides an aspect-oriented approach similar to SXM's, where you can apply an attribute to a class to automagically make it transactional, and also a way to directly access the memory managed by the system. But its flaws are as numerous as its features:
- NSTM uses blocking locks that can be held for significant periods of time (especially with System.Transaction integration). If the locks aren't released, for instance due to a thread abortion or you forgetting to call Dispose(), the system may deadlock.
- NSTM has no facility to allow one transaction to help another transaction commit, nor does it have fancy contention management, and so it will experience a significantly higher number of transaction abortions. It also appears vulnerable to livelock (although I haven't verified this), and offers no guarantee of progress, not even the weak guarantee of obstruction freedom.
- NSTM supports a somewhat limited set of types: only value types, strings, and ICloneable objects. Furthermore, its support for value types is buggy, and can lead to uncommitted changes leaking out of transactions even if they've been rolled back.
- NSTM is easy to use wrongly. For instance, it cheerfully provides an implicit conversion function that lets you destroy the consistency of your transactions by writing code as unassuming as "x = 5". It also provides a large number of configuration options, and it can't be reliably used between different libraries because they may need different configurations, at least one of which exists only as a global setting.
- NSTM has questionable code quality, and the source contains comments indicating the existence of unknown bugs lurking around. It also does a lot of silly things that hurt performance, and takes a lenient approach to user mistakes that covers up bugs.
Because of the apparent lack of a decent STM implementation for .NET, I set out to create my own.
- It offers a strong guarantee of progress — lock freedom — which guarantees that the system as a whole can always make progress (i.e. that no matter how much contention there is between individual threads running their own transactions, at least one transaction can always succeed). (There is one minor caveat: if you integrate with System.Transactions, there is a short time window when the guarantee is suspended, but that's necessary to support the System.Transactions interface.)
- It supports a broad range of types: Immutable reference types, and value types that contain no mutable reference types in their object graphs (i.e. most of them), are handled automatically, so implementing ICloneable is often unnecessary, but types can implement it if they wish to customize the copying process.
- It guarantees consistency even if a thread gets aborted, even if it aborts in the middle of a commit. (Caveat: when used with System.Transactions, a thread abort in the middle of a commit will roll back changes to memory, but what happens to other resources depends on the implementations of their particular transaction managers.)
- Transactions help each other commit when they detect contention between themselves. While it may not always work ideally, the contention management works well enough that you shouldn't need to think about it.
- It transparently integrates with System.Transactions, but this can be disabled if you prefer.
- It fully supports nested transactions. I initially implemented multiple isolation levels, conditional retry, and 'orElse' composition as well, but the most useful alternative isolation level is incompatible with the guarantee of progress, and conditional retry and 'orElse' composition didn't seem to be of great practical use. Since they reduced performance, I removed them, but only a minor change would be needed to add them back.
- Its performance is substantially better than that of NSTM, the nearest competitor.
- It's easy to use correctly and difficult to use incorrectly.
This has been implemented in my Transactions library. It has been used within serious projects and has worked well. It would be nice to have an attribute that can automagically make classes transactional, but it's quite useful as it is, and I believe it to be a quality system that you can build upon. If you want to use it but are unsure about how to structure an object to make it transactional, feel free to email me.
Comments
No comments yet.