Fun With java.util.concurrent
Not much time to post today, so this one is going to be a quickie. ;-)
Today I had quite a lot of fun with FutureTask and ConcurrentHashMap. I attempted to implement a lockless protocol for controlling access to the distributed process servers (which seems to be working, BTW). I had to ensure that a process server would be launched only once per address/port pair. If more than one thread were competing to get a reference to a server, I had to ensure that all but one thread would remain blocked. The thread that didn't block would then be responsible for launching the server, and ultimately for making the reference available to itself and to all other blocked threads. FutureTask made that synchronzation a whole lot easier. Communicating failure to the blocked threads was also quite easy.
There were also some other issues like error recovery that I had to deal with, but I have to say - the more I use j.u.c, the more I like it. Some of the code I wrote is based on a few tips that Tim Peierls posted on concurrency-interest a while ago in response to a question posed by Greg Luck of ehcache. I wasn't able to fully grasp the elegance of that code until today.
One of the most displeasing facts about asynchronous systems, however, is distinguishing real failure from asynchronous drift is so difficult. I say that because my code manipulates lots of proxies - JDI "mirrors", CORBA proxies and RMI proxies - and these proxies might fail at any instant, but sometimes that failure is legal, sometimes it is not. Sometimes I cannot know whether my failure seems illegal because some other event I was expecting didn't arrive yet, or if it's because that event has never been emmited (in which case, failure has indeed happened).
Also, in many occasions you have incoming events that have to trigger updates on other components that have threads running inside of them. Once again, ensuring that these threads behave correctly (and don't go accessing dead proxies and sockets causing spurious exceptions) involves adding tons of state-checking code inside components.
To get this a bit clearer, consider the distributed thread manager. The distributed thread manager will respond to DDWP (Distributed Debugging Wire Protocol) events and attempt to access node proxies based on them. When a node dies, it emmits an event. When a node death event arrives at the distributed thread manager, it will no longer attempt to access that node proxy. However, a DDWP event may well arrive before a death notification - the distributed thread manager will then try to access a node proxy and, of course, will fail. Maybe communication has been severed (failure). Maybe it's because the death event has been delayed (not failure). Who knows - you may keep things pending and fail on timeout, you might sit and wait (maybe forever), or just accept the situation as legal, even though it might not be.
I'm not saying it is impossible to deal with these situations, I'm just saying that it requires careful reasoning about the order in which things might happen and about what should be the correct way to react to the different ways in which events that are required for you to get a clear picture of what's really happening might be missing. Oh, I love it. :-)
Phew. I guess it wasn't a quickie after all.