Deadlock in global ? (we see global:random_sleep)

Tue Apr 27 22:41:11 CEST 2004

I've heard that use of the global app is dangerous because the 
algorithms do not recover properly if there is a temporary partitioning 
of the network that divides a single cloud into multiple clouds and 
then rejoins them.

Today we had a multiple-node deadlock after running solid for several 
months.  In fact the last time this happened was due to accidental 
manual tinkering.  Our distributed application uses global to ensure 
only a single instance of a particular resource is connecting to our 
server farm.

All our processes appear to be hung waiting on global:random_sleep.

My question is if anyone knows if there is a "known problem" with 
global.  And can someone describe what that problem is and perhaps 
point us in the direction of alternatives?

The problem seems quite rare.  So if it is a bug in global or a 
non-solvable problem then perhaps a workaround for now would be a means 
to reset the global state somehow?