Deadlocks in Distributed Systems:
Deadlock can occur whenever two or more processes are competing for limited resources and the processes are allowed to acquire and hold a resource (obtain a lock) thus preventing others from using the resource while the process waits for other resources.
Two common places where deadlocks may occur are with processes in an operating system (distributed or centralized) and with transactions in a database.
-
Deadlocks in distributed systems are similar to deadlocks in single processor systems, only worse:
-
They are harder to avoid, prevent or even detect.
-
They are hard to cure when tracked down because all relevant information is scattered over many machines.
-
-
People sometimes might classify deadlock into the following types:
-
Communication deadlocks : Competing with buffers for send/receive
-
Resources deadlocks : Exclusive access on I/O devices, files, locks, and other resources
-
-
Four best-known strategies to handle deadlocks:
-
The ostrich algorithm (ignore the problem).
-
Detection (let deadlocks occur, detect them, and try to recover).
-
Prevention (statically make deadlocks structurally impossible).
-
Avoidance (avoid deadlocks by allocating resources carefully).
-
Deadlock Detection:
Deadlock detection attempts to find and resolve actual deadlocks. These strategies rely on a Wait-For-Graph (WFG) that in some schemes is explicitly built and analyzed for cycles. In the WFG, the nodes represent processes and the edges represent the blockages or dependencies. Thus, if process A is waiting for a resource held by process B, there is an edge in the WFG from the node for process A to the node for process B.
In the AND model (resource model), a cycle in the graph indicates a deadlock. In the OR model, a cycle may not mean a deadlock since any of a set of requested resources may unblock the process. A knot in the WFG is needed to declare a deadlock. A knot exists when all nodes that can be reached from some node in a directed graph can also reach that node.
In a centralized system, a WFG can be constructed fairly easily. The WFG can be checked for cycles periodically or every time a process is blocked, thus potentially adding a new edge to the WFG. When a cycle is found, a victim is selected and aborted.
Centralized Deadlock Detection:
We use a centralized deadlock detection algorithm and try to imitate the nondistributed algorithm.
-
Each machine maintains the resource graph for its own processes and resources.
-
A centralized coordinator maintain the resource graph for the entire system.
-
When the coordinator detect a cycle, it kills off one process to break the deadlock.
-
In updating the coordinator’s graph, messages have to be passed.
-
Method 1) Whenever an arc is added or deleted from the resource graph, a message have to be sent to the coordinator.
-
Method 2) Periodically, every process can send a list of arcs added and deleted since previous update.
-
Method 3) Coordinator ask for information when it needs it.
False Deadlocks :
-
When the coordinator gets a message that leads to a suspect deadlock:
-
It send everybody a message saying “I just received a message with a timestamp T which leads to deadlock. If anyone has a message for me with an earlier timestamp, please send it immediately”
-
One possible way to prevent false deadlock is to use the Lamport’s algorithm to provide global timing for the distributed systems.
-
When every machine has replied, positively or negatively, the coordinator will see that the deadlock has really occurred or not.
The Chandy-Misra-Haas algorithm:
-
Processes are allowed to request multiple resources at once the growing phase of a transaction can be speeded up.
-
The consequence of this change is a process may now wait on two or more resources at the same time.
-
When a process has to wait for some resources, a probe message is generated and sent to the process holding the resources. The message consists of three numbers: the process being blocked, the process sending the message, and the process receiving the message.
-
When message arrived, the recipient checks to see it it itself is waiting for any processes. If so, the message is updated, keeping the first number unchanged, and replaced the second and third field by the corresponding process number.
-
The message is then send to the process holding the needed resources.
-
If a message goes all the way around and comes back to the original sender -- the process that initiate the probe, a cycle exists and the system is deadlocked.
-
There are several ways to break the deadlock:
-
The process that initiates commit suicide : this is overkilling because several process might initiates a probe and they will all commit suicide in fact only one of them is needed to be killed.
-
Each process append its id onto the probe, when the probe come back, the originator can kill the process which has the highest number by sending him a message. (Hence, even for several probes, they will all choose the same guy)
The Probe Scheme:
The scheme proposed by Chandy, Misra and Haas [1] uses local WFGs to detect local deadlocks and probes to determine the existence of global deadlocks. If the controller at a site suspects that process A, for which it is responsible, is involved in a deadlock it will send a probe message to each process that A is dependent on. When A requested the remote resource, a process or agent was created at the remote site to act on behalf of process A to obtain the resource. This agent receives the probe message and determines the dependencies at the current site.
The probe message identifies process A and the path it has been sent along. Thus, the message probe(i,j,k) says that it was initiated on behalf of process i and was sent from the controller for agent j (which i is dependent on) to the controller for agent k.
When an agent whose process is not blocked receives a probe, it discards the probe. It is not blocked so there is no deadlock. An agent that is blocked sends a probe to each agent that it is blocked by. If process i ever receives probe(i,j,i), it knows it is deadlocked.
This popular approach, often called "edge-chasing", has been shown correct in that it detects all deadlocks and does not find deadlocks when none exist.
In the OR model, where only one out of several requested resources must be acquired, all of the blocking processes are sent a query message and all of the messages must return to the originator in order for a deadlock to be declared. The query message is similar to a probe, but has an additional identifier so the originator can determine whether or not all the paths were covered.
There are several variations to these algorithms that seek to optimize different properties such as: number of messages, length of messages, and frequency of detection. One method proposed by Mitchell and Merritt for the single-resource model, sends information in the opposite direction from the WFG edges .
Deadlock Prevention:
Prevention is the name given to schemes that guarantee that deadlocks can never happen because of the way the system is structured. One of the four conditions for deadlock is prevented, thus preventing deadlocks.
Collective Requests:
One way to do this is to make processes declare all of the resources they might eventually need, when the process is first started. Only if all the resources are available is the process allowed to continue. All of the resources are acquired together, and the process proceeds, releasing all the resources when it is finished. Thus, hold and wait cannot occur.
The major disadvantage of this scheme is that resources must be acquired because they might be used, not because they will be used. Also, the pre-allocation requirement reduces potential concurrency.
Ordered Requests
Another prevention scheme is to impose an order on the resources and require processes to request resources in increasing order. This prevents cyclic wait and thus makes deadlocks impossible.
One advantage of prevention is that process aborts are never required due to deadlocks. While most systems can deal with rollbacks, some systems may not be designed to handle them and thus must use deadlock prevention.
-
A method that might work is to order the resources and require processes to acquire them in strictly increasing order. This approach means that a process can never hold a high resource and ask for a low one, thus making cycles impossible.
-
With global timing and transactions in distributed systems, two other methods are possible -- both based on the idea of assigning each transaction a global timestamp at the moment it starts.
-
When one process is about to block waiting for a resource that another process is using, a check is made to see which has a larger timestamp.
-
We can then allow the wait only if the waiting process has a lower timestamp.
-
The timestamp is always increasing if we follow any chain of waiting processes, so cycles are impossible --- we can used decreasing order if we like.
-
It is wiser to give priority to old processes because
-
they have run longer so the system have larger investment on these processes.
-
they are likely to hold more resources.
-
A young process that is killed off will eventually age until it is the oldest one in the system, and that eliminates starvation.
-
Preemption:
Wait-die Vs. Wound-wait:
-
Definition it can be restarted safely later.
i. Wait-die:
-
If an old process wants a resource held by a young process, the old one will wait.
-
If a young process wants a resource held by an old process, the young process will be killed.
-
Observation: The young process, after being killed, will then start up again, and be killed again. This cycle may go on many times before the old one release the resource.
-
Once we are assuming the existence of transactions, we can do something that had previously been forbidden: take resources away from running processes.
-
When a conflict arises, instead of killing the process making the request, we can kill the resource owner. Without transactions, killing a process might have severe consequences. With transactions, these effects will vanish magically when the transaction dies.
ii. Wound-wait:
-
If an old process wants a resource held by a young process, the old one will preempt the young process, wounded and killed, restarts and wait.
-
If a young process wants a resource held by an old process, the young process will wait.