Global detection of resource leaks in a multi-node computer system

Patent Number: 8537662, issued on 2013/09/17
Applied on 2012/06/08, 13/492,634
Inventor(s): Eric Barsness, David Darrington, Amanda Randles, John Santosuosso
Assignee: International Business Machines Corporation

Abstract: A process is disclosed for identifying and recovering from resource leaks on compute nodes of a parallel computing system. A resource monitor stores information about system resources available on a compute node in a clean state. After the compute node runs a job, the resource monitor compares the current resource availability to the clean state. If a resource leak is found, the resource monitor contacts a global resource manger to remove the resource leak.

Claims: 1. A method for correcting resource leaks that occur on a parallel computing system having a service node and a plurality of compute nodes, comprising: by a respective resource monitor executing on each compute node, storing a resource availability level reflecting a clean state of the respective compute node, wherein the clean state of the respective compute node is characterized by an absence of resource leaks on the respective compute node; responsive to a first compute node being programmatically selected, by a resource manager executing on the service node, to be monitored, determining, by the resource monitor executing on the first compute node, whether a resource leak has occurred on the first compute; and upon determining that a resource leak has occurred on the first compute node and by the resource monitor executing on the first compute node, notifying the resource manager that the resource leak has occurred on the first compute node, whereupon the resource manager is configured to: identify one or more computing jobs completed by the first compute node; and remove the first compute node from a pool of available compute nodes, to prevent any job from being assigned to the first compute node. 2. The method of claim 1, further comprising: identifying at least a second compute node, of the plurality, that also executed the identified one or more computing jobs; and determining whether a resource leak has occurred on the second compute node; and upon determining that a resource leak has occurred on the second compute node, invoking the corrective action to restore a resource availability level of the second compute node to a clean state. 3. The method of claim 1, wherein the operation further comprises: prior to invoking the corrective action, removing the first compute node from a pool of available compute nodes; and after the clean state is restored on the first compute node, returning the first compute node to the pool of available compute nodes. 4. The method of claim 1, wherein the corrective action is selected from at least one of: rebooting the first compute node; and loading a system image of the first compute node captured in the clean state. 5. The method of claim 1, wherein the resource leak includes one or more orphaned temporary files. 6. The method of claim 1, wherein the resource leak comprises a decrease in memory available on the first compute node that exceeds a predetermined threshold. 7. The method of claim 1, further comprising: providing the resource manager executing on the service node and, for each of the plurality of compute nodes, the respective resource monitor executing on the respective compute node, wherein the resource manager is communicably connected to each resource monitor; by the resource manager, determining other compute nodes likely to have resource leaks, by identifying at least a second compute node, of the plurality, that also executed the identified one or more computing jobs; by the resource monitor executing on the second compute node, determining whether a resource leak has occurred on the second compute node; and upon determining that a resource leak has occurred on the second compute node and by the resource monitor executing on the second compute node, notifying the resource manager that the resource leak has occurred on the second compute node. 8. The method of claim 7, wherein the resource manager is configured to, upon being notified by the resource manager that the resource leak has occurred on the second compute node: identify one or more computing jobs completed by the second compute node; remove the second compute node from the pool of available compute nodes, to prevent any job from being assigned to the second compute node; after the second compute node is removed from the pool of available compute nodes, perform a corrective action to restore the second compute node to the clean state of the second compute node; and after the clean state is restored on the second compute node, return the second compute node to the pool of available compute nodes, to once again allow jobs to be assigned to the second compute node. 9. A non-transitory computer-readable medium containing a program which, when executed, performs an operation for correcting resource leaks that occur on a parallel computing system having a service node and a plurality of compute nodes, the operation comprising: by a respective resource monitor executing on each compute node, storing a resource availability level reflecting a clean state of the respective compute node, wherein the clean state of the respective compute node is characterized by an absence of resource leaks on the respective compute node; responsive to a first compute node being programmatically selected, by a resource manager executing on the service node, to be monitored, determining, by the resource monitor executing on the first compute node, whether a resource leak has occurred on the first compute; and upon determining that a resource leak has occurred on the first compute node and by the resource monitor executing on the first compute node, notifying the resource manager that the resource leak has occurred on the first compute node, whereupon the resource manager is configured to: identify one or more computing jobs completed by the first compute node; and remove the first compute node from a pool of available compute nodes, to prevent any job from being assigned to the first compute node. 10. The computer-readable storage medium of claim 9, wherein the operation further comprises: identifying at least a second compute node, of the plurality, that also executed the identified one or more computing jobs; determining whether a resource leak has occurred on the second compute node; and upon determining that a resource leak has occurred on the second compute node, invoking the corrective action to restore a resource availability level of the second compute node to a clean state. 11. The computer-readable storage medium of claim 9, wherein the operation further comprises: prior to invoking the corrective action, removing the first compute node from a pool of available compute nodes; and after the clean state is restored on the first compute node, returning the first compute node to the pool of available compute nodes. 12. The computer-readable storage medium of claim 9, wherein the corrective action is selected from at least one of: rebooting the first compute node; and loading a system image of the first compute node captured in the clean state. 13. The computer-readable storage medium of claim 9, wherein the resource leak includes one or more orphaned temporary files. 14. The computer-readable storage medium of claim 9, wherein the resource leak comprises a decrease in memory available on the first compute node that exceeds a predetermined threshold. 15. A parallel computing system, comprising: a service node having a computer processor and a memory; a plurality of compute nodes, each having at least a computer processor and a memory; a program which, when executed on the parallel computing system, is configured to: by a respective resource monitor executing on each compute node, store a resource availability level reflecting a clean state of the respective compute node, wherein the clean state of the respective compute node is characterized by an absence of resource leaks on the respective compute node; responsive to a first compute node being programmatically selected, by a resource manager executing on the service node, to be monitored, determine, by the resource monitor executing on the first compute node, whether a resource leak has occurred on the first compute; and upon determining that a resource leak has occurred on the first compute node and by the resource monitor executing on the first compute node, notify the resource manager that the resource leak has occurred on the first compute node, whereupon the resource manager is configured to: identify one or more computing jobs completed by the first compute node; and remove the first compute node from a pool of available compute nodes, to prevent any job from being assigned to the first compute node. 16. The system of claim 15, wherein the program is further configured to: identify at least a second compute node, of the plurality, that also executed the identified one or more computing jobs; determine whether a resource leak has occurred on the second compute node; and upon determining that a resource leak has occurred on the second compute node, invoke the corrective action to restore a resource availability level of the second compute node to a clean state. 17. The system of claim 15, wherein the program is further configured to: prior to invoking the corrective action, remove the first compute node from a pool of available compute nodes; and after the clean state is restored on the first compute node, return the first compute node to the pool of available compute nodes. 18. The system of claim 15, wherein the corrective action is selected from at least one of: rebooting the first compute node; and loading a system image of the first compute node captured in the clean state. 19. The system of claim 15, wherein the resource leak includes one or more orphaned temporary files. 20. The system of claim 15, wherein the resource leak comprises a decrease in memory available on the first compute node that exceeds a predetermined threshold.