a cluster where there is an active Generic Script resource, the cluster may
become unresponsive. Cluster Administrator and Cluster.exe appear to stop
responding (hang). The cluster log shows blocked threads inside a
Generic Script resource. For example:
000007c4.000007e4::2002/12/12-19:17:03.781 INFO [FM] FmpRmOnlineResource: called InterlockedIncrement on gdwQuoBlockingResources for resource f37f58fb-03ff-44b3-a4d7-086b0838d73d
event log contains a message similar to either of the following:
Event ID: 1232
Event Type: Error
Cluster generic script resource MyScript timed out. Online script
entry point did not complete execution in a timely manner. This could be due to
an infinite loop or a hang in this entry point, or the pending timeout may be
too short for this resource. Please review the Online script entry point to
make sure there's no infinite loop or a hang in the script code, and then
consider increasing the pending timeout value if necessary. In a command shell,
run "cluster res "MyScript" /prop PersistentState=0" to disable this resource,
and then run "net stop clussvc" to stop the cluster service. Ensure that any
problem in the script code is fixed. Then run "net start clussvc" to start the
cluster service. If necessary, ensure that the pending time out is increased
before bringing the resource online again.
Event ID: 1233
Event Type: Error
Event Source: ClusSvc
Cluster generic script resource MyScript: Request to perform the Online
operation will not be processed. This is because of a previous failed attempt
to execute the Online entry point in a timely fashion. Please review the script
code for this entry point to make sure there is no infinite loop or a hang in
it, and then consider increasing the resource pending timeout value if
necessary. In a command shell, run "cluster res "MyScript" /pro
PersistentState=0" to disable this resource, and then run "net stop clussvc" to
stop the cluster service. Ensure that any problem in the script code is fixed.
Then run "net start clussvc" to start the cluster service. If necessary, ensure
that the pending time out is increased before bringing the resource online
Generic Script resource script can cause the whole cluster to stop responding or become
unresponsive if any of the following conditions exist:
- The Generic Script resource script contains an infinite
loop (and therefore never exits).
- Calls to certain cluster application programming interfaces (APIs) are occurring. Calls to
certain cluster APIs must be avoided from within a resource DLL or resource
script because they can cause a cluster-wide deadlock. This script may be
calling cluster APIs or starting Cluster.exe (which may result in calling
cluster APIs that must be avoided) as one of the steps. For
information about APIs that should not be called from a resource DLL or
script, see “Function Calls to Avoid in Resource DLLs” in the Microsoft Platform SDK (PSDK).
- An action the Generic Script resource script is performing
takes longer than the pending timeout value.
To avoid an infinite hang situation, the Cluster Resource
Monitor refuses to perform any operations (such as Online, Offline, IsAlive,
and LooksAlive) on the script after any operation has exceeded the pending
timeout value. Any additional attempts to perform Generic Script resource
operations on that resource will result in the second event log message that is shown in the "Symptoms" section of this article.
The Cluster Resource Monitor will not perform any additional operations on
a Generic Script resource after any entry point has exceeded the pending
timeout value, but the problematic thread will continue to run.
To resolve the problem, disable the resource (that is, prevent it from coming
online), stop the Cluster service (this terminates the
problematic thread), fix the script problem, and then restart the Cluster
service. Depending on the cause of this problem, you may want to increase the
online or offline pending timeout value for this resource. For step-by-step instructions, see the "Recover and Restart the Cluster Service” section later in this article.
Changing Pending Timeout Values
Any cluster resource operation should complete execution well inside the range of the pending timeout. For this
reason, do not change the timeout value without a thorough understanding of
why your script entry point exceeds this period of time. Also,
consider all the implications of increasing this value because the cluster
will be unresponsive until the timeout value is exceeded.
Recover and Restart the Cluster Service
- Disable the resource (in this example, named MyScript) by typing
the following command:
cluster resource "MyScript" /properties
- Stop the Cluster service on the node that currently owns
this resource’s group by typing the following
command in a console window:
net stop clussvc
- Fix any problem that you identify in the script that causes it
to stop responding, loop, or exceed the pending timeout value. You may determine that
the appropriate thing to do is to increase the pending timeout value, but make
sure that you carefully consider the implications of doing so.
- Restart the Cluster service by typing the following command:
- Bring the resource back online manually by using
Cluster Administrator or Cluster.exe. To do so, type the following command:
resource “MyScript” /onlineNote that bringing the
resource back online automatically sets PersistentState to 1, so there is no need for an additional command to change the value
Microsoft has confirmed that this is a bug in the Microsoft products that are
listed at the beginning of this article.