In this Document [ID 345090.1]
Symptoms |
Changes |
Cause |
Solution |
References |
The Workflow Background Process (WFBG) is running for hours without completing.
The wf_item_activity_statuses_h table has over 80 million records, and is growing quickly.
The problem started after a new custom Workflow process was introduced to handle rate change orders.
It was observed that new items are being added at a very fast rate to the WF deferred queue.
Another symptom of this problem may be the archive log size growing rapidly when the WFBG process
is running.
STEPS
1. Submit the Workflow Background Process with the YNN or YYY options.
BUSINESS IMPACT
Since the Workflow uses a deferred booking mechanism, the orders are not being picked up and booked
by the WFBG as this process never completes normally. As such, orders cannot be picked and shipped.
Another impact is the large amount of data being built up in the Workflow tables, leading to large redo
logs in the database.
A custom Workflow process was introduced into Production recently.
The root cause of this problem is the extremely small relative wait times (like .00001) on the custom WF items.
These small wait times were placed on activities in a processing loop which checks for a certain condition to be
met,
and gets deferred each time that condition is met. Every time this
happens, a row gets inserted in tables wf_item_activity_statuses and
wf_item_activity_statuses_h, and a message is placed in the WF deferred
queue.
If there are a lot of such WAIT activities and the WFBG runs often - like very 5 minutes - this can place a great
burden on the system CPU and storage, as well as table and index structures.
Note: Having a WAIT activity with parameters set to Wait Mode of Absolute Date and no Absolute Date stated,
also translates into a relative wait time of zero. If this WAIT activity is placed in a loop within the Workflow, then
the same symptoms described above can occur.
There is an additional problem, however. If a WAIT activity as described above is processed by WFBG and
then deferred again because the loop condition is met, it gets enqueued in the Workflow deferred queue for the
specified amount of time, which in the above cases is just over a minute, or even mere seconds. Once WFBG is
done processing all other items in the deferred queue, it will AGAIN pick up this activity it has earlier enqueued.
If the specified wait time has already elapsed - which is almost certain with this workflow design - then the item
will be processed and, if the loop condition is still not met, enqueued yet again. If there are enough of such items,
this can easily crate an infinite loop and cause WFBG to never complete.
This problem was discussed in bug# 4466640.
Following is an excerpt from the bug:
1. Technical Analysis of the problem :
Due to poor custom workflow design, WFBG process for deferred OEOL items was running indefinitely.
This was due to WAIT activities placed in a loop with a condition checking activity and with node attributes
resulting in wait time of zero, or near zero. This caused WFBG to spin in a loop between the WAIT activity
and the condition checking activity for as long as the condition was met.
2. Technical Resolution of the problem :
Just altering the workflow definition using Workflow Builder was not enough. This would create a new process
version, resolving the problem for future WF items. However, the existing process versions would remain
unaffected, so a sql*plus data fix script was required to change all process versions. The script sets Wait Mode
to relative Time and Relative Time to the value given by the operator and is run one WAIT activity at a time.
3. SQL scripts created for:
(a) Identifying the Problem Transactions :
list_waits_and_defers.sql
(b) Fixing the Problem Transactions :
update_rel_time2.sql
4. Recommendations for customer to avoid re-occurrence in future :
Whenever possible, standard Wait activity (wf_standard.wait()) should not be placed in a loop with a small
wait time (shorter than several hours). Alternative workflow designs should be found. Wf_standard.wait()
with wait time of zero or near zero, or standard Defer activity (wf_standard.defer()), may NEVER be placed
in a loop in a workflow process design diagram.
To resolve the problem on existing WF items a data fix is necessary. Support should log a bug similar to bug: 4466640
and request the same type of fix.
To prevent occurrence of the problem on new WF items please see the recommendations above.