Let’s take a trip to the past…we weren’t actually there, but we can pretend. Let’s look into a conversation between a DBA, developers, and management at some time ‘way back when.’
Management: Our system has been performing poorly for months now.
Developers: We didn’t change anything.
DBA: I didn’t change anything.
Management: We need a fix.
Developers: We don’t have time to go through all our code and tune it.
DBA: *frantically types: alter system set optimizer_index_caching = 35; alter system set optimizer_index_cost_adj = 25;* Fixed!
Queries are not performing as expected. Sometimes a query will run fine, sometimes it runs horribly. Sometimes a combination of queries and users results in massive buffer busy waits or ‘latch: cache buffers chains’ waits. Nothing is working and so finally, a consultant is called.
You’ve probably guessed that I was the consultant. Move to the head of the class! In particular, there were three or four main offenders; queries that would sometimes perform well, then all of a sudden ‘turn horrible.’ The core of the problem was truly unstable execution plans.
Using ASH I was able to go through different times of day and compare reports from times where the system performed well with times the system performed horribly. Doing so allowed me to compare SQL_IDs between different times of day to see which queries seemed to be going through dramatic changes. Once a SQL_ID was found that experienced such a dramatic change, I used DBMS_XPLAN.DISPLAY_AWR to find all the different iterations of the query.
In case you’ve never used it, DBMS_XPLAN.DISPLAY_AWR is very useful. It allows you to input only a SQL_ID, and Oracle will show you the explain plans for that SQL_ID as recorded in the AWR. For instance:
select * from TABLE(dbms_xplan.display_awr(’95t7cmj84u4jz’));
The next step was to look through the results to find any inconsistencies large enough to result in huge differences in performance. In this situation, I found that ‘good’ explain plans were using decent indexes, NESTED LOOPs, and low overall bytes. The ‘bad’ explain plans were using index joins, where an index fast full scan would join with an index range scan into a view (e.g. index$_join$_003). These plans resulted in many more block touches and much more CPU used.
Using Grid Control, I looked at the timeline for each of the hash values of the SQL_ID and saw that the poorly performing plans lined up with their bad performance periods.
Of course, now the burning question: what can be done about this?
One solution is to set the _index_join_enabled parameter to FALSE, disabling the index joins completely. Based on query analysis it appears that for right now, the pros will far outweigh the cons. But what about queries where index joins perform more optimally? What if Oracle replaces the index joins with something even worse?
The better solution is to negate the need for OPTIMIZER_INDEX_COST_ADJ and OPTIMIZER_INDEX_CACHING to be set to such extreme values (or to be set at all). Though these parameters can be very useful in the hands of a skilled DBA, they can also be very dangerous if improperly used. In this case, the CBO never would have chosen to use index joins if the OPTIMIZER_INDEX_COST_ADJ weren’t set so extremely low.
For testing purposes I took one of the problem queries and ran it with OPTIMIZER_INDEX_COST_ADJ = 100 and OPTIMIZER_INDEX_CACHING = 0. The query performed no index joins, but it did perform two full table scans and overall the results were less than satisfactory. Without those two parameters, the queries remained ‘unfixed’ as far as runtime is concerned.
So in the end, I have struck a compromise with the client. If another period of poor performance occurs, we can attempt setting “_index_join_enabled” to false to try to band-aid the band-aid of their performance problems. However (and that’s a big however), they need to take the time to work on a more permanent solution. For instance, using development to test and tune queries with OPTIMIZER_INDEX_CACHING = 0 and OPTIMIZER_INDEX_COST_ADJ = 100 (or at least a less extreme value).
Otherwise it’s a band-aid…then another…then another…until the true performance issue is so obfuscated it may never be found.
I’ve worked with another client that does somewhat the same thing. If they use a resource manager plan, their system will not become overloaded and things seem to work very well. Turn off the plan and the runqueue goes through the roof (in the hundreds) and cache buffers chains latches abound. Unfortunately whenever a problem happens, all one sees taking DB time is ‘resmgr: cpu quantum’. For them, throttling is a fix. I’d prefer a full tuning expedition that fixes the actual cache buffers chains issue.