CPU Spikes and Top CPU SQL

By Sai S Selvaganesan and Lei Zeng

This article was originally published at IOUG SELECT journal Q2 2011. Thanks to April Sims for editing this article and providing advice.

 Contents

 Scripts

Supporting mission-critical production databases, an Oracle database administrator always faces the challenge of being questioned when applications run slow on an otherwise well-optimized and stable environment. Myriad factors could cause this degradation including CPU spikes on a database server. Above an acceptable range, the spikes impact the response and throughput of all applications that database server serves. With years of experience and the assumption that the database is guilty until proven innocent, our team has developed a systematic approach. We utilize data from Automatic Workload Repository (AWR) and Automated Session History (ASH) to find out SQL statements that are suffering as a result of these spikes and possibly which of those could be the root cause of CPU spikes. This article discusses this topic and illuminates several discoveries using two real-life case studies.

Introduction

We closely monitor our production database systems and CPU spikes are always on the top of the watch list. Although the CPU spikes vary depending on intensity and duration, the impact is obvious. CPU spikes point directly to contention and starvation of system resources, resulting in reduced database performance. A database server that is part of RAC database architecture, prolonged CPU spikes could cause the cluster database to become unresponsive that may result in instance or node eviction.

We have a very systematic and well-documented approach for building the underlying infrastructure for a database. One of the key rules is to build and deploy the hardware stack such that the peak baseline CPU utilization is never more than 50 percent with no degradation in performance for the application workload defined. Over the years we have found out that this has helped us in delivering systems that meet and/or beat the high expectations of our management. From a rudimentary capacity-planning approach this makes sense, but it assumes that the application workload is well-defined and the application is optimized. However, there are always going to be problems that arise from well-detailed systems. Several times when there was absolutely no code or workload change, application performance degradation usually is a result of sudden CPU spikes.

With this in mind, as a DBA team we have developed a systematic approach to troubleshoot database performance. This approach also provides compelling evidence to show whether there were anomalies going on inside database during CPU spike time. With Oracle Database10g’s introduction of AWR and ASH technologies, we have formalized a series of steps to analyze their data with the aim of nailing down problematic SQL statements that could cause CPU spikes. (By now, we should all be aware that AWR is a tool to collect instance-level performance data, while ASH collects session-level performance data!) By combining these two powerful tools, we have been successful in unfolding CPU spike mysteries. The approach outlined can be used on both 10g and 11g databases.

Assumptions

In this analysis, the first assumption is that all external causes outside the database have been eliminated, and that the SQL statement(s) cause the CPU spike occurring on the database server. Several known processes running on a database server may contribute to CPU spikes. Database backup process, OEM agent process, or any other OS daemon process, etc. This can be easily confirmed by using system monitoring tools, such as Oracle’s OS watcher that tracks top CPU consumer processes at the OS level during the CPU spike time.

Secondly, our approach is built on top of performance data maintained automatically by AWR and ASH. The CPU spike may have happened in the past, or maybe it is an ongoing issue. Our approach works consistently for both situations, assuming that there is enough AWR and ASH data available to cover the CPU spike time past or present. Both AWR and ASH are part of Oracle Enterprise Manager Diagnostic Pack that must be licensed as a separate option. Oracle has default retention policy for AWR data (seven days) but you can adjust it according to your needs.

Methodology

Our team believes that the performance data in AWR and ASH at the CPU spike time are skewed compared to those at the non-CPU spike time. Our approach is a symptom-driven, bottom-up technique that focuses on understanding the outstanding variations. Though, the challenge of using AWR data always lies in the fact that it sums up information at the instance level where symptoms and problems could both be involved. The first step to tackle any CPU spike issue starts with one symptom: The SQL statement with the highest CPU time (defined as top CPU SQL). Be aware that the identified top CPU SQL may or may not be the real problem. After you identify the top CPU consuming SQL, review other relevant AWR data looking for problem SQL statements. Any finding from AWR data is confirmed with the corresponding ASH data. Identify the sessions that are top CPU SQL, and examine all aspects of session-level information such as wait events, blocking session, etc.

When using this approach, follow two basic guidelines:

  • Both AWR and ASH data are time-based data. If the top CPU SQL is not new to the database, one should be able to compare and confirm findings from AWR or ASH data across time.

  • Both AWR and ASH data are dimension-based data. Figure 1 and Figure 2 show those major dimensions in AWR and ASH data of Oracle 10g, which hold true for 11g. Supporting evidence will be displayed from multiple dimensions in AWR data as well as in ASH data when AWR and ASH data

To expedite analysis, our team created a series of SQL scripts to extract data directly from AWR and ASH data dictionary views, instead of generating the whole comprehensive AWR or ASH reports.

To summarize the steps involved:

  • For the CPU spike under investigation, first identify the begin snapshot id and end snapshot id as the range of AWR and ASH data at the CPU spike time. The script awr_snapshots.sql is provided to take the instance_number as input parameter, and output the list of snapshot id for us to choose.


  • For the AWR data identified above, extract the top CPU SQL by using script awr_top_cpu_sql.sql. It takes the instance_number, begin_snapshot_id, end_snapshot_id as input parameters, and displays the list of top 10 SQL statements with the highest CPU time in the view DBA_HIST_SQLSTAT. To get a nice rolling window to show the movement of SQL statements over time, this script pulls from the base AWR snapshot interval and provides the list for every snapshot existing between begin_snapshot_id and end_snapshot_id. The goal is to check whether there is a SQL statement standing up consistently at the top of the list during the time period of the CPU spike, but does not show up as a Top CPU consuming SQL during time periods when there were no CPU spikes.


  • Once the top CPU SQL is identified, retrieve its run-time information out of the full AWR data. The script awr_sql_run_history.sql is provided to take the instance_num, sql_id as input parameters, and output all records found in the view DBA_HIST_SQLSTAT. Consider the impact of a SQL statement on the database that can be demonstrated by values in columns of CPU_TIME_DELTA, ROWS_PROCESSED_DELTA, BUFFER_GETS_DELTA, DISK_READ_DELTA, EXECUTION_DELTA and PARSE_CALLS_DELTA. By comparing the values at CPU spike time with those at non- CPU spike time, you should able to see whether the changes are obvious enough to make this SQL statement a problem. Otherwise, consider it only as a symptom.


  • Check other dimensions of AWR data to confirm whatever the findings have drawn from the top CPU SQL identified above. The scripts awr_top_events.sql and awr_load_profile.sql are provided to take the instance_number, begin_snapshot_id, end_snapshot_id as input parameters, and display instance-level top-five timed events and load profiles for review.


  • Further dig out database sessions that run the top CPU SQL at the CPU spike time and check their session level information from multiple dimensions of ASH data. The goal is to use the Oracle Wait Interface to see whether contention is happening for those sessions. If yes, then from what points and from where are those contention points coming. The script ash_session_for_sqlid.sql is provided to take the instance_number, begin_snapshot_id, end_snapshot_id, sql_id as input parameters, and display sessions matched in the view DBA_HIST_ACTIVE_SESS_HISTORY. You can then use scripts ash_session_for_sid.sql and ash_session_ details.sql to derive from session dimension to wait event, blocking session and application dimensions.


  • Also, sum up ASH data for all sessions at the CPU spike time to confirm our findings about the top CPU SQL sessions. The scripts ash_blocking_ session.sql is provided to take the instance_number, begin_snapshot_id, end_snapshot_id as input parameters and rollup session count according to columns EVENTS and BLOCKING_SESSION in the view DBA_HIST_ACTIVE_ SESS_HISTORY, thus easily detecting whether a particular session, or wait event, or combination of both, is a hot spot. The scripts ash_top_events.sql is provided to display the top-10 wait events based on the session count in ASH data and we need to confirm with top-five events in AWR data.


Having describing the basics, we are going to elaborate our analytic technique using the following two case studies.

Case Study 1

The monitoring tool alerted us that a particular database server was having CPU spikes above 90 percent. We notified the on-call DBA and started using the UNIX top command at the OS level to check what was consuming significant CPU resources. Despite the difficulty of having more than 10 oracle processes always staying at the top of the list and their process ids as well as positions shuffling up and down, the DBA finally traced these down to certain database sessions and the associated SQL statements currently running.

Then, the real questions started. The SQL statements identified were all legitimately issued from an application server. According to the application support team, these SQLs are not new and the database runs them on an ongoing basis during the day. So why the CPU did spikes show up all of sudden at this particular time? The on-call DBA also checked that there was no change in the SQL execution plan and the wait events of those database sessions were db file sequential read most of the times, indicating sessions were doing active IO. Because the database is doing what it is supposed to do, we used our top CPU SQL approach trying to solve this puzzle before asking other teams to help diagnose the actual underlying issue.

As shown in Listing 1, our team identified that the CPU spike time started at 1:30 p.m. and the begin snapshot id in AWR as 176856. As the issue was currently going on, we used DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT to create a new snapshot 176860 to record performance data into AWR.

Listing 1: Identify the begin and end snapshot id for the CPU spike time
SQL> @awr_snapshots 1

Instance Number   Snap Id Snap Time                 Snap Level
--------------- --------- ------------------------- ----------
...
              1    176855 07-MAR-2010 13:00                  1
              1    176856 07-MAR-2010 13:30                  1
              1    176857 07-MAR-2010 14:00                  1
              1    176858 07-MAR-2010 14:30                  1
              1    176859 07-MAR-2010 15:00                  1
              1    176860 07-MAR-2010 15:16                  1

As shown in Listing 2, the team decided to take SQL vu7ytmpa3s5ek as the top CPU SQL because it stayed consistently at the top of the list. We also decided to always work on the biggest problem first, while bearing in mind to come back to check the SQL statements at the second or third place, because they could also contribute to the CPU spike.

Listing 2: Identify the top CPU SQL as vu7ytmpa3s5ek
SQL> @awr_top_cpu_sql 1 176856 176860

Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
                                    Elapsed       CPU
Sql Id          Plan Hash Value     Seconds     Seconds         Rows    Buffer Gets     Disk Reads  Executions      Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek     5296468315      11,385      11,013         106,230   3,293,690,084         8,620          24          24
...

Begin snapshot id=176857 End snapshot id=176858 End snapshot time=07-MAR-2010 14:30
                                    Elapsed       CPU
Sql Id          Plan Hash Value     Seconds     Seconds         Rows    Buffer Gets     Disk Reads  Executions      Parses
-----------------------------------------------------------------------------------------------------------------------------
vu7ytmpa3s5ek     5296468315      14,261      13,771          45,350   3,884,282,095         4,906          10          10
...

As shown in Listing 3, we pulled out all the run-time information for SQL vu7ytmpa3s5ek in AWR data and the last few lines really popped out. We saw a big spur in the number of Executions of this SQL around the time when CPU spike started, and the huge jump of Buffer Gets incurred thereafter. By reviewing the history of how this SQL ran in the past (Executions < 5 and Buffer Gets < 9 digits per snapshot), we thought this could be the problem. We confirmed it by checking other SQL statements without revealing any similar issues.

Listing 3: Examine the run-time information for SQL vu7ytmpa3s5ek
SQL> @awr_sql_run_history 1 vu7ytmpa3s5ek

             Instance Snap  Snap                               Elapsed          CPU
Sql Id       Number   Id    Time              Plan Hash Value  Seconds      Seconds       Rows Buffer Gets   Disk Reads Executions Parses
------------ -------- ----- ----------------- --------------- -------- ------------ ---------- ------------- ---------- ---------- ------
...
vu7ytmpa3s5ek   1    176837 07-MAR-2010 04:00      5296468315    1,044        1,030     30,938    461,227,822     1,021          5      5
vu7ytmpa3s5ek   1    176838 07-MAR-2010 04:30      5296468315      696          690     19,985    343,813,929       481          0      0
vu7ytmpa3s5ek   1    176839 07-MAR-2010 05:00      5296468315      266          265      1,882    132,844,756        62          0      0
vu7ytmpa3s5ek   1    176840 07-MAR-2010 05:30      5296468315       44           40      4,756     18,941,601       380          3      3
vu7ytmpa3s5ek   1    176841 07-MAR-2010 06:00      5296468315       60           59        148     29,203,250        77          1      1
vu7ytmpa3s5ek   1    176842 07-MAR-2010 06:30      5296468315      105          104          0     48,946,885        96          1      1
vu7ytmpa3s5ek   1    176843 07-MAR-2010 07:00      5296468315      606          596      5,840    287,736,753       176          1      1
vu7ytmpa3s5ek   1    176844 07-MAR-2010 07:30      5296468315      271          270        142    132,232,291        33          6      6
vu7ytmpa3s5ek   1    176845 07-MAR-2010 08:00      5296468315      757          734     10,152    313,558,597       729          7      7
vu7ytmpa3s5ek   1    176846 07-MAR-2010 08:30      5296468315    1,067        1,061      7,090    443,839,315       221          1      1
vu7ytmpa3s5ek   1    176847 07-MAR-2010 09:00      5296468315      579          574      2,831    248,128,175        48          1      1
vu7ytmpa3s5ek   1    176848 07-MAR-2010 09:30      5296468315      612          609        160    301,446,395         0          0      0
vu7ytmpa3s5ek   1    176849 07-MAR-2010 10:00      5296468315      499          499        130    249,778,972         0          0      0
vu7ytmpa3s5ek   1    176850 07-MAR-2010 10:30      5296468315      610          608        140    300,330,553         0          0      0
vu7ytmpa3s5ek   1    176855 07-MAR-2010 13:00      5296468315      810          746     57,922    320,108,962     4,196         26     26
vu7ytmpa3s5ek   1    176856 07-MAR-2010 13:30      5296468315    3,887        3,768     61,244  1,321,679,994     4,489         14     14
vu7ytmpa3s5ek   1    176857 07-MAR-2010 14:00      5296468315    7,498        7,245     44,986  1,972,010,090     4,131         10     10
vu7ytmpa3s5ek   1    176858 07-MAR-2010 14:30      5296468315    6,763        6,526        364  1,912,272,005       775          0      0
vu7ytmpa3s5ek   1    176859 07-MAR-2010 15:00      5296468315    6,309        6,160        266  2,008,435,124        75          0      0
vu7ytmpa3s5ek   1    176860 07-MAR-2010 15:16      5296468315    4,595        4,512        260  1,670,567,860        13          0      0

As shown in Listing 4, we also confirmed with the top-five timed events in AWR. The event read by other session, db file sequential read and db file scattered read were taken as being in line with the heavy 'Buffer Gets' observed.

Listing 4: Review the top 5 timed events at the CPU spike time in AWR data
SQL> @awr_top_events 1 176856 176860

Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
Instance                             Averge                           % Total
Number               Event Name       Waits    Time (s)   Wait (ms)   Call Time
-------------------------------------------------------------------------------------
1                      CPU time                  7751                  47.9
1         read by other session     108,844      5679        52        35.1
1       db file sequential read     143,598      1295         9           8
1        db file scattered read       7,182       660        92         4.1
1                gc buffer busy      15,508       426        27         2.6
...

To confirm, we need to further validate it by ASH data. Listing 5 displays a list of sessions that we found to be running SQL vu7ytmpa3s5ek at the CPU spike time. Listing 6 shows one top CPU SQL session that we traced down. By looking at the wait event and blocking session information, we observed another problem when the multiple sessions ran this SQL simultaneously: contention. The event 'latch: cache buffers chains' and 'gc buffer busy' were taken as signals of contention that again were in line with the heavy Buffer Gets. Listing 7 and 8 show the summarized wait event and blocking session information for all sessions at the CPU spike time, which also pointed us to the same direction.

Listing 5: Identify sessions running SQL vu7ytmpa3s5ek at the CPU spike time
SQL> @ash_session_for_sqlid 1 176856 176860 vu7ytmpa3s5ek
                                                Session                                                               Blocking
  Snap Id Sample Time                       Sid State        Sql Id          Event                                     Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
   176857 07-MAR-2010 13:43:49             1067 WAITING      vu7ytmpa3s5ek   gc cr block busy
   176857 07-MAR-2010 13:44:40             1074 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:45:10              840 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                  1074
   176857 07-MAR-2010 13:45:30             1074 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:45:50              852 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:46:50              841 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:47:10              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:47:30              852 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:48:00              852 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:48:50              963 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:40:21              874 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   885
   176857 07-MAR-2010 13:51:11              897 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:51:41              852 WAITING      vu7ytmpa3s5ek   db file sequential read
   176857 07-MAR-2010 13:51:51              868 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:51:51              874 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:51:51              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:51:51              897 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:51:51             1045 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:52:01              840 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   874
   176857 07-MAR-2010 13:52:01              841 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   874
   176857 07-MAR-2010 13:52:01             1039 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   874
...

Listing 6: Trace down the top CPU SQL session 885 at the CPU spike time
SQL> @ash_session_for_sid 1 176856 176860 885
                                                Session                                                               Blocking
  Snap Id Sample Time                       Sid State        Sql Id          Event                                     Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
   176857 07-MAR-2010 13:37:10              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:41:51              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:42:11              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   841
   176857 07-MAR-2010 13:45:42              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   897
   176857 07-MAR-2010 13:46:22              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:47:43              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:48:33              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:49:03              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:49:23              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                  1074
   176857 07-MAR-2010 13:49:34              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:51:14              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:51:24              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:51:35              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:56:16              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                  1074
   176857 07-MAR-2010 13:57:16              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains
   176857 07-MAR-2010 13:57:36              885 WAITING      vu7ytmpa3s5ek   gc buffer busy
   176858 07-MAR-2010 14:01:07              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                  1039
   176858 07-MAR-2010 14:01:27              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   963
   176858 07-MAR-2010 14:07:28              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   974
   176858 07-MAR-2010 14:08:29              885 WAITING      vu7ytmpa3s5ek   latch: cache buffers chains                   965
...

Listing 7: Check blocking session information for all sessions at the CPU spike time
SQL>  @ash_blocking_session 1 176856 176860
                                         Blocking Waiting Session
Event                                     Session           Count
---------------------------------------- -------- ---------------
read by other session                                       3,471
db file sequential read                                       642
gc buffer busy                                                566
db file scattered read                                        476
latch: cache buffers chains                                    95
log file sync                                1132              63
control file sequential read                                   62
gc cr block busy                                               35
SQL*Net more data to client                                    30
gc cr grant 2-way                                              24
latch: cache buffers chains                  1074              18
latch: cache buffers chains                   897              18
latch: cache buffers chains                   974              11

Listing 8: Review top wait events at the CPU spike time in ASH data
SQL> @ash_top_events 1 176856 176860

Begin snapshot id=176856 End snapshot id=176857 End snapshot time=07-MAR-2010 14:00
Instance                               Blocking     Waiting Session
Number                          Event   Session     Count
-----------------------------------------------------------------------------------
1               read by other session                1027
1             db file sequential read                 235
1                      gc buffer busy                 201
1              db file scattered read                 121
1         latch: cache buffers chains                  43
1                       log file sync      1132        24
1                    gc cr block busy                  16
1        control file sequential read                  14
1         latch: cache buffers chains      1074        12
1         latch: cache buffers chains       897        11
...

After we identified the top CPU SQL as the problem, we identified which side of application it came from as shown in Listing 9, and its SQL text, execution plan as shown in Listing 10. With all detailed data in hand to support our findings, we were confident to move forward to talk to the application team. a time because the value of Rows was always equal to that of Executions, we concluded that this SQL was just a symptom of showing that it could not run fast enough at CPU spike time.

Listing 9: Check the application-level information for the top CPU SQL session 885
SQL> @ash_session_details 1 176856 176860 885 vu7ytmpa3s5ek

   Sid Sql Id          Username     Module               Action               Program              Service Name
------ --------------- ------------ -------------------- -------------------- -------------------- --------------------
   885 vu7ytmpa3s5ek   REPORT_USER  JDBC Thin Client                          JDBC Thin Client     adhoc_report

Listing 10: Retrieve the sql text and execution plan for SQL vu7ytmpa3s5ek in AWR data
SQL>  @awr_sql_report 1 176856 176860 vu7ytmpa3s5ek

The real-time troubleshooting tool was very helpful in revealing what is currently happening. However, it doesn’t provide enough information on what happened in the past, which could blur the understanding of the test results. Combining AWR data and ASH data, this approach reveals both historically and realistic answers for troubleshooting issues.

Case Study 2

Another database server was found out to have CPU spike between 1 a.m. and 2 a.m. every morning. There was no monitoring alert because the spike came up just right below the warning line. However, it generated a noticeable pattern in the CPU graph. To take preventive actions, our team was assigned to find out the root cause of this daily occurrence.

Selecting one day’s CPU spike time and identified SQL 37sgt41qpy0p8 as the top CPU SQL. As shown in Listing 11, its run-time information actually indicated a different story. The CPU time of this SQL was stable at 10 seconds at non-CPU spike time, while jumping up to 5,000 seconds at CPU spike time. While the Buffer Gets and Disk Reads did show some increases at the CPU spike time, their numbers were considered not significant compared to other SQL statements running at the same time. The number of Executions, on the contrary, was reduced substantially at the CPU spike time. Taking into consideration that this SQL was being run more than 2,000 times per AWR snapshot window at the non-CPU time, and each time it processed one row at a time because the value of Rows was always equal to that of Executions, we concluded that this SQL was just a symptom of showing that it could not run fast enough at CPU spike time.

Listing 11: Examine the run-time information for SQL 37sgt41qpy0p8
SQL> @awr_sql_run_history 1 37sgt41qpy0p8

             Instance Snap  Snap                               Elapsed          CPU
Sql Id       Number   Id    Time              Plan Hash Value  Seconds      Seconds       Rows Buffer Gets Disk Reads Executions Parses
------------ -------- ----- ----------------- --------------- -------- ------------ ---------- ----------- ---------- ---------- ------
...
37sgt41qpy0p8   2     70209 17-APR-2010 22:30               0       10           10      2,112           0         0       2,112      0
37sgt41qpy0p8   2     70210 17-APR-2010 23:00               0       10           10      2,070           0         0       2,070      0
37sgt41qpy0p8   2     70211 17-APR-2010 23:30               0       10           10      2,138           0         0       2,138      0
37sgt41qpy0p8   2     70212 18-APR-2010 01:00               0      392          247      2,070       8,111         0       2,070      0
37sgt41qpy0p8   2     70213 18-APR-2010 01:30               0    5,020        4,668        852      35,616         0         852      1
37sgt41qpy0p8   2     70214 18-APR-2010 02:00               0    2,967        2,664        776      21,104         0         776      1
37sgt41qpy0p8   2     70215 18-APR-2010 02:30               0       10           10      1,926           0         0       1,926      0
37sgt41qpy0p8   2     70216 18-APR-2010 03:00               0       10           10      2,096           0         0       2,096      0
37sgt41qpy0p8   2     70217 18-APR-2010 03:30               0       10           10      2,109           0         0       2,109      1
37sgt41qpy0p8   2     70218 18-APR-2010 04:00               0       10           10      2,032           0         0       2,032      0

To get a hint of what else could be happening, we checked the top-five timed event in AWR in Listing 12. From the event row cache lock and library cache lock, we observed that there was contention as well as an object lock issue.

Listing 12: Review the top 5 timed events at the CPU spike time in AWR data
 SQL> @awr_top_events 2 70212 70214

 Begin snapshot id=70212 End snapshot id=70213 End snapshot time=18-APR-2010 01:30
 Instance                                Averge                           % Total
 Number                  Event Name       Waits    Time (s)   Wait (ms)   Call Time
 -----------------------------------------------------------------------------------
 2                         CPU time                  4146                  37.2
 2                   row cache lock     110,290      1277        12        11.4
 2               library cache lock       3,999       882       221         7.9
 2                 direct path read     211,549       817         4         7.3
 2         latch: row cache objects      61,304       715        12         6.4
...

We kept digging into ASH data and identified one session (SID = 833) that was running SQL with SQL ID 37sgt41qpy0p8 at the time of the CPU spike. As shown in Listing 13, we confirmed it was running into row lock and library cache contention as indicated by the wait event. The blocking session (SID = 837) was also identified and we traced it down as shown in Listing 14. We could see that this blocking session was causing the problem: It blocked sessions while being blocked also. Confirmed with Listing 15 and 16, in which the wait event and blocking session information for all sessions in ASH data all pointed to the session (SID = 837) as the big blocker.

Listing 13: Trace down the top CPU SQL session 833 at the CPU spike time
SQL> @ash_session_for_sid 2 70212 70214 833
                                                Session                                                               Blocking
  Snap Id Sample Time                       Sid State        Sql Id          Event                                     Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
    70214 18-APR-2010 01:40:16              833 WAITING      rjjk8gjw4tmqf   library cache lock
    70214 18-APR-2010 01:40:26              833 WAITING      e89tktqg9kv1s   library cache lock
    70214 18-APR-2010 01:40:47              833 WAITING      ydsgq975nmhkd   library cache lock				   837
    70214 18-APR-2010 01:40:57              833 WAITING      jyztj0vtzu45b   library cache lock
    70214 18-APR-2010 01:41:07              833 WAITING      o0464ghfsxzwd   library cache lock
    70214 18-APR-2010 01:41:17              833 WAITING      069g87c64q13e   latch: row cache objects
    70214 18-APR-2010 01:42:08              833 WAITING      mtjy9mk07rcmk   row cache lock
    70214 18-APR-2010 01:42:28              833 WAITING      2hwsq1ma3w29m   latch: row cache objects
    70214 18-APR-2010 01:42:49              833 WAITING      3c73j42t595vn   row cache lock
    70214 18-APR-2010 01:43:19              833 WAITING      9m05zf9a2t4ht   library cache lock
    70214 18-APR-2010 01:44:10              833 WAITING      5xj39u95phcx2   library cache lock
    70214 18-APR-2010 01:44:30              833 WAITING      8k2b99mtds9fx   row cache lock
    70214 18-APR-2010 01:45:32              833 WAITING      4s826ffgzzwvv   library cache lock
    70214 18-APR-2010 01:46:33              833 WAITING      0ybcgnn9jy6jg   library cache lock
    70214 18-APR-2010 01:46:53              833 WAITING      16qj7vcafhpsy   row cache lock
    70214 18-APR-2010 01:47:34              833 WAITING      5u0jd8tppsstw   library cache lock                            837
    70214 18-APR-2010 01:49:06              833 WAITING      hv59t19bjahqi   latch: row cache objects
...

Listing 14: Trace down blocking session 837 at the CPU spike time
SQL>@ash_session_for_sid 2 70212 70214 837
                                                Session                                                               Blocking
  Snap Id Sample Time                       Sid State        Sql Id          Event                                     Session
--------- ------------------------------ ------ ------------ --------------- ---------------------------------------- --------
...
    70214 18-APR-2010 01:40:57              837 WAITING      h9fcdzjnksq78   library cache lock                            813
    70214 18-APR-2010 01:41:07              837 WAITING      h9fcdzjnksq78   library cache lock                            813
    70214 18-APR-2010 01:41:27              837 WAITING      h9fcdzjnksq78   library cache lock                            813
    70214 18-APR-2010 01:41:38              837 WAITING      h9fcdzjnksq78   library cache lock                            805
    70214 18-APR-2010 01:41:48              837 WAITING      h9fcdzjnksq78   library cache lock                            809
    70214 18-APR-2010 01:41:58              837 WAITING      h9fcdzjnksq78   library cache lock                           1111
    70214 18-APR-2010 01:42:08              837 WAITING      h9fcdzjnksq78   library cache lock                            805
    70214 18-APR-2010 01:42:18              837 WAITING      h9fcdzjnksq78   library cache lock                            852
    70214 18-APR-2010 01:42:28              837 WAITING      h9fcdzjnksq78   library cache lock                           1113
    70214 18-APR-2010 01:42:38              837 WAITING      h9fcdzjnksq78   library cache lock                            557
    70214 18-APR-2010 01:42:49              837 WAITING      h9fcdzjnksq78   library cache lock                            833
    70214 18-APR-2010 01:42:59              837 WAITING      h9fcdzjnksq78   library cache lock                           1113
    70214 18-APR-2010 01:43:09              837 WAITING      h9fcdzjnksq78   library cache lock                           1113
    70214 18-APR-2010 01:43:19              837 WAITING      h9fcdzjnksq78   library cache lock                            851
    70214 18-APR-2010 01:43:30              837 WAITING      h9fcdzjnksq78   library cache lock                            833
    70214 18-APR-2010 01:43:40              837 WAITING      h9fcdzjnksq78   library cache lock                           1111
    70214 18-APR-2010 01:43:50              837 WAITING      h9fcdzjnksq78   library cache lock                            813
    70214 18-APR-2010 01:44:00              837 WAITING      h9fcdzjnksq78   library cache lock                            805
...

Listing 15: Check blocking session information for all sessions at the CPU spike time
SQL> @ash_blocking_session 2 70212 70214
                                         Blocking Waiting Session
Event                                     Session           Count
---------------------------------------- -------- ---------------
row cache lock                                                249
latch: row cache objects                                      190
direct path read                                              143
library cache lock                                            124
db file scattered read                                        102
library cache lock                            837              99
db file sequential read                                        39
PX Deq Credit: send blkd                                       32
gc cr multi block request                                      28

Listing 16: Review top wait events at the CPU spike time in ASH data
SQL> @ash_top_events 2 70212 70214

...
Begin snapshot id=70213 End snapshot id=70214 End snapshot time=18-APR-2010 02:00
Instance                            Blocking     Waiting Session
Number                       Event   Session     Count
----------------------------------------------------------------------------------
2                   row cache lock                 203
2         latch: row cache objects                 145
2               library cache lock                 124
2           db file scattered read                  73
2               library cache lock       837        62
2          db file sequential read                  36
2         PX Deq Credit: send blkd                  32
2        gc cr multi block request                  23

By checking the application-level information for this session as shown in Listing 17, it came from a DBMS_SCHEDULER job. After reviewing this job’s schedule and the run log in the view DBMS_SCHEDULER_JOBS and DBMS_ SCHEDULER_JOB_RUN_DETAILS, there was a perfect match between the run time of this job and the CPU spike time on daily basis. As this was a Daily object statistics collection job, it explained both the CPU spike as well as the row cache/library cache contention as the stats collection invalidated cursors that needed to be reparsed and reloaded into the SQL cache.

Listing 17: Check the application-level information for the top CPU SQL session 837
SQL>  @ash_session_details 2 70212 70214 837 h9fcdzjnksq78

   Sid Sql Id          Username     Module               Action          Program              	 Service Name
------ --------------- ------------ -------------------- --------------- ----------------------- ----------------
   837 h9fcdzjnksq78   DBA          DBMS_SCHEDULER       DAILY_DB_STATS  oracle@db-box (J001)    dba_maintenance

When the CPU spike happened in the past, our approach has also proven to be efficient, without many other options to explore as when CPU spike issue is actually occurring.

Conclusion

AWR and ASH are considered to be great tools that Oracle provides to troubleshoot the database for the purpose of determining and solving performance issues. Combined with ASH data, AWR data has become more meaningful in determining issues occurring at both instance-wide as well as at the session level. There are many possibilities of exploring the usage of AWR and ASH data - our top CPU SQL approach to untangle CPU spike issue is just one good example. Our team continues to learn about performance issues, refining this method as new situations are encountered.

Copyright © 2011 Actrace. All Rights Reserved.
Home | Product | Case Study | Support | Download | Contact Us