Solve Locking and Blocking Puzzle
 Introduction

In order to ensure data consistency, Oracle uses different types of lock to manage access to shared resources such as a table definition, a set of data in a table, etc. When a session requests lock on a shared resource and another session has already hold lock on it, it may need to wait. This would lead to a blocking situation where waiter sessions make no progress until the blocker sessions release the lock.

Depending on how long the blocking situation lasts and how many waiter sessions get involved, the impact of a blocking issue varies. Many databases have implemented monitoring mechanism to watch blocking situations to avoid it persisting over an extended period of time. DBA gets alerted about an on-going blocking issue, and takes actions such as killing blocker session to resolve the conflicts. Diagnostic SQL scripts querying real-time database performance data (v$lock and v$session) are very helpful for DBA to pinpoint blocker and waiter sessions.

However, blocking situation sometimes seizes database swiftly. For example, a very busy production database had a blocking issue all of sudden and the issue was self-resolved after 5 minutes. Many customers noticed system was ‘hung’ during that short period of time and after that, things came back to normal. When DBA got called in to investigate the issue one hour later, the real-time database performance data showed no issue and left many question unanswered: "what really happened?", "will it happen again?", etc.

For troubleshooting historical blocking issue, DBA can always resort to AWR and ASH data. Oracle’s ASH report has a 'top blocking session' section, however, the information there is too simple to construct blocker and waiter diagram and help further investigation.

DBspeed is designed to be able to display the blocker and waiter diagram with corresponding wait events and SQL statements on a single page. Supplemented by AWR top wait events and SQL Stats data, DBspeed offers a deeper understanding of the root cause of blocking issue and enable us resort to effective solutions .

 Case Study Summary

ASH data contains information about a session and its blocking sessions if it is being blocked during ASH sampling time. It is possible to use ASH data to construct the blocker and waiter diagram for a blocking situation. The case described here explains how ASH data was used to dig out the blocker session and help to understand what caused blocking. AWR data such as top wait event and SQL stats are also utilized to crosscheck the findings from ASH data.

The list of steps include:

 Step 1 : Check DB Time to determine when performance bottleneck started

The first step is always to identify the time window for performance bottleneck issue. DBspeed has 'AWR AAS' report which uses DB Time to do that job.

 Step 2 : Check AWR top 5 events to see which event was moving up to the top during performance bottleneck

AWR shows the top wait events changed to wait events 'enq: TM - contention' and 'enq: TX - row lock contention'. These two wait events means there were two different types of lock involved.

TM locks are acquired for a base table or partition when Oracle performs operations such as enabling referential constraints (Metalink note 34664.1). TX locks are acquired when a transaction initiates its first change and is held until the transaction does a COMMIT or ROLLBACK (Metalink note 62354.1)

According to these two wait events, we are sure that the database ran into blocking issue.

 Step 3 : Check ASH top wait events and see if there were corresponding blocking sessions

With blocking issue in mind, we turned to ASH data. DBspeed has 'ASH Top Events with Blocking Session' report to rank ASH top wait events together with corresponding blocking session.

As we can see here, there are two outstanding blocker sessions. Let us start scratching blocker and waiter diagram from now on.

  •   SID 659 SERIAL# 29653
  •   SID 695 SERIAL# 61429

 Step 4 : Check ASH top blocking session to see who were the top blockers.

DBspeed has 'Top Blocking Sessions' report to rank blocker sessions according to the number of waiter sessions it blocks.

As we can see here, the top three outstanding blocker sessions are :

  •   SID 659 SERIAL# 29653, there were 855 waiter sessions being blocked by this blocker session
  •   SID 1743 SERIAL# 53855, there were 92 waiter sessions being blocked by this blocker session
  •   SID 695 SERIAL# 61429, there were 91 waiter sessions being blocked by this blocker session
Two blocker sessions are confirmed by the previous ASH top wait event report. (The result is not fully matched because we used different 'Activity %' threshold as the ‘top’ criteria in two different reports. We can adjust 'Activity %' threshold to match results.).

 Step 5 : Relate top blocking sessions to wait events to confirm the blocking symptom

In DBspeed's 'Top Blocking Sessions' report, the blocker session is the base dimension. We can add more dimension in to explore if there are correlational relationship between a blocker session and a specific wait event, or a SQL statement, or waiter sessions.

As we can see here, we added dimension ‘wait event’. We can confirm the blocker sessions and those wait events on which they caused waiter sessions waiting.

 Step 6 : Construct blocker and waiter diagram

Let us add more dimensions in to construct the blockers and waiters diagram:

SID 1743 SERIAL# 53855 ---> SID 659 SERIAL# 29653 ---> around 800 waiter sesions, including SID 695 SERIAL# 61429 ---> around 90 waiter sesions

 Step 7 : Check waiter session and its SQL

Let us check what is the SQL statement the waiter session was running during blocking. From the SQL, we are trying to figure out why a lock was requested by the waiter session.

As we can see here, session SID 695 SERIAL# 61429 was trying to perform 'insert' statement on table US_SALES.

 Step 8 : Check another waiter session and its SQL

Session SID 659 SERIAL# 29653 was trying to perform 'delete' statement on table PRODUCTS

 Step 9 : Check the lock type, mode and object requested

Let us add more details to the blocker and waiter diagram.

SID 1743 SERIAL# 53855  
  ---> SID 659 SERIAL# 29653, SQL='delete from PRODUCTS...'
   ---> SID 695 SERIAL# 61429, SQL='insert into US_SALES...'

The lock type, mode and object requested can also help us understand why blocking was happening. For wait event 'enq: TM - contention', the P1 value points to lock type and mode requested, while P2 value points to the object ID of the object requested.

DBspeed has 'ASH Top Event P1/P2/P3 Values' report to display P values for a specific event. On this report, the dimension ‘SQL’ can also be correlated with the wait event specified. As we can see here, we can identify the waiter sessions, SQL statements they were trying to run, wait events on which they were being blocked, and P values of those wait events.

DBspeed has 'AWR Object Search' report to retrive object name from the P2 value.

In order to further break down P1 value into lock type and mode, script (check_lock.sql) is used.

 Step 10 : Understand blocker and waiter diagram

With all the information we have so far, let us add details to the blocker and waiter diagram.

SID 1743 SERIAL# 53855   
  ---> SID 659 SERIAL# 29653, SQL='delete from PRODUCTS...', requested TM mode 5 on table US_SALES
   ---> SID 695 SERIAL# 61429, SQL='insert into US_SALES...', requested TM mode 3 on table US_SALES

By looking at blocker and waiter diagram, one obvious question jumped out: why a ‘delete from PRODUCTS..’ statement needs to request TM lock mode 3 on table US_SALES? Referential Integrity enforcement. The 'delete' statement on a parent table will incur TM lock mode 3 on child table if the foreign key column on the child table doesn’t get indexed (Metalink notes 38373.1).

By checking the foreign keys and indexes on table US_SALES, we can confirm that foreign key column not being indexes can explain why session SID 659 SERIAL# 29653 requested TM lock mode 5 on table US_SALES.

With a TM lock mode 5 request on table US_SALES happening, all the following DML operations on the same table would request TM lock mode 3 and be blocked. This explains how session SID 695 SERIAL# 61429 with 'insert into US_SALES..' statement got blocked by session SID 659 SERIAL# 29653.

 Step 11 : Check blocker session and its SQL

Then, here comes to one more question: why session SID 659 SERIAL# 29653 couldn’t get TM lock mode 5 on table US_SALES? This would explain how session SID 1743 SERIAL# 53855 blocked it.

However, we searched ASH data and all we could find out about blocker session SID 1743 SERIAL# 53855 is very little. The only SQL statement (‘SELECT .. from US_SALES .. ’) related to this session during problem time window was not believed to cause blocking. We need to keep in mind that the blocker session may issue a sequence of SQL statements including a DML on objects but not all of them will be recorded in ASH (Metalink note 729727.1). It is a reasonable assumption that blocker session SID 1743 SERIAL# 53855 had an un-committed transaction on table US_SALES which blocked the request of TM lock mode 5 on the same table.

 Step 12 : Draw conclusions and provide solutions
SID 1743 SERIAL# 53855 a type of DML on table US_SALES, currently hold a lock on table US_SALES
 ---> SID 659 SERIAL# 29653, SQL='delete from PRODUCTS...',  requested TM mode 5 on table US_SALES
   ---> SID 695 SERIAL# 61429, SQL='insert into US_SALES...', requested TM mode 3 on table US_SALES

By tracing down the blocker and waiter diagram, we concluded that the request TM lock mode 5 on table US_SALES was the most significant factor which caused blocking. It happened due to the fact that the table US_SALES had a foreign key column not being indexed. In order to avoid this type of lock and the same type of blocking issue again, we provided the solution to index the foreign key column.

Copyright © 2011 Actrace. All Rights Reserved.
Home | Product | Case Study | Support | Download | Contact Us