OEM Agent 12c: Common Issues and Resolutions

Start Here

Get in touch with a
TriCore Solutions specialist

Blog | Jun 29, 2017

OEM Agent 12c: Common Issues and Resolutions

Enterprise Management (EM) agents is one of the most important component in OEM monitoring environment. The EM Agent is the only component/service which runs on a target machine and is solely responsible for collecting and uploading all monitoring metric data to OEM. This data then gets processed by OEM and a notification alert is generated according to the defined thresholds values.

Introduction:

In this blog, I will review some of the most common and frequent issues which we face day to day working with OEM 12c agents. This information should be very helpful for any DBA in fixing these common issues and to keep different Oracle targets well monitored through OEM.

Enterprise Management (EM) agents is one of the most important component in OEM monitoring environment. The EM Agent is the only component/service which runs on a target machine and is solely responsible for collecting and uploading all monitoring metric data to OEM. This data then gets processed by OEM and a notification alert is generated according to the defined thresholds values.

OEM agent support

Some Common and Frequent EM agent issues include:

Issue 1:

Java Heap Space- OutOfMemoryError issue

This is one of the most common issues a DBA faces while working with EM Agents on Version 12.1.0.1.0 and later.

12c OEM agent crashes suddenly flashing the following errors in: /agent_inst/sysman/log/gcagent.log

12c OEM agent

At times, Java heap space issues can be accompanied by TaskZombieException as given below: 

++++++++++++++++++++

2014-10-12 07:49:43,917 [153032:GC.Executor.70250] ERROR - Critical error:
oracle.sysman.gcagent.task.TaskZombieException: task declared as a zombie
--
2014-10-12 07:49:57,229 [153098:oracle.dfw.impl.incident.DiagnosticsDataExtractorImpl - Incident Dump Executor (created: Sun Oct 12 07:49:48 AST 2014)] ERROR - Result set exceeded max flood control level
2014-10-12 07:50:47,129 [150730:GC.SysExecutor.1751 (Ping OMS)] ERROR - PingListener "Upload Manager" threw an unchecked exception on notification of ((* PingEvent: subsequent scheduled ping attempt with result=SUCCESS occurred at Sun Oct 12 07:50:47 AST 2014 *))
oracle.sysman.gcagent.task.TaskCancelledError: Aborting task due to Thread.interrupt
--
2014-10-12 07:51:33,212 [153110:GC.SysExecutor.1776 (SchedulerHeartbeat)] ERROR - Critical error:
java.lang.OutOfMemoryError: Java heap space

++++++++++++++++++++ 

Cause:

Whenever an EM agent collects monitoring metric data it uses Java memory and it may crash if it doesn’t have access to required java memory as per the configuration.

Resolution:

  1. Stop the agent.

                emctl stop agent

  1. Take backup of properties file residing in ../agent_inst/sysman/config.
  2. Edit emd.properties for below values. #Applicable only if getting TaskZombieException 

_zombieSuspensions=true
_canceledThreadWait=210
 

  1. Change Java heap memory size.

from :
agentJavaDefines=-Xmx128M -XX:MaxPermSize=96M  #This is the default value.
to:
agentJavaDefines=-Xmx512M -XX:MaxPermSize=96M 

  1. Start the agent:
    emctl start agent 

In some cases you will need to have XmxXXXM value even greater than 512M until the issue gets
fixed.
-------------------------------------------------------------------------------------------- 

Issue 2:

Leap Second Adjustment on Linux Causing Issues to EM Agents

This is more of a bug of the Linux operating system however it impacts EM agent and OMS functionality adversely.

Whenever this bug is hit, Oracle Enterprise Manager Management Agent (OMA) or Oracle Management Service (OMS) may start consuming excessive CPU on the server. This issue has been generally faced at the end of June or December in a calendar year till now.

I found this issue very interesting while researching on it as it involves many celestial objects like Sun, Earth, and Moon. Let’s briefly discover how.

What is a Leap Second?             

As per Redhat definition

Leap seconds are a periodic one-second adjustment of Coordinated Universal Time (UTC) in order to keep a system's time of day close to the mean solar time. However, the Earth's rotation speed varies in response to climatic and geological events, and due to this, UTC leap seconds are irregularly spaced and unpredictable.”

 “Why this extra second? It exists because the rotation of the Earth on its axis, which determines the passing of days and nights, slows down over a long period, mainly as a consequence of Moon-Sun attraction effects. In addition, the Earth is affected by its internal (core, mantle) and external (atmosphere, oceans) constituents.”

As per Oracle, Linux distributions which may be affected include Oracle Linux, Red Hat Enterprise Linux, Oracle VM and Oracle Unbreakable Enterprise Kernel. Asianux 2 and 3, based on RHEL 4, 5 or 6, may also be affected (RHEL 7 is not affected, according to Red Hat).

How to identify leap second adjustment:

$dmesg | grep -i leap
 [10703552.860274] Clock: inserting leap second 23:59:60 UTC

It will show if a “leap second” is being inserted/adjusted or not.

Resolution:

Oracle came up with a workaround based on its users/customers feedback.

Run below command as root user.

# /etc/init.d/ntpd stop
#  date -s "`date`"    (
reset the system clock)
# /etc/init.d/ntpd start

This issue can also be rectified by restarting the impacted host.
Reference Oracle Document:

Enterprise Manager Management Agent or OMS CPU Use Is Excessive near Leap Second Additions on Linux (Doc ID 1472651.1) 

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=287033065719959&id=1472651.1&_adf.ctrl-state=qtcxet0s2_85 

Conclusion:

By keeping these issues and their resolutions in mind you can avoid compromising on monitoring of your critical production and other environments.  Please feel free to reach out to me in case of any query on the topics. For any questions click below. You can also leave a comment in the field below.
Ask Bakshish