Wednesday, March 18, 2009

WebSphere Portal Disaster Troubleshooting Guide

In case you find a case where your WebSphere Portal crashed, here is a few tips to troubleshoot.

WEBSPHERE PORTAL DISASTERS

This section covers the variety of disasters that might happen in the WebSphere Portal component:

* Portal JVM Crash. WebSphere Portal is considered to have crashed it its process identifier (PID) disappeared or its PID changed over time without intentional stopping of the server, such as stopServer command being issued.
* Portal Runtime failure. This is any failure during general operation and administration of WebSphere Portal is considered a runtime failure.
* Portal Server Start time failure. This is any failure that causes the Portal can not be started or can be started but is no longer accessible.
* Portal Server Operating System failure. This is any failure that is caused by the Operating System, for example: the accidental removing of the WebSphere folder.


PORTAL JVM CRASH

The following diagnostics logs and files are essential to troubleshooting a crash:

* Javacore. This file is generated when the JVM terminates unexpectedly. It is a text file that contains information about the JVM and Java application captured at some point during execution.
* Core Dump or User Dump. This file contains a complete dump of your computer’s memory, and therefore, it can grow large.

The following are the steps to determine this issue:

* Search for the Javacore. If there is a Javacore present, analyze that file and find the any Signal 11, since this indicates the application server crash.
* If there it no Javacore, the issue may be solved by upgrading the IBM Java SDK with the latest service release because the JVM is probably hung.

PORTAL RUNTIME FAILURE

This could include, but is not limited to, the following:

* Unable to login
* Administration tab unavailable
* Portlet deployment failures
* Portlet unavailable
* Page does not display

The followings are the steps to determine the Portal Runtime Failure:

* Review the SystemOut.log and SystemErr.log
* Analyze these two files to see if there is any failure and try to find out what the causes of this failure are.
* If the above problem determination approach did not resolve the issue, use the more advanced troubleshooting method by enabling additional tracing.

PORTAL START TIME FAILURE

The following diagnostics logs and files are essential to troubleshooting a issue:

* SystemOut.log. This file tells what activities happened during the Portal Starting time.
* SystemErr.log. This file tells what exceptions/errors happened during the Portal Starting time.

The following are the steps to determine this issue:

* Find if there is any Portal backend connectivity problem, for example: LDAP or Portal Database.
* Find if there is any WebSphere resource not working properly, for example: JMS, EJB, or JDBC
* Find if there is any WebSphere Portal Service can not be started.

WEBSPHERE PORTAL RECOVERY STRATEGY

This section covers the strategies that can be taken into consideration if the analyses as mentioned above do not produce any meaningful action.

* PORTAL REINSTALLATION. This step is the last step if all the previous steps fail. This step is basically portal reinstallation from scratch.
* PORTLET REINSTALLATION. This step is the Portlet installation and is taken if the failure only occurs to a specified Portlet.
* PAGE RECONFIGURED. This step is the Page reconfiguration and is taken if the failure only occurs to a specified Page.
* PORTAL LATEST WORKING BACKUP RESTORATION. This step is the Portal Restoration from the latest working backup. This is a major activity and will rollback all Portal changes that have been done to the Portal.