Root cause analysis documentation

Root cause analysis (RCA) is the process of identifying the underlying causes of problems in order to propose effective solutions. RCA assumes that consistently preventing and addressing fundamental issues is far more successful than treating ad hoc symptoms and putting out flames. Root cause analysis uses a set of ideas, tools, and approaches to uncover the underlying reasons of an occurrence or trend. Looking beyond the obvious cause and effect, RCA can reveal where processes or systems failed or caused the incident in the first place.

Portfolios

RCAs will be carried out on multi-customer events.  When delivering root cause analysis, we prioritize accuracy over timing, always aiming to meet communicated time frames.

           

Incident PF-Nov30th,2024-0700 UTC

Emergency Maintenance Interruption 

Internal Tracking: PF-Nov30th,2024-0700 UTC 

Issue Title: Degradation of hardware 

Product Impacted: Portfolios 

Service Impacted: Production and Sandbox 

Start Time: Nov 30th,2024 07:00 AM UTC 

End Time: Nov 30th,2024 09:00 AM UTC 

Issue Description: Operations will execute necessary maintenance on infrastructure that serves Portfolios customers. During maintenance time, application service may be temporarily interrupted - expected outage 5-10 minutes but not exceeding 30. 

Root Cause: Our vendor notified Operations that the hardware was recently flagged for immediate retirement due to an unrecoverable fault with the underlying hardware.  

Mitigation strategy: During the maintenance, operations will evacuate client environments to more reliable hardware. 

           

           

Incident PF-Nov10th,2024-0700 UTC

Emergency Maintenance Interruption 

Internal Tracking: PF-Nov10th,2024-0700 UTC 

Issue Title: Degradation of hardware 

Product Impacted: Portfolios 

Service Impacted: Production and Sandbox 

Start Time: Nov 10th,2024 07:00 AM UTC 

End Time: Nov 10th,2024 11:00 AM UTC 

Issue Description: Operations will execute necessary maintenance on infrastructure that serves Portfolios customers. During maintenance time, application service may be temporarily interrupted - expected outage 5-10 minutes but not exceeding 30. 

Root Cause: Our vendor notified Operations that the hardware was recently flagged for immediate retirement due to an unrecoverable fault with the underlying hardware.  

Mitigation strategy: During the maintenance, operations will evacuate client environments to more reliable hardware. 

           

           

Planview Incident Alert Nov 8th, 2024 UTC (final):

Unplanned Service Interruption Report

Internal Tracking:  N/A

Issue Title:  N/A

Product Impacted:  Portfolios

Service Impacted:  Site Availability

First Automated alert:  Nov. 8th, 2024, 08:03 AM UTC 

First Customer alert:  N/A 

Incident Start: Nov. 8th, 2024, 08:03 AM UTC  

Time of Resolution: Customers production environments started coming back on line at 10:08 AM UTC thru 10:48 AM UTC.  All sandbox environments were back online at 12:48 PM UTC

Issue Description:  Some clients' access to Portfolios sites was interrupted. All impacted customers utilized Portfolios release 2023 or prior.

Resolution Description: Reverted Network Change

Root Cause:  A modification in the DHCP option set led Portfolios instances to fail to resolve the correct domain for Planview's feature flag relays. When the DHCP lease expired and instances began resolving the incorrect domain, most Portfolios instances reverted to a local fallback file and continued to function properly. Portfolios older than 2023.10 failed to respond due to a known issue with legacy versions of Portfolios. The issue was fixed by reversing the modification and requiring all instances to refresh DNS. 

Mitigation strategy:  We recommend that all customers update their Portfolios instances to the most recent release.  We are bridging gaps between testing in our staging and production environments.