Root cause analysis (RCA) is the process of identifying the underlying causes of problems in order to propose effective solutions. RCA assumes that consistently preventing and addressing fundamental issues is far more successful than treating ad hoc symptoms and putting out flames. Root cause analysis uses a set of ideas, tools, and approaches to uncover the underlying reasons of an occurrence or trend. Looking beyond the obvious cause and effect, RCA can reveal where processes or systems failed or caused the incident in the first place.
RCAs will be carried out on multi-customer events. When delivering root cause analysis, we prioritize accuracy over timing, always aiming to meet communicated time frames.
Emergency Maintenance Interruption
Internal Tracking: PF-Nov30th,2024-0700 UTC
Issue Title: Degradation of hardware
Product Impacted: Portfolios
Service Impacted: Production and Sandbox
Start Time: Nov 30th,2024 07:00 AM UTC
End Time: Nov 30th,2024 09:00 AM UTC
Issue Description: Operations will execute necessary maintenance on infrastructure that serves Portfolios customers. During maintenance time, application service may be temporarily interrupted - expected outage 5-10 minutes but not exceeding 30.
Root Cause: Our vendor notified Operations that the hardware was recently flagged for immediate retirement due to an unrecoverable fault with the underlying hardware.
Mitigation strategy: During the maintenance, operations will evacuate client environments to more reliable hardware.
Emergency Maintenance Interruption
Internal Tracking: PF-Nov10th,2024-0700 UTC
Issue Title: Degradation of hardware
Product Impacted: Portfolios
Service Impacted: Production and Sandbox
Start Time: Nov 10th,2024 07:00 AM UTC
End Time: Nov 10th,2024 11:00 AM UTC
Issue Description: Operations will execute necessary maintenance on infrastructure that serves Portfolios customers. During maintenance time, application service may be temporarily interrupted - expected outage 5-10 minutes but not exceeding 30.
Root Cause: Our vendor notified Operations that the hardware was recently flagged for immediate retirement due to an unrecoverable fault with the underlying hardware.
Mitigation strategy: During the maintenance, operations will evacuate client environments to more reliable hardware.
Internal Tracking: N/A
Issue Title: N/A
Product Impacted: Portfolios
Service Impacted: Site Availability
First Automated alert: Nov. 8th, 2024, 08:03 AM UTC
First Customer alert: N/A
Incident Start: Nov. 8th, 2024, 08:03 AM UTC
Time of Resolution: Customers production environments started coming back on line at 10:08 AM UTC thru 10:48 AM UTC. All sandbox environments were back online at 12:48 PM UTC
Issue Description: Some clients' access to Portfolios sites was interrupted. All impacted customers utilized Portfolios release 2023 or prior.
Resolution Description: Reverted Network Change
Root Cause: A modification in the DHCP option set led Portfolios instances to fail to resolve the correct domain for Planview's feature flag relays. When the DHCP lease expired and instances began resolving the incorrect domain, most Portfolios instances reverted to a local fallback file and continued to function properly. Portfolios older than 2023.10 failed to respond due to a known issue with legacy versions of Portfolios. The issue was fixed by reversing the modification and requiring all instances to refresh DNS.
Mitigation strategy: We recommend that all customers update their Portfolios instances to the most recent release. We are bridging gaps between testing in our staging and production environments.