Skip to main content

GDA-SMDS : RCA for unavailability of V1 Geo Search API

Created by User 7f21c on Oct 17, 2024

· Service Now Incident : INC7311269

· Issue Description : Users unable to complete registration due to the unavailable City field not returning any values.

· Affected API : https://opslocsearchapisweprod.azurewebsites.net

· TM URL : https://ops-locsearch-apis-prod.trafficmanager.net

· Apigee Endpoint : https://api.maersk.com/master-data/geography-search

· Github : https://github.com/Maersk-Global/SMDS%5FLocation

As part of regular Maintenance Activity, Unused and Unsupported Postgres Single Servers were dropped as part of Resource Cleanup.opsprodpg is one of the dropped Postgres Single Server included in the Cleanup.

Since all the Single Servers are migrated to Flexi Servers. It was assumed this server is not in use anymore and dropped the opsprodpg along with the below mentioned servers as part of Cleanup.

Once the service got impacted. It took me sometime to spot the app services which is responsible for this API using the legacy postgres standalone single server

SMDS Geo came up with V2 version of API which operated on a new DB instance & different schema 1 year ago. There were follow-ups with consumers & most of them had moved from V1 to V2 version of API. The V1 instance of API was connected to the legacy DB version that was accidentally dropped during this activity assuming it’s not in use after 1 year of modernization Post finding the app service and Git repo , came to know it was ADO Pipeline used to deploy the application which was deprecated.

There was no impact to CMD application since it operated independently on the cached data during this outage.

Issue Mitigation :

Teams gathered and a new flexi server was built and restore the data by using EMP Triggers was initiated. While Team started working on the EMP Trigger , parallelly a P1 case #2410080030009343 with Microsoft was raised to restore the dropped Server.

MS technician Joined the call and Server restore is initiated and completed successfully. Post the restore, App is restarted and service is restored.

Actions Taken to Avoid this kind of actions in Future:

· Before Deleting any resource, Schedule a Scream Test of that resource by Placing a firewall rule to block all incoming and outgoing Traffic. Scream test will allow to identify the resource is used by any Application.

· Implement Locks for all resources in the Resource Group Level will avoid this issue. Only the Group Admin can clear the lock then only resources will be deleted.

· Implement the Change Record creation process for the resource deletion will allow the approvals from the App Owners

Further Steps :

· March 2025 is the support for Postgre Single Servers. Hence the DB for the opslocsearchapi needs to be migrated to new or existing Flexi servers.

· Need to setup Github Pipelines for Application Deploy.

· Its a Prod API, It needs to be deployed in the Multi Region.

· V1 service is used by http://Maersk.com needs to be migrated to use the V2 service. It will provide the way to decommission the V1 legacy service.

Completed Actions :

· GitHub Actions is setup for this legacy API to deploy in Preprod and prod.

Below are the Important Time Stamps :

EventsTime Stamp
Server Dropped TimeTue Oct 08 2024 10:19:00 GMT
Issue Start TimeTue Oct 08 2024 10:19:00 GMT
Issue Reported Time
Initial InvestigationsTue Oct 08 2024 11:51:00 GMT
P1 Case open Time with MicrosoftTue Oct 08 2024 14:30:00 GMT
Server Restore Start TimeTue Oct 08 2024 16:00:00 GMT
Server Restore End TimeTue Oct 08 2024 16:30:00 GMT
Application RestartTue Oct 08 2024 16:30:00 GMT
Service Restored TimeTue Oct 08 2024 16:33:00 GMT

Servers Dropped on Tue Oct 08 2024:

NAMERESOURCE TYPESTATUSHIGH AVAILABILITYRESOURCE GROUPLOCATIONSUBSCRIPTION
opspprdpgAzure Database for PostgreSQL single serverAvailable--wepprg01West EuropeMaersk LS CD Operation Master Data Management Prod 01
opsprodpgAzure Database for PostgreSQL single serverAvailable--weprodrg01West EuropeMaersk LS CD Operation Master Data Management Prod 01
pg-opsmdm-preprod-westeuAzure Database for PostgreSQL single serverAvailable--rg-db-preprod-westeu-001West EuropeMaersk LS CD Operation Master Data Management Prod 01
pg-opsmdm-prod-eus2-001Azure Database for PostgreSQL single serverAvailable--rg-db-prod-eus2-001East US 2Maersk LS CD Operation Master Data Management Prod 01
pg-opsmdm-prod-westeu-001Azure Database for PostgreSQL single serverAvailable--rg-db-prod-westeu-001West EuropeMaersk LS CD Operation Master Data Management Prod 01

Azure P1 Case Info :

https://portal.azure.com/#view/Microsoft%5FAzure%5FSupport/SupportRequestDetails.ReactView/id/%2Fsubscriptions%2F7346298c-5bda-4767-b67a-e25836c6f20e%2Fproviders%2Fmicrosoft.support%2Fsupporttickets%2F06bfd9d3-404eb2fd-81334dd3-9c07-4405-8f89-f3e959932047

Azure Activity Logs :

"C:\Users\KSH146\OneDrive - Maersk Group\Back_up\RCA\Azure_Actiivty_Logs.csv"

Resources Dropped :

"C:\Users\KSH146\OneDrive - Maersk Group\Back_up\RCA\Azure_Actiivty_Logs.csv"

Was this page helpful?