GDA-SMDS : RCA for unavailability of V1 Geo Search API
Created by User 7f21c on Oct 17, 2024
· Service Now Incident : INC7311269
· Issue Description : Users unable to complete registration due to the unavailable City field not returning any values.
· Affected API : https://opslocsearchapisweprod.azurewebsites.net
· TM URL : https://ops-locsearch-apis-prod.trafficmanager.net
· Apigee Endpoint : https://api.maersk.com/master-data/geography-search
· Github : https://github.com/Maersk-Global/SMDS%5FLocation
As part of regular Maintenance Activity, Unused and Unsupported Postgres Single Servers were dropped as part of Resource Cleanup.opsprodpg is one of the dropped Postgres Single Server included in the Cleanup.
Since all the Single Servers are migrated to Flexi Servers. It was assumed this server is not in use anymore and dropped the opsprodpg along with the below mentioned servers as part of Cleanup.
Once the service got impacted. It took me sometime to spot the app services which is responsible for this API using the legacy postgres standalone single server
SMDS Geo came up with V2 version of API which operated on a new DB instance & different schema 1 year ago. There were follow-ups with consumers & most of them had moved from V1 to V2 version of API. The V1 instance of API was connected to the legacy DB version that was accidentally dropped during this activity assuming it’s not in use after 1 year of modernization Post finding the app service and Git repo , came to know it was ADO Pipeline used to deploy the application which was deprecated.
There was no impact to CMD application since it operated independently on the cached data during this outage.
Issue Mitigation :
Teams gathered and a new flexi server was built and restore the data by using EMP Triggers was initiated. While Team started working on the EMP Trigger , parallelly a P1 case #2410080030009343 with Microsoft was raised to restore the dropped Server.
MS technician Joined the call and Server restore is initiated and completed successfully. Post the restore, App is restarted and service is restored.
Actions Taken to Avoid this kind of actions in Future:
· Before Deleting any resource, Schedule a Scream Test of that resource by Placing a firewall rule to block all incoming and outgoing Traffic. Scream test will allow to identify the resource is used by any Application.
· Implement Locks for all resources in the Resource Group Level will avoid this issue. Only the Group Admin can clear the lock then only resources will be deleted.
· Implement the Change Record creation process for the resource deletion will allow the approvals from the App Owners
Further Steps :
· March 2025 is the support for Postgre Single Servers. Hence the DB for the opslocsearchapi needs to be migrated to new or existing Flexi servers.
· Need to setup Github Pipelines for Application Deploy.
· Its a Prod API, It needs to be deployed in the Multi Region.
· V1 service is used by http://Maersk.com needs to be migrated to use the V2 service. It will provide the way to decommission the V1 legacy service.
Completed Actions :
· GitHub Actions is setup for this legacy API to deploy in Preprod and prod.
Below are the Important Time Stamps :
| Events | Time Stamp |
|---|---|
| Server Dropped Time | Tue Oct 08 2024 10:19:00 GMT |
| Issue Start Time | Tue Oct 08 2024 10:19:00 GMT |
| Issue Reported Time | |
| Initial Investigations | Tue Oct 08 2024 11:51:00 GMT |
| P1 Case open Time with Microsoft | Tue Oct 08 2024 14:30:00 GMT |
| Server Restore Start Time | Tue Oct 08 2024 16:00:00 GMT |
| Server Restore End Time | Tue Oct 08 2024 16:30:00 GMT |
| Application Restart | Tue Oct 08 2024 16:30:00 GMT |
| Service Restored Time | Tue Oct 08 2024 16:33:00 GMT |
Servers Dropped on Tue Oct 08 2024:
| NAME | RESOURCE TYPE | STATUS | HIGH AVAILABILITY | RESOURCE GROUP | LOCATION | SUBSCRIPTION |
|---|---|---|---|---|---|---|
| opspprdpg | Azure Database for PostgreSQL single server | Available | -- | wepprg01 | West Europe | Maersk LS CD Operation Master Data Management Prod 01 |
| opsprodpg | Azure Database for PostgreSQL single server | Available | -- | weprodrg01 | West Europe | Maersk LS CD Operation Master Data Management Prod 01 |
| pg-opsmdm-preprod-westeu | Azure Database for PostgreSQL single server | Available | -- | rg-db-preprod-westeu-001 | West Europe | Maersk LS CD Operation Master Data Management Prod 01 |
| pg-opsmdm-prod-eus2-001 | Azure Database for PostgreSQL single server | Available | -- | rg-db-prod-eus2-001 | East US 2 | Maersk LS CD Operation Master Data Management Prod 01 |
| pg-opsmdm-prod-westeu-001 | Azure Database for PostgreSQL single server | Available | -- | rg-db-prod-westeu-001 | West Europe | Maersk LS CD Operation Master Data Management Prod 01 |
Azure P1 Case Info :
Azure Activity Logs :
"C:\Users\KSH146\OneDrive - Maersk Group\Back_up\RCA\Azure_Actiivty_Logs.csv"
Resources Dropped :
"C:\Users\KSH146\OneDrive - Maersk Group\Back_up\RCA\Azure_Actiivty_Logs.csv"