Today’s post is a real incident that happened with me this week. I want to share it with you all while details are still fresh in my memory. It’s an interesting case of interrupted scheduled SQL agent jobs after time zone modifications on server running SQL Server. I would like to share the story behind the issue, in case you don’t have spare time or not interested in the story, you may skip it and move directly to the issue and its resolution.
The story behind:
So here it is. I am in database consultant role for all the clients that my organization supports. I am being on boarded to a renowned banking firm which operates worldwide. I was introduced to East Asia region databases for this firm a month back and I have completed my initial evaluation of the environment by now. I have identified various actions items that we need to perform to make sure the databases are maintained in a standard way, following security compliances and performing optimally with minimum downtime. Well, all that’s said in one line is months of work. After a series of approvals working with senior management and clients, we agreed upon actions, related changes and timelines.
My first task was on Backups. The firm uses third party tools to take backups directly to tapes. This tool backups everything including VM Snapshots, Exchange servers, and Organizations data etc. of over 50 TB. To my surprise the backup strategy involved taking incremental/differential backups from Monday to Thursday and then taking one single full backup from Friday to Sunday. The reason I found for this single full backup for three days is that there are only two tape controllers where tapes can be mounted and the devices have lot of work to back up all the organization data within 3 days (Fri-Sun). Well whatsoever may be the cause but the gaps are scary. Just in case there is crash on any one of these days, we may lose more than 24 hours of data. This sounds too big mistake for a banking firm operating world-wide. Immediate action was suggested to enable copy only backups from Fri to Sun while a better backup strategy is planned keeping implementation costs in mind. I worked with stakeholders on deciding time window for backups to take place to avoid existing full backup job schedule, business hours on Friday or any other maintenance job conflicts over weekend. Time window was provided to me in IST which I very quickly converted to Client’s time zone and got approvals.
The mistake:
I was working on server to configure the job schedule which is when I looked the server time and found that the Server’s time zone is configured as IST. At this point I was cancelling the time zone change settings for server and due to network/server lag my cancel click somehow passed onto Ok button. Damn! The server time got changed, there wasn’t any confirmation message, no alert! I quickly reverted the change. I checked the server but couldn’t figure out any impact on server due to this until calls started flowing in after 30 minutes from client team complaining about the issue.
The Issue:
The complaints were about stopped file processing which was configured on server using SQL agent jobs. There were number of SQL Agent jobs configured on the server running every 5-10 minutes which were interconnected. One jobs executes, processes something and calls another one. A complex settings which I am sure is for a complex banking scenario.
The SQL jobs were not running and several job runs were missed delaying the complete process and piling up the unprocessed files on initial folder. Clients found stale data of over 30 minutes which is when they started complaining.
The Cause:
I investigated and found the though the jobs were enabled and scheduled properly and the time zone was also reverted within seconds but the damage was already done. As soon as the time change happened, the Agent job which were scheduled to run that very second was triggered and got completed. The job ran at 1:30 PM IST which is 3:30 PM in client’s time zone. The job triggered when server clock was 3:30 PM, this changed the next scheduled run time to 3:35 PM. Despite the time zone change to IST the next scheduled run time was still showing 3:35 PM for which we still had about 1.5 hours to go.
Resolution:
The issue could have self-resolved after waiting for SQL jobs to run at next scheduled run time i.e. 3:35 PM IST – a gap of 1.5 hours more. Which is definitely a suicide and you understand why!!
I tried opening the job schedule from GUI, clicking OK/Save and I also tried to update the next scheduled run time column in various system tables within MSDB database but nothing worked.
What worked was quite simple. The issue got fixed when I disabled the schedule from jobs and saved it and later re-enabled the schedule and saved the jobs.
Another resolution to this problem would be to add a new schedule within same job to run and make it for the duration between current time and next schedule runtime.
Sometimes we do simple mistakes which carry and cause big issues. Resolutions to these are also pretty simple rather than something complex and deep within system settings. Though I am always careful, it was learning to be a bit more careful while running activities on Production servers.
I shared the story intentionally to make sure this issue resides in your memory for long.
Hope today’s learning was useful as well as fun. Happy Learning folks!!
Comentários