Walkthrough of diagnosing a SQL Server error – Part 2
- by Scott Whigham on June 10, 2009 2:00 PMIn Part 1, I explained the “start” of the problem. I put “start” in quotes because what I described was/is just the beginning… Here’s where I was: I have a job that has hung and it is preventing me from taking any backups on the server. Let’s walk through a few options:
First Action: Stop the job
At this point, I felt that the most logical thing to do next was to stop the job. I loaded up SSMS and went to the Activity Monitor. I located the job that was still executing an xp_cmdshell job and I told it to stop. And it tried and tried and tried… but it failed.
Second Action: Re-scour the logs
Okay – so back to sp_who and a few DMVs. After reviewing the current sessions and what they were doing, I started reviewing the logs a little more intensely. I had seen errors in the log already but they were not unusual or unexpected given the situation except for one (Screenshot #1). Now that I had tried to stop the job and that had no real effect though, there were new errors:
Screenshot #1: An error from earlier today (prior to attempting to stop the job)
The “Not continuing to wait” part is important for the errors I received after manually stopping the job:
Screenshot #2: An error attempting to stop the job
Okay – so the “Continuing to wait” bit is interested… Notice that the wait time is 300ms. As of this writing, I continue to receive this error:
Screenshot #3: Just taken – wait time of 25500ms
Off I go to search for the error “Error message 844: Time out occurred while waiting for buffer latch -- type 2…Continuing to wait.” Not much was found – I did find that CU5 for SQL Server 2008 RTM featured a fix for this - http://support.microsoft.com/default.aspx?scid=kb;en-us;968543&sd=rss&spid=13165. I also saw evidence that this was an indication of potential disk problems. The machine was not in any way under stress.
Third Action: chkdsk
Since the I/0 buffer latch is often an indicator of disk problems, I ran chkdsk. chkdsk reported that there minor errors. However, at the start of a weekday is not the time to bring the server down unless there is an error preventing work. Ugh – what to do… I’ve seen chkdsk run for 15 hours straight trying to repair clusters – I didn’t have that kind of time. I can’t take a backup… Ugh…
By this time, I’m starting to get a bit worried about losing data. Currently, our users are still able to use the database yet no SQL Server authentication logins are being allowed in.
Yes, you read that right – no new SQL authentication logins are allowed to login:
Screenshot #4: Attempting to login using SQL Server authentication
I’m able to connect using Windows authentication though…
Continued in the next post…




Leave a comment