Walkthrough of diagnosing a SQL Server error – Part 2
- by Scott Whigham on June 10, 2009 2:00 PMIn Part 1, I explained the “start” of the problem. I put “start” in quotes because what I described was/is just the beginning… Here’s where I was: I have a job that has hung and it is preventing me from taking any backups on the server. Let’s walk through a few options:
First Action: Stop the job
At this point, I felt that the most logical thing to do next was to stop the job. I loaded up SSMS and went to the Activity Monitor. I located the job that was still executing an xp_cmdshell job and I told it to stop. And it tried and tried and tried… but it failed.
Second Action: Re-scour the logs
Okay – so back to sp_who and a few DMVs. After reviewing the current sessions and what they were doing, I started reviewing the logs a little more intensely. I had seen errors in the log already but they were not unusual or unexpected given the situation except for one (Screenshot #1). Now that I had tried to stop the job and that had no real effect though, there were new errors:
Screenshot #1: An error from earlier today (prior to attempting to stop the job)
The “Not continuing to wait” part is important for the errors I received after manually stopping the job:
Screenshot #2: An error attempting to stop the job
Okay – so the “Continuing to wait” bit is interested… Notice that the wait time is 300ms. As of this writing, I continue to receive this error:
Screenshot #3: Just taken – wait time of 25500ms
Off I go to search for the error “Error message 844: Time out occurred while waiting for buffer latch -- type 2…Continuing to wait.” Not much was found – I did find that CU5 for SQL Server 2008 RTM featured a fix for this - http://support.microsoft.com/default.aspx?scid=kb;en-us;968543&sd=rss&spid=13165. I also saw evidence that this was an indication of potential disk problems. The machine was not in any way under stress.
Third Action: chkdsk
Since the I/0 buffer latch is often an indicator of disk problems, I ran chkdsk. chkdsk reported that there minor errors. However, at the start of a weekday is not the time to bring the server down unless there is an error preventing work. Ugh – what to do… I’ve seen chkdsk run for 15 hours straight trying to repair clusters – I didn’t have that kind of time. I can’t take a backup… Ugh…
By this time, I’m starting to get a bit worried about losing data. Currently, our users are still able to use the database yet no SQL Server authentication logins are being allowed in.
Yes, you read that right – no new SQL authentication logins are allowed to login:
Screenshot #4: Attempting to login using SQL Server authentication
I’m able to connect using Windows authentication though…
Continued in the next post…




done a lot of anti-counterfeiting
measures,. For example, Rolex has 5 numbers, they are: (1) case models, (2) Watch production sequence number
movement and on the number, (4) the movement and on the production sequence number, (5) the band number. In addition to the
band number in the band discount out surface, the rest are hidden, not to remove the watch strap or open the rear door in order to
see it.