img by Coba

Walkthrough of diagnosing a SQL Server error – Part 2

In Part 1, I explained the “start” of the problem. I put “start” in quotes because what I described was/is just the beginning… Here’s where I was: I have a job that has hung and it is preventing me from taking any backups on the server. Let’s walk through a few options:

First Action: Stop the job

At this point, I felt that the most logical thing to do next was to stop the job. I loaded up SSMS and went to the Activity Monitor. I located the job that was still executing an xp_cmdshell job and I told it to stop. And it tried and tried and tried… but it failed.

Second Action: Re-scour the logs

Okay – so back to sp_who and a few DMVs. After reviewing the current sessions and what they were doing, I started reviewing the logs a little more intensely. I had seen errors in the log already but they were not unusual or unexpected given the situation except for one (Screenshot #1). Now that I had tried to stop the job and that had no real effect though, there were new errors:

NoContinuing

Screenshot #1: An error from earlier today (prior to attempting to stop the job)

The “Not continuing to wait” part is important for the errors I received after manually stopping the job:

Continuing

Screenshot #2: An error attempting to stop the job

Okay – so the “Continuing to wait” bit is interested… Notice that the wait time is 300ms. As of this writing, I continue to receive this error:

Latest

Screenshot #3: Just taken – wait time of 25500ms

Off I go to search for the error “Error message 844: Time out occurred while waiting for buffer latch -- type 2…Continuing to wait.” Not much was found – I did find that CU5 for SQL Server 2008 RTM featured a fix for this - http://support.microsoft.com/default.aspx?scid=kb;en-us;968543&sd=rss&spid=13165. I also saw evidence that this was an indication of potential disk problems. The machine was not in any way under stress.

Third Action: chkdsk

Since the I/0 buffer latch is often an indicator of disk problems, I ran chkdsk. chkdsk reported that there minor errors. However, at the start of a weekday is not the time to bring the server down unless there is an error preventing work. Ugh – what to do… I’ve seen chkdsk run for 15 hours straight trying to repair clusters – I didn’t have that kind of time. I can’t take a backup… Ugh…

By this time, I’m starting to get a bit worried about losing data. Currently, our users are still able to use the database yet no SQL Server authentication logins are being allowed in.

Yes, you read that right – no new SQL authentication logins are allowed to login:

CannotConnect

Screenshot #4: Attempting to login using SQL Server authentication

I’m able to connect using Windows authentication though…

Continued in the next post…

authors
scott whigham
grant moyle
chad weaver
recent comments
  • rolex watches: done a lot of anti-counterfeiting measures,. For example, Rolex has read more