Windows Server 2003 constantly unresponsive

April 18, 2011 at 11:00:01
Specs: Windows Server 2003 Enterprise x64 SP2, i7-2600K
I put a new Win2k3 server into production yesterday and almost immediately I started experiencing loss of responsiveness by the server. It happens every few minutes, and sometimes every few seconds. When it goes unresponsive, everything on the machine seems to come to a halt for a period of a few seconds to a minute or two. Network utilization drops to zero. In performance monitor, a completely blank space is left in the graph. It's like the system literally doesn't do ANYTHING for the duration of the halt. Then it just resumes operation until it happens again. There are no errors coming up in the Event Viewer.

Prior to putting this system into production I spent 3 full days messing with CPU and memory clock speeds and stress testing the system and never experienced this problem. I only began experiencing it once it had been put into production at the datacenter and I started accessing it remotely via RDP. The system is running SQL Server 2000 SP4 and is also serving some websites in IIS.

System hardware:
i7-2600K CPU
GIGABYTE GA-P67A-D3-B3 Motherboard
4 x 4GB Mushkin Blackline DDR3 1866 Memory Model 996986
Dynatron K199 CPU Cooler
OCZ Vertex 30gb SSD Drive (Used for SQL Server and it's database files only)
2 x Samsung F4 2TB HDDs, RAID1 using onboard Intel Rapid Storage (Boot drive and all web files)

OS: Windows Server 2003 Enterprise x64 SP2 (fully updated including drivers)

I had initially deployed the system overclocked to 4.5GHZ and running the memory at 1866 because it seemed stable at these numbers. However I have since reduced it to 3.2GHZ and 1333 just to rule out overclocking as a cause. I have had Core Temp running always and the system is not reaching unreasonable temps.

I'm at a loss as to what to try next to troubleshoot as I've never experienced a problem quite like this and have yet to stumble on any clues as to the reason.


See More: Windows Server 2003 constantly unresponsive

Report •

#1
April 18, 2011 at 13:33:54
I would suspect the overclocking damaged mainboard componets.
Might want to look at processes when the lag happens to see if something isn't grabbing the cpu exclusively.

I would never put a database on a SSD drive unless it was being replicated to another system. That SSD drive dies you will lose whatever data was not backed up which could be substancial if its being used for financial transactions..

Answers are only as good as the information you provide.
How to properly post a question:
Sorry no tech support via PM's


Report •

#2
April 18, 2011 at 14:18:38
The database is backed up nightly, and the loss of 1 day's transactions in this case is not a big deal as it is simply user-posted web content (eg forums) and no financial transactions or anything similar.

There is no measured CPU spike when it lags, and no increase in CPU temp.

The damage due to overclocking theory is one that I've considered but I'm definitely not to the point where I'm ready to give up on troubleshooting and throw it in the trash.


Report •

#3
April 18, 2011 at 15:07:13
might want to run memtest on the system
I would clear the bios and perhaps choose one of the default profiles and do only minimal changes like data/time and boot order.

Answers are only as good as the information you provide.
How to properly post a question:
Sorry no tech support via PM's


Report •

Related Solutions

#4
April 18, 2011 at 15:50:16
I too was thinking that a memtest run would be a good idea, just to make sure. For now I'm doing as much testing as possible without driving to the DC taking the system offline.

About 20 minutes ago I re-routed all web traffic to a different server so now this machine is only processing SQL, and as of yet I have not observed a single "halt" since that change.


Report •

#5
April 18, 2011 at 16:59:16
Further testing seems to confirm that the problem only occurs when web traffic is directed to this server. I route it away, and the problem goes away immediately. I route some back onto the server, and immediately it starts halting again. Also, I tried doubling the traffic, diverting all traffic away from the other webserver, and this caused it to become completely unresponsive (RDP trying to reconnect) until I switched the traffic back away from it, after which it again immediately started working.

I did notice that the pagefile had been set to only 2-4GB and I increased this to 8-16GB, however this did not fix the problem, nor was any improvement observed.


Report •

#6
April 19, 2011 at 11:04:25
Bottleneck would appear to be the nic. Might consider doing adapter teaming.

Answers are only as good as the information you provide.
How to properly post a question:
Sorry no tech support via PM's


Report •

#7
April 19, 2011 at 11:15:12
Thanks for your continued help with this. I don't think it's the NIC because I'm only pushing, at most 5-10 megabit. Everything is GigE lan right into the DC's backbones. Also the other server, which is a very similar setup (2600 on an ASUS H67 board) is now handling all web traffic quite easily. The only problem is it's maxed out at 16GB RAM and presently using 24GB of virtual memory. Surprisingly though, it's responding very quickly on all hosted sites so I'm leaving it in this configuration until I either can find a solution or give up and throw another server on the stack.

Report •

#8
April 20, 2011 at 08:07:20
24gig of virtual? That is really really wrong. If even half of that pagefile is getting used you need to add ram which may be the issue all along. Keep an eye on your hard page faults. Those indicate a need for more ram.

Strange an similiar machine handles the same load this server couldn't.

Answers are only as good as the information you provide.
How to properly post a question:
Sorry no tech support via PM's


Report •

#9
April 20, 2011 at 10:32:08
Just to be clear, the 24gb of virtual includes the 16gb of ram, so I'm only 8gb over capacity. And this is on the system that's serving the websites. Given the response time of the websites, it doesn't seem to be affecting performance but I still do not like running a server that much over capacity.

On the other system thats running SQL server, memory usage is at 4.5GB with no websites running.

It could be a problem with SQL server and IIS fighting over resources. It could be that together they're overwhelming the desktop heap, but that is purely a guess based on the fact that in the past I've had heap problems with IIS. I'm running about 650 websites which is why the webserver is using 24gb - each of those sites caches most of its datasets in memory.

One thing that is very interesting to note is that I'm experiencing better performance and far fewer problems on the current setup than I did on the previous servers which were dual-e5420 machines that were never run beyond their physical RAM capacity.


Report •

Ask Question