Convergence Diary: SQL Server and Windows Performance Counters that matter

This session is driven by Chad Aberlen, one of the Microsoft Escalation Engineers. Most of his expertise comes from the GP side, but he assured this was not a GP session, but a general Dynamics session. Let's see Smile.

He wanted to give an overview on how to approach performance tuning, key microsoft sql performance counters and detecting three common performance issues. I have seen similar session quite a few times, but this is another agenda for sure.

Why tune? Mostly because of the fact: time is money, so you can't lose user productivity. Also the cost of hardware is a reason for tuning. Furthermore, hardware may only temporarily resolve issues and higher return on investment as reasons to tune.

But where should we start? You should monitor the system. Try to isolate the problems. Is it only one bad query, or is it only one peace of functionality ... or may be a general problem that is going on? Hardware issue or software issue? So, what is causing the high pressure?

What are the hardware components that are restricting performance. We all know that:

  • Disk I/O
  • CPU
  • Memory (please be not be afraid going to x64 because of the fact that more memory can be used)
  • Network

How do we measure the disk subsystem:

  • Use physical disk instead of logical disk.
  • The sec/transfer < 15ms. This is the most important one for Chad.
  • The transfers/sec <120 per disk.
  • The current disk queue length < (2* #disks)
  • Disk bytes/sec < (10 MB/sec per disk)

Then, he went over some RAID configurations. Putting log files on RAID5, we all know that as well, he doesn't like that (and again, there were quite some people). Writing is slow in RAID5. Off course, it depends on the number of disks, but in general, it's too slow. RAID10 is best. Good speed, and very good write speed. Therefor, this is what he recommends:

  

RAID 1

RAID 5

RAID 10

DB files

avoid as generally not enough drives

acceptiable if low percentage of writes (which is seldom the case in NAV, so let's try to avoid!)

best performance

Log file

General Recommendation

avoid because of high cost of write I/O

best performance and use if RAID 1 shows pressure

Tempdb

General Recommendation

avoid because of high cost of write I/O

best performance and use if RAID 1 shows pressure

Master / MSDB

General Recommendation

  

  

Not really new things here, but it's always nice that it's confirmed by someone that hasn't been working on NAV at all Smile.

Now, what could be causing the I/O cost on the disks? This could be caused by memory pressure, or excessive paging (also due to too low memory) or poorly designed queries (scans on large tables, missing key indexes, ...), high write I/O's to a RAID 5 volume, high usage during peak times, ... . So it could be hardware, or software.

He continued with a demo. He showed the performance on his laptop, performing a query that took some time, and generated a scan on a very large table, which caused a huge disk activity.

Next, how do we measure the CPU?

  • % processor time < 80%
  • % privileged time < 10%
  • Processor queue length < 2 (no matter how many CPU's there are).
  • Context switches/sec < 1000 (the number is something one could argue with, but keep an eye on it. "How many times is another process calling the attention of the CPU, so going away from SQL?")

So, what causes CPU bottelnecks? Compiles/recompiles of execution plans, hash joins, aggregate functions, data sorting, disk I/O activity (paging), other applications/services, screen savers, ... .

Time for another demo. He showed the above counters in the performance monitor. Executes a script with a mix of creating a table and inserting data into it. It resulted in a pretty busy CPU. The second time, the script was somewhat changed (with the same result at the end): first creating all tables, then inserting the data into the tables. The performance was remarkably better!

What about the memory?

  • Set to dynamically allocate or raise max
  • Increase physical RAM
  • Evaluate high read count queries
  • /3GB switch in boot.ini
  • /PAE switch in boot.ini + AWE enabled in SQL. If you don't put AWE in SQL, SQL won't use the extra RAM that /PAE makes available. So remember that!

He somewhat skipped the network counters, because it was nog really important to him. Too bad, because I was interested in it.

SQL Server specific counters:

  • Missing indexes: full scans/sec
  • Blocking:
    • Total latch wait time (ms)
    • Lock timeouts/sec
    • Lock wait time (ms) - this is the counter Chad uses the most.
    • Number of deadlocks/sec
  • Miscellaneous
    • User connections
    • Batch requests/sec
    • SQL re-compilations/sec (lower is better)

He really doesn't like deadlocks. When he sees a deadlock, he wants to know what caused, what peace of functionality, when, who, ... . It can result in bad data, so let's try to avoid it.

Next, a demo about SQL blocking. He just simulated someone changing data and therefor locking a set of records, while someone else also tried to change the same record. The "Lock wait time" showed how long someone was bloced as one pike of (in this case) 60 seconds.

These should be the top three performance issues

  • High disk latency
    • Wrong RAID
    • Queries with high I/O costs
  • Execution plan recompiles
    • Query design
  • High blocking lengths
    • Long transactions
    • Excessive locking

To conclude, he talked about an unsupported inhouse tool that multiple teams within microsoft were using to log some of the performance issues ... Which is nice to see as a "benchmark" or tool that can point you in a right direction. Take a look at http://www.codeplex.com/PAL for more information. Use and abuse. Again, this tool is not supported ... It's just a tip.

That's all, folks Wink.

Comment List
Related
Recommended