Sunday, August 3, 2008

Debugging a Sun/Solaris Performance Problem

This won't solve a problem, but finding one is a good start; an even better start is run a script when your system is going well, and compare the results to the same checks when there is a problem.

# vi /var/adm/messages*
Search for WARNING and/or NOTICE

# typescript /tmp/test.log
You may not want to run tests twice if you are at a terminal and can't scroll.

# prtdiag -v
Gives a description of the system and any hardware failures.

# uptime
Shows load (threads running) the last 1, 5 and 15 minutes.

# vmstat 10 3 (numbers are interval and count; first report is cumulative from boot)
Check scan rate (sr) to see if there is a memory shortage (<200 likely okay) and system to user ratio.
Ignore "free" memory, by design on Solaris a system that has been up for awhile only shows 3% free.

# mpstat 10 3 (always use interval and count for any test ending in "stat")
Show the load on individual CPUs, other stats show average

# iostat 10 3 (reads per second r/s, and KB read per sec, kr/s)
Disk activity, see of problems are caused by a too busy disk. r/s in the 100 range may be a lot, as is
kr/s in the 1000 range, but these are numbers that differ with systems so you check to see what they
are on a good running system.

# svcs -xv
See if any services are struggling.

# ps -elfy
Get the pid of any process that worries you (ie, has a lot of time on the processors)

# truss -o /tmp/problem.log -f -p (process number you got above)

# (vi or tail /tmp/problem.log
What is the process doing? If it is stuck, the same error message may come up many times.


SAQ: Sometimes asked question, not frequent:

"I got an error number 'Err#28 ENOSPC'. What does that mean?"
Two ways to figure this out, the first E means "Error" and NOSPAC means no space (yeah).
Way number two, look it up.

# man -s 2 intro
/28 (search)

28 ENOSPC
No space left on device.

No comments: