This won't solve a problem, but finding one is a good start; an even better start is run a script when your system is going well, and compare the results to the same checks when there is a problem.
# vi /var/adm/messages*
Search for WARNING and/or NOTICE
# typescript /tmp/test.log
You may not want to run tests twice if you are at a terminal and can't scroll.
# prtdiag -v
Gives a description of the system and any hardware failures.
# uptime
Shows load (threads running) the last 1, 5 and 15 minutes.
# vmstat 10 3 (numbers are interval and count; first report is cumulative from boot)
Check scan rate (sr) to see if there is a memory shortage (<200 likely okay) and system to user ratio.
Ignore "free" memory, by design on Solaris a system that has been up for awhile only shows 3% free.
# mpstat 10 3 (always use interval and count for any test ending in "stat")
Show the load on individual CPUs, other stats show average
# iostat 10 3 (reads per second r/s, and KB read per sec, kr/s)
Disk activity, see of problems are caused by a too busy disk. r/s in the 100 range may be a lot, as is
kr/s in the 1000 range, but these are numbers that differ with systems so you check to see what they
are on a good running system.
# svcs -xv
See if any services are struggling.
# ps -elfy
Get the pid of any process that worries you (ie, has a lot of time on the processors)
# truss -o /tmp/problem.log -f -p (process number you got above)
# (vi or tail /tmp/problem.log
What is the process doing? If it is stuck, the same error message may come up many times.
SAQ: Sometimes asked question, not frequent:
"I got an error number 'Err#28 ENOSPC'. What does that mean?"
Two ways to figure this out, the first E means "Error" and NOSPAC means no space (yeah).
Way number two, look it up.
# man -s 2 intro
/28 (search)
28 ENOSPC
No space left on device.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment