|
Troubleshooting Bottlenecks
Disk IO
Bottlenecks. The goal is to evenly spread IO across all available disk
drives. The IO Stall Time screen shows the average rate per second
of IO stalls for categories of files (root, data, RUJ, AIJ). For disk
drives averaging 100 IOs per second, the target rate is 1 (the screen
displays “seconds X 100” so the computation is 1/100*100=1). If the
‘average stall time per second’ is greater than 1, then disk IO
requests are probably being queued (this can be verified using the VMS
MONITOR DISK/ITEM=QUEUE utility). It might be necessary to set the refresh
rate very high (/TIME=-1, or 1/100th of a second), for the hot spots to be
visible within RMU/SHOW STATISTICS (for Rdb) and DBO/SHOW STATISTICS (for
DBMS). The File IO
Overview option of the IO Statistics (by file) screen shows a
brief summary of the IO for all the files. The display can be sorted using
the Configure menu option; sorting by Total IOs will list
the “hottest” files first. Check the details for the individual files
using the IO Statistics (by file) screen to see how much activity
or stalling is occurring. The detailed information displayed for each file
shows total IO as well as the average stall times per IO, indicating
whether there is excessive IO or if the IO is slow (remember the target
stall rate of 1 for disk drives capable of 100 IOs per second).
It is important to understand the application – and how it
accesses the “hot files”. Does the I/O to these areas seem reasonable? The /ALARM and
/NOTIFY parameters can be used to collect more information about stalls.
The /NOTIFY parameter sends a message to the system operators when a stall
exceeds a specified /ALARM interval and also writes an entry in the
operator log. The log can be used to look at the specifics of what’s
been happening over time. Searching the log for stall messages will show
if a particular area, page or record keeps recurring in the stall messages
– then we can determine whether to move data
files, restructure an area or make programming adjustments. For example,
if we see that 50 stalls occurred waiting for record 1:2:3, then we need
to investigate what this record is before determining whether there is an
application problem or an inefficiency in the database design. Changing
the way a program accesses data or doing more frequent commits might free
up record bottlenecks. If it turns out
that a particular data file is experiencing long stalls, we might consider
moving the storage area file to another disk drive, one which is faster or
not as busy. Other strategies include modifying buffers to reduce IO and
placing the file on stripe sets. If a particular disk drive is the
bottleneck, we might consider distributing the files across other disks or
upgrading hardware to faster controllers or solid-state disks. Another cause of disk IO bottlenecks is record fragmentation. This occurs when the record length grows and record fragments get stored on different pages. If a record is stored across several pages, Rdb or DBMS has to fetch all the appropriate pages and reconstruct the record before it is presented to the user. If more than 10% of the records are fragmented, you will probably notice degraded performance. The Record Statistics screen provides information about fragmentation; for fetches and stores. To reduce fragmentation, for DBMS you can use DBO/MODIFY/RESTRUCTURE to restructure the area, or increase the page size then modify all of the records to have DBMS “rejoin” the fragmented records. For Rdb, you can use RMU/UNLOAD and RMU/LOAD.
|