Last edit: 05-03-17 Graham Wideman |
Personal |
Software and Hardware Projects and Products Diagnostic Equipment and Software Article created: 98-07-01 |
There are some tenets relating to diagnosing and solving technical problems:
1. Technical problems always have concrete causes. (There may, of course, also be overlaying social causes that have permitted the technical causes to go unaddressed.)
2. If you don't understand what the concrete causes are, you have only two choices:
This has often led me to be particularly insistent on diagnostics, either as an integral part of a new system, or as a separate measurement tool to assure level of service or collection of troubleshooting data. Some of the samples in the Delphi area on this site follow this pursuit. A few prior examples follow:
Name/Date | Description, Pics |
RS232 |
The RS232 serial communications "standard" has been widely used as a communications channel on lab instruments, printers and computers, since the 70's. But before the late '80's and the dominance of the PC version, the implementations were very idiosyncratic and inconsistent. This meant that the chances of plug 'n' go success connecting device A to device B were close to zero, and usually the documentation was incomplete or wrong. To address this frequent problem, I developed an RS232 test unit that was part patch bay, and part signal investigator -- capable of definitively determining which signals were inputs versus outputs, and what they were doing during the transactions. This later became several chapters in my book on PC interfacing. |
Modem- pool |
My early attempts to telecommute were heavy on the "commute", and light on the "tele" -- largely due to unreliability of the SDSU's centrally-managed modem pool (at the time, 64 modems). Frequent user complaints were doing little to solve the problem. To mount a campaign to get this addressed, I wrote a modem-pool test package. With this software, a PC would spend 24-7 simulating user dial-ups... calling the modem pool and attempting to log on to a system. Along the way it would query the modem server, and gather stats on the level of service that an actual user would experience. Every few hours, its script would log on to a system with an email account, and send a report (including logon-success bargraphs and phone-line-specific diagnostics) to the relevant administrators and techs. Service improved shortly thereafter. This was a lesson in "if you don't measure it, it will be broken". |
Network-card |
Our group was responsible for desktop network installs, which, under DOS, Novell and/or Win 3.1 were ugly, and fraught with pitfalls and documentation lapses. To help our installers zero in on the troublespot, I wrote software for a portable PC to act as a network troubleshooting instrument and network sniffer. |
Oracle/MVS instability diagnosis, 1997-98 |
As part of my data warehousing activities at SDSU, it was necessary to use Oracle's database server (versions 7.2.x, 7.3.x) running on the campus's main IBM mainframe under MVS (later OS390). Oracle would frequently crash, or provide incongruously slow performance, and occasionally manage to knock out the entire system. Despite complaints from numerous users and honest efforts by the database admin, months would elapse in rounds of piecemeal communication with Oracle support and little progress would be made on solving the problems. Incredibly, Oracle provides no test suites which might be used to exercise and benchmark, nor diagnostics that might be used to get at the root of the problem, so it is very difficult to nail down where a problem might be. Another lesson in: "If you don't measure it, it will be broken". This could have been primarily a technical problem, or it could have been a problem of too many parties involved with no one really identified as responsible and engaged and equipped to diagnose and fix the problem. Either way, the situation had to be moved away from "endless recounting of myths" to some reality-based discourse. To respond to this I heavily instrumented our warehouse processing code, automating detailed logs and isolating reliable problem cases. (Overall, dealing We managed to install a duplicate system on a unix machine, and could then run comparisons. Finally, we published the daily performance summary and MVS-vs-unix results onto the web for all to see. This did not fix the technical problem, but it made the problem measurable, showed it to be a generally poor level of stability of the MVS-Oracle database-and-support environment (an otherwise rather hard-to-believe state of affairs), and communicated it broadly within our organization. The warehouse was moved permanently to another platform. Ensuing decisions were made to avoid using that environment for a major new accounting system. Ironically, because certain soon-to-be-mission-critical systems could not be moved from the IBM/Oracle platform, these measurements were also a critical factor in my conclusion that my work at SDSU would soon become impossible to advance. |