Sometimes it happens that we have a crash in the production environment on Access Point or Controller. The consequences might be minor if some low important module crashed, which is barely used. On the other hand the controller may completely crash and reboot itself, which may cause down time for the end users or services.
If we were running older firmware version and had a crash, the normal response from the vendor’s TAC would be to install to the newest firmware version. The most common reason for that is a fact that in every newer firmware version there are multiple bug fixes and improvements. Each firmware is tested against different modules by vendor’s testing and validating team and there are bugs, which will be documented internally only. The reported issues by the customer will be reproduced by the TAC engineer in vendor’s lab to determine possible cause of the crash. Eventually the engineering team will provide a fix, which will be included in the next firmware version and published in the release notes.
The golden rule is to keep the production environment up to date with the newest firmware version. It might be impossible to run an upgrade for every possible firmware version, but in bigger installations the firmware should be upgraded at least once in a half a year time period.
Another golden rule for wireless network engineers is to always read release notes before upgrading the system.
What to do when you have a crash?
To determine the crash in Aruba OS8 environment, simply login to the Mobility Master or Mobility Controller in CLI:
(MM1) *[mynode] #
The asterisk * means the MM had a crash.
For more details, please use command show switchinfo:
(P1T6-MM-1) *[mynode] #show switchinfo ... No AP crash information available. Controller Crash information available. Reboot Cause: Halt reboot (Intent:cause: 86:50)
Please keep in mind to check for crashes in all Mobility Controllers with the mdc command, which will login directly from MM to MC:
(P1T6-MM-1) *[mynode] #cd P1T6-MC-1 (P1T6-MM-1) *[20:4c:03:39:46:dc] #mdc Redirecting to Managed Device Shell (P1T6-MC-1) [MDC] # (P1T6-MC-1) [MDC] #exit Exiting Managed Device Shell (P1T6-MM-1) *[20:4c:03:39:46:dc] #cd P1T6-MC-2 (P1T6-MM-1) *[20:4c:03:5f:b9:aa] #mdc Redirecting to Managed Device Shell (P1T6-MC-2) [MDC] # (P1T6-MC-2) [MDC] #exit
No * means no crash on MCs.
For more details which process crashed, there is a command show crashinfo:
(P1T6-MM-1) *[mynode] #show crashinfo Crash Info Table ---------------- Crash Time Process Name ---------- ------------ May 15 01:15 nginx
Now tar the crash file, which will save it to the flash memory:
(P1T6-MM-1) *[mynode] #tar crash
There will be a crash.tar file in the flash memory:
(P1T6-MM-1) [mynode] #dir ... -rw-r--r-- 1 root root 453920 May 17 17:14 crash.tar ...
Now copy the file to your tftp/ftp/scp server together with the techdumps and open a case with the TAC team.
You may also review all newer release notes for similar problem, which already might have been reported to the vendor and fix is available in the newer firmware version.