2024年9月20日发(作者:穆如彤)
Updated Oct 3, 05
rp44x0/rx4640 System Power-Up Troubleshooting / System
Build-up Procedure
When you might need this procedure:
•
•
If the system does not successfully power up and remain powered up.
If the system will power up and remain powered up – but will not enter or pass POST (Power-ON
Self Test) and boot to BCH/EFI- and the cause of this problem is not apparent from the MP’s
SEL/FPL logs.
Assumptions before using this procedure:
The problem is a SOLID failure event (i.e. happens every time you attempt to power on the
system or initiate POST).
•
There is a functioning console terminal (or a PC with appropriate terminal emulation) available
and attached to the MP’s “Console” port.
•
Problem Symptoms:
The system will not power up or remain powered up.
Power-Up Troubleshooting Steps:
1. If this is the first time the system has been powered up, check the incoming AC Line Voltage.
Ensure that you have 200 to 240 VAC AC power applied to the bulk DC supplies. The
rp44x0/rx4640 servers require 200 VAC nominal (i.e. they WILL NOT operate on 100/120
VAC). Typical MP SEL Error Log entries when attempting to run on 100/120 VAC would look
like the following:
Alert Level 7: Fatal
Keyword: Type-02 096f02 618242
A/C Failed, disconnected, or out of range
Logged by: Baseboard Management Controller;
Sensor: Power Unit - AC Presence
Data1: 240VA Power Down
0x20430F39B1020040 FFFF026FCF090300
2.
Next, check the front panel “Power” LED - visible thru the hole in the flap covering the ON/OFF
switch. If it is flashing Amber, then housekeeping voltages are available (i.e. AC voltages are
applied) - go to step 3 below. If it is not illuminated / flashing then housekeeping voltages are not
available. Check the LEDs on the Bulk DC Power Supplies. The Bulk DC Power Supplies have
three indicators – a “Predict Fail” Amber LED to the left, a “Failed” Amber LED in the center
(the symbol is a triangle with an “!” exclamation point) and a Green “Power” LED to the right. If
the “Failed” LED is illuminated then the supply should be replaced – P/N A6961-67125. If the
Green “Power LED” is illuminated or flashing then the supply is OK and has AC voltage applied.
3. Check the “QuickFind diagnostic panel” located on top of the system – visible when the system
is pulled out of the rack. Diagnose any failures indicated by LED(s) illuminated via Service
Manual or other decoding links – e.g.
/ceasst/ia64/server/rx4640/led_.
4. Next, see if the MP is functional. The MP is typically accessed via a terminal attached to the “MP
Local” port on the rear bulkhead and is accessed via a Control-B CR key sequence. If the MP is
functional, then check the status of the DC power system via the CM> PC command or the CM>
PS command. If the PC or PS command output shows the “Current System Power State” to be
“OFF”, then try to turn the DC power ON via the PC command. If the system will not turn the
DC power ON, or if it will not remain ON, then check the System Event Logs for errors per step
5 below. Alternatively, you can press the ON/OFF button (located behind the front bezel flap) to
attempt to enable the DC voltages.
5. If a terminal attached to the MP “Local” port does not respond to a Control-B CR key sequence
(and the terminal is running at 9600 baud, 8 data bits, No Parity, 1 Stop Bit, Xon/Xoff, is
ONLINE, etc.) then it is possible that the MP is hung or non-functional. Check the following
LEDs located inside the system:
- The “MP Heartbeat” LED. This Green LED is physically located between the 3.3 and 5 volt
VRMs on the IO baseboard assembly and is visible from rear of the system by looking thru the
holes in the sheet metal directly above the MP LAN connector. There will be two Green LEDs in
this vicinity and the MP Heartbeat LED is the one on the left (closest to the bulk power supplies).
This LED should be flashing at ~ 1 Hz rate whenever housekeeping voltages are available AND
the MP circuitry is functional. If it IS NOT flashing (or is solid Green) then it is possible that the
MP circuitry is non-functional or is not receiving housekeeping voltages. One way to see if the
housekeeping voltages are available it to look at the BMC Heartbeat as described below.
- The BMC Heartbeat LED. This LED is located immediately to the right (from rear view) of the
MP heartbeat LED described above. It should be flashing Green (~ 1 Hz blink rate) whenever AC
power is applied to the system, minimum housekeeping voltages are available, and the BMC
circuitry is functional. If it is NOT flashing then it is possible that the BMC circuitry is non-
functional or the internal housekeeping voltages are missing.
If the MP Heartbeat LED IS NOT flashing but the BMC Heartbeat IS flashing:
•
Replace the I/O Baseboard
If the BMC Heartbeat LED is NOT flashing but the MP Heartbeat IS flashing:
•
Replace the I/O Baseboard – P/N A6961-69x01
•
Replace the Mid-Plane Board – P/N A6961-67005
If neither heartbeat LED is flashing then it is possible that internal DC housekeeping voltages are
missing. Housekeeping voltages (“12_STBY”) are generated by the bulk AC power supply(ies)
and passed thru the DC Power Distribution board, attached ribbon cable (flat grey cable), and
Mid-Plane board to the I/O Baseboard.
• Replace the DC Power Distribution board – A6961-60015
•
Replace the Mid-Plane Board – P/N A6961-67005
• Replace the flat cable between the DC Power Distribution and Mid-Plane – P/N A6961-
63004.
• Replace the I/O Baseboard – P/N A6961-69x01
6. Remember that if the MP “Local “ terminal port DOES respond to a Control-B prompt, then it is
always advisable to FIRST (i.e. before replacing any hardware) examine the MP System Logs
(SL) and look at the System Event Logs (SEL) for any recent alerts that have an alert level of 3 or
greater. Select TEXT mode and set the Alert Level to 3. If there are many entries in the SEL logs
and you are not sure which ones are associated with the current problem then clear the MP logs
(you may want to store them off somewhere first) and recreate the failure. Then look at the SEL
logs again. If the SEL logs do not assist in determining the root cause of the problem then it may
be necessary to do a “system build-up power troubleshooting procedure as described below:
Problem Symptoms:
The system will power on and remain powered on – but will not enter into or pass POST and boot to
EFI or BCH menu – and SEL log analysis does not directly point to the problem. In this case, it may
be necessary or desirable to perform the following system “build up” procedure:
System Build-up Troubleshooting Procedure:
-
-
Remove all the AC Power Cords from the Bulk Power Supplies.
Pull the CPU Carrier, Memory Carrier, Disk Drive(s), I/O cards (if possible) and plug in the AC
Power Cord(s). The MP should come alive and you should have the following CM> “DF” FRU
IDs listed - and might eventually end up with the following Alert Event:
FRU IDs:
--------
0002-Power Converter 0003-Power Supply 0 0004-Power Supply 1
0005-Diagnostic Panel 0006-Front Panel 0000-Motherboard
Log Entry 4: 00:00:09
Alert Level 5: Critical
Keyword: Type-02 257100 2453760
Missing FRU device - Mem Extender
Logged by: Baseboard Management Controller;
Sensor: Entity Presence
0x2050 FF00
-
If you DO NOT see all of the above FRU IDs then concentrate on the MISSING FRU ID(s) –
NOTE, a defective Mid-Plane board can cause any number of strange power up and/or DC
voltage problems. If you DO NOT get the above Alert Level 5 event but get another sort of high
level Alert, try replacing the Mid-Plane board.
If you DO show the above FRUID entries and get the Alert Level 5 “Missing FRU device - Mem
Extender” event then:
The next Step would be to add the Memory Carrier FRU (with at least one rank of DIMMs).
Remember to pull AC power cords before making any of these changes You should then notice
that the fans START to spin up but spin down about 1 second later. You should observe the
Amber Power LED flashing and the Red FAULT System LED flashing. Here is the output of the
DF command you should expect at this point (this example had two ranks of memory DIMMs
installed):
FRU IDs:
--------
0152-DIMM0D 0001-Mem Extender 0002-Power Converter
0003-Power Supply 0 0004-Power Supply 1 0005-Diagnostic Panel
0006-Front Panel 0128-DIMM0A 0136-DIMM0B
0144-DIMM0C 0160-DIMM1A 0168-DIMM1B
0176-DIMM1C 0184-DIMM1D 0000-Motherboard
Here are the high level Alerts generated in the SEL Logs (Alert Level 5 or greater - MP logs
cleared before test):
Log Entry 3: 23 Dec 2004 21:50:43
Alert Level 5: Critical
Keyword: Type-02 257100 2453760
Missing FRU device - CPU 0 PIROM
Logged by: Baseboard Management Controller;
Sensor: Entity Presence
0x2041CB3DB3020040 FF20
-
If you DO NOT show all of the above FRU IDs then concentrate on the MISSING FRU ID. If
you DO show the above FRUID entries and the Alert Level 5 “Missing FRU device - CPU 0
PIROM” then:
The next Step should be to insert the CPU carrier. Note, for this test step I DID NOT Pull the
processors from the CPU carrier board first. Also, this example shows the expected results for an
rp4440 system with 2 processor modules installed! So, I recommend having at least one processor
module installed (Module 0). Otherwise, I imagine you would get slightly different Alert
Messages as a result. You may get different results if you have an Intel Itanium2 rx4640 system
processor assembly installed.
When you add the CPU Carrier + processor(s), the system fans should come on and STAY on.
The DF command output should then look something like this:
FRU IDs:
--------
0001-Mem Extender 0002-Power Converter 0003-Power Supply 0
0004-Power Supply 1 0005-Diagnostic Panel 0006-Front Panel
0007-Disk Management 0008-Disk Backplane 0010-Processor Board
0012-Power Pod 0 0013-Power Pod 1 0032-CPU 0 PIROM
0033-CPU 1 PIROM 0036-Processor 0 RAM 0037-Processor 1 RAM
0128-DIMM0A 0136-DIMM0B 0144-DIMM0C
0152-DIMM0D 0160-DIMM1A 0168-DIMM1B
0176-DIMM1C 0184-DIMM1D 0000-Motherboard
-
At this point, if the installed H/W is all functional the system should initiate POST selftest. It is
then recommend to go immediately into the SEL "Live Logs" or the VFP to ensure that POST is
initiated and proceeds without error all the way to BCH or EFI. Remember that for the rp4440
system you will not normally see any POST forward progress messages on the console unless you
are in SL live mode.
- If POST does not start after 5 to 10 seconds, then suspect some sort of problem with the CPU
Carrier Board and/or the processor(s) mounted on it. Typical symptoms of this sort of problem
would be an “FRB2 hang” alert eventually showing up in the System Event Log. In this case it
may be necessary to reduce the processor complement to one module (put the module in position
0) and retest. Then replace the CPU module (or swap the CPU module with one previously
removed) and then try replacing the CPU Carrier board. Note that there is a switch on the CPU
Carrier board that determines whether it runs Itanium or PA-RISC code – so be sure to check this
switch position if you get an FRB2 hang and have previously replaced the CPU Carrier board as
part of the troubleshooting procedure. For rp4440 this switch (switch block S5103 - lowest-most
switch block, lowest-most switch when viewed with extractor handles toward you) should be set
to the right (PA-RISC). Set it to the left for Itanium2 processors. This switch position setting is
normally be physically stamped on the sheet metal cover for convenience.
If you get any other sort of error at this point, you should probably re-examine the SEL events
and see if they point to the root cause. If the SEL logs do not assist in pointing to the root cause,
then you should acquire the assistance of system experts.
2024年9月20日发(作者:穆如彤)
Updated Oct 3, 05
rp44x0/rx4640 System Power-Up Troubleshooting / System
Build-up Procedure
When you might need this procedure:
•
•
If the system does not successfully power up and remain powered up.
If the system will power up and remain powered up – but will not enter or pass POST (Power-ON
Self Test) and boot to BCH/EFI- and the cause of this problem is not apparent from the MP’s
SEL/FPL logs.
Assumptions before using this procedure:
The problem is a SOLID failure event (i.e. happens every time you attempt to power on the
system or initiate POST).
•
There is a functioning console terminal (or a PC with appropriate terminal emulation) available
and attached to the MP’s “Console” port.
•
Problem Symptoms:
The system will not power up or remain powered up.
Power-Up Troubleshooting Steps:
1. If this is the first time the system has been powered up, check the incoming AC Line Voltage.
Ensure that you have 200 to 240 VAC AC power applied to the bulk DC supplies. The
rp44x0/rx4640 servers require 200 VAC nominal (i.e. they WILL NOT operate on 100/120
VAC). Typical MP SEL Error Log entries when attempting to run on 100/120 VAC would look
like the following:
Alert Level 7: Fatal
Keyword: Type-02 096f02 618242
A/C Failed, disconnected, or out of range
Logged by: Baseboard Management Controller;
Sensor: Power Unit - AC Presence
Data1: 240VA Power Down
0x20430F39B1020040 FFFF026FCF090300
2.
Next, check the front panel “Power” LED - visible thru the hole in the flap covering the ON/OFF
switch. If it is flashing Amber, then housekeeping voltages are available (i.e. AC voltages are
applied) - go to step 3 below. If it is not illuminated / flashing then housekeeping voltages are not
available. Check the LEDs on the Bulk DC Power Supplies. The Bulk DC Power Supplies have
three indicators – a “Predict Fail” Amber LED to the left, a “Failed” Amber LED in the center
(the symbol is a triangle with an “!” exclamation point) and a Green “Power” LED to the right. If
the “Failed” LED is illuminated then the supply should be replaced – P/N A6961-67125. If the
Green “Power LED” is illuminated or flashing then the supply is OK and has AC voltage applied.
3. Check the “QuickFind diagnostic panel” located on top of the system – visible when the system
is pulled out of the rack. Diagnose any failures indicated by LED(s) illuminated via Service
Manual or other decoding links – e.g.
/ceasst/ia64/server/rx4640/led_.
4. Next, see if the MP is functional. The MP is typically accessed via a terminal attached to the “MP
Local” port on the rear bulkhead and is accessed via a Control-B CR key sequence. If the MP is
functional, then check the status of the DC power system via the CM> PC command or the CM>
PS command. If the PC or PS command output shows the “Current System Power State” to be
“OFF”, then try to turn the DC power ON via the PC command. If the system will not turn the
DC power ON, or if it will not remain ON, then check the System Event Logs for errors per step
5 below. Alternatively, you can press the ON/OFF button (located behind the front bezel flap) to
attempt to enable the DC voltages.
5. If a terminal attached to the MP “Local” port does not respond to a Control-B CR key sequence
(and the terminal is running at 9600 baud, 8 data bits, No Parity, 1 Stop Bit, Xon/Xoff, is
ONLINE, etc.) then it is possible that the MP is hung or non-functional. Check the following
LEDs located inside the system:
- The “MP Heartbeat” LED. This Green LED is physically located between the 3.3 and 5 volt
VRMs on the IO baseboard assembly and is visible from rear of the system by looking thru the
holes in the sheet metal directly above the MP LAN connector. There will be two Green LEDs in
this vicinity and the MP Heartbeat LED is the one on the left (closest to the bulk power supplies).
This LED should be flashing at ~ 1 Hz rate whenever housekeeping voltages are available AND
the MP circuitry is functional. If it IS NOT flashing (or is solid Green) then it is possible that the
MP circuitry is non-functional or is not receiving housekeeping voltages. One way to see if the
housekeeping voltages are available it to look at the BMC Heartbeat as described below.
- The BMC Heartbeat LED. This LED is located immediately to the right (from rear view) of the
MP heartbeat LED described above. It should be flashing Green (~ 1 Hz blink rate) whenever AC
power is applied to the system, minimum housekeeping voltages are available, and the BMC
circuitry is functional. If it is NOT flashing then it is possible that the BMC circuitry is non-
functional or the internal housekeeping voltages are missing.
If the MP Heartbeat LED IS NOT flashing but the BMC Heartbeat IS flashing:
•
Replace the I/O Baseboard
If the BMC Heartbeat LED is NOT flashing but the MP Heartbeat IS flashing:
•
Replace the I/O Baseboard – P/N A6961-69x01
•
Replace the Mid-Plane Board – P/N A6961-67005
If neither heartbeat LED is flashing then it is possible that internal DC housekeeping voltages are
missing. Housekeeping voltages (“12_STBY”) are generated by the bulk AC power supply(ies)
and passed thru the DC Power Distribution board, attached ribbon cable (flat grey cable), and
Mid-Plane board to the I/O Baseboard.
• Replace the DC Power Distribution board – A6961-60015
•
Replace the Mid-Plane Board – P/N A6961-67005
• Replace the flat cable between the DC Power Distribution and Mid-Plane – P/N A6961-
63004.
• Replace the I/O Baseboard – P/N A6961-69x01
6. Remember that if the MP “Local “ terminal port DOES respond to a Control-B prompt, then it is
always advisable to FIRST (i.e. before replacing any hardware) examine the MP System Logs
(SL) and look at the System Event Logs (SEL) for any recent alerts that have an alert level of 3 or
greater. Select TEXT mode and set the Alert Level to 3. If there are many entries in the SEL logs
and you are not sure which ones are associated with the current problem then clear the MP logs
(you may want to store them off somewhere first) and recreate the failure. Then look at the SEL
logs again. If the SEL logs do not assist in determining the root cause of the problem then it may
be necessary to do a “system build-up power troubleshooting procedure as described below:
Problem Symptoms:
The system will power on and remain powered on – but will not enter into or pass POST and boot to
EFI or BCH menu – and SEL log analysis does not directly point to the problem. In this case, it may
be necessary or desirable to perform the following system “build up” procedure:
System Build-up Troubleshooting Procedure:
-
-
Remove all the AC Power Cords from the Bulk Power Supplies.
Pull the CPU Carrier, Memory Carrier, Disk Drive(s), I/O cards (if possible) and plug in the AC
Power Cord(s). The MP should come alive and you should have the following CM> “DF” FRU
IDs listed - and might eventually end up with the following Alert Event:
FRU IDs:
--------
0002-Power Converter 0003-Power Supply 0 0004-Power Supply 1
0005-Diagnostic Panel 0006-Front Panel 0000-Motherboard
Log Entry 4: 00:00:09
Alert Level 5: Critical
Keyword: Type-02 257100 2453760
Missing FRU device - Mem Extender
Logged by: Baseboard Management Controller;
Sensor: Entity Presence
0x2050 FF00
-
If you DO NOT see all of the above FRU IDs then concentrate on the MISSING FRU ID(s) –
NOTE, a defective Mid-Plane board can cause any number of strange power up and/or DC
voltage problems. If you DO NOT get the above Alert Level 5 event but get another sort of high
level Alert, try replacing the Mid-Plane board.
If you DO show the above FRUID entries and get the Alert Level 5 “Missing FRU device - Mem
Extender” event then:
The next Step would be to add the Memory Carrier FRU (with at least one rank of DIMMs).
Remember to pull AC power cords before making any of these changes You should then notice
that the fans START to spin up but spin down about 1 second later. You should observe the
Amber Power LED flashing and the Red FAULT System LED flashing. Here is the output of the
DF command you should expect at this point (this example had two ranks of memory DIMMs
installed):
FRU IDs:
--------
0152-DIMM0D 0001-Mem Extender 0002-Power Converter
0003-Power Supply 0 0004-Power Supply 1 0005-Diagnostic Panel
0006-Front Panel 0128-DIMM0A 0136-DIMM0B
0144-DIMM0C 0160-DIMM1A 0168-DIMM1B
0176-DIMM1C 0184-DIMM1D 0000-Motherboard
Here are the high level Alerts generated in the SEL Logs (Alert Level 5 or greater - MP logs
cleared before test):
Log Entry 3: 23 Dec 2004 21:50:43
Alert Level 5: Critical
Keyword: Type-02 257100 2453760
Missing FRU device - CPU 0 PIROM
Logged by: Baseboard Management Controller;
Sensor: Entity Presence
0x2041CB3DB3020040 FF20
-
If you DO NOT show all of the above FRU IDs then concentrate on the MISSING FRU ID. If
you DO show the above FRUID entries and the Alert Level 5 “Missing FRU device - CPU 0
PIROM” then:
The next Step should be to insert the CPU carrier. Note, for this test step I DID NOT Pull the
processors from the CPU carrier board first. Also, this example shows the expected results for an
rp4440 system with 2 processor modules installed! So, I recommend having at least one processor
module installed (Module 0). Otherwise, I imagine you would get slightly different Alert
Messages as a result. You may get different results if you have an Intel Itanium2 rx4640 system
processor assembly installed.
When you add the CPU Carrier + processor(s), the system fans should come on and STAY on.
The DF command output should then look something like this:
FRU IDs:
--------
0001-Mem Extender 0002-Power Converter 0003-Power Supply 0
0004-Power Supply 1 0005-Diagnostic Panel 0006-Front Panel
0007-Disk Management 0008-Disk Backplane 0010-Processor Board
0012-Power Pod 0 0013-Power Pod 1 0032-CPU 0 PIROM
0033-CPU 1 PIROM 0036-Processor 0 RAM 0037-Processor 1 RAM
0128-DIMM0A 0136-DIMM0B 0144-DIMM0C
0152-DIMM0D 0160-DIMM1A 0168-DIMM1B
0176-DIMM1C 0184-DIMM1D 0000-Motherboard
-
At this point, if the installed H/W is all functional the system should initiate POST selftest. It is
then recommend to go immediately into the SEL "Live Logs" or the VFP to ensure that POST is
initiated and proceeds without error all the way to BCH or EFI. Remember that for the rp4440
system you will not normally see any POST forward progress messages on the console unless you
are in SL live mode.
- If POST does not start after 5 to 10 seconds, then suspect some sort of problem with the CPU
Carrier Board and/or the processor(s) mounted on it. Typical symptoms of this sort of problem
would be an “FRB2 hang” alert eventually showing up in the System Event Log. In this case it
may be necessary to reduce the processor complement to one module (put the module in position
0) and retest. Then replace the CPU module (or swap the CPU module with one previously
removed) and then try replacing the CPU Carrier board. Note that there is a switch on the CPU
Carrier board that determines whether it runs Itanium or PA-RISC code – so be sure to check this
switch position if you get an FRB2 hang and have previously replaced the CPU Carrier board as
part of the troubleshooting procedure. For rp4440 this switch (switch block S5103 - lowest-most
switch block, lowest-most switch when viewed with extractor handles toward you) should be set
to the right (PA-RISC). Set it to the left for Itanium2 processors. This switch position setting is
normally be physically stamped on the sheet metal cover for convenience.
If you get any other sort of error at this point, you should probably re-examine the SEL events
and see if they point to the root cause. If the SEL logs do not assist in pointing to the root cause,
then you should acquire the assistance of system experts.