Drive - Loss of Path Redundancy

What Caused the Problem?

A communication path with a drive has been lost. The Recovery Guru Details area provides specific information you will need as you follow the recovery steps.

Caution: Electronic discharge can damage sensitive components. Always use proper antistatic protection when handling components. Touching components without using a proper ground may damage the equipment.

Important Notes

Recovery Steps

1

Fix any other problems reported by the Recovery Guru before attempting to fix this problem.

2

If...

Then...

The affected enclosure listed in the Recovery Guru Details area contains both controllers and drives

Go to step 7.

The affected enclosure listed in the Recovery Guru Details area contains only drives

Go to step 3.

3

To determine the non-working channel, start at the drive port on the controller enclosure corresponding to the working channel (refer to the labels on the back of the controller enclosure if needed). Trace the cable from the working channel to the ESM canister in the affected drive enclosure reported in the details area.

Caution: Possible loss of data accessibility. Do not disconnect any cables on the working channel. Doing so may cause a possible loss of data accessibility.

4

Locate the other ESM canister in the affected drive enclosure (this is the canister on the non-working channel).

5

Replace the ESM canister on the non-working channel using the following steps:

a

Label the interface transceivers (GBICs or SFPs). The labels will help you correctly reconnect the cables to the new ESM canister.

While the cables are still connected, remove the interface transceivers from the ESM canister you are replacing.

b

Remove the ESM canister.

Note: The Service Action Allowed status in the Details area is always NO for this problem because the component is not failed. In this situation, it is acceptable to remove the battery even though the Service Action Allowed is NO.

c

Set all switches on the new ESM canister to the same values as the old ESM canister.

d

Insert the new ESM canister into the drive enclosure.

e

Using the labels created in step a, reconnect the cables to the replaced canister. Wait 40 seconds, then go to step 6.

6

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area.

If...

Then...

The problem has been fixed

You are finished with this procedure. Do NOT go to step 7.

The problem has not been fixed

Go to step 7.

7

You must replace the drive. Which procedure you use depends on the RAID level of the array associated with the affected drive. To determine the associated array, highlight the affected drive in the Physical View of the Subsystem Management Window and select View >> Associated Elements. Next highlight the associated array in the Logical View of the Subsystem Management Window.

If...

Then...

The array is RAID 0

Go to "Recovery Steps for Replacing a Drive in a RAID 0 Array."

The array is RAID 1, 3, or 5

Go to "Recovery Steps for Replacing a Drive in a RAID 1, 3, or 5 Array."

Recovery Steps for Replacing a Drive in a RAID 0 Array

Use the following procedure if the affected array is RAID 0.

Fix any other problems reported by the Recovery Guru before continuing with this procedure. Note that all logical drives in the Logical View of the Subsystem Management Window must be Optimal .

1

Stop all I/O to the affected logical drives.

2

Reseating the drive may clear up the path redundancy problem. Remove the drive and then re-insert it.

Note: The Service Action Allowed status in the Details area is always NO for this problem because the component is not failed. In this situation, it is acceptable to remove the battery even though the Service Action Allowed is NO.

3

Wait 40 seconds, and then click the Recheck button to rerun the Recovery Guru to ensure that the problem has been fixed.

If...

Then...

The problem has been fixed

You are finished with this procedure. Do NOT go to step 4.

The problem has not been fixed

Go to step 4.

4

Back up all data on the affected logical drives. (Step 7 will destroy all data on the affected logical drives.)

Note: To the operating system (OS), a failed logical drive is the same as a failed non-RAID drive. Refer to the OS documentation for requirements concerning failed drives and apply them where necessary.

5

If any of the affected logical drives are also source or target logical drives in a copy operation that is either Pending or In Progress, you must stop the copy operation before continuing.

Go to the Copy Manager by selecting Logical Drive >> Copy >> Copy Manager, then highlight each copy pair that contains an affected logical drive and select Copy >> Stop.

6

If you have flashcopy logical drives associated with the affected logical drives, these flashcopy logical drives will no longer be valid once you fail the drive in step 8.

If necessary, perform any operations on the flashcopy logical drives and then delete them.

7

Caution: Possible loss of data accessibility. Transitioning logical drives to failed may cause the loss of accessibility to data on the logical drives. Make sure that you back up all data on the affected logical drives before starting this step.

Highlight the affected drive in the Physical View of the Subsystem Management Window and select Advanced >> Recovery >> Fail Drive. The affected logical drives become Failed .

8

Remove the failed drive (its fault indicator light should be on).

Note: Make sure the replacement drive has a capacity equal to or greater than the failed drive.

9

Wait 30 seconds, then insert the new drive. Its fault indicator light may be lit for a short time (one minute or less).

Note: Wait until the replaced drive is ready (its fault indicator light must be off) before attempting to initialize the logical drives in step 10.

10

Highlight the array associated with the replaced drive in the Logical View of the Subsystem Management Window and select Advanced >> Recovery >> Initialize >> Array.

  • The logical drives in the array are initialized, one at a time.
  • To monitor initialization progress for a logical drive, highlight the logical drive in the Logical View of the Subsystem Management Window and select Logical Drive >> Properties. Note that when the initialization is completed, the progress bar is no longer displayed.
  • When initialization is completed, all logical drives in the array are Optimal .

Important: Make sure you save this procedure by selecting Save As. Once you fix the failure, you will not be able to access the information from Recovery Guru.

11

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area.

If...

Then...

The problem has been fixed.

a

If desired, create any flashcopy logical drives that you deleted in step 6.

b

If desired, re-create any copies you stopped by highlighting the copy pairs in the Copy Manager and selecting Copy >> Re-Copy.

c

Add the affected logical drives back to the operating system. You may need to reboot the system to see the re-initialized logical drives.

Note: Do not start I/O to these logical drives until you have restored data from backup

d Restore the data for the affected logical drives from backup.

e

You are finished with this procedure.

The problem has not been fixed.

There is a problem with the controller. Go to "Recovery Steps for Replacing a Controller."

Recovery Steps for Replacing a Drive in a RAID 1, 3, or 5 Array

Use the following procedure if the affected array is RAID 1, 3, or 5.

1

You should stop all I/O to all logical drives in the array associated with the affected drive to reduce the possibility of data loss. If another drive fails in this array while you are performing this procedure, you will lose data.

2

Reseating the drive may clear up the path redundancy problem. Remove the drive and then re-insert it.

Note: The Service Action Allowed status in the Details area is always NO for this problem because the component is not failed. In this situation, it is acceptable to remove the battery even though the Service Action Allowed is NO.

3

Wait 40 seconds, and then click the Recheck button to rerun the Recovery Guru to ensure that the problem has been fixed.

If...

Then...

The problem has been fixed

You are finished with this procedure. Do NOT go to step 4.

The problem has not been fixed

Go to step 4.

4

Although not required, you should back up all data on all logical drives associated with the affected drive.

5

Highlight the affected drive in the Physical View of the Subsystem Management Window and select Advanced >> Recovery >> Fail Drive. The associated logical drives become Degraded .

6

Remove the failed drive (its fault indicator light should be on).

Note: Make sure the replacement drive has a capacity equal to or greater than the failed drive.

7

Wait 30 seconds, then insert the new drive. Its fault indicator light may be lit for a short time (one minute or less).

8

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area.

If...

Then...

The problem has been fixed.

You are finished with this procedure.

The problem has not been fixed.

There is a problem with the controller. Go to "Recovery Steps for Replacing a Controller."

Recovery Steps for Replacing a Controller

Important: The controller replacement recovery steps should only be attempted after ALL other options have been exhausted.

Use the following procedure to replace a controller to resolve a loss of path redundancy condition.

If... Then...
Your storage subsystem has one controller Go to "Replacing a Controller in a Single-Controller Storage Subsystem."
Your storage subsystem has two controllers Go to "Replacing a Controller in a Dual-Controller Storage Subsystem."

Replacing a Controller in a Single-Controller Storage Subsystem

1

Ensure that your replacement controller matches the controller in the storage subsystem. If you do not have a controller with the appropriate replacement part number, contact your technical support representative.

2

Stop all I/O to this storage subsystem.

3

Turn off power to the affected enclosure.

4

Remove the affected controller. Refer to the Enterprise Management Window (EMW) to view which management method you are using to manage this storage subsystem.

If... Then...
You are using In-Band management for ALL hosts attached to this storage subsystem Go to step 5.
You are using Out-of-Band management for ANY host attached to this storage subsystem Before you insert a new controller canister into the storage subsystem, you must update the DHCP/BOOTP server so that it will associate the new controller's hardware Ethernet (MAC) address with the DNS/network name and IP address previously assigned to the removed controller.

To update the DHCP/BOOTP server, find the entry associated with the removed controller and replace its Ethernet (MAC) address with the new controller's Ethernet (MAC) address. The controller's Ethernet (MAC) address is located on an Ethernet ID label on the controller canister in the form xx.xx.xx.xx.xx.xx.

When you are finished, go to step 5.

5

If... Then...
The controller for this storage subsystem is located in an enclosure containing both controllers and drives Check to see if the new controller canister contains a battery.
  • If your model of storage subsystem does not contain batteries, go to step 6.
  • If your model of storage subsystem is supposed to contain batteries and...
    • there is not a battery installed in the new controller canister, then install the battery from the old canister, and go to step 6.
    • there is a battery installed in the new controller canister, then go to step 6.
The controller for this storage subsystem is located in an enclosure containing only controllers Go to step 6.

6

a

Make sure at least one minute has elapsed. Then, insert the new controller canister firmly in place.

b

Turn on power to the affected enclosure.

c

Note the controller slot (A or B) of the affected controller listed in the Recovery Guru Details area. Highlight this controller slot in the Physical View of the Subsystem Management Window (AMW).

d
If... Then...
The controller indicates that it is Online Go to step e.
The controller indicates that it is Offline Select Advanced >> Recover >> Place Controller >> Online and then go to step e.

e

If... Then...
The controller for this storage subsystem is located in an enclosure containing both controllers and drives Determine whether you need to reset the battery age.
  • If your model of storage subsystem does not contain batteries and is supposed to, go to step 7.
  • If your model of storage subsystem is supposed to contain batteries and...
    • you installed the battery from the old controller canister, then you do not need to reset the battery age. Go to step 7.
    • there was already a battery in the replacement controller canister, then you must reset the battery age using the following procedure:

      Select the Components button on the enclosure containing the controllers in the Physical View of the Subsystem Management Window. Highlight the batteries option and select the Reset button associated with the new controller canister (A or B). Then, go to step 7.

The controller for this storage subsystem is located in an enclosure containing only controllers Go to step 7.

7

If you have logical drives mapped to hosts that have Automatic Logical Drive Transfer (ADT) disabled, it may be necessary to redistribute the logical drives to their preferred controller. Use the following steps to determine the ADT status of the hosts connected to your storage subsystem:

a

Open the Storage Subsystem Profile by selecting the Storage Subsystem >> View Profile menu option from the Subsystem Management Window. Then, select the profile's Mappings tab.

b

Scroll to the NVSRAM Host Type Internal Definitions section.

If... Then...
There are hosts mapped to the logical drives on this storage subsystem that have an ADT status of disabled

OR

There are hosts mapped to the logical drives on this storage subsystem that are not running a host-based, multi-path failover driver

It may be necessary to redistribute the logical drives to their preferred controller. If the Subsystem Management Window's Advanced >> Recovery >> Redistribute Logical Drives menu option is available, select the option.

Note: If you have a mix of hosts with ADT enabled and ADT disabled, all logical drives will be immediately assigned back to their preferred path. However, until the host-based multi-path failover driver detects the valid preferred path (may take several minutes), the logical drives mapped to the ADT-enabled hosts may get temporarily returned back to the non-preferred path.

If the menu option is not available (grayed out), the logical drives are already associated with their preferred controllers and no action is needed.

Go to step 8.

There are NO hosts mapped to the logical drives on this storage subsystem with an ADT status of disabled

OR

All hosts mapped to logical drives on this storage subsystem are running a host-based multi-path failover drive

No action is required.

If logical drives need to be redistributed to their preferred controller, the host-based, multi-path failover driver will automatically initiate the transfer.

Note that detection of a restored preferred path by the multi-path failover driver can take several minutes.

Got to step 8.

8

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your technical support representative.

Replacing a Controller in a Dual-Controller Storage Subsystem

1

Determine which is the affected controller by locating the non-working channel. Refer to step 3 at the beginning of this recovery procedure for details on how to locate the non-working channel.

2

Place the affected controller offline.

a

Highlight the controller containing the battery near expiration in the Physical View of the Subsystem Management Window.

b

Select Advanced >> Recovery >> Place Controller >> Offline.

c

Select Yes in the Place Offline confirmation window.

d

Go to step 3.

3

Read all of the following steps before taking any action.

a

Click the Recheck button to rerun the Recovery Guru.

b

Select the Offline Controller problem that is being reported in the Summary area.

c

Complete the Recovery Steps in the Offline Controller to replace the controller.

4

Click the Recheck button to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your technical support representative.