Ferhat's Blog

There will be only one database

What happens when exadata has lost a disk?

Posted by fsengonul on February 5, 2011

We have experienced a disk failure today and changed it without any problem or manual commands.
This morning we have lost a disk in exadata. We got an alert and an email mentioning that “Hard disk status changed to predicative failure: critical” . There was also the drawing of the location of the corrupted disk in the email.
From the logs of the cell and asm, it can be easily seen that it has dropped the grid disks and started a rebalance operation in order to be sure that all the data has 2 copies.
We did not wait for the oracle/sun engineer to come and replace the disk. Our system admins has replaced the disk and exadata automatically recognized the new disk and started a new rebalance operation without any manual commands.


/* cell  triggers the drop operation */
Sat Feb 05 11:50:31 2011
Received subopcode 6 in publish ASM Query on 3 guids.
NOTE: Initiating ASM Instance operation: ASM DROP critical disk on 3 disks
DATA_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]
RECO_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]
SYSTEMDG_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]


/* the corrupt disk has been replaced with the spare one */
Sat Feb 05 16:40:44 2011
Drop celldisk CD_08_cel11 (options: force, from memory only) - begin
Drop celldisk CD_08_cel11 - end
Sat Feb 05 16:40:44 2011
Open received invalid device name SYSTEMDG_CD_08_cel11
Sat Feb 05 16:40:44 2011
Open received invalid device name SYSTEMDG_CD_08_cel11
Sat Feb 05 16:42:44 2011
create CELLDISK CD_08_cel11 on device /dev/sdi
Sat Feb 05 16:42:44 2011
create GRIDDISK DATA_CD_08_cel11 on CELLDISK CD_08_cel11
Griddisk DATA_CD_08_cel11  - number is (248)
NOTE: Initiating ASM instance operation:
Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000
Received subopcode 4 in publish ASM Query on 1 guids.
NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks
DATA_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]

Storage Index Allocation for GridDisk DATA_CD_08_cel11 successful

Sat Feb 05 16:42:44 2011
create GRIDDISK RECO_CD_08_cel11 on CELLDISK CD_08_cel11
Griddisk RECO_CD_08_cel11  - number is (252)
NOTE: Initiating ASM instance operation:
Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000
Received subopcode 4 in publish ASM Query on 1 guids.
NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks
RECO_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]

Storage Index Allocation for GridDisk RECO_CD_08_cel11 successful

 



Sat Feb 05 16:42:44 2011
create GRIDDISK SYSTEMDG_CD_08_cel11 on CELLDISK CD_08_cel11
Griddisk SYSTEMDG_CD_08_cel11  - number is (256)
NOTE: Initiating ASM instance operation:
Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000
Received subopcode 4 in publish ASM Query on 1 guids.
NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks
SYSTEMDG_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]

7 Responses to “What happens when exadata has lost a disk?”

  1. laotsao said

    Did you setup ASR?
    It is my understanding that ASR should send error like disk failure to oracle/sun service center automatically

    • fsengonul said

      Yes , you’re right . It’s not configured for the existing environment at the moment. All of our systems have been monitored 7/24 by HP Openview and grid.

  2. laotsao said

    ASR and HP Openview can co-exit.

  3. Nice post Ferhat,
    Actually, I’m somewhat doubtful that the partner information is used for this at all. To test this I recently created a disk group made up of 3 disks, one in a different failure group (3 fail groups). The disk group redundancy level was set to normal. I didn’t create *any* tablespaces on this disk group. I went to the storage cell and set one of the grid disks to ‘inactive’ (cellsrv’s way of offlining a grid disk). As expected, the disk group stayed online. Then I set the status of another grid disk to ‘inactive’ in another storage cell. ASM immediately dismounted the disk group. Now, this wasn’t a very thorough test of course. I can think of a few more tests that would be more conclusive, but since there was no tablespace (no user data) on the disk group at all, I would have expected the disk group to remain mounted when I ‘lost’ 2 of the three disks. Any thoughts as to why it worked this way when there was no PST data to be lost?

  4. […] one of our Customer Exadata references) has published the Logfiles from such an incident with this posting. Thank you for that and also for your fine presentation about the Exadata […]

  5. Merhaba Ferhat,

    Biz de aynı problemle karşılaşmıştık. FR Exadata’nın bir disk’i sarı ışık yakmıştı. CELLCLI bize predictive failue veriyordu. Error count’ta artıyordu. Metalink’te bu konuyla direkt ilgili bir not var ve yeniden celldisk, griddisk ve asm disk ekleme yapmadan sorun giderilebiliyor. ASM üzerinden ilgili diski drop ettikten sonra fiziksel olarak çıkartıp yenisini ekliyorsun. Bu arada power limit yüksek olursa o kadar hızlı rebalancing oluyor. Disk drop edildiği zaman temiz olanını yerleştiriyorsun. Cell otomatik olarak görüyor ve giriş tarihini yeni fiziksel disk için tanımlıyor. Ardından ASM’ye gösteriyorsun. Normal şartlar altında böyle ilerliyor ama bazen cell ve griddisk olarak tanımlaman da gerekebiliyor. Bunda da zaten bir sakınca yok.

    Paylaşım için teşekkürler, elinize sağlık.

    Ogan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: