What happens when exadata has lost a disk?
Posted by fsengonul on February 5, 2011
We have experienced a disk failure today and changed it without any problem or manual commands.
This morning we have lost a disk in exadata. We got an alert and an email mentioning that “Hard disk status changed to predicative failure: critical” . There was also the drawing of the location of the corrupted disk in the email.
From the logs of the cell and asm, it can be easily seen that it has dropped the grid disks and started a rebalance operation in order to be sure that all the data has 2 copies.
We did not wait for the oracle/sun engineer to come and replace the disk. Our system admins has replaced the disk and exadata automatically recognized the new disk and started a new rebalance operation without any manual commands.
/* cell triggers the drop operation */ Sat Feb 05 11:50:31 2011 Received subopcode 6 in publish ASM Query on 3 guids. NOTE: Initiating ASM Instance operation: ASM DROP critical disk on 3 disks DATA_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000] RECO_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000] SYSTEMDG_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]
/* the corrupt disk has been replaced with the spare one */ Sat Feb 05 16:40:44 2011 Drop celldisk CD_08_cel11 (options: force, from memory only) - begin Drop celldisk CD_08_cel11 - end Sat Feb 05 16:40:44 2011 Open received invalid device name SYSTEMDG_CD_08_cel11 Sat Feb 05 16:40:44 2011 Open received invalid device name SYSTEMDG_CD_08_cel11 Sat Feb 05 16:42:44 2011 create CELLDISK CD_08_cel11 on device /dev/sdi Sat Feb 05 16:42:44 2011 create GRIDDISK DATA_CD_08_cel11 on CELLDISK CD_08_cel11 Griddisk DATA_CD_08_cel11 - number is (248) NOTE: Initiating ASM instance operation: Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000 Received subopcode 4 in publish ASM Query on 1 guids. NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks DATA_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000] Storage Index Allocation for GridDisk DATA_CD_08_cel11 successful
Sat Feb 05 16:42:44 2011 create GRIDDISK RECO_CD_08_cel11 on CELLDISK CD_08_cel11 Griddisk RECO_CD_08_cel11 - number is (252) NOTE: Initiating ASM instance operation: Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000 Received subopcode 4 in publish ASM Query on 1 guids. NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks RECO_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000] Storage Index Allocation for GridDisk RECO_CD_08_cel11 successful
Sat Feb 05 16:42:44 2011 create GRIDDISK SYSTEMDG_CD_08_cel11 on CELLDISK CD_08_cel11 Griddisk SYSTEMDG_CD_08_cel11 - number is (256) NOTE: Initiating ASM instance operation: Operation: DROP and ADD of ASM disk for Grid disk guid=00000xxxx-yyyy-zzzz-0000-000000000000 Received subopcode 4 in publish ASM Query on 1 guids. NOTE: Initiating ASM Instance operation: ASM DROP ADD disk on 1 disks SYSTEMDG_CD_08_cel11 [00000xxxx-yyyy-zzzz-0000-000000000000]

laotsao said
Did you setup ASR?
It is my understanding that ASR should send error like disk failure to oracle/sun service center automatically
fsengonul said
Yes , you’re right . It’s not configured for the existing environment at the moment. All of our systems have been monitored 7/24 by HP Openview and grid.
laotsao said
ASR and HP Openview can co-exit.
fsengonul said
The openview agents are not installed on the storage cells. It only checks whether the cells are up and accessible.
Randy Johnson said
Nice post Ferhat,
Actually, I’m somewhat doubtful that the partner information is used for this at all. To test this I recently created a disk group made up of 3 disks, one in a different failure group (3 fail groups). The disk group redundancy level was set to normal. I didn’t create *any* tablespaces on this disk group. I went to the storage cell and set one of the grid disks to ‘inactive’ (cellsrv’s way of offlining a grid disk). As expected, the disk group stayed online. Then I set the status of another grid disk to ‘inactive’ in another storage cell. ASM immediately dismounted the disk group. Now, this wasn’t a very thorough test of course. I can think of a few more tests that would be more conclusive, but since there was no tablespace (no user data) on the disk group at all, I would have expected the disk group to remain mounted when I ‘lost’ 2 of the three disks. Any thoughts as to why it worked this way when there was no PST data to be lost?
Replacing a damaged Hard Disk on Exadata Cells « The Oracle Instructor said
[...] one of our Customer Exadata references) has published the Logfiles from such an incident with this posting. Thank you for that and also for your fine presentation about the Exadata [...]
Ogan Özdoğan said
Merhaba Ferhat,
Biz de aynı problemle karşılaşmıştık. FR Exadata’nın bir disk’i sarı ışık yakmıştı. CELLCLI bize predictive failue veriyordu. Error count’ta artıyordu. Metalink’te bu konuyla direkt ilgili bir not var ve yeniden celldisk, griddisk ve asm disk ekleme yapmadan sorun giderilebiliyor. ASM üzerinden ilgili diski drop ettikten sonra fiziksel olarak çıkartıp yenisini ekliyorsun. Bu arada power limit yüksek olursa o kadar hızlı rebalancing oluyor. Disk drop edildiği zaman temiz olanını yerleştiriyorsun. Cell otomatik olarak görüyor ve giriş tarihini yeni fiziksel disk için tanımlıyor. Ardından ASM’ye gösteriyorsun. Normal şartlar altında böyle ilerliyor ama bazen cell ve griddisk olarak tanımlaman da gerekebiliyor. Bunda da zaten bir sakınca yok.
Paylaşım için teşekkürler, elinize sağlık.
Ogan