Avoiding data loss with built-in data protection in Solaris

It is often really important to have assurance from the types of built-in capabilities that an operating system provides out-of-the-box, as many additional examples of end-user value are constantly being engineered into (and innovated on top of), in this case, Oracle Solaris.

One of the older 32-bit Intel Pentium-based systems running an earlier version of Solaris  (acting as a test bed for various storage data services) recently experienced a set of questionable disk conditions.  (A few years ago, Google conducted an interesting study on frequency of disk failures).   (Yes, there are cases where there is still plenty of useful life remaining for older generation computers)

Looking closer, I noticed that in September of last year there was, in fact, an event registering a number of possible disk failure symptoms.  This resulted in an alert being generated by the Solaris availability sub-system (Fault Management Architecture, or FMA) regarding one of the disk’s having experiencing errors.  Because FMA is engineered alongside the ZFS architecture (and disk failures are noticed by ZFS and communicated to FMA directly) proper awareness of system’s state can be communicated faster. Since the disks in this particular storage pool are unified into a simple mirror, a single disk’s questionable availability did not impact access to data; it continued to be available.

isaac@hp162:~# zpool status data
  pool: data
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub repaired 0 in 11h57m with 0 errors on Sat Sep 14 21:52:01 2013
config:

    NAME        STATE     READ WRITE CKSUM
    data        DEGRADED     0     0     0
      mirror-0  DEGRADED     0     0     0
        c8d0    ONLINE       0     0     0
        c9d0    DEGRADED     0     0     0  too many errors

errors: No known data errors

If you click the sun.com URL provided by in the output above, you’ll be re-directed to the proper My Oracle Support page (having validated your Oracle.com SSO) that describes the error and corrective actions you could take.

Taking a look at FMA reveals the specifics of the observed failure event. Note the EVENT-ID column:

isaac@hp162:~# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 12 22:30:36 150e64e3-a340-cfbc-bca3-e8acefaa62a8  ZFS-8000-GH    Major     

Host        : hp162
Platform    : PY197AV-ABA-a1150y        Chassis_id  : MXG54003B1-NA540
Product_sn  : 

Fault class : fault.fs.zfs.vdev.checksum
Affects     : zfs://pool=data/vdev=fed9a5cb08d6c467
                  faulted but still in service
Problem in  : zfs://pool=data/vdev=fed9a5cb08d6c467
                  faulted but still in service

Description : The number of checksum errors associated with a ZFS device
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/ZFS-8000-GH for more information.

Response    : The device has been marked as degraded.  An attempt
              will be made to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

Replacing the questionable disk is naturally the recommended thing to do.

isaac@hp162:~# zpool offline data c9d0
isaac@hp162:~# zpool status -x
  pool: data
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scan: scrub repaired 0 in 11h57m with 0 errors on Sat Sep 14 21:52:01 2013
config:

        NAME        STATE     READ WRITE CKSUM
        data        DEGRADED     0     0     0
          mirror-0  DEGRADED     0     0     0
            c8d0    ONLINE       0     0     0
            c9d0    OFFLINE      0     0     0

errors: No known data errors

isaac@hp162:~# zpool detach  data c9d0
isaac@hp162:~# zpool status data
  pool: data
 state: ONLINE
 scan: resilvered 201K in 0h0m with 0 errors on Wed Feb 12 13:45:46 2014
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          c8d0      ONLINE       0     0     0

errors: No known data errors

Having the disk detached creates an element of risk in that the pool runs on a single disk for the time being. Ideally, one would have a hot spare configured, that would automatically take over for the failed disk the moment a failure is detected.

Since the objective here is to preserve the data and have 2 sub-mirrors provide the resiliency, we can re-attach the replaced disk to the surviving sub-mirror, to let disk resilvering take place.

isaac@hp162:~# zpool attach  data c8d0 c9d0

isaac@hp162:~# zpool status -x
  pool: data
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Wed Feb 12 13:53:24 2014
    12.4M scanned out of 1.48T at 671K/s, 659h11m to go
    12.4M resilvered, 0.00% done
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c8d0    ONLINE       0     0     0
            c9d0    ONLINE       0     0     0  (resilvering)

errors: No known data errors

And then, sometime later:

isaac@hp162:~$ zpool status data 
  pool: data
 state: ONLINE
 scan: resilvered 1.48T in 53h9m with 0 errors on Fri Feb 14 19:02:41 2014
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c8d0    ONLINE       0     0     0
            c9d0    ONLINE       0     0     0

errors: No known data errors

If this happens to you, remember to update FMA with the activity you’ve performed. Until you do so, the jury is still out on the specific event that took place, and so – based on the “investigation”, you could issue an acquittal of the EVENT-ID.

 
isaac@hp162:~# fmadm acquit 150e64e3-a340-cfbc-bca3-e8acefaa62a8
fmadm: recorded acquittal of 150e64e3-a340-cfbc-bca3-e8acefaa62a8

Hopefully, this doesn’t happen again. But if and when it does (*wink wink*), better have the proper proactive processes and notification frameworks in place.

Happy Valentine’s day, ZFS ! 😎

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: