Initially we had setup replication V2.1 (Cheesecake) to NOT
do fail-back at least in the initial version.
It turns out that fail-back in the Cinder code is rather easy,
we just enable calling failover-host on a host that's already
failed-over and use the *special* keyword of "default" as the
backend_id argument which signifies we want to switch back to
whatever is configured as the default in the cinder.conf file.
To do this we just add some logic that checks the secondary_backend_id
param in volume.manager:failover_host and set service fields
appropriately. Note that we're sending the call to the driver
first and giving it a chance to raise an exception if it can't
satisfy the request at the current time.
We also needed to modify the volume.api:failover_host to allow
failed-over as a valid transition state, and again update the
Service query to include disabled services.
It's up to drivers to figure out if they want to require some
extra admin steps and document exactly how this works. It's also
possible that during an initial failover that you might want to
return a status update for all volumes NOT replicated and mark
their volume-status to "error".
Expected behavior is depicted in the service output here:
http://paste.openstack.org/show/488294/