How to recover mongodb on a Juju server stuck in RECOVERING mode

Juju controller in HA is made of a mongodb cluster in PRIMARY/SECONDARY status.

It may happen (for instance if one server is shut down) that the db cannot get synchronized and it gets stuck.

The symptoms are these kind of log messages in /var/log/syslog:

Sep 10 06:28:28 ct1-VM-juju-prod-01 mongod.37017[1920]: [ReplicationExecutor] syncing from: 10.4.1.139:37017
Sep 10 06:28:29 ct1-VM-juju-prod-01 mongod.37017[1920]: [rsBackgroundSync] we are too stale to use 10.4.1.139:37017 as a sync source
Sep 10 06:28:29 ct1-VM-juju-prod-01 mongod.37017[1920]: [ReplicationExecutor] could not find member to sync from

where 10.4.1.139 is the IP address of the PRIMARY mongo;

Logging on the mongodb shell (TODO: add instructions here) and issuing rs.status() will give the following output:

{
        "_id" : 13,
        "name" : "10.3.1.30:37017",
        "health" : 1,
        "state" : 3,
        "stateStr" : "RECOVERING",
        "uptime" : 1141,
        "optime" : {
                "ts" : Timestamp(1519741743, 80),
                "t" : NumberLong(554)
        },
        "optimeDate" : ISODate("2018-02-27T14:29:03Z"),
        "lastHeartbeat" : ISODate("2018-09-10T09:26:25.735Z"),
        "lastHeartbeatRecv" : ISODate("2018-09-10T09:26:24.218Z"),
        "pingMs" : NumberLong(0),
        "configVersion" : 72
},

Here is the procedure to recover the stale mongodb server.

  1. Make sure that you complete backup of your PRIMARY controller.

https://docs.jujucharms.com/2.3/en/controllers-backup

  1. Take note of all controllers (primary and secondary) IPs

  2. Stop ALL juju agents:

    $ juju ssh $MACHINE_NUMBER sudo systemctl stop jujud-machine-$MACHINE_NUMBER.service
    
  3. SSH to RECOVERING node:

    $ ssh -i .local/share/juju/ssh/id_rsa $RECOVERING_MACHINE_IP
    
  4. Gracefully shutdown mongodb server

Ubuntu-18 or older:

$ sudo systemctl stop juju-db

Ubuntu-20 or newer:

$ sudo snap stop juju-db
  1. Make a backup of “dbPath” directory

Ubuntu-18 or older:

$ mv /var/lib/juju/db /var/lib/juju/db.orig
    $ mkdir /var/lib/juju/db
    $ chmod 700 /var/lib/juju/db

Ubuntu-20 or newer:

$ mv /var/snap/juju-db/common/db /var/snap/juju-db/common/db.orig
    $ mkdir /var/snap/juju-db/common/db
    $ chmod 700 /var/snap/juju-db/common/db
  1. Start mongod server:

    $ systemctl start juju-db
    
  2. Wait until sync completes (verify with rs.status() )

  3. When all the RECOVERING controllers return in SECONDARY mode, start all juju agents starting from the PRIMARY:

    $ ssh -i .local/share/juju/ssh/id_rsa PRIMARY_MACHINE_IP
    $ sudo systemctl start jujud-machine-$MACHINE_NUMBER.service
    $ ssh -i .local/share/juju/ssh/id_rsa SECONDARY_MACHINE_IP
    $ sudo systemctl start jujud-machine-$MACHINE_NUMBER.service
    $ ssh -i .local/share/juju/ssh/id_rsa SECONDARY_MACHINE_IP
    $ sudo systemctl start jujud-machine-$MACHINE_NUMBER.service
    

NOTA 1: initial sync may introduce increased load on the PRIMARY node therefore it should be performed during maintenance window during low usage time.

NOTA 2: initial sync should be performed on one SECONDARY at time and making sure that reset SECONDARY is in sync with PRIMARY before proceeding with resting second SECONDARY server.