Kafka pods gets into crashloop backoff

Problem

In MOSIPv1.1.4 we introduced kafka for processing packets but we see that the kafka pods are gets into crashloop backoff state and keeps on crashing. Even after restarts the pods do not come up.

Root cause

Basically till now have seen this crashloopbackoff issue due to below mentioned errors:

  1. epouch mismatch. this is due to time not in sync for all the machines. (error: current epouch is not same as expected epouch)

  2. pod in crashloopbackoff as unable to lock the disk. happens when issue with nfs. (error: unable to lock /var/*** )

  3. kafka pod waiting for other zookeepers to start or still replicating the other kafka memory. (error: Kafka server is still starting up, cannot shut down!)

Solution

In order to resolve the issues as mentioned above follow the below steps respective to the type of error noticed.

  1. For epouch mismatch issue select the pod for which you are facing the epouch mismatch. Then describe the pod and get the volume attached to it from nfs folder and get into the same zookeeper/kafka volume and copy the currentEpouch to acceptedEpouch.
    eg. cd /srv/nfs/mosip/default-data-kafka-zookeeper-0-pvc-2dadadf4-6983-4a5e-897b-f7e7c7c4ca7e/data/version-2/
    cp currentEpouch acceptedEpouch
    this basically happens when your timing mismatch is there between all the mz worker nodes.

  2. For the pod unable to lock the disk issue:
    check the node is healthy or not, any mounting issue with the respective node. and restart once it is resolved.

  3. For the 3rd case where pod is still starting up and got restarting need to increase the probe timing there in the statefulset:

    kc1 edit statefulset kafka increase the initial delay seconds to 240 sec or 300 sec. save the statefulset yaml and the changes will be reflected as per pod disruption policy. similarly edit the zookeeper statefulset. kc1 edit statefulset kafka-zookeeper

In order to completely redeploy the kafka pods at any time follow the below mentioned steps:

  1. Remove the kafka from helm

  2. Remove the pvc (persistent volume) of kafka

  3. Remove the pv (persistent volume claim) of kafka

  4. Remove the persisted kafka folders

  5. Run the kafka playbook