Packet manager is not able to join back to the hazelcast cache

Problem

We have observed that after a substantial amount of packet processing or when there are any restarts in the packet manager pods, the packet manager is unable to join back to the hazelcast cache cluster resulting in failures in packet processing in multiple registration processor stages (uploader, validator, OSI, UIN, demo).

This generally happens when the packet manager is running for more than a week, continuously, resulting in auto-restart of one of the replicas.

Solution

We are looking into a fix for the same but in the meantime, we have a workaround for this issue.

You need to restart the packet manager in regular intervals (weekly once) so that it will not get loaded and will be in the cluster. Find below the steps to do so.

  1. Stop the re-processor stage, so that packets do not get reprocessed.

  2. Check if packets are not flowing between the stages.

  3. Re-deploy the packet manager using helm.

  4. Once the packet manager service is all up and running start the reprocessor and monitor for some 10-15 mins if the packets are getting processed without any issue.

Before redeploying the packet manager using helm, please make sure that packets are not flowing in the registration processor or else there will be failures in them.