Building Idempotent Async Job Processing

High level overview

We need this because ?

We want to process tasks/jobs only once. It is hard to deliver exactly once semantics in distributed system, but we can build resiliency and fault tolerance in our system in such a way that the downstream effects happen only once.

We do this by idempotency, avoiding duplicates, setting checkpoints to recover from and retry-ability.

We want to process jobs only once because

A job can be picked up more than once but to make sure we only process the job once and correctly. We need to add persistence to store the state of job processing and make sure the job is not concurrently processed by multiple workers.

We also use a key to manage idempotency, called idempotency key. This key is unique in a limited window of time (we don’t need it to be fully unique)

The different components of the system

Now coming to Jobs, here is where it gets tricky

Jobs will have multiple stages in its lifetime, generally: PENDING, INITIATED, STEP_1_DONE, STEP_2_DONE, FAILED_RETRYABLE, FAILED_PERMANENT

For each job we will use following fields

The status field denotes states of the JOB and will be a Directed acyclic graph.

We will use states as recovery points of the JOB, that means we will have to make each state change atomic.

Each job phase can be

We need to make sure each job state change is atomic and each stage acts as recovery point for the job to continue from in case of retry.

For example

Job Stages

Let’s look into this diagram

In case of failure