Flink enables the seamless recovery of stream processing applications from failures by implementing periodic checkpoints. These checkpoints capture the current state of operators and the consumption progress of streams, allowing applications to resume processing from the last successful point in case of an interruption. This is achieved using an algorithm similar to the Chandy-Lamport algorithm for distributed snapshots, called Asynchronous Barrier Snapshotting (ABS). ABS ensures that all nodes in the system agree on a common snapshot, even in the presence of asynchronous updates and failures, allowing for consistent and efficient recovery.
How does it works?
- The Job Manager inserts control records at regular intervals, known as stage barriers, into the stream. These barriers segment the stream into distinct stages, and at the conclusion of each stage, the collection of operator states captures the entire execution history up to that point. This allows for the creation of a snapshot, which can be used to restore the system’s state in the event of a failure.
- A source task takes a snapshot of its current state when it receives a barrier, and then shares that snapshot with all its output tasks.
- When a non-source task receives a barrier from one of its inputs, it blocks that input until it has received a barrier from all the inputs. It then takes a snapshot of its current state and broadcasts the barrier to its outputs. Finally, it unblocks its inputs. This blocking guarantees that the checkpoint contains the state after processing all the elements before the barrier and no elements after the barrier.
The snapshot taken while the inputs are blocked are logical , where the actual, physical snapshot is happening asynchronously in the background. One way to achieve this is through copy-on-write techniques. This is done to reduce the duration of this blocking phase to start processing data again as quickly as possible.