Skip to main content
Version: 2.3.0

Checkpoint Storage

Introduction​

Checkpoint is a fault-tolerant recovery mechanism. This mechanism ensures that when the program is running, it can recover itself even if it suddenly encounters an exception.

Checkpoint Storage​

Checkpoint Storage is a storage mechanism for storing checkpoint data.

SeaTunnel Engine supports the following checkpoint storage types:

  • HDFS (S3,HDFS,LocalFile)
  • LocalFile (native), (it's deprecated: use Hdfs(LocalFile) instead.

We used the microkernel design pattern to separate the checkpoint storage module from the engine. This allows users to implement their own checkpoint storage modules.

checkpoint-storage-api is the checkpoint storage module API, which defines the interface of the checkpoint storage module.

if you want to implement your own checkpoint storage module, you need to implement the CheckpointStorage and provide the corresponding CheckpointStorageFactory implementation.

Checkpoint Storage Configuration​

The configuration of the seatunnel-server module is in the seatunnel.yaml file.


seatunnel:
engine:
checkpoint:
storage:
type: hdfs #plugin name of checkpoint storage, we support hdfs(S3, local, hdfs), localfile (native local file) is the default, but this plugin is de
# plugin configuration
plugin-config:
namespace: #checkpoint storage parent path, the default value is /seatunnel/checkpoint
K1: V1 # plugin other configuration
K2: V2 # plugin other configuration

S3​

S3 base on hdfs-file, so you can refer hadoop docs to config s3.

Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets. The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of com.amazonaws.auth.AWSCredentialsProvider may also be used. if you used SimpleAWSCredentialsProvider (can be obtained from the Amazon Security Token Service), these consist of an access key, a secret key. you can config like this:

``` yaml

seatunnel:
engine:
checkpoint:
interval: 6000
timeout: 7000
max-concurrent: 5
tolerable-failure: 2
storage:
type: hdfs
max-retained: 3
plugin-config:
storage-type: s3
s3.bucket: your-bucket
fs.s3a.endpoint: your-endpoint
fs.s3a.access-key: your-access-key
fs.s3a.secret-key: your-secret-key
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider


if you used InstanceProfileCredentialsProvider, this supports use of instance profile credentials if running in an EC2 VM, you could check iam-roles-for-amazon-ec2. you can config like this:


seatunnel:
engine:
checkpoint:
interval: 6000
timeout: 7000
max-concurrent: 5
tolerable-failure: 2
storage:
type: hdfs
max-retained: 3
plugin-config:
storage-type: s3
s3.bucket: your-bucket
fs.s3a.endpoint: your-endpoint
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.InstanceProfileCredentialsProvider

For additional reading on the Hadoop Credential Provider API see: Credential Provider API.

HDFS​

if you used HDFS, you can config like this:

seatunnel:
engine:
checkpoint:
storage:
type: hdfs
max-retained: 3
plugin-config:
storage-type: hdfs
fs.defaultFS: hdfs://localhost:9000
// if you used kerberos, you can config like this:
kerberosPrincipal: your-kerberos-principal
kerberosKeytab: your-kerberos-keytab

LocalFile​

seatunnel:
engine:
checkpoint:
interval: 6000
timeout: 7000
max-concurrent: 5
tolerable-failure: 2
storage:
type: hdfs
max-retained: 3
plugin-config:
namespace: /tmp/seatunnel/checkpoint_snapshot
storage-type: hdfs

Notice​

Because of original binary package lack of class file about hadoop, so, if you use hdfs as checkpoint type, you need to download following package and put it in the directory $SEATUNNEL_HOME/lib. The problem will be fixd in the next version.

https://repo1.maven.org/maven2/org/apache/seatunnel/seatunnel-hadoop3-3.1.4-uber/2.3.0/seatunnel-hadoop3-3.1.4-uber-2.3.0-optional.jar