Version: Next

Paimon

Paimon sink connector

Description

Sink connector for Apache Paimon. It can support cdc mode 、auto create table.

Supported DataSource Info

Datasource	Dependent	Maven
Paimon	hive-exec	Download
Paimon	libfb303	Download

Database Dependency

In order to be compatible with different versions of Hadoop and Hive, the scope of hive-exec in the project pom file are provided, so if you use the Flink engine, first you may need to add the following Jar packages to <FLINK_HOME>/lib directory, if you are using the Spark engine and integrated with Hadoop, then you do not need to add the following Jar packages.

hive-exec-xxx.jar
libfb303-xxx.jar

Some versions of the hive-exec package do not have libfb303-xxx.jar, so you also need to manually import the Jar package.

Key features

Options

name	type	required	default value	Description
warehouse	String	Yes	-	Paimon warehouse path
catalog_type	String	No	filesystem	Catalog type of Paimon, support filesystem and hive
catalog_uri	String	No	-	Catalog uri of Paimon, only needed when catalog_type is hive
database	String	Yes	-	The database you want to access
table	String	Yes	-	The table you want to access
hdfs_site_path	String	No	-	The path of hdfs-site.xml
schema_save_mode	Enum	No	CREATE_SCHEMA_WHEN_NOT_EXIST	The schema save mode
data_save_mode	Enum	No	APPEND_DATA	The data save mode
paimon.table.primary-keys	String	No	-	Default comma-separated list of columns (primary key) that identify a row in tables.(Notice: The partition field needs to be included in the primary key fields)
paimon.table.partition-keys	String	No	-	Default comma-separated list of partition fields to use when creating tables.
paimon.table.write-props	Map	No	-	Properties passed through to paimon table initialization, reference.
paimon.hadoop.conf	Map	No	-	Properties in hadoop conf
paimon.hadoop.conf-path	String	No	-	The specified loading path for the 'core-site.xml', 'hdfs-site.xml', 'hive-site.xml' files
paimon.table.non-primary-key	Boolean	false	-	Switch to create `table with PK` or `table without PK`. true : `table without PK`, false : `table with PK`

Checkpoint in batch mode

When you set checkpoint.interval to a value greater than 0 in batch mode, the paimon connector will commit the data to the paimon table when the checkpoint triggers after a certain number of records have been written. At this moment, the written data in paimon that is visible. However, if you do not set checkpoint.interval in batch mode, the paimon sink connector will commit the data after all records are written. The written data in paimon that is not visible until the batch task completes.

Changelog

You must configure the changelog-producer=input option to enable the changelog producer mode of the paimon table. If you use the auto-create table function of paimon sink, you can configure this property in paimon.table.write-props.

The changelog producer mode of the paimon table has four mode which is none、input、lookup and full-compaction.

All changelog-producer modes are currently supported. The default is none.

none
input
lookup
full-compaction
note： When you use a streaming mode to read paimon table，different mode will produce different results。

Filesystems

The Paimon connector supports writing data to multiple file systems. Currently, the supported file systems are hdfs and s3. If you use the s3 filesystem. You can configure the fs.s3a.access-key、fs.s3a.secret-key、fs.s3a.endpoint、fs.s3a.path.style.access、fs.s3a.aws.credentials.provider properties in the paimon.hadoop.conf option. Besides, the warehouse should start with s3a://.

Schema Evolution

Cdc Ingestion supports a limited number of schema changes. Currently supported schema changes includes:

Adding columns.
Modify column. More specifically, If you modify the column type, the following changes are supported:
- altering from a string type (char, varchar, text) to another string type with longer length,
- altering from a binary type (binary, varbinary, blob) to another binary type with longer length,
- altering from an integer type (tinyint, smallint, int, bigint) to another integer type with wider range,
- altering from a floating-point type (float, double) to another floating-point type with wider range,
are supported.
Note:
If {oldType} and {newType} belongs to the same type family, but old type has higher precision than new type. Ignore this convert.
Drop columns.
Change columns.

Examples

Schema evolution

env {
  # You can set engine configuration here
  parallelism = 5
  job.mode = "STREAMING"
  checkpoint.interval = 5000
  read_limit.bytes_per_second=7000000
  read_limit.rows_per_second=400
}

source {
  MySQL-CDC {
    server-id = 5652-5657
    username = "st_user_source"
    password = "mysqlpw"
    table-names = ["shop.products"]
    base-url = "jdbc:mysql://mysql_cdc_e2e:3306/shop"
    
    schema-changes.enabled = true
  }
}

sink {
  Paimon {
    warehouse = "file:///tmp/paimon"
    database = "mysql_to_paimon"
    table = "products"
  }
}

Single table

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    table-names = ["seatunnel.role"]
  }
}

transform {
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="file:///tmp/seatunnel/paimon/hadoop-sink/"
    database="seatunnel"
    table="role"
  }
}

Single table with s3 filesystem

env {
  execution.parallelism = 1
  job.mode = "BATCH"
}

source {
  FakeSource {
    schema = {
      fields {
        c_map = "map<string, string>"
        c_array = "array<int>"
        c_string = string
        c_boolean = boolean
        c_tinyint = tinyint
        c_smallint = smallint
        c_int = int
        c_bigint = bigint
        c_float = float
        c_double = double
        c_bytes = bytes
        c_date = date
        c_decimal = "decimal(38, 18)"
        c_timestamp = timestamp
      }
    }
  }
}

sink {
  Paimon {
    warehouse = "s3a://test/"
    database = "seatunnel_namespace11"
    table = "st_test"
    paimon.hadoop.conf = {
        fs.s3a.access-key=G52pnxg67819khOZ9ezX
        fs.s3a.secret-key=SHJuAQqHsLrgZWikvMa3lJf5T0NfM5LMFliJh9HF
        fs.s3a.endpoint="http://minio4:9000"
        fs.s3a.path.style.access=true
        fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
    }
  }
}

Single table(Specify hadoop HA config and kerberos config)

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    table-names = ["seatunnel.role"]
  }
}

transform {
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="hdfs:///tmp/seatunnel/paimon/hadoop-sink/"
    database="seatunnel"
    table="role"
    paimon.hadoop.conf = {
      fs.defaultFS = "hdfs://nameservice1"
      dfs.nameservices = "nameservice1"
      dfs.ha.namenodes.nameservice1 = "nn1,nn2"
      dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020"
      dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020"
      dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      dfs.client.use.datanode.hostname = "true"
      security.kerberos.login.principal = "your-kerberos-principal"
      security.kerberos.login.keytab = "your-kerberos-keytab-path"
    }
  }
}

Single table(Specify hadoop HA config with hadoop_user_name)

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    table-names = ["seatunnel.role"]
  }
}

transform {
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="hdfs:///tmp/seatunnel/paimon/hadoop-sink/"
    database="seatunnel"
    table="role"
    paimon.hadoop.conf = {
      hadoop_user_name = "hdfs"
      fs.defaultFS = "hdfs://nameservice1"
      dfs.nameservices = "nameservice1"
      dfs.ha.namenodes.nameservice1 = "nn1,nn2"
      dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020"
      dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020"
      dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      dfs.client.use.datanode.hostname = "true"
      security.kerberos.login.principal = "your-kerberos-principal"
      security.kerberos.login.keytab = "your-kerberos-keytab-path"
    }
  }
}

Single table(Hive catalog)

env {
  parallelism = 1
  job.mode = "BATCH"
}

source {
  FakeSource {
    schema = {
      fields {
        pk_id = bigint
        name = string
        score = int
      }
      primaryKey {
        name = "pk_id"
        columnNames = [pk_id]
      }
    }
    rows = [
      {
        kind = INSERT
        fields = [1, "A", 100]
      },
      {
        kind = INSERT
        fields = [2, "B", 100]
      },
      {
        kind = INSERT
        fields = [3, "C", 100]
      },
      {
        kind = INSERT
        fields = [3, "C", 100]
      },
      {
        kind = INSERT
        fields = [3, "C", 100]
      },
      {
        kind = INSERT
        fields = [3, "C", 100]
      }
      {
        kind = UPDATE_BEFORE
        fields = [1, "A", 100]
      },
      {
        kind = UPDATE_AFTER
        fields = [1, "A_1", 100]
      },
      {
        kind = DELETE
        fields = [2, "B", 100]
      }
    ]
  }
}

sink {
  Paimon {
    schema_save_mode = "RECREATE_SCHEMA"
    catalog_name="seatunnel_test"
    catalog_type="hive"
    catalog_uri="thrift://hadoop04:9083"
    warehouse="hdfs:///tmp/seatunnel"
    database="seatunnel_test"
    table="st_test3"
    paimon.hadoop.conf = {
      fs.defaultFS = "hdfs://nameservice1"
      dfs.nameservices = "nameservice1"
      dfs.ha.namenodes.nameservice1 = "nn1,nn2"
      dfs.namenode.rpc-address.nameservice1.nn1 = "hadoop03:8020"
      dfs.namenode.rpc-address.nameservice1.nn2 = "hadoop04:8020"
      dfs.client.failover.proxy.provider.nameservice1 = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
      dfs.client.use.datanode.hostname = "true"
    }
  }
}

Single table with write props of paimon

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    table-names = ["seatunnel.role"]
  }
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="file:///tmp/seatunnel/paimon/hadoop-sink/"
    database="seatunnel"
    table="role"
    paimon.table.write-props = {
        bucket = 2
        file.format = "parquet"
    }
    paimon.table.partition-keys = "dt"
    paimon.table.primary-keys = "pk_id,dt"
  }
}

Write with the `changelog-producer` attribute

env {
 parallelism = 1
 job.mode = "STREAMING"
 checkpoint.interval = 5000
}

source {
 Mysql-CDC {
  base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
  username = "root"
  password = "******"
  table-names = ["seatunnel.role"]
 }
}

sink {
 Paimon {
  catalog_name = "seatunnel_test"
  warehouse = "file:///tmp/seatunnel/paimon/hadoop-sink/"
  database = "seatunnel"
  table = "role"
  paimon.table.write-props = {
   changelog-producer = full-compaction
   changelog-tmp-path = /tmp/paimon/changelog
  }
 }
}

Write to dynamic bucket table

Single dynamic bucket table with write props of paimon，operates on the primary key table and bucket is -1.

core options

Please reference

name	type	required	default values	Description
dynamic-bucket.target-row-num	long	yes	2000000L	controls the target row number for one bucket.
dynamic-bucket.initial-buckets	int	no		controls the number of initialized bucket.

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    table-names = ["seatunnel.role"]
  }
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="file:///tmp/seatunnel/paimon/hadoop-sink/"
    database="seatunnel"
    table="role"
    paimon.table.write-props = {
        bucket = -1
        dynamic-bucket.target-row-num = 50000
    }
    paimon.table.partition-keys = "dt"
    paimon.table.primary-keys = "pk_id,dt"
  }
}

Multiple table

example1

env {
  parallelism = 1
  job.mode = "STREAMING"
  checkpoint.interval = 5000
}

source {
  Mysql-CDC {
    base-url = "jdbc:mysql://127.0.0.1:3306/seatunnel"
    username = "root"
    password = "******"
    
    table-names = ["seatunnel.role","seatunnel.user","galileo.Bucket"]
  }
}

transform {
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="file:///tmp/seatunnel/paimon/hadoop-sink/"
    database="${database_name}_test"
    table="${table_name}_test"
  }
}

example2

env {
  parallelism = 1
  job.mode = "BATCH"
}

source {
  Jdbc {
    driver = oracle.jdbc.driver.OracleDriver
    url = "jdbc:oracle:thin:@localhost:1521/XE"
    user = testUser
    password = testPassword

    table_list = [
      {
        table_path = "TESTSCHEMA.TABLE_1"
      },
      {
        table_path = "TESTSCHEMA.TABLE_2"
      }
    ]
  }
}

transform {
}

sink {
  Paimon {
    catalog_name="seatunnel_test"
    warehouse="file:///tmp/seatunnel/paimon/hadoop-sink/"
    database="${schema_name}_test"
    table="${table_name}_test"
  }
}

Changelog

Change Log

Change	Commit	Version
[feature][connectors-v2] Support in predicate pushdown in paimon (#9379)	https://github.com/apache/seatunnel/commit/1ec43755d5	dev
[Improve][Connector-V2] Fix the word misspellings for paimon connector (#9332)	https://github.com/apache/seatunnel/commit/ba7f5c9e30	2.3.11
[Feature][Transform] Support define sink column type (#9114)	https://github.com/apache/seatunnel/commit/ab7119e507	2.3.11
[improve] paimon options (#9167)	https://github.com/apache/seatunnel/commit/b0889305c2	2.3.11
[Fix][Paimon] nullable and comment attribute was lost during automatic table creation (#9020)	https://github.com/apache/seatunnel/commit/eb54fdd52c	2.3.11
[Feature][Connector-V2] Support between predicate pushdown in paimon (#8962)	https://github.com/apache/seatunnel/commit/3b141cf621	2.3.10
[Feature][Connector-V2] Suppor Time type in paimon connector (#8880)	https://github.com/apache/seatunnel/commit/9f1e590091	2.3.10
[Feature][Paimon] Customize the hadoop user (#8888)	https://github.com/apache/seatunnel/commit/2657626f93	2.3.10
[Improve][Connector-v2][Paimon]PaimonCatalog close error message update (#8640)	https://github.com/apache/seatunnel/commit/48253da8d6	2.3.10
[Improve] restruct connector common options (#8634)	https://github.com/apache/seatunnel/commit/f3499a6eeb	2.3.10
[Improve][Connector-v2] Support checkpoint in batch mode for paimon sink (#8333)	https://github.com/apache/seatunnel/commit/f22d4ebd4d	2.3.9
[Feature][Connector-v2] Support schema evolution for paimon sink (#8211)	https://github.com/apache/seatunnel/commit/57190e2a3b	2.3.9
[Improve][dist]add shade check rule (#8136)	https://github.com/apache/seatunnel/commit/51ef800016	2.3.9
[Feature][Connector-v2] Support S3 filesystem of paimon connector (#8036)	https://github.com/apache/seatunnel/commit/e2a4772933	2.3.9
[Feature][transform] transform support explode (#7928)	https://github.com/apache/seatunnel/commit/132278c06a	2.3.9
[Feature][Connector-V2] Piamon Sink supports changelog-procuder is lookup and full-compaction mode (#7834)	https://github.com/apache/seatunnel/commit/c0f27c2f76	2.3.9
[Fix][connector-v2]Fix Paimon table connector Error log information. (#7873)	https://github.com/apache/seatunnel/commit/a3b49e6354	2.3.9
[Improve][Connector-v2] Use checkpointId as the commit's identifier instead of the hash for streaming write of paimon sink (#7835)	https://github.com/apache/seatunnel/commit/c7a384af2b	2.3.9
[Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786)	https://github.com/apache/seatunnel/commit/6b7c53d03c	2.3.9
[Fix][Connecotr-V2] Fix paimon dynamic bucket tale in primary key is not first (#7728)	https://github.com/apache/seatunnel/commit/dc7f695537	2.3.8
[Improve][Connector-v2] Remove useless code and add changelog doc for paimon sink (#7748)	https://github.com/apache/seatunnel/commit/846d876dc2	2.3.8
[Hotfix][Connector-V2] Release resources even the task is crashed for paimon sink (#7726)	https://github.com/apache/seatunnel/commit/5ddf8d461e	2.3.8
[Fix][Connector-V2] Fix paimon e2e error (#7721)	https://github.com/apache/seatunnel/commit/61d1964361	2.3.8
[Feature][Connector-Paimon] Support dynamic bucket splitting improves Paimon writing efficiency (#7335)	https://github.com/apache/seatunnel/commit/bc0326cba8	2.3.8
[Feature][Connector-v2] Support streaming read for paimon (#7681)	https://github.com/apache/seatunnel/commit/4a2e27291c	2.3.8
[Hotfix][Seatunnel-common] Fix the CommonError msg for paimon sink (#7591)	https://github.com/apache/seatunnel/commit/d1f5db9257	2.3.8
[Feature][CONNECTORS-V2-Paimon] Paimon Sink supported truncate table (#7560)	https://github.com/apache/seatunnel/commit/4f3df22124	2.3.8
[Improve][Connector-v2] Improve the exception msg in case-sensitive case for paimon sink (#7549)	https://github.com/apache/seatunnel/commit/7d31e5668c	2.3.8
[Hotfix][Connector-V2] Fixed lost data precision for decimal data types (#7527)	https://github.com/apache/seatunnel/commit/df210ea73d	2.3.8
[Improve][API] Move catalog open to SaveModeHandler (#7439)	https://github.com/apache/seatunnel/commit/8c2c5c79a1	2.3.8
[Improve][Connector] Add multi-table sink option check (#7360)	https://github.com/apache/seatunnel/commit/2489f6446b	2.3.7
The isNullable attribute is true when the primary key field in the Paimon table converts the Column object. #7231 (#7242)	https://github.com/apache/seatunnel/commit/b0fe432e99	2.3.6
[Feature][Core] Support using upstream table placeholders in sink options and auto replacement (#7131)	https://github.com/apache/seatunnel/commit/c4ca74122c	2.3.6
[Paimon]support projection for paimon source (#6343)	https://github.com/apache/seatunnel/commit/6c1577267f	2.3.6
[Improve][Paimon] Add check for the base type between source and sink before write. (#6953)	https://github.com/apache/seatunnel/commit/d56d64fc04	2.3.6
[Improve][Connector-V2] Improve the paimon source (#6887)	https://github.com/apache/seatunnel/commit/658643ae53	2.3.6
[Hotfix][Connector-V2] Close the tableWrite when task is close (#6897)	https://github.com/apache/seatunnel/commit/23a744b9b2	2.3.6
[Fix][Connector-V2] Field information lost during Paimon DataType and SeaTunnel Column conversion (#6767)	https://github.com/apache/seatunnel/commit/6cf6e41da7	2.3.6
[Improve][Connector-V2] Support hive catalog for paimon sink (#6833)	https://github.com/apache/seatunnel/commit/4969c91dc4	2.3.6
[Hotfix][Connector-V2] Fix the batch write with paimon (#6865)	https://github.com/apache/seatunnel/commit/9ec971d942	2.3.6
[Feature][Doris] Add Doris type converter (#6354)	https://github.com/apache/seatunnel/commit/5189991843	2.3.6
[Improve][Connector-V2] Support hadoop ha and kerberos for paimon sink (#6585)	https://github.com/apache/seatunnel/commit/20b62f3bf3	2.3.5
[Feature][Paimon] Support specify paimon table write properties, partition keys and primary keys (#6535)	https://github.com/apache/seatunnel/commit/2b1234c7ae	2.3.5
[Feature][Connector-V2] Support multi-table sink feature for paimon #5652 (#6449)	https://github.com/apache/seatunnel/commit/b0abbd2d89	2.3.5
[Feature][Connectors-v2-Paimon] Adaptation Paimon 0.6 Version (#6061)	https://github.com/apache/seatunnel/commit/b32df930e9	2.3.4
[Fix][Connectors-v2-Paimon] Flink table store failed to prepare commit (#6057)	https://github.com/apache/seatunnel/commit/c8dcefc3be	2.3.4
[Improve][Common] Introduce new error define rule (#5793)	https://github.com/apache/seatunnel/commit/9d1b2582b2	2.3.4
[Improve] Remove use `SeaTunnelSink::getConsumedType` method and mark it as deprecated (#5755)	https://github.com/apache/seatunnel/commit/8de7408100	2.3.4
[Hotfix][Connector-V2][Paimon] Bump paimon-bundle version to 0.4.0-incubating (#5219)	https://github.com/apache/seatunnel/commit/2917542bfa	2.3.3
[Improve] Documentation and partial word optimization. (#4936)	https://github.com/apache/seatunnel/commit/6e8de0e2a6	2.3.3
[Connector-V2][Paimon] Introduce paimon connector (#4178)	https://github.com/apache/seatunnel/commit/da507bbe0e	2.3.2

Paimon

Description​

Supported DataSource Info​

Database Dependency​

Key features​

Options​

Checkpoint in batch mode​

Changelog​

Filesystems​

Schema Evolution​

Examples​

Schema evolution​

Single table​

Single table with s3 filesystem​

Single table(Specify hadoop HA config and kerberos config)​

Single table(Specify hadoop HA config with hadoop_user_name)​

Single table(Hive catalog)​

Single table with write props of paimon​

Write with the changelog-producer attribute​

Write to dynamic bucket table​

core options​

Multiple table​

example1​

example2​

Changelog​

Description

Supported DataSource Info

Database Dependency

Key features

Options

Checkpoint in batch mode

Changelog

Filesystems

Schema Evolution

Examples

Schema evolution

Single table

Single table with s3 filesystem

Single table(Specify hadoop HA config and kerberos config)

Single table(Specify hadoop HA config with hadoop_user_name)

Single table(Hive catalog)

Single table with write props of paimon

Write with the `changelog-producer` attribute

Write to dynamic bucket table

core options

Multiple table

example1

example2

Changelog