Skip to main content
Version: 2.2.0-beta

Apache Iceberg

Apache Iceberg source connector

Descriptionโ€‹

Source connector for Apache Iceberg. It can support batch and stream mode.

Key featuresโ€‹

Optionsโ€‹

nametyperequireddefault value
catalog_namestringyes-
catalog_typestringyes-
uristringfalse-
warehousestringyes-
namespacestringyes-
tablestringyes-
case_sensitivebooleanfalsefalse
start_snapshot_timestamplongfalse-
start_snapshot_idlongfalse-
end_snapshot_idlongfalse-
use_snapshot_idlongfalse-
use_snapshot_timestamplongfalse-
stream_scan_strategyenumfalseFROM_LATEST_SNAPSHOT

catalog_name [string]โ€‹

User-specified catalog name.

catalog_type [string]โ€‹

The optional values are:

  • hive: The hive metastore catalog.
  • hadoop: The hadoop catalog.

uri [string]โ€‹

The Hive metastoreโ€™s thrift URI.

warehouse [string]โ€‹

The location to store metadata files and data files.

namespace [string]โ€‹

The iceberg database name in the backend catalog.

table [string]โ€‹

The iceberg table name in the backend catalog.

case_sensitive [boolean]โ€‹

If data columns where selected via fields(Collection), controls whether the match to the schema will be done with case sensitivity.

fields [array]โ€‹

Use projection to select data columns and columns order.

start_snapshot_id [long]โ€‹

Instructs this scan to look for changes starting from a particular snapshot (exclusive).

start_snapshot_timestamp [long]โ€‹

Instructs this scan to look for changes starting from the most recent snapshot for the table as of the timestamp. timestamp โ€“ the timestamp in millis since the Unix epoch

end_snapshot_id [long]โ€‹

Instructs this scan to look for changes up to a particular snapshot (inclusive).

use_snapshot_id [long]โ€‹

Instructs this scan to look for use the given snapshot ID.

use_snapshot_timestamp [long]โ€‹

Instructs this scan to look for use the most recent snapshot as of the given time in milliseconds. timestamp โ€“ the timestamp in millis since the Unix epoch

stream_scan_strategy [enum]โ€‹

Starting strategy for stream mode execution, Default to use FROM_LATEST_SNAPSHOT if donโ€™t specify any value. The optional values are:

  • TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode.
  • FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive.
  • FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive.
  • FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive.
  • FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive.

Exampleโ€‹

simple

source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hadoop"
warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"
}
}

Or

source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hive"
uri = "thrift://localhost:9083"
warehouse = "hdfs://your_cluster//tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"
}
}

schema projection

source {
Iceberg {
catalog_name = "seatunnel"
catalog_type = "hadoop"
warehouse = "hdfs://your_cluster/tmp/seatunnel/iceberg/"
namespace = "your_iceberg_database"
table = "your_iceberg_table"

fields {
f2 = "boolean"
f1 = "bigint"
f3 = "int"
f4 = "bigint"
}
}
}