跳到主要内容
版本:2.3.13

Hbase

Hbase 源连接器

描述

从 Apache Hbase 读取数据。

主要功能

选项

名称类型必填默认值
zookeeper_quorumstring-
tablestring-
schemaconfig-
hbase_extra_configconfig-
cachingint-1
batchint-1
cache_blocksbooleanfalse
is_binary_rowkeybooleanfalse
start_rowkeystring-
end_rowkeystring-
start_row_inclusivebooleantrue
end_row_inclusivebooleanfalse
start_timestamplong-
end_timestamplong-
common-options-

zookeeper_quorum [string]

hbase的zookeeper集群主机,例如:“hadoop001:2181,hadoop002:2181,hadoop003:2181”

table [string]

要写入的表名,例如:“seatunnel” 如果表在自定义 namespace 下,请使用 namespace:table 形式(如 ns1:seatunnel_test);未填写 namespace 时,SeaTunnel 会使用 HBase 的默认命名空间 default

schema [config]

Hbase 使用字节数组进行存储。因此,您需要为表中的每一列配置数据类型。有关更多信息,请参阅:guide

hbase_extra_config [config]

hbase 的额外配置

caching

caching 参数用于设置在扫描过程中一次从服务器端获取的行数。这可以减少客户端与服务器之间的往返次数,从而提高扫描效率。默认值:-1

batch

batch 参数用于设置在扫描过程中每次返回的最大列数。这对于处理有很多列的行特别有用,可以避免一次性返回过多数据,从而节省内存并提高性能。

cache_blocks

cache_blocks 参数用于设置在扫描过程中是否缓存数据块。默认情况下,HBase 会在扫描时将数据块缓存到块缓存中。如果设置为 false,则在扫描过程中不会缓存数据块,从而减少内存的使用。在SeaTunnel中默认值为: false

is_binary_rowkey

HBase 的行键既可以是文本字符串,也可以是二进制数据。在 SeaTunnel 中,行键默认设置为文本字符串(即 is_binary_rowkey 默认值为 false)

start_rowkey

扫描起始行

end_rowkey

扫描结束行

start_row_inclusive

设置扫描范围是否包含起始行。当设置为 true 时,扫描结果将包含起始行。默认值: true (包含)。

注意: 在大多数情况下,应保持默认值 (true)。仅当您有特定需求需要排除起始行时才修改此参数。

end_row_inclusive

设置扫描范围是否包含结束行。当设置为 false 时,扫描结果将不包含结束行,遵循左闭右开的区间约定 [start, end)。默认值: false (不包含)。

注意: 在大多数情况下,应保持默认值 (false),这遵循 HBase 标准的左闭右开区间约定。仅当您需要在扫描结果中包含结束行时才修改此参数。

重要提示: 在使用多个 split 并行读取时,这两个参数的组合对数据完整性至关重要:

  • 默认配置 (start_row_inclusive=true, end_row_inclusive=false): 这是推荐的配置,可以确保跨 split 时不会丢失数据或产生重复数据。每个 split 遵循 [start, end) 左闭右开区间约定。
  • 都设置为 false (start_row_inclusive=false, end_row_inclusive=false): 这可能会导致数据丢失,因为边界行会被所有 split 排除在外。
  • 都设置为 true (start_row_inclusive=true, end_row_inclusive=true): 这可能会导致数据重复,因为边界行会被相邻的多个 split 重复包含。

start_timestamp

时间范围扫描的起始时间戳(包含)。单位为毫秒(epoch)。时间范围遵循 [start, end) 左闭右开约定。如果只设置 start_timestamp,则最大值视为无限上界。

end_timestamp

时间范围扫描的结束时间戳(不包含)。单位为毫秒(epoch)。时间范围遵循 [start, end) 左闭右开约定。如果只设置 end_timestamp,则最小值视为无限下界。

说明:

  • start_timestamp / end_timestamp 必须大于等于 0;若两者同时配置,需要满足 start_timestamp < end_timestamp(遵循 [start, end) 约定,start_timestamp == end_timestamp 将导致空扫描)。
  • start_rowkey / end_rowkeystart_timestamp / end_timestamp 同时配置时,会同时应用行键范围与时间范围限制,最终返回两者的交集。

常用选项

Source 插件常用参数,具体请参考 Source 常用选项

示例

source {
Hbase {
zookeeper_quorum = "hadoop001:2181,hadoop002:2181,hadoop003:2181"
table = "seatunnel_test"
caching = 1000
batch = 100
cache_blocks = false
is_binary_rowkey = false
start_rowkey = "B"
end_rowkey = "C"
start_timestamp = 1700000000000
end_timestamp = 1700003600000
schema = {
columns = [
{
name = "rowkey"
type = string
},
{
name = "columnFamily1:column1"
type = boolean
},
{
name = "columnFamily1:column2"
type = double
},
{
name = "columnFamily2:column1"
type = bigint
}
]
}
}
}

Kerberos 示例

备注:

  • connector-hbase 不会解析 krb5_path / kerberos_principal / kerberos_keytab_path
  • 需要在运行环境中提前完成 Kerberos 登录并保证 krb5.conf 可被 JVM 访问(例如 kinit -kt ... 或 JVM -Djava.security.krb5.conf=...),同时将 HBase/Hadoop 的安全配置写入 hbase_extra_config
source {
Hbase {
zookeeper_quorum = "zk1:2181,zk2:2181,zk3:2181"
table = "source_table"
caching = 1000
batch = 200
cache_blocks = false
is_binary_rowkey = false

# HBase安全配置
hbase_extra_config = {
"hbase.security.authentication" = "kerberos"
"hadoop.security.authentication" = "kerberos"
"hbase.master.kerberos.principal" = "hbase/_HOST@REALM"
"hbase.regionserver.kerberos.principal" = "hbase/_HOST@REALM"
"hbase.rpc.protection" = "authentication"
"hbase.zookeeper.useSasl" = "false"
}

schema = {
columns = [
{ name = "rowkey", type = string },
{ name = "info:name", type = string },
{ name = "info:score", type = string }
]
}
}
}

变更日志

Change Log
ChangeCommitVersion
[Fix][connector-hbase] Fix namespace handling for HBase source (#10295)https://github.com/apache/seatunnel/commit/d722474bc32.3.13
[Feature][Connector-V2][HBase] Support time-range scan with min/max timestamp in HBaseSource (#10318)https://github.com/apache/seatunnel/commit/402291d3592.3.13
[Fix][Connector-V2][Hbase] Fix ERROR_WHEN_DATA_EXISTS NPE on empty table (#10336)https://github.com/apache/seatunnel/commit/9d58bc01ac2.3.13
[Improve][Connector-V2][HBase] Support DATE/TIME/TIMESTAMP/DECIMAL in sink and fix DECIMAL deserialization (#10291)https://github.com/apache/seatunnel/commit/2cc680fe652.3.13
[Fix[Connector-V2][Hbase] Avoid duplicate split assignment on restore (#10310)https://github.com/apache/seatunnel/commit/75bc71beb82.3.13
[Fix][Connector-V2][Hbase] Fix HBase sink binary rowkey handling (#10300)https://github.com/apache/seatunnel/commit/84b039d4fa2.3.13
[Fix][Connector-V2][Hbase] Fix source reader only scanning first split (#10287)https://github.com/apache/seatunnel/commit/d393d2a82f2.3.13
[Fix][Connector-V2][HBase] Ensure fully qualified table name is used in tableExists method and add unit tests (#10126)https://github.com/apache/seatunnel/commit/53c50f39442.3.13
[Improve][Connector-V2][HBase] Support configurable range scan boundary inclusion policies (#10011)https://github.com/apache/seatunnel/commit/40bf6560f52.3.13
[Feature][Connector-V2] Support row range boundaries for HBaseSource (#9983)https://github.com/apache/seatunnel/commit/d7b8f37b412.3.13
[Fix][Core] Add shade module for apache commons lang3 (#9895)https://github.com/apache/seatunnel/commit/abb9124b052.3.13
[Feature][Checkpoint] Add check script for source/sink state class serialVersionUID missing (#9118)https://github.com/apache/seatunnel/commit/4f5adeb1c72.3.11
[Improve] hbase options (#8923)https://github.com/apache/seatunnel/commit/b6a702b58f2.3.10
[Improve] restruct connector common options (#8634)https://github.com/apache/seatunnel/commit/f3499a6eeb2.3.10
[Improve][dist]add shade check rule (#8136)https://github.com/apache/seatunnel/commit/51ef8000162.3.9
[Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786)https://github.com/apache/seatunnel/commit/6b7c53d03c2.3.9
[Fix][Connector-V2] Fix known directory create and delete ignore issues (#7700)https://github.com/apache/seatunnel/commit/e2fb6795772.3.8
[Feature][Connector-V2][Hbase] implement hbase catalog (#7516)https://github.com/apache/seatunnel/commit/b978792cb12.3.8
[Feature][Connector-V2] Support multi-table sink feature for HBase (#7169)https://github.com/apache/seatunnel/commit/025fa3bb882.3.8
[hotfix][connector-v2-hbase]fix and optimize hbase source problem (#7148)https://github.com/apache/seatunnel/commit/34a6b8e9f62.3.7
[Improve][hbase] The specified column is written to the specified column family (#5234)https://github.com/apache/seatunnel/commit/49d397c61d2.3.6
[feature][connector-v2-hbase-sink] Support Connector v2 HBase sink TTL data writing (#7116)https://github.com/apache/seatunnel/commit/adafd802552.3.6
[E2E][HBase]Refactor hbase e2e (#6859)https://github.com/apache/seatunnel/commit/1da9bd6ce42.3.6
[Connector]Add hbase source connector (#6348)https://github.com/apache/seatunnel/commit/f108a5e6582.3.6
[Feature][HbaseSink]support array data. (#6100)https://github.com/apache/seatunnel/commit/b5920147662.3.4
[Improve][Common] Introduce new error define rule (#5793)https://github.com/apache/seatunnel/commit/9d1b2582b22.3.4
[Improve] Remove use SeaTunnelSink::getConsumedType method and mark it as deprecated (#5755)https://github.com/apache/seatunnel/commit/8de74081002.3.4
[Hotfix][Connector-v2][HbaseSink]Fix default timestamp (#4958)https://github.com/apache/seatunnel/commit/3d8f3bf9022.3.3
[Improve][build] Give the maven module a human readable name (#4114)https://github.com/apache/seatunnel/commit/d7cd6010512.3.1
[Improve][Project] Code format with spotless plugin. (#4101)https://github.com/apache/seatunnel/commit/a2ab1665612.3.1
[Feature][Connector-V2][Hbase] Introduce hbase sink connector (#4049)https://github.com/apache/seatunnel/commit/68bda94a4c2.3.1