Apache Pulsar
Apache Pulsar source connector
Description
Source connector for Apache Pulsar.
Key features
Options
| Name | Type | Required | Default Value | Description |
|---|---|---|---|---|
| topic | String | No | - | Topic name(s) to read. Supports comma-separated list. Note: only one of topic, topic-pattern, tables_configs |
| topic-pattern | String | No | - | Regular expression for topic names. Note: only one of topic, topic-pattern, tables_configs |
| table_path | String | No | - | Logical table identifier for multi-table mode |
| tables_configs | Array | No | - | Multi-table configuration. Each item can override global defaults. Note: only one of topic, topic-pattern, tables_configs |
| topic-discovery.interval | Long | No | -1 | Interval (ms) to discover new partitions. Non-positive disables discovery. Only works with topic-pattern |
| subscription.name | String | No | - | Consumer subscription name. Can be defined globally or per item in multi-table mode |
| client.service-url | String | Yes | - | Pulsar client service URL, e.g., pulsar://localhost:6650 |
| admin.service-url | String | Yes | - | Pulsar admin HTTP URL, e.g., http://localhost:8080 |
| auth.plugin-class | String | No | - | Pulsar client authentication plugin class name |
| auth.params | String | No | - | Pulsar client authentication parameters |
| poll.timeout | Integer | No | 100 | Timeout (ms) for polling messages from Pulsar |
| poll.interval | Long | No | 50 | Interval (ms) between two polls |
| poll.batch.size | Integer | No | 500 | Maximum number of messages to poll in a single batch |
| cursor.startup.mode | Enum | No | LATEST | Startup position mode. Options: EARLIEST, LATEST, SUBSCRIPTION, TIMESTAMP |
| cursor.startup.timestamp | Long | No | - | Start timestamp (ms) when cursor.startup.mode=TIMESTAMP |
| cursor.reset.mode | Enum | No | LATEST | Reset mode when cursor.startup.mode=SUBSCRIPTION. Options: EARLIEST, LATEST |
| cursor.stop.mode | Enum | No | NEVER | Stop position mode. Options: NEVER (streaming), LATEST (batch), TIMESTAMP (batch) |
| cursor.stop.timestamp | Long | No | - | Stop timestamp (ms) when cursor.stop.mode=TIMESTAMP |
| schema | Config | No | - | Data structure including field names and types |
| format | String | No | json | Data format. Default is json. Multi-table mode only supports JSON and CANAL_JSON |
| common-options | No | - | Source plugin common parameters. See Source Common Options for details |
topic [String]
Topic name(s) to read data from when the table is used as source. It also supports topic lists by separating topics with commas like 'topic-1,topic-2'.
Note, only one of topic, topic-pattern and tables_configs can be specified for sources.
topic-pattern [String]
The regular expression for a pattern of topic names to read from. All topics with names that match the specified regular expression will be subscribed by the consumer when the job starts running.
Note, only one of topic, topic-pattern and tables_configs can be specified for sources.
table_path [String]
Logical table identifier for one tables_configs item. This option is mainly used in multi-table mode.
tables_configs [Array]
Multi-table source configuration. Each item can override global defaults such as format, cursor options and subscription.name.
Each item must configure exactly one of:
topictopic-pattern
Additional rules:
table_pathis required whentopic-patternis used.subscription.namemust exist either globally or inside the item.- Only
JSONandCANAL_JSONare supported in multi-table mode. - Explicit
topicentries must not overlap with anytopic-patternentry. - If multiple
topic-patternitems can match the same topic, the first matching item intables_configswins. Put more specific patterns before broader ones. - In batch mode, multi-table configurations must be bounded. If more than one table is configured and any table uses
cursor.stop.mode = NEVER, the source is unbounded and batch jobs are rejected. Single-table mode and single-entrytables_configskeep backward-compatible batch behavior.
topic-discovery.interval [Long]
The interval (in ms) for the Pulsar source to discover the new topic partitions. A non-positive value disables the topic partition discovery.
Note, This option only works if the 'topic-pattern' option is used.
subscription.name [String]
Specify the subscription name for this consumer. This is required for each effective table configuration, but in multi-table mode it can be defined globally or overridden per tables_configs item.
client.service-url [String]
Service URL provider for Pulsar service. To connect to Pulsar using client libraries, you need to specify a Pulsar protocol URL. You can assign Pulsar protocol URLs to specific clusters and use the Pulsar scheme.
For example, localhost: pulsar://localhost:6650,localhost:6651.
admin.service-url [String]
The Pulsar service HTTP URL for the admin endpoint.
For example, http://my-broker.example.com:8080, or https://my-broker.example.com:8443 for TLS.
auth.plugin-class [String]
Name of the authentication plugin.
auth.params [String]
Parameters for the authentication plugin.
For example, key1:val1,key2:val2
poll.timeout [Integer]
The maximum time (in ms) to wait when fetching records. A longer time increases throughput but also latency.
poll.interval [Long]
The interval time(in ms) when fetcing records. A shorter time increases throughput, but also increases CPU load.
poll.batch.size [Integer]
The maximum number of records to fetch to wait when polling. A longer time increases throughput but also latency.
cursor.startup.mode [Enum]
Startup mode for Pulsar consumer, valid values are 'EARLIEST', 'LATEST', 'SUBSCRIPTION', 'TIMESTAMP'.
cursor.startup.timestamp [Long]
Start from the specified epoch timestamp (in milliseconds).
Note, This option is required when the "cursor.startup.mode" option used 'TIMESTAMP'.
cursor.reset.mode [Enum]
Cursor reset strategy for Pulsar consumer valid values are 'EARLIEST', 'LATEST'.
Note, This option only works if the "cursor.startup.mode" option used 'SUBSCRIPTION'.
cursor.stop.mode [String]
Stop mode for Pulsar consumer, valid values are 'NEVER', 'LATEST'and 'TIMESTAMP'.
Note, When 'NEVER' is specified, it is a real-time job, and other mode are off-line jobs.
cursor.stop.timestamp [Long]
Stop from the specified epoch timestamp (in milliseconds).
Note, This option is required when the "cursor.stop.mode" option used 'TIMESTAMP'.
schema [Config]
The structure of the data, including field names and field types. reference to Schema-Feature
format [String]
Data format. The default format is json, reference formats.
common options
Source plugin common parameters, please refer to Source Common Options for details.
Example
source {
Pulsar {
topic = "example"
subscription.name = "seatunnel"
client.service-url = "pulsar://localhost:6650"
admin.service-url = "http://my-broker.example.com:8080"
plugin_output = "test"
}
}
Multi-table Example
source {
Pulsar {
subscription.name = "seatunnel-sub"
client.service-url = "pulsar://localhost:6650"
admin.service-url = "http://localhost:8080"
cursor.startup.mode = "EARLIEST"
cursor.stop.mode = "NEVER"
format = "json"
tables_configs = [
{
table_path = "default.orders"
topic = "persistent://public/default/orders"
schema = {
fields {
order_id = "bigint"
user_id = "int"
}
}
},
{
table_path = "default.users"
topic-pattern = "persistent://public/default/users-.*"
subscription.name = "users-sub"
format = "canal_json"
schema = {
fields {
user_id = "int"
name = "string"
}
}
}
]
}
}
In batch mode, replace cursor.stop.mode = "NEVER" with a bounded mode such as LATEST or TIMESTAMP.
Changelog
Change Log
| Change | Commit | Version |
|---|---|---|
| [Improve][API] Optimize the enumerator API semantics and reduce lock calls at the connector level (#9671) | https://github.com/apache/seatunnel/commit/9212a77140 | 2.3.12 |
| [improve] pulsar options (#9180) | https://github.com/apache/seatunnel/commit/26a2160c80 | 2.3.12 |
| [Feature][Checkpoint] Add check script for source/sink state class serialVersionUID missing (#9118) | https://github.com/apache/seatunnel/commit/4f5adeb1c7 | 2.3.11 |
| [Improve] restruct connector common options (#8634) | https://github.com/apache/seatunnel/commit/f3499a6eeb | 2.3.10 |
| [Improve][dist]add shade check rule (#8136) | https://github.com/apache/seatunnel/commit/51ef800016 | 2.3.9 |
| [Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786) | https://github.com/apache/seatunnel/commit/6b7c53d03c | 2.3.9 |
| [Improve][API] Make sure the table name in TablePath not be null (#7252) | https://github.com/apache/seatunnel/commit/764d8b0bc8 | 2.3.7 |
| [Feature][Kafka] Support multi-table source read (#5992) | https://github.com/apache/seatunnel/commit/60104602d1 | 2.3.6 |
| [PulsarSource]Improve pulsar throughput performance. (#6234) | https://github.com/apache/seatunnel/commit/37461f4f3e | 2.3.4 |
| [Feature][Connector-v2][PulsarSink]Add Pulsar Sink Connector. (#4382) | https://github.com/apache/seatunnel/commit/543d2c5086 | 2.3.4 |
| [Chore] Remove useless DeserializationFormatFactory and its implement (#5880) | https://github.com/apache/seatunnel/commit/f0511544ff | 2.3.4 |
| fix: update IDENTIFIER = Pulsar for pulsar-datasource on project:seatunnel-web (#5852) | https://github.com/apache/seatunnel/commit/3b6de3743e | 2.3.4 |
| [Improve][Common] Introduce new error define rule (#5793) | https://github.com/apache/seatunnel/commit/9d1b2582b2 | 2.3.4 |
| Support config column/primaryKey/constraintKey in schema (#5564) | https://github.com/apache/seatunnel/commit/eac76b4e50 | 2.3.4 |
| [Improve][CheckStyle] Remove useless 'SuppressWarnings' annotation of checkstyle. (#5260) | https://github.com/apache/seatunnel/commit/51c0d709ba | 2.3.4 |
| [Hotfix] Fix com.google.common.base.Preconditions to seatunnel shade one (#5284) | https://github.com/apache/seatunnel/commit/ed5eadcf73 | 2.3.3 |
| [Feature][Json-format] support read format for pulsar (#4111) | https://github.com/apache/seatunnel/commit/7d61ae93e7 | 2.3.2 |
| [hotfix][pulsar] Fix the bug that can't consume messages all the time. (#4125) | https://github.com/apache/seatunnel/commit/a6705cc5bf | 2.3.2 |
| [Feature] add cdc multiple table support & fix zeta bug | https://github.com/apache/seatunnel/commit/533ff2c2fa | 2.3.1 |
| [hotfix][pulsar] PulsarSource consumer ack exception. (#4237) | https://github.com/apache/seatunnel/commit/9725d675da | 2.3.1 |
| Merge branch 'dev' into merge/cdc | https://github.com/apache/seatunnel/commit/4324ee1912 | 2.3.1 |
| [Improve][Project] Code format with spotless plugin. | https://github.com/apache/seatunnel/commit/423b583038 | 2.3.1 |
| [Improve][Connector-v2][Pulsar] Set the name of the pulsar consumption thread. (#4182) | https://github.com/apache/seatunnel/commit/e567203f7d | 2.3.1 |
| [improve][api] Refactoring schema parse (#4157) | https://github.com/apache/seatunnel/commit/b2f573a13e | 2.3.1 |
| [Improve][build] Give the maven module a human readable name (#4114) | https://github.com/apache/seatunnel/commit/d7cd601051 | 2.3.1 |
| [Improve][Project] Code format with spotless plugin. (#4101) | https://github.com/apache/seatunnel/commit/a2ab166561 | 2.3.1 |
| [Bug][Connector-v2][PulsarSource]Fix pulsar option topic-pattern bug. (#3989) | https://github.com/apache/seatunnel/commit/aee2c580ea | 2.3.1 |
| [Feature][Connector] add get source method to all source connector (#3846) | https://github.com/apache/seatunnel/commit/417178fb84 | 2.3.1 |
| [Feature][API & Connector & Doc] add parallelism and column projection interface (#3829) | https://github.com/apache/seatunnel/commit/b9164b8ba1 | 2.3.1 |
| [Improve][Connector-V2][Pulsar] Unified exception for Pulsar source &… (#3590) | https://github.com/apache/seatunnel/commit/4fe9323419 | 2.3.0 |
| [Hotfix][OptionRule] Fix option rule about all connectors (#3592) | https://github.com/apache/seatunnel/commit/226dc6a119 | 2.3.0 |
| [Hotfix][Connector-V2][Pulsar] fix conditional options (#3504) | https://github.com/apache/seatunnel/commit/0066affacf | 2.3.0 |
| [Feature][Connector][pulsar] expose configurable options in Pulsar (#3341) | https://github.com/apache/seatunnel/commit/200faa7c29 | 2.3.0 |
| [Connector][Dependency] Add Miss Dependency Cassandra And Change Kudu Plugin Name (#3432) | https://github.com/apache/seatunnel/commit/6ac6a0a0cd | 2.3.0 |
| [chore] fix pulsar consumer comment error (#3356) | https://github.com/apache/seatunnel/commit/91e632c526 | 2.3.0 |
| [Connector-V2][ElasticSearch] Add ElasticSearch Source/Sink Factory (#3325) | https://github.com/apache/seatunnel/commit/38254e3f26 | 2.3.0 |
| [hotfix][connector][pulsar] Fix not being able to mark #noMoreNewSplits when restoring (#2945) | https://github.com/apache/seatunnel/commit/5ad69076b3 | 2.3.0-beta |
| Move Handover to common module (#2877) | https://github.com/apache/seatunnel/commit/d94a874bcb | 2.3.0-beta |
| [hotfix][connector-v2] fix pulsar source exceptions (#2820) | https://github.com/apache/seatunnel/commit/8ff0ba7015 | 2.2.0-beta |
| [#2606]Dependency management split (#2630) | https://github.com/apache/seatunnel/commit/fc047be69b | 2.2.0-beta |
| [SeaTunnel]Simply seatunnel package pipeline. (#2563) | https://github.com/apache/seatunnel/commit/9d88b6221a | 2.2.0-beta |
| [Improve][Connector-V2] Pulsar support user-defined schema (#2436) | https://github.com/apache/seatunnel/commit/16cabe6a35 | 2.2.0-beta |
| [improve][UT] Upgrade junit to 5.+ (#2305) | https://github.com/apache/seatunnel/commit/362319ff3e | 2.2.0-beta |
StateT of SeaTunnelSource should extend Serializable (#2214) | https://github.com/apache/seatunnel/commit/8c426ef850 | 2.2.0-beta |
| [doc][connector-v2] pulsar source options doc (#2128) | https://github.com/apache/seatunnel/commit/59ce8a2b32 | 2.2.0-beta |
| [api-draft][Optimize] Optimize module name (#2062) | https://github.com/apache/seatunnel/commit/f79e3112b1 | 2.2.0-beta |