Transform Plugin System
Why This Page Exists
SeaTunnel already has a generated transform catalog and a page for transform common options. What is still missing is a system-level explanation of how transforms fit into the pipeline, what contracts they share, and how contributors should think about them.
This page fills that gap.
Where Transforms Sit In A Job
Transforms sit between source and sink and operate on SeaTunnel's own row and table model:
Source -> Transform Chain -> Sink
In practice, the transform block is optional, but it becomes the main place to express pipeline logic when:
- source fields do not match sink fields directly
- rows must be filtered, enriched, or reshaped
- CDC metadata needs to be converted into a downstream-friendly form
- one job needs to route or reshape multiple logical tables
SeaTunnel uses plugin_output to register an intermediate dataset and plugin_input to consume one or more previously produced datasets. This lets transforms form a logical graph instead of a single rigid linear chain.
What Transforms Are Responsible For
At a system level, transforms do more than field-level mapping. They are responsible for:
- reshaping rows without binding the job to an engine-specific record type
- preserving or updating schema information when columns are added, removed, or renamed
- exposing metadata such as row kind or event time as normal fields for downstream logic
- routing, merging, or filtering logical tables in multi-table jobs
- keeping job logic declarative so the same pipeline can run on different engines
This is why the transform layer matters in both batch pipelines and CDC pipelines.
Core Contracts
The transform system is built around a small set of contracts:
SeaTunnelTransform: the base runtime contractSeaTunnelMapTransform: one-input to one-output row transformationSeaTunnelFlatMapTransform: one-input to zero-or-more output rowsTableTransform: wrapper that creates a runtime transform instanceTableTransformFactory: SPI entry point used for discovery and creationTableTransformFactoryContext: factory context carryingReadonlyConfig, class loader, and upstreamCatalogTablemetadata
This contract split matters because SeaTunnel wants transform plugins to stay:
- declarative from the user's point of view
- engine-independent from the contributor's point of view
- metadata-aware from the planner's point of view
Related docs:
How A Transform Is Prepared And Executed
At a high level, transform preparation works like this:
- the job config defines a transform block and its options
- SeaTunnel discovers the matching
TableTransformFactorythrough the factory and SPI mechanism - options are validated before the runtime transform is created
- upstream
CatalogTablemetadata is passed into the transform factory context - the runtime transform is inserted into the logical pipeline and later adapted to the chosen engine
The key design choice is that the transform plugin works on SeaTunnel contracts first. Translation to Flink, Spark, or native Zeta execution happens later.
Common Transform Categories
The current transform ecosystem is broad, but most plugins fall into a few categories:
Row Projection And Mapping
These plugins are used when the main task is to align source fields with downstream schema expectations.
Filtering And Routing
These plugins decide which records or tables continue through the pipeline.
SQL And Expression-Oriented Processing
These plugins are useful when the transformation logic is easier to express declaratively than with custom code.
Metadata And CDC Adaptation
These plugins are especially important in CDC pipelines because they help preserve or reshape change semantics for downstream systems.
Programmable Or AI-Oriented Processing
These plugins are used when row processing needs external models, richer computation, or custom business logic.
Design Guidelines For Contributors
When adding or reviewing a transform plugin, check these points first:
- keep the transform contract engine-independent
- define options through stable
OptionandOptionRulecontracts - make schema changes explicit instead of leaving downstream ambiguity
- handle multi-table inputs and outputs deliberately when the plugin can be used in that mode
- avoid leaking source-specific or sink-specific responsibilities into the transform layer
In general, transforms should own row and schema shaping logic, not external commit semantics or engine runtime behavior.
Common Misunderstandings
"Transforms are only optional decoration"
Not really. In many jobs the transform layer is where the actual business mapping, schema alignment, and CDC adaptation happens.
"Transform logic is always row-only"
Also not true. Many transforms need to preserve or reshape schema and metadata, especially in multi-table and change-event scenarios.
"If a transform works on one engine, portability is automatic"
Portability is a design goal, not a free side effect. Contributors still need to avoid engine-specific assumptions and follow SeaTunnel's API contracts.
Recommended Reading Path
- this page for the system view
- Transform Common Options
- Core API Design
- CDC Pipeline Architecture
- Plugin Discovery and Class Loading
- Transforms Catalog