Data Format Handling
Why This Page Exists
SeaTunnel already documents many concrete formats such as Avro, Protobuf, Debezium JSON, Canal JSON, and Maxwell JSON. What is still missing is the system-level explanation of how formats fit into SeaTunnel's pipeline model.
This page explains that role.
The Core Idea
SeaTunnel separates three concerns that are easy to mix up:
- external storage or transport system
- connector runtime
- record serialization format
For example, Kafka is the transport system, Kafka connector is the runtime bridge, and Avro or Debezium JSON is the format.
This separation matters because the same connector may support multiple formats, and the same format may appear in different connectors.
Format Position in the Pipeline
A format usually sits at the boundary between SeaTunnel's internal row model and an external byte or message representation.
Typical path:
External bytes / messages
|
v
Connector reader
|
v
Format decoder
|
v
SeaTunnelRow + schema metadata
|
v
Transforms and engine runtime
|
v
SeaTunnelRow
|
v
Format encoder
|
v
External bytes / messages
This is why format handling is not an afterthought. It directly affects schema, type conversion, CDC semantics, and compatibility with external ecosystems.
What the Internal Model Looks Like
Inside SeaTunnel, data is generally represented as:
SeaTunnelRowCatalogTableTableSchemaSeaTunnelDataType
Formats are responsible for bridging between this internal model and external representations such as:
- JSON documents
- Avro records
- Protobuf messages
- CDC envelope formats
Related docs:
Common Format Categories
Plain Row Formats
These formats mainly encode field values and structure:
- JSON
- Avro
- Protobuf
- text and CSV-style formats where supported through file connectors
Typical concerns:
- field order and naming
- nullability
- numeric and temporal type mapping
- schema presence or schema-less mode
CDC Envelope Formats
These formats carry change semantics in addition to row values:
- Debezium JSON
- CDC-compatible Debezium JSON
- Canal JSON
- Maxwell JSON
- OGG JSON
Typical concerns:
- row kind mapping
- before / after images
- source metadata
- schema change compatibility
Related docs:
Where Format Handling Usually Happens
Source Side
On the source side, format handling typically decodes external payloads into SeaTunnel's internal row model.
Examples:
- reading Avro or JSON from Kafka
- reading JSON or text-based records from file connectors
- decoding CDC envelopes from message systems
Sink Side
On the sink side, format handling usually serializes rows into the target system's expected payload shape.
Examples:
- writing Debezium-compatible messages to Kafka
- writing Avro or Protobuf to messaging systems
- writing text or JSON files
Why Format Handling Is Not Just Serialization
In SeaTunnel, format handling often also decides:
- how schema is inferred or validated
- how external field types map into
SeaTunnelDataType - whether CDC metadata is preserved
- whether the target ecosystem can consume the output without custom adapters
That means format design is tightly connected to both type system design and connector usability.
Schema and Format
Formats interact with schema in several different ways.
Schema-Driven Formats
Some formats work best when schema is explicit, either from the connector, an external registry, or a user-defined schema block.
Typical examples:
- Avro
- Protobuf
- strongly typed JSON pipelines
Schema-Lite or Schema-Less Formats
Some formats can be consumed with weaker schema assumptions, but that shifts more responsibility to connector configuration or runtime inference.
Typical examples:
- plain JSON
- text-based payloads
Related docs:
CDC Format Handling
CDC formats are especially important because they are often used to integrate SeaTunnel with existing ecosystems.
Common reasons to use a CDC envelope format:
- keep compatibility with Debezium consumers
- publish changelog events to Kafka
- preserve before / after semantics for downstream consumers
- bridge database CDC into messaging systems
When documenting or reviewing a CDC format flow, check:
- whether row kind is preserved correctly
- whether source metadata is preserved
- whether topic, table, and key conventions remain compatible downstream
Type Mapping Risk Areas
The most error-prone areas in format handling are usually:
- decimal precision and scale
- timestamp and time zone semantics
- nested row, map, and array types
- binary encoding
- null handling in schema-less JSON-style payloads
These issues often look like connector bugs, but the real boundary problem is in format conversion.
Operational Concerns
When a pipeline fails around formats, the symptoms often show up as:
- decode failures
- incompatible schema errors
- downstream consumers rejecting messages
- silent precision loss or unexpected type coercion
The first questions to ask are:
- Is the connector using the expected format?
- Does the source and sink agree on schema expectations?
- Is this a plain row format or a CDC envelope format?
- Are nested and temporal fields mapped consistently?
Recommended Reading Path
- this page for the system view
- CatalogTable and Metadata Management
- Schema Feature
- a concrete format page under
connectors/formats/ - CDC Pipeline Architecture if the format is changelog-oriented