Version: Next

Data Format Handling

Why This Page Exists

SeaTunnel already documents many concrete formats such as Avro, Protobuf, Debezium JSON, Canal JSON, and Maxwell JSON. What is still missing is the system-level explanation of how formats fit into SeaTunnel's pipeline model.

This page explains that role.

The Core Idea

SeaTunnel separates three concerns that are easy to mix up:

external storage or transport system
connector runtime
record serialization format

For example, Kafka is the transport system, Kafka connector is the runtime bridge, and Avro or Debezium JSON is the format.

This separation matters because the same connector may support multiple formats, and the same format may appear in different connectors.

Format Position in the Pipeline

A format usually sits at the boundary between SeaTunnel's internal row model and an external byte or message representation.

Typical path:

External bytes / messages
        |
        v
Connector reader
        |
        v
Format decoder
        |
        v
SeaTunnelRow + schema metadata
        |
        v
Transforms and engine runtime
        |
        v
SeaTunnelRow
        |
        v
Format encoder
        |
        v
External bytes / messages

This is why format handling is not an afterthought. It directly affects schema, type conversion, CDC semantics, and compatibility with external ecosystems.

What the Internal Model Looks Like

Inside SeaTunnel, data is generally represented as:

SeaTunnelRow
CatalogTable
TableSchema
SeaTunnelDataType

Formats are responsible for bridging between this internal model and external representations such as:

JSON documents
Avro records
Protobuf messages
CDC envelope formats

Related docs:

Common Format Categories

Plain Row Formats

These formats mainly encode field values and structure:

JSON
Avro
Protobuf
text and CSV-style formats where supported through file connectors

Typical concerns:

field order and naming
nullability
numeric and temporal type mapping
schema presence or schema-less mode

CDC Envelope Formats

These formats carry change semantics in addition to row values:

Debezium JSON
CDC-compatible Debezium JSON
Canal JSON
Maxwell JSON
OGG JSON

Typical concerns:

row kind mapping
before / after images
source metadata
schema change compatibility

Related docs:

CDC Pipeline Architecture

Where Format Handling Usually Happens

Source Side

On the source side, format handling typically decodes external payloads into SeaTunnel's internal row model.

Examples:

reading Avro or JSON from Kafka
reading JSON or text-based records from file connectors
decoding CDC envelopes from message systems

Sink Side

On the sink side, format handling usually serializes rows into the target system's expected payload shape.

Examples:

writing Debezium-compatible messages to Kafka
writing Avro or Protobuf to messaging systems
writing text or JSON files

Why Format Handling Is Not Just Serialization

In SeaTunnel, format handling often also decides:

how schema is inferred or validated
how external field types map into SeaTunnelDataType
whether CDC metadata is preserved
whether the target ecosystem can consume the output without custom adapters

That means format design is tightly connected to both type system design and connector usability.

Schema and Format

Formats interact with schema in several different ways.

Schema-Driven Formats

Some formats work best when schema is explicit, either from the connector, an external registry, or a user-defined schema block.

Typical examples:

Avro
Protobuf
strongly typed JSON pipelines

Schema-Lite or Schema-Less Formats

Some formats can be consumed with weaker schema assumptions, but that shifts more responsibility to connector configuration or runtime inference.

Typical examples:

plain JSON
text-based payloads

Related docs:

Schema Feature

CDC Format Handling

CDC formats are especially important because they are often used to integrate SeaTunnel with existing ecosystems.

Common reasons to use a CDC envelope format:

keep compatibility with Debezium consumers
publish changelog events to Kafka
preserve before / after semantics for downstream consumers
bridge database CDC into messaging systems

When documenting or reviewing a CDC format flow, check:

whether row kind is preserved correctly
whether source metadata is preserved
whether topic, table, and key conventions remain compatible downstream

Type Mapping Risk Areas

The most error-prone areas in format handling are usually:

decimal precision and scale
timestamp and time zone semantics
nested row, map, and array types
binary encoding
null handling in schema-less JSON-style payloads

These issues often look like connector bugs, but the real boundary problem is in format conversion.

Operational Concerns

When a pipeline fails around formats, the symptoms often show up as:

decode failures
incompatible schema errors
downstream consumers rejecting messages
silent precision loss or unexpected type coercion

The first questions to ask are:

Is the connector using the expected format?
Does the source and sink agree on schema expectations?
Is this a plain row format or a CDC envelope format?
Are nested and temporal fields mapped consistently?

Data Format Handling

Why This Page Exists​

The Core Idea​

Format Position in the Pipeline​

What the Internal Model Looks Like​

Common Format Categories​

Plain Row Formats​

CDC Envelope Formats​

Where Format Handling Usually Happens​

Source Side​

Sink Side​

Why Format Handling Is Not Just Serialization​

Schema and Format​

Schema-Driven Formats​

Schema-Lite or Schema-Less Formats​

CDC Format Handling​

Type Mapping Risk Areas​

Operational Concerns​

Recommended Reading Path​

Why This Page Exists

The Core Idea

Format Position in the Pipeline

What the Internal Model Looks Like

Common Format Categories

Plain Row Formats

CDC Envelope Formats

Where Format Handling Usually Happens

Source Side

Sink Side

Why Format Handling Is Not Just Serialization

Schema and Format

Schema-Driven Formats

Schema-Lite or Schema-Less Formats

CDC Format Handling

Type Mapping Risk Areas

Operational Concerns

Recommended Reading Path