Embedding
Embedding Transform Plugin
Description
The Embedding transform plugin leverages embedding models to convert text and multimodal data into vectorized representations. This
transformation can be applied to various fields including text, images, and videos. The plugin supports multiple model providers and can be integrated with
different API endpoints.
Important Note: The current embedding precision only supports float32 format.
Options
| Name | Type | Required | Default Value | Description |
|---|---|---|---|---|
| model_provider | enum | yes | - | The model provider for embedding. Options may include AMAZON, QIANFAN, OPENAI, etc. |
| api_key | string | yes | - | The API key required to authenticate with the embedding service. |
| secret_key | string | yes | - | The secret key required for additional authentication with the embedding service. |
| aws_region | string | no | AWS Region. Required for use Amazon Bedrock model. | |
| single_vectorized_input_number | int | no | 1 | The number of inputs vectorized in one request. Default is 1. |
| vectorization_fields | map | yes | - | A mapping between input fields and their corresponding output vector fields. |
| model | string | yes | - | The specific model to use for embedding (e.g: text-embedding-3-small for OPENAI). |
| api_path | string | no | - | The API endpoint for the embedding service. Typically provided by the model provider. |
| dimension | int | no | - | TThe vector dimension defaults to 2048. The Embedding-3 model supports custom vector dimensions, and it is recommended to choose dimensions of 256, 512, 1024, or 2048. |
| oauth_path | string | no | - | The API endpoint for the oauth service. |
| custom_config | map | no | Custom configurations for the model. | |
| custom_response_parse | string | no | Specifies how to parse the response from the model using JsonPath. Example: $.choices[*].message.content. | |
| custom_request_headers | map | no | Custom headers for the request to the model. | |
| custom_request_body | map | no | Custom body for the request. Supports placeholders like ${model}, ${input}. | |
| model_retry_max_attempts | int | no | 1 | Maximum attempts for one remote model request. The default value 1 keeps the previous no-retry behavior. |
| model_retry_backoff_ms | long | no | 1000 | Initial backoff in milliseconds before retrying a remote model request. |
| model_retry_max_backoff_ms | long | no | 10000 | Maximum backoff in milliseconds before retrying a remote model request. |
| model_request_timeout_ms | int | no | 20000 | Request timeout in milliseconds for remote model calls. |
Precision Support
Important: The current version of the Embedding plugin only supports float32 precision for vector data.
- All generated embedding vectors will be stored in float32 format
- If your model or API returns other precision formats (such as float64), the plugin will automatically convert them to float32
model_provider
The providers for generating embeddings include common options such as AMAZON, DOUBAO, QIANFAN, and OPENAI. Additionally,
you can choose CUSTOM to implement requests and retrievals for custom embedding models.
api_key
The API key for authenticating requests to the embedding service. This is typically provided by the model provider when you register for their service.
secret_key
The secret key used for additional authentication. Some providers may require this for secure API requests.
single_vectorized_input_number
Specifies how many model inputs are included in one remote vectorization request. The default is 1. Adjust based on your processing capacity and the model provider's API limitations.
This is request-level batching inside one row's vectorization inputs. It is not row-level transform micro-batching, and it does not mean that the transform will collect multiple SeaTunnel rows before calling the provider.
Model invocation reliability
Embedding providers use a common model invocation runtime for remote calls. The provider is still responsible for provider-specific request body, headers, authentication, response parsing, and provider error conversion. The common runtime handles timeout propagation, retry, error classification, response count validation, safe logs, metrics hooks, and the cache boundary.
The default retry behavior is compatible with previous jobs: model_retry_max_attempts = 1 means each request is tried
once. When you configure a value greater than 1, retryable failures such as rate limiting, timeout, and temporary remote
service errors can be retried with backoff. Authentication failures, configuration errors, response parse failures, and
response count mismatches are not retried.
For every request batch, the number of returned vectors must match the number of inputs. If the provider returns fewer or more vectors than requested, the request fails and the transform does not emit possibly misaligned vectors.
Retries are performed for the same remote request payload. The transform only emits vectors after a successful response, but providers can still charge or apply side effects per attempt. Downstream sink idempotency is not changed by this option.
The runtime records safe diagnostic context such as provider, model, batch size, attempt number, error category, retryable flag, and elapsed time. It does not log API keys, secret keys, full source text chunks, binary payloads, or full provider response bodies.
Bedrock now uses the same common runtime path as the other embedding providers, so retry, timeout, response parsing, and response-count validation behave consistently across providers.
The runtime also has a cache boundary. When a cache implementation is wired in, keys are built from provider, model,
output configuration, modality, format, normalized metadata, and a SHA-256 digest of normalized input content. The
default production wiring still uses ModelInvocationCache.NOOP, so existing jobs keep the previous behavior unless an
integration layer enables caching. Existing binary multimodal cache state is unchanged and still only reassembles file
chunks before vectorization.
Compatibility notes:
- The default values for
model_retry_max_attempts,model_retry_backoff_ms, andmodel_request_timeout_msremain unchanged. - No user-facing config names were renamed or removed in this update.
- The cache integration is additive and does not change the default execution path.
vectorization_fields
A mapping between input fields and their respective output vector fields. This allows the plugin to understand which fields to vectorize and how to store the resulting vectors. The plugin supports multimodal data by allowing you to specify the modality type for each field.
Basic Text Vectorization:
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
Multimodal Vectorization:
vectorization_fields {
# Basic text field
text_vector = text_field
# Explicit modality type configuration
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}
# Auto-detect modality type (based on file suffix)
thumbnail_vector = {
field = thumbnail_image # If value is "image.png", auto-detects as PNG modality
format = url
}
# Video field configuration
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}
# Binary data configuration
binary_image_vector = {
field = image_data
modality = jpeg
format = binary
}
}
Field Specification Formats:
Supported Modality Types:
- Images:
jpeg(jpg, jpeg),png(png, apng),gif,webp,bmp(bmp, dib),tiff(tiff, tif),ico,icns,sgi,jpeg2000(j2c, j2k, jp2, jpc, jpf, jpx) - Videos:
mp4,avi,mov - Text:
text(default)
Payload Formats:
text- Text format (default)url- URL formatbinary- Binary data format
Automatic Modality Detection:
When modality is not explicitly specified and format is not binary, the system automatically detects the modality type based on the file suffix of the field value:
Important: When using multimodal fields (image or video), ensure your model provider supports multimodal embedding. Image and video fields must contain valid URLs or binary data. Currently,
DOUBAOprovider supports multimodal data processing.
model
The specific embedding model to use. This depends on the model_provider. For example, if using OPENAI, you
might specify text-embedding-3-small.
api_path
The API endpoint to use for making requests to the embedding service. This might vary based on the provider and model used. Generally, this is provided by the model provider.
oauth_path
The API endpoint for the oauth service. Get certification information. This might vary based on the provider and model used. Generally, this is provided by the model provider.
custom_config
The custom_config option allows you to provide additional custom configurations for the model. This is a map where you
can define various settings that might be required by the specific model you're using.
custom_response_parse
The custom_response_parse option allows you to specify how to parse the model's response. You can use JsonPath to
extract the specific data you need from the response. For example, by using $.data[*].embedding, you can extract
the embedding field values from the following JSON and obtain a List of nested List results. For more details on
using JsonPath, please refer to
the JsonPath Getting Started guide.
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
-0.006929283495992422,
-0.005336422007530928,
-0.00004547132266452536,
-0.024047505110502243
]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}
custom_request_headers
The custom_request_headers option allows you to define custom headers that should be included in the request sent to
the model's API. This is useful if the API requires additional headers beyond the standard ones, such as authorization
tokens, content types, etc.
custom_request_body
The custom_request_body option supports placeholders:
${model}: Placeholder for the model name.${input}: Placeholder to determine input value and define request body request type based on the type of body value. Example:["${input}"]-> ["input"] (list)
common options
Transform plugin common parameters, please refer to Transform Plugin for details.
Example Configurations
Basic Text Embedding
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
model_provider = QIANFAN
model = bge_large_en
api_key = xxxxxxxxxx
secret_key = xxxxxxxxxx
api_path = xxxxxxxxxx
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
plugin_output = "embedding_output"
}
}
sink {
Assert {
plugin_input = "embedding_output"
rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
Multimodal Embedding (Volcengine Doubao)
Multimodal Embedding supports input as accessible URL or Binary data formats to process multimodal data.
URL
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
product_name = "string"
description = "string"
product_image_url = "string"
product_video_url = "string"
thumbnail_image = "string"
promotional_video = "string"
category = "string"
price = "decimal(10,2)"
created_at = "timestamp"
}
}
rows = [
{
fields = [
1,
"iPhone 15 Pro",
"Latest iPhone with advanced camera system and A17 Pro chip",
"https://example.cimages/iphone15pro.jpg",
"https://example.com/videos/iphone15pro_demo.mp4",
"https://example.com/thumbnails/iphone15pro_thumb.png",
"https://example.com/videos/iphone15pro_promo.mov",
"Electronics",
999.99,
"2024-01-15T10:30:00"
],
kind = INSERT
},
{
fields = [
2,
"MacBook Air M3",
"Ultra-thin laptop with M3 chip for incredible performance",
"https://example.cimages/macbook_air_m3.jpeg",
"https://example.com/videos/macbook_air_review.avi",
"https://example.com/thumbnails/macbook_thumb.webp",
"https://example.com/videos/macbook_commercial.mp4",
"Computers",
1299.99,
"2024-02-20T14:15:00"
],
kind = INSERT
}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
model_provider = DOUBAO
model = "doubao-embedding-vision"
api_key = "your-api-key"
api_path = "https://ark.cn-beijing.volces.com/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1
vectorization_fields {
# Text field - defaults to text modality
description_vector = description
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}
thumbnail_vector = {
field = thumbnail_image # If value is "thumb.png", auto-detects as PNG
format = url
}
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}
promo_video_vector = {
field = promotional_video # If value is "promo.mov", auto-detects as MOV
format = url
}
# Mixed content - product name
product_name_vector = product_name
}
plugin_output = "multimodal_embedding_output"
}
}
sink {
Assert {
plugin_input = "multimodal_embedding_output"
rules = {
field_rules = [
{
field_name = id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = description_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = product_image_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = thumbnail_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = demo_video_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
Binary
env {
job.mode = "BATCH"
}
source {
LocalFile {
path = "/seatunnel/read/binary/"
file_format_type = "binary"
binary_complete_file_mode = false
binary_chunk_size = 1024
plugin_output = "binary_source"
}
}
transform {
Embedding {
plugin_input = "binary_source"
model_provider = DOUBAO
model = "doubao-embedding-vision-250615"
api_key = "test-api-key"
api_path = "http://mockserver:1080/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1
vectorization_fields = {
image_embedding = {
field = "data"
modality = "jpeg"
format = "binary"
}
}
plugin_output = "binary_embedding_output"
}
}
sink {
Assert {
plugin_input = "binary_embedding_output"
rules = {
row_rules = [
{
rule_type = MAX_ROW
rule_value = 1
}
],
field_rules = [
{
field_name = image_embedding
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = relativePath
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}
Customize the embedding model
env {
job.mode = "BATCH"
}
source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}
transform {
Embedding {
plugin_input = "fake"
model_provider = CUSTOM
model = text-embedding-3-small
api_key = xxxxxxxx
api_path = "http://mockserver:1080/v1/doubao/embedding"
single_vectorized_input_number = 2
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
custom_config={
custom_response_parse = "$.data[*].embedding"
custom_request_headers = {
"Content-Type"= "application/json"
"Authorization"= "Bearer xxxxxxx
}
custom_request_body ={
modelx = "${model}"
inputx = ["${input}"]
}
}
plugin_output = "embedding_output_1"
}
}
sink {
Assert {
plugin_input = "embedding_output_1"
rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}