跳到主要内容
版本:Next

向量化嵌入

Embedding:将文本、图片或视频等内容转换为向量表示

描述

Embedding 转换插件利用 embedding 模型将文本和多模态数据转换为向量化表示。此转换可以应用于各种字段,包括文本、图片和视频。该插件支持多种模型提供商,并且可以与不同的 API 集成。

重要提示: 当前 embedding 精确度仅支持 float32

配置选项

名称类型是否必填默认值描述
model_providerenum-embedding模型的提供商。可选项包括 AMAZONQIANFANOPENAI 等。
api_keystring-用于验证embedding服务的API密钥。
secret_keystring-用于额外验证的密钥。一些提供商可能需要此密钥进行安全的API请求。
aws_regionstring用于使用Amazon Bedrock 模型,需要指定模型请求区域.
single_vectorized_input_numberint1单次请求向量化的输入数量。默认值为1。
vectorization_fieldsmap-输入字段和相应的输出向量字段之间的映射。
modelstring-要使用的具体embedding模型。例如,如果提供商为OPENAI,可以指定 text-embedding-3-small
api_pathstring-embedding服务的API。通常由模型提供商提供。
dimensionint2048向量维度默认为 2048,Embedding-3模型支持自定义向量维度,建议选择256、512、1024或2048维度。
oauth_pathstring-oauth 服务的 API 。
custom_configmap模型的自定义配置。
custom_response_parsestring使用 JsonPath 解析模型响应的方式。示例:$.choices[*].message.content
custom_request_headersmap发送到模型的请求的自定义头信息。
custom_request_bodymap请求体的自定义配置。支持占位符如 ${model}${input}
model_retry_max_attemptsint1单个远程模型请求的最大尝试次数。默认值 1 表示保持原有不自动重试行为。
model_retry_backoff_mslong1000远程模型请求重试前的初始退避时间,单位毫秒。
model_retry_max_backoff_mslong10000远程模型请求重试前的最大退避时间,单位毫秒。
model_request_timeout_msint20000远程模型调用的请求超时时间,单位毫秒。

精度支持

重要: 当前版本的 Embedding 插件仅支持 float32 精度的向量数据。

  • 所有生成的 embedding 向量将以 float32 格式存储
  • 如果您的模型或API返回其他精度格式(如 float64),插件会自动转换为 float32

model_provider

用于生成 embedding 的模型提供商。常见选项包括 AMAZONDOUBAOQIANFANOPENAI 等,同时可选择 CUSTOM 实现自定义 embedding 模型的请求以及获取。

api_key

用于验证 embedding 服务请求的API密钥。通常由模型提供商在你注册他们的服务时提供,对于使用AMAZON 模型则对应IAM access key。

secret_key

用于额外验证的密钥。一些提供商可能要求此密钥以确保API请求的安全性。

single_vectorized_input_number

指定一个远程向量化请求中包含的模型输入数量。默认值为1。根据处理能力和模型提供商的API限制进行调整。

这是 request-level 的批处理语义,只作用于一行数据中的多个待向量化输入。它不是 row-level transform micro-batching, 也不表示 Transform 会先收集多行 SeaTunnel row 再调用模型提供商。

模型调用可靠性

Embedding provider 通过通用模型调用运行时执行远程调用。Provider 仍然负责 provider-specific 的请求体、请求头、认证、 响应解析以及 provider 错误转换;通用运行时负责超时传递、重试、错误分类、响应数量校验、安全日志、指标 hook 和缓存边界。

默认重试行为与已有任务兼容:model_retry_max_attempts = 1 表示每个请求只尝试一次。配置为大于 1 后,限流、超时、 临时远端服务错误等可重试失败可以按退避策略重试。认证失败、配置错误、响应解析失败和返回 vector 数量不匹配不会重试。

每个 request batch 都必须为每个输入返回且仅返回一个 vector。如果 provider 返回的 vector 数量少于或多于输入数量, 该请求会失败,Transform 不会输出可能错位的向量。

重试会对同一个远程请求 payload 再次尝试。Transform 只有在拿到成功响应后才输出向量,但 provider 仍可能按每次尝试计费或产生 provider 侧副作用。该配置不会改变下游 Sink 的幂等语义。

运行时会记录 provider、model、batch size、attempt number、error category、retryable flag、elapsed time 等安全诊断上下文。 日志不会记录 API key、secret key、完整源文本 chunk、二进制 payload 或完整 provider response body。

Bedrock 现在也走统一的 common runtime 路径,因此 retry、timeout、响应解析和返回数量校验在各个 provider 之间保持一致。

运行时也提供了一个 cache 边界。当接入 cache 实现时,key 由 provider、model、输出配置、modality、format、规范化后的 metadata, 以及规范化输入内容的 SHA-256 摘要组成。默认的生产 wiring 仍然使用 ModelInvocationCache.NOOP,因此在接入层显式启用缓存之前, 现有任务的行为保持不变。现有的 binary multimodal cache 行为不变,仍然只用于向量化前的文件分片重组。

兼容性说明:

  • model_retry_max_attemptsmodel_retry_backoff_msmodel_request_timeout_ms 的默认值保持不变。
  • 本次更新没有重命名或删除任何用户可见的配置项。
  • cache 集成是增量能力,不会改变默认执行路径。

vectorization_fields

输入字段和相应的输出向量字段之间的映射。这使得插件可以理解要向量化的字段以及如何存储生成的向量。插件通过允许您为每个字段指定模态类型来支持多模态数据。

基本文本向量化:

vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}

多模态向量化:

vectorization_fields {
# 基本文本字段
text_vector = text_field

# 显式指定模态类型的配置
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}

# 自动检测模态类型(根据文件后缀)
thumbnail_vector = {
field = thumbnail_image # 如果值为 "image.png",会自动检测为 PNG 模态
format = url
}

# 视频字段配置
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}

# 二进制数据配置
binary_image_vector = {
field = image_data
modality = jpeg
format = binary
}
}

字段规范格式:

支持的模态类型:

  • 图片: jpeg (jpg, jpeg), png (png, apng), gif, webp, bmp (bmp, dib), tiff (tiff, tif), ico, icns, sgi, jpeg2000 (j2c, j2k, jp2, jpc, jpf, jpx)
  • 视频: mp4, avi, mov
  • 文本: text(默认)

数据格式:

  • text - 文本格式(默认)
  • url - URL 格式
  • binary - 二进制数据格式

自动模态检测: 当未显式指定 modalityformat 不是 binary 时,系统会根据字段值的文件后缀自动检测模态类型:

重要: 使用多模态字段(图片或视频)时,请确保您的模型提供商支持多模态 embedding。图片和视频字段必须包含有效的 URL 或二进制数据。目前,DOUBAO 提供商支持多模态数据处理。

model

要使用的具体 embedding 模型。这取决于model_provider。例如,如果使用 OPENAI ,可以指定 text-embedding-3-small

api_path

用于向 embedding 服务发送请求的API。根据提供商和所用模型的不同可能有所变化。通常由模型提供商提供。

oauth_path

用于向oauth服务发送请求的API,获取对应的认证信息。根据提供商和所用模型的不同可能有所变化。通常由模型提供商提供。

custom_config

custom_config 选项允许您为模型提供额外的自定义配置。这是一个映射,您可以在其中定义特定模型可能需要的各种设置。

custom_response_parse

custom_response_parse 选项允许您指定如何解析模型的响应。您可以使用 JsonPath 从响应中提取所需的特定数据。例如,使用 $.data[*].embedding 提取如下json中的 embedding 字段 值,获取 List 嵌套 List 的结果。JsonPath 的使用请参考 JsonPath 快速入门

{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
-0.006929283495992422,
-0.005336422007530928,
-0.00004547132266452536,
-0.024047505110502243
]
}
],
"model": "text-embedding-3-small",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}

custom_request_headers

custom_request_headers 选项允许您定义应包含在发送到模型 API 的请求中的自定义头信息。如果 API 需要标准头信息之外的额外头信息,例如授权令牌、内容类型等,这个选项会非常有用。

custom_request_body

custom_request_body 选项支持占位符:

  • ${model}:用于模型名称的占位符。
  • ${input}:用于确定输入值的占位符,同时根据 body value 的类型定义请求体请求类型。例如:["${input}"] -> ["input"] ( list)。

common options

转换插件的常见参数, 请参考 Transform Plugin 了解详情

示例配置

基本文本 Embedding

env {
job.mode = "BATCH"
}

source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}

transform {
Embedding {
plugin_input = "fake"
model_provider = QIANFAN
model = bge_large_en
api_key = xxxxxxxxxx
secret_key = xxxxxxxxxx
api_path = xxxxxxxxxx
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
plugin_output = "embedding_output"
}
}

sink {
Assert {
plugin_input = "embedding_output"


rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}

多模态 Embedding(火山引擎豆包)

多模态 Embedding 支持输入可访问 URL 或 二进制数据格式处理多模态数据

可访问 URL

env {
job.mode = "BATCH"
}

source {
FakeSource {
row.num = 5
schema = {
fields {
id = "int"
product_name = "string"
description = "string"
product_image_url = "string"
product_video_url = "string"
thumbnail_image = "string"
promotional_video = "string"
category = "string"
price = "decimal(10,2)"
created_at = "timestamp"
}
}
rows = [
{
fields = [
1,
"iPhone 15 Pro",
"Latest iPhone with advanced camera system and A17 Pro chip",
"https://example.cimages/iphone15pro.jpg",
"https://example.com/videos/iphone15pro_demo.mp4",
"https://example.com/thumbnails/iphone15pro_thumb.png",
"https://example.com/videos/iphone15pro_promo.mov",
"Electronics",
999.99,
"2024-01-15T10:30:00"
],
kind = INSERT
},
{
fields = [
2,
"MacBook Air M3",
"Ultra-thin laptop with M3 chip for incredible performance",
"https://example.cimages/macbook_air_m3.jpeg",
"https://example.com/videos/macbook_air_review.avi",
"https://example.com/thumbnails/macbook_thumb.webp",
"https://example.com/videos/macbook_commercial.mp4",
"Computers",
1299.99,
"2024-02-20T14:15:00"
],
kind = INSERT
}
]
plugin_output = "fake"
}
}

transform {
Embedding {
plugin_input = "fake"
model_provider = DOUBAO
model = "doubao-embedding-vision"
api_key = "your-api-key"
api_path = "https://ark.cn-beijing.volces.com/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1

vectorization_fields {
# 文本字段 - 默认文本模态
description_vector = description

# 显式指定图片模态
product_image_vector = {
field = product_image_url
modality = jpeg
format = url
}

thumbnail_vector = {
field = thumbnail_image
format = url
}

# 视频字段
demo_video_vector = {
field = product_video_url
modality = mp4
format = url
}

promo_video_vector = {
field = promotional_video # 如果值为 "promo.mov",自动检测为 MOV
format = url
}

product_name_vector = product_name
}

plugin_output = "multimodal_embedding_output"
}
}

sink {
Assert {
plugin_input = "multimodal_embedding_output"
rules = {
field_rules = [
{
field_name = id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = description_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = product_image_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = thumbnail_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = demo_video_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}

二进制格式

env {
job.mode = "BATCH"
}

source {
LocalFile {
path = "/seatunnel/read/binary/"
file_format_type = "binary"
binary_complete_file_mode = false
binary_chunk_size = 1024
plugin_output = "binary_source"
}
}

transform {
Embedding {
plugin_input = "binary_source"
model_provider = DOUBAO
model = "doubao-embedding-vision-250615"
api_key = "test-api-key"
api_path = "http://mockserver:1080/api/v3/embeddings/multimodal"
single_vectorized_input_number = 1

vectorization_fields = {
image_embedding = {
field = "data"
modality = "jpeg"
format = "binary"
}
}

plugin_output = "binary_embedding_output"
}
}

sink {
Assert {
plugin_input = "binary_embedding_output"
rules = {
row_rules = [
{
rule_type = MAX_ROW
rule_value = 1
}
],
field_rules = [
{
field_name = image_embedding
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = relativePath
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}

Customize the embedding model


env {
job.mode = "BATCH"
}

source {
FakeSource {
row.num = 5
schema = {
fields {
book_id = "int"
book_name = "string"
book_intro = "string"
author_biography = "string"
}
}
rows = [
{fields = [1, "To Kill a Mockingbird",
"Set in the American South during the 1930s, To Kill a Mockingbird tells the story of young Scout Finch and her brother, Jem, who are growing up in a world of racial inequality and injustice. Their father, Atticus Finch, is a lawyer who defends a black man falsely accused of raping a white woman, teaching his children valuable lessons about morality, courage, and empathy.",
"Harper Lee (1926–2016) was an American novelist best known for To Kill a Mockingbird, which won the Pulitzer Prize in 1961. Lee was born in Monroeville, Alabama, and the town served as inspiration for the fictional Maycomb in her novel. Despite the success of her book, Lee remained a private person and published only one other novel, Go Set a Watchman, which was written before To Kill a Mockingbird but released in 2015 as a sequel."
], kind = INSERT}
{fields = [2, "1984",
"1984 is a dystopian novel set in a totalitarian society governed by Big Brother. The story follows Winston Smith, a man who works for the Party rewriting history. Winston begins to question the Party’s control and seeks truth and freedom in a society where individuality is crushed. The novel explores themes of surveillance, propaganda, and the loss of personal autonomy.",
"George Orwell (1903–1950) was the pen name of Eric Arthur Blair, an English novelist, essayist, journalist, and critic. Orwell is best known for his works 1984 and Animal Farm, both of which are critiques of totalitarian regimes. His writing is characterized by lucid prose, awareness of social injustice, opposition to totalitarianism, and support of democratic socialism. Orwell’s work remains influential, and his ideas have shaped contemporary discussions on politics and society."
], kind = INSERT}
{fields = [3, "Pride and Prejudice",
"Pride and Prejudice is a romantic novel that explores the complex relationships between different social classes in early 19th century England. The story centers on Elizabeth Bennet, a young woman with strong opinions, and Mr. Darcy, a wealthy but reserved gentleman. The novel deals with themes of love, marriage, and societal expectations, offering keen insights into human behavior.",
"Jane Austen (1775–1817) was an English novelist known for her sharp social commentary and keen observations of the British landed gentry. Her works, including Sense and Sensibility, Emma, and Pride and Prejudice, are celebrated for their wit, realism, and biting critique of the social class structure of her time. Despite her relatively modest life, Austen’s novels have gained immense popularity, and she is considered one of the greatest novelists in the English language."
], kind = INSERT}
{fields = [4, "The Great GatsbyThe Great Gatsby",
"The Great Gatsby is a novel about the American Dream and the disillusionment that can come with it. Set in the 1920s, the story follows Nick Carraway as he becomes entangled in the lives of his mysterious neighbor, Jay Gatsby, and the wealthy elite of Long Island. Gatsby's obsession with the beautiful Daisy Buchanan drives the narrative, exploring themes of wealth, love, and the decay of the American Dream.",
"F. Scott Fitzgerald (1896–1940) was an American novelist and short story writer, widely regarded as one of the greatest American writers of the 20th century. Born in St. Paul, Minnesota, Fitzgerald is best known for his novel The Great Gatsby, which is often considered the quintessential work of the Jazz Age. His works often explore themes of youth, wealth, and the American Dream, reflecting the turbulence and excesses of the 1920s."
], kind = INSERT}
{fields = [5, "Moby-Dick",
"Moby-Dick is an epic tale of obsession and revenge. The novel follows the journey of Captain Ahab, who is on a relentless quest to kill the white whale, Moby Dick, that once maimed him. Narrated by Ishmael, a sailor aboard Ahab’s ship, the story delves into themes of fate, humanity, and the struggle between man and nature. The novel is also rich with symbolism and philosophical musings.",
"Herman Melville (1819–1891) was an American novelist, short story writer, and poet of the American Renaissance period. Born in New York City, Melville gained initial fame with novels such as Typee and Omoo, but it was Moby-Dick, published in 1851, that would later be recognized as his masterpiece. Melville’s work is known for its complexity, symbolism, and exploration of themes such as man’s place in the universe, the nature of evil, and the quest for meaning. Despite facing financial difficulties and critical neglect during his lifetime, Melville’s reputation soared posthumously, and he is now considered one of the great American authors."
], kind = INSERT}
]
plugin_output = "fake"
}
}

transform {
Embedding {
plugin_input = "fake"
model_provider = CUSTOM
model = text-embedding-3-small
api_key = xxxxxxxx
api_path = "http://mockserver:1080/v1/doubao/embedding"
single_vectorized_input_number = 2
vectorization_fields {
book_intro_vector = book_intro
author_biography_vector = author_biography
}
custom_config={
custom_response_parse = "$.data[*].embedding"
custom_request_headers = {
"Content-Type"= "application/json"
"Authorization"= "Bearer xxxxxxx
}
custom_request_body ={
modelx = "${model}"
inputx = ["${input}"]
}
}
plugin_output = "embedding_output_1"
}
}

sink {
Assert {
plugin_input = "embedding_output_1"
rules =
{
field_rules = [
{
field_name = book_id
field_type = int
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_name
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography
field_type = string
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = book_intro_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
},
{
field_name = author_biography_vector
field_type = float_vector
field_value = [
{
rule_type = NOT_NULL
}
]
}
]
}
}
}