OssFile
Oss file sink connector
Support Those Engines
Spark
Flink
SeaTunnel Zeta
Usage Dependency
For Spark/Flink Engine
- You must ensure your spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
- You must ensure
hadoop-aliyun-xx.jar
,aliyun-sdk-oss-xx.jar
andjdom-xx.jar
in${SEATUNNEL_HOME}/plugins/
dir and the version ofhadoop-aliyun
jar need equals your hadoop version which used in spark/flink andaliyun-sdk-oss-xx.jar
andjdom-xx.jar
version needs to be the version corresponding to thehadoop-aliyun
version. Eg:hadoop-aliyun-3.1.4.jar
dependencyaliyun-sdk-oss-3.4.1.jar
andjdom-1.1.jar
.
For SeaTunnel Zeta Engine
- You must ensure
seatunnel-hadoop3-3.1.4-uber.jar
,aliyun-sdk-oss-3.4.1.jar
,hadoop-aliyun-3.1.4.jar
andjdom-1.1.jar
in${SEATUNNEL_HOME}/lib/
dir.
Key features
By default, we use 2PC commit to ensure exactly-once
- file format type
- text
- csv
- parquet
- orc
- json
- excel
- xml
- binary
Data Type Mapping
If write to csv
, text
file type, All column will be string.
Orc File Type
SeaTunnel Data Type | Orc Data Type |
---|---|
STRING | STRING |
BOOLEAN | BOOLEAN |
TINYINT | BYTE |
SMALLINT | SHORT |
INT | INT |
BIGINT | LONG |
FLOAT | FLOAT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
DECIMAL | DECIMAL |
BYTES | BINARY |
DATE | DATE |
TIME TIMESTAMP | TIMESTAMP |
ROW | STRUCT |
NULL | UNSUPPORTED DATA TYPE |
ARRAY | LIST |
Map | Map |
Parquet File Type
SeaTunnel Data Type | Parquet Data Type |
---|---|
STRING | STRING |
BOOLEAN | BOOLEAN |
TINYINT | INT_8 |
SMALLINT | INT_16 |
INT | INT32 |
BIGINT | INT64 |
FLOAT | FLOAT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
DECIMAL | DECIMAL |
BYTES | BINARY |
DATE | DATE |
TIME TIMESTAMP | TIMESTAMP_MILLIS |
ROW | GroupType |
NULL | UNSUPPORTED DATA TYPE |
ARRAY | LIST |
Map | Map |
Options
Name | Type | Required | Default | Description |
---|---|---|---|---|
path | string | yes | The oss path to write file in. | |
tmp_path | string | no | /tmp/seatunnel | The result file will write to a tmp path first and then use mv to submit tmp dir to target dir. Need a OSS dir. |
bucket | string | yes | - | |
access_key | string | yes | - | |
access_secret | string | yes | - | |
endpoint | string | yes | - | |
custom_filename | boolean | no | false | Whether you need custom the filename |
file_name_expression | string | no | "${transactionId}" | Only used when custom_filename is true |
filename_time_format | string | no | "yyyy.MM.dd" | Only used when custom_filename is true |
file_format_type | string | no | "csv" | |
filename_extension | string | no | - | Override the default file name extensions with custom file name extensions. E.g. .xml , .json , dat , .customtype |
field_delimiter | string | no | '\001' | Only used when file_format_type is text |
row_delimiter | string | no | "\n" | Only used when file_format_type is text |
have_partition | boolean | no | false | Whether you need processing partitions. |
partition_by | array | no | - | Only used then have_partition is true |
partition_dir_expression | string | no | "${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/" | Only used then have_partition is true |
is_partition_field_write_in_file | boolean | no | false | Only used then have_partition is true |
sink_columns | array | no | When this parameter is empty, all fields are sink columns | |
is_enable_transaction | boolean | no | true | |
batch_size | int | no | 1000000 | |
compress_codec | string | no | none | |
common-options | object | no | - | |
max_rows_in_memory | int | no | - | Only used when file_format_type is excel. |
sheet_name | string | no | Sheet${Random number} | Only used when file_format_type is excel. |
csv_string_quote_mode | enum | no | MINIMAL | Only used when file_format is csv. |
xml_root_tag | string | no | RECORDS | Only used when file_format is xml. |
xml_row_tag | string | no | RECORD | Only used when file_format is xml. |
xml_use_attr_format | boolean | no | - | Only used when file_format is xml. |
single_file_mode | boolean | no | false | Each parallelism will only output one file. When this parameter is turned on, batch_size will not take effect. The output file name does not have a file block suffix. |
create_empty_file_when_no_data | boolean | no | false | When there is no data synchronization upstream, the corresponding data files are still generated. |
parquet_avro_write_timestamp_as_int96 | boolean | no | false | Only used when file_format is parquet. |
parquet_avro_write_fixed_as_int96 | array | no | - | Only used when file_format is parquet. |
enable_header_write | boolean | no | false | Only used when file_format_type is text,csv. false:don't write header,true:write header. |
encoding | string | no | "UTF-8" | Only used when file_format_type is json,text,csv,xml. |
path [string]
The target dir path is required.
bucket [string]
The bucket address of oss file system, for example: oss://tyrantlucifer-image-bed
access_key [string]
The access key of oss file system.
access_secret [string]
The access secret of oss file system.
endpoint [string]
The endpoint of oss file system.
custom_filename [boolean]
Whether custom the filename
file_name_expression [string]
Only used when custom_filename
is true
file_name_expression
describes the file expression which will be created into the path
. We can add the variable ${now}
or ${uuid}
in the file_name_expression
, like test_${uuid}_${now}
,
${now}
represents the current time, and its format can be defined by specifying the option filename_time_format
.
Please note that, If is_enable_transaction
is true
, we will auto add ${transactionId}_
in the head of the file.
filename_time_format [String]
Only used when custom_filename
is true
When the format in the file_name_expression
parameter is xxxx-${Now}
, filename_time_format
can specify the time format of the path, and the default value is yyyy.MM.dd
. The commonly used time formats are listed as follows:
Symbol | Description |
---|---|
y | Year |
M | Month |
d | Day of month |
H | Hour in day (0-23) |
m | Minute in hour |
s | Second in minute |
file_format_type [string]
We supported as the following file types:
text
csv
parquet
orc
json
excel
xml
binary
Please note that, The final file name will end with the file_format_type's suffix, the suffix of the text file is txt
.
field_delimiter [string]
The separator between columns in a row of data. Only needed by text
file format.
row_delimiter [string]
The separator between rows in a file. Only needed by text
file format.
have_partition [boolean]
Whether you need processing partitions.
partition_by [array]
Only used when have_partition
is true
.
Partition data based on selected fields.
partition_dir_expression [string]
Only used when have_partition
is true
.
If the partition_by
is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.
Default partition_dir_expression
is ${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/
. k0
is the first partition field and v0
is the value of the first partition field.
is_partition_field_write_in_file [boolean]
Only used when have_partition
is true
.
If is_partition_field_write_in_file
is true
, the partition field and the value of it will be write into data file.
For example, if you want to write a Hive Data File, Its value should be false
.
sink_columns [array]
Which columns need be written to file, default value is all the columns get from Transform
or Source
.
The order of the fields determines the order in which the file is actually written.
is_enable_transaction [boolean]
If is_enable_transaction
is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.
Please note that, If is_enable_transaction
is true
, we will auto add ${transactionId}_
in the head of the file.
Only support true
now.
batch_size [int]
The maximum number of rows in a file. For SeaTunnel Engine, the number of lines in the file is determined by batch_size
and checkpoint.interval
jointly decide. If the value of checkpoint.interval
is large enough, sink writer will write rows in a file until the rows in the file larger than batch_size
. If checkpoint.interval
is small, the sink writer will create a new file when a new checkpoint trigger.
compress_codec [string]
The compress codec of files and the details that supported as the following shown:
- txt:
lzo
none
- json:
lzo
none
- csv:
lzo
none
- orc:
lzo
snappy
lz4
zlib
none
- parquet:
lzo
snappy
lz4
gzip
brotli
zstd
none
Tips: excel type does not support any compression format
common options
Sink plugin common parameters, please refer to Sink Common Options for details.
max_rows_in_memory [int]
When File Format is Excel,The maximum number of data items that can be cached in the memory.
sheet_name [string]
Writer the sheet of the workbook
csv_string_quote_mode [string]
When File Format is CSV,The string quote mode of CSV.
- ALL: All String fields will be quoted.
- MINIMAL: Quotes fields which contain special characters such as a the field delimiter, quote character or any of the characters in the line separator string.
- NONE: Never quotes fields. When the delimiter occurs in data, the printer prefixes it with the escape character. If the escape character is not set, format validation throws an exception.
xml_root_tag [string]
Specifies the tag name of the root element within the XML file.
xml_row_tag [string]
Specifies the tag name of the data rows within the XML file.
xml_use_attr_format [boolean]
Specifies Whether to process data using the tag attribute format.
parquet_avro_write_timestamp_as_int96 [boolean]
Support writing Parquet INT96 from a timestamp, only valid for parquet files.
parquet_avro_write_fixed_as_int96 [array]
Support writing Parquet INT96 from a 12-byte field, only valid for parquet files.
encoding [string]
Only used when file_format_type is json,text,csv,xml.
The encoding of the file to write. This param will be parsed by Charset.forName(encoding)
.
How to Create an Oss Data Synchronization Jobs
The following example demonstrates how to create a data synchronization job that reads data from Fake Source and writes it to the Oss:
For text file format with have_partition
and custom_filename
and sink_columns
# Set the basic configuration of the task to be performed
env {
parallelism = 1
job.mode = "BATCH"
}
# Create a source to product data
source {
FakeSource {
schema = {
fields {
name = string
age = int
}
}
}
}
# write data to Oss
sink {
OssFile {
path="/seatunnel/sink"
bucket = "oss://tyrantlucifer-image-bed"
access_key = "xxxxxxxxxxx"
access_secret = "xxxxxxxxxxx"
endpoint = "oss-cn-beijing.aliyuncs.com"
file_format_type = "text"
field_delimiter = "\t"
row_delimiter = "\n"
have_partition = true
partition_by = ["age"]
partition_dir_expression = "${k0}=${v0}"
is_partition_field_write_in_file = true
custom_filename = true
file_name_expression = "${transactionId}_${now}"
filename_time_format = "yyyy.MM.dd"
sink_columns = ["name","age"]
is_enable_transaction = true
}
}
For parquet file format with have_partition
and sink_columns
# Set the basic configuration of the task to be performed
env {
parallelism = 1
job.mode = "BATCH"
}
# Create a source to product data
source {
FakeSource {
schema = {
fields {
name = string
age = int
}
}
}
}
# Write data to Oss
sink {
OssFile {
path = "/seatunnel/sink"
bucket = "oss://tyrantlucifer-image-bed"
access_key = "xxxxxxxxxxx"
access_secret = "xxxxxxxxxxxxxxxxx"
endpoint = "oss-cn-beijing.aliyuncs.com"
have_partition = true
partition_by = ["age"]
partition_dir_expression = "${k0}=${v0}"
is_partition_field_write_in_file = true
file_format_type = "parquet"
sink_columns = ["name","age"]
}
}
For orc file format simple config
# Set the basic configuration of the task to be performed
env {
parallelism = 1
job.mode = "BATCH"
}
# Create a source to product data
source {
FakeSource {
schema = {
fields {
name = string
age = int
}
}
}
}
# Write data to Oss
sink {
OssFile {
path="/seatunnel/sink"
bucket = "oss://tyrantlucifer-image-bed"
access_key = "xxxxxxxxxxx"
access_secret = "xxxxxxxxxxx"
endpoint = "oss-cn-beijing.aliyuncs.com"
file_format_type = "orc"
}
}
enable_header_write [boolean]
Only used when file_format_type is text,csv.false:don't write header,true:write header.
Multiple Table
For extract source metadata from upstream, you can use ${database_name}
, ${table_name}
and ${schema_name}
in the
path.
env {
parallelism = 1
spark.app.name = "SeaTunnel"
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "1g"
spark.master = local
job.mode = "BATCH"
}
source {
FakeSource {
tables_configs = [
{
schema = {
table = "fake1"
fields {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_bytes = bytes
c_date = date
c_decimal = "decimal(38, 18)"
c_timestamp = timestamp
c_row = {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_bytes = bytes
c_date = date
c_decimal = "decimal(38, 18)"
c_timestamp = timestamp
}
}
}
},
{
schema = {
table = "fake2"
fields {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_bytes = bytes
c_date = date
c_decimal = "decimal(38, 18)"
c_timestamp = timestamp
c_row = {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_bytes = bytes
c_date = date
c_decimal = "decimal(38, 18)"
c_timestamp = timestamp
}
}
}
}
]
}
}
sink {
OssFile {
bucket = "oss://whale-ops"
access_key = "xxxxxxxxxxxxxxxxxxx"
access_secret = "xxxxxxxxxxxxxxxxxxx"
endpoint = "https://oss-accelerate.aliyuncs.com"
path = "/tmp/fake_empty/text/${table_name}"
row_delimiter = "\n"
partition_dir_expression = "${k0}=${v0}"
is_partition_field_write_in_file = true
file_name_expression = "${transactionId}_${now}"
file_format_type = "text"
filename_time_format = "yyyy.MM.dd"
is_enable_transaction = true
compress_codec = "lzo"
}
}
Tips
Changelog
Change Log
Change | Commit | Version |
---|---|---|
Revert " [improve] update localfile connector config" (#9018) | https://github.com/apache/seatunnel/commit/cdc79e13a | 2.3.10 |
[improve] update localfile connector config (#8765) | https://github.com/apache/seatunnel/commit/def369a85 | 2.3.10 |
[Feature][Connector-V2] Add filename_extension parameter for read/write file (#8769) | https://github.com/apache/seatunnel/commit/78b23c0ef | 2.3.10 |
[Improve] restruct connector common options (#8634) | https://github.com/apache/seatunnel/commit/f3499a6ee | 2.3.10 |
[Feature][Connector-V2] Support create emtpy file when no data (#8543) | https://github.com/apache/seatunnel/commit/275db7891 | 2.3.10 |
[Feature][Connector-V2] Support single file mode in file sink (#8518) | https://github.com/apache/seatunnel/commit/e893deed5 | 2.3.10 |
[Feature][File] Support config null format for text file read (#8109) | https://github.com/apache/seatunnel/commit/2dbf02df4 | 2.3.9 |
[Improve][API] Unified tables_configs and table_list (#8100) | https://github.com/apache/seatunnel/commit/84c0b8d66 | 2.3.9 |
[Feature][Restapi] Allow metrics information to be associated to logical plan nodes (#7786) | https://github.com/apache/seatunnel/commit/6b7c53d03 | 2.3.9 |
[Improve][Connector-V2] Support read archive compress file (#7633) | https://github.com/apache/seatunnel/commit/3f98cd8a1 | 2.3.8 |
[Improve] Added OSSFileCatalog and it's factory (#7458) | https://github.com/apache/seatunnel/commit/9006a205d | 2.3.8 |
[Improve][Connector] Add multi-table sink option check (#7360) | https://github.com/apache/seatunnel/commit/2489f6446 | 2.3.7 |
[Feature][Core] Support using upstream table placeholders in sink options and auto replacement (#7131) | https://github.com/apache/seatunnel/commit/c4ca74122 | 2.3.6 |
[Improve][Files] Support write fixed/timestamp as int96 of parquet (#6971) | https://github.com/apache/seatunnel/commit/1a48a9c49 | 2.3.6 |
[Chore] Fix file spell errors (#6606) | https://github.com/apache/seatunnel/commit/2599d3b73 | 2.3.5 |
[Fix][Connector-V2] Fix connector support SPI but without no args constructor (#6551) | https://github.com/apache/seatunnel/commit/5f3c9c36a | 2.3.5 |
Add support for XML file type to various file connectors such as SFTP, FTP, LocalFile, HdfsFile, and more. (#6327) | https://github.com/apache/seatunnel/commit/ec533ecd9 | 2.3.5 |
[Feature][OssFile Connector] Make Oss implement source factory and sink factory (#6062) | https://github.com/apache/seatunnel/commit/1a8e9b455 | 2.3.4 |
[Refactor][File Connector] Put Multiple Table File API to File Base Module (#6033) | https://github.com/apache/seatunnel/commit/c324d663b | 2.3.4 |
[Hotfix][Oss File Connector] fix oss connector can not run bug (#6010) | https://github.com/apache/seatunnel/commit/755bc2a73 | 2.3.4 |
Support using multiple hadoop account (#5903) | https://github.com/apache/seatunnel/commit/d69d88d1a | 2.3.4 |
[Improve][Common] Introduce new error define rule (#5793) | https://github.com/apache/seatunnel/commit/9d1b2582b | 2.3.4 |
[Improve][connector-file] unifiy option between file source/sink and update document (#5680) | https://github.com/apache/seatunnel/commit/8d87cf8fc | 2.3.4 |
[Feature] Support LZO compress on File Read (#5083) | https://github.com/apache/seatunnel/commit/a4a190109 | 2.3.4 |
[Feature][Connector-V2][File] Support read empty directory (#5591) | https://github.com/apache/seatunnel/commit/1f58f224a | 2.3.4 |
Support config column/primaryKey/constraintKey in schema (#5564) | https://github.com/apache/seatunnel/commit/eac76b4e5 | 2.3.4 |
[Feature][File Connector]optionrule FILE_FORMAT_TYPE is text/csv ,add parameter BaseSinkConfig.ENABLE_HEADER_WRITE: #5566 (#5567) | https://github.com/apache/seatunnel/commit/0e02db768 | 2.3.4 |
[Feature][Connector V2][File] Add config of 'file_filter_pattern', which used for filtering files. (#5153) | https://github.com/apache/seatunnel/commit/a3c13e59e | 2.3.3 |
[Fix][Connector-V2] Fix file-oss config check bug and amend file-oss-jindo factoryIdentifier (#4581) | https://github.com/apache/seatunnel/commit/5c4f17df2 | 2.3.2 |
[Feature][ConnectorV2]add file excel sink and source (#4164) | https://github.com/apache/seatunnel/commit/e3b97ae5d | 2.3.2 |
Change file type to file_format_type in file source/sink (#4249) | https://github.com/apache/seatunnel/commit/973a2fae3 | 2.3.1 |
Merge branch 'dev' into merge/cdc | https://github.com/apache/seatunnel/commit/4324ee191 | 2.3.1 |
[Improve][Project] Code format with spotless plugin. | https://github.com/apache/seatunnel/commit/423b58303 | 2.3.1 |
[improve][api] Refactoring schema parse (#4157) | https://github.com/apache/seatunnel/commit/b2f573a13 | 2.3.1 |
[Improve][build] Give the maven module a human readable name (#4114) | https://github.com/apache/seatunnel/commit/d7cd60105 | 2.3.1 |
[Improve][Project] Code format with spotless plugin. (#4101) | https://github.com/apache/seatunnel/commit/a2ab16656 | 2.3.1 |
[Feature][Connector-V2][File] Support compress (#3899) | https://github.com/apache/seatunnel/commit/55602f6b1 | 2.3.1 |
[Feature][Connector] add get source method to all source connector (#3846) | https://github.com/apache/seatunnel/commit/417178fb8 | 2.3.1 |
[Improve][Connector-V2][File] Improve file connector option rule and document (#3812) | https://github.com/apache/seatunnel/commit/bd7607766 | 2.3.1 |
[Hotfix][OptionRule] Fix option rule about all connectors (#3592) | https://github.com/apache/seatunnel/commit/226dc6a11 | 2.3.0 |
[Improve][Connector-V2][File] Unified excetion for file source & sink connectors (#3525) | https://github.com/apache/seatunnel/commit/031e8e263 | 2.3.0 |
[Feature][Connector-V2][File] Add option and factory for file connectors (#3375) | https://github.com/apache/seatunnel/commit/db286e863 | 2.3.0 |
[Improve][Connector-V2][File] Improve code structure (#3238) | https://github.com/apache/seatunnel/commit/dd5c35388 | 2.3.0 |
[Connector-V2][ElasticSearch] Add ElasticSearch Source/Sink Factory (#3325) | https://github.com/apache/seatunnel/commit/38254e3f2 | 2.3.0 |
[Improve][Connector-V2][File] Support parse field from file path (#2985) | https://github.com/apache/seatunnel/commit/0bc12085c | 2.3.0-beta |
[Improve][connector][file] Support user-defined schema for reading text file (#2976) | https://github.com/apache/seatunnel/commit/1c05ee0d7 | 2.3.0-beta |
[Improve][Connector] Improve write parquet (#2943) | https://github.com/apache/seatunnel/commit/8fd966394 | 2.3.0-beta |
[Fix][Connector-V2] Fix HiveSource Connector read orc table error (#2845) | https://github.com/apache/seatunnel/commit/61720306e | 2.2.0-beta |
[Improve][Connector-V2] Improve read parquet (#2841) | https://github.com/apache/seatunnel/commit/e19bc82f9 | 2.2.0-beta |
[Feature][Connector-V2] Add oss sink (#2629) | https://github.com/apache/seatunnel/commit/bb2ad4048 | 2.2.0-beta |
[#2606]Dependency management split (#2630) | https://github.com/apache/seatunnel/commit/fc047be69 | 2.2.0-beta |
[chore][connector-common] Rename SeatunnelSchema to SeaTunnelSchema (#2538) | https://github.com/apache/seatunnel/commit/7dc2a2738 | 2.2.0-beta |
[Feature][Connector-V2] Add oss source connector (#2467) | https://github.com/apache/seatunnel/commit/712b77744 | 2.2.0-beta |