Version: 2.3.0-beta

File

File source connector

Description

Read data from local or hdfs file.

tip

Engine Supported and plugin name

Spark: File
Flink: File

Options

Spark
Flink

name	type	required	default value
format	string	no	json
path	string	yes	-
common-options	string	yes	-

format [string]

Format for reading files, currently supports text, parquet, json, orc, csv.

name	type	required	default value
format.type	string	yes	-
path	string	yes	-
schema	string	yes	-
common-options	string	no	-
parallelism	int	no	-

format.type [string]

The format for reading files from the file system, currently supports csv , json , parquet , orc and text .

schema [string]

csv
- The schema of csv is a string of jsonArray , such as "[{\"type\":\"long\"},{\"type\":\"string\"}]" , this can only specify the type of the field , The field name cannot be specified, and the common configuration parameter field_name is generally required.
json
- The schema parameter of json is to provide a json string of the original data, and the schema can be automatically generated, but the original data with the most complete content needs to be provided, otherwise the fields will be lost.
parquet
- The schema of parquet is an Avro schema string , such as {\"type\":\"record\",\"name\":\"test\",\"fields\":[{\"name\" :\"a\",\"type\":\"int\"},{\"name\":\"b\",\"type\":\"string\"}]} .
orc
- The schema of orc is the string of orc schema , such as "struct<name:string,addresses:array<struct<street:string,zip:smallint>>>" .
text
- The schema of text can be filled with string .

parallelism [`Int`]

The parallelism of an individual operator, for FileSource

path [string]

If read data from hdfs , the file path should start with hdfs://
If read data from local , the file path should start with file://

common options [string]

Source plugin common parameters, please refer to Source Plugin for details

Examples

Spark
Flink

file {
    path = "hdfs:///var/logs"
    result_table_name = "access_log"
}

file {
    path = "file:///var/logs"
    result_table_name = "access_log"
}

    FileSource{
    path = "hdfs://localhost:9000/input/"
    format.type = "json"
    schema = "{\"data\":[{\"a\":1,\"b\":2},{\"a\":3,\"b\":4}],\"db\":\"string\",\"q\":{\"s\":\"string\"}}"
    result_table_name = "test"
}

File

Description​

Options​

format [string]​

format.type [string]​

schema [string]​

parallelism [Int]​

path [string]​

common options [string]​

Examples​