FakeSource
FakeSource connector
Support Those Engines
Spark
Flink
SeaTunnel Zeta
Description
The FakeSource is a virtual data source, which randomly generates the number of rows according to the data structure of the user-defined schema, just for some test cases such as type conversion or connector new feature testing
Key Features
Source Options
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
| tables_configs | list | no | - | Define Multiple FakeSource, each item can contains the whole fake source config description below |
| schema | config | yes | - | Define Schema information |
| rows | config | no | - | The row list of fake data output per degree of parallelism see title Options rows Case. |
| row.num | int | no | 5 | The total number of data generated per degree of parallelism |
| split.num | int | no | 1 | the number of splits generated by the enumerator for each degree of parallelism |
| split.read-interval | long | no | 1 | The interval(mills) between two split reads in a reader |
| map.size | int | no | 5 | The size of map type that connector generated |
| array.size | int | no | 5 | The size of array type that connector generated |
| bytes.length | int | no | 5 | The length of bytes type that connector generated |
| string.length | int | no | 5 | The length of string type that connector generated |
| string.fake.mode | string | no | range | The fake mode of generating string data, support range and template, default range,if use configured it to template, user should also configured string.template option |
| string.template | list | no | - | The template list of string type that connector generated, if user configured it, connector will randomly select an item from the template list |
| tinyint.fake.mode | string | no | range | The fake mode of generating tinyint data, support range and template, default range,if use configured it to template, user should also configured tinyint.template option |
| tinyint.min | tinyint | no | 0 | The min value of tinyint data that connector generated |
| tinyint.max | tinyint | no | 127 | The max value of tinyint data that connector generated |
| tinyint.template | list | no | - | The template list of tinyint type that connector generated, if user configured it, connector will randomly select an item from the template list |
| smallint.fake.mode | string | no | range | The fake mode of generating smallint data, support range and template, default range,if use configured it to template, user should also configured smallint.template option |
| smallint.min | smallint | no | 0 | The min value of smallint data that connector generated |
| smallint.max | smallint | no | 32767 | The max value of smallint data that connector generated |
| smallint.template | list | no | - | The template list of smallint type that connector generated, if user configured it, connector will randomly select an item from the template list |
| int.fake.template | string | no | range | The fake mode of generating int data, support range and template, default range,if use configured it to template, user should also configured int.template option |
| int.min | int | no | 0 | The min value of int data that connector generated |
| int.max | int | no | 0x7fffffff | The max value of int data that connector generated |
| int.template | list | no | - | The template list of int type that connector generated, if user configured it, connector will randomly select an item from the template list |
| bigint.fake.mode | string | no | range | The fake mode of generating bigint data, support range and template, default range,if use configured it to template, user should also configured bigint.template option |
| bigint.min | bigint | no | 0 | The min value of bigint data that connector generated |
| bigint.max | bigint | no | 0x7fffffffffffffff | The max value of bigint data that connector generated |
| bigint.template | list | no | - | The template list of bigint type that connector generated, if user configured it, connector will randomly select an item from the template list |
| float.fake.mode | string | no | range | The fake mode of generating float data, support range and template, default range,if use configured it to template, user should also configured float.template option |
| float.min | float | no | 0 | The min value of float data that connector generated |
| float.max | float | no | 0x1.fffffeP+127 | The max value of float data that connector generated |
| float.template | list | no | - | The template list of float type that connector generated, if user configured it, connector will randomly select an item from the template list |
| double.fake.mode | string | no | range | The fake mode of generating float data, support range and template, default range,if use configured it to template, user should also configured double.template option |
| double.min | double | no | 0 | The min value of double data that connector generated |
| double.max | double | no | 0x1.fffffffffffffP+1023 | The max value of double data that connector generated |
| double.template | list | no | - | The template list of double type that connector generated, if user configured it, connector will randomly select an item from the template list |
| vector.dimension | int | no | 4 | Dimension of the generated vector, excluding binary vectors |
| binary.vector.dimension | int | no | 8 | Dimension of the generated binary vector |
| vector.float.min | float | no | 0 | The min value of float data in vector that connector generated |
| vector.float.max | float | no | 0x1.fffffeP+127 | The max value of float data in vector that connector generated |
| common-options | no | - | Source plugin common parameters, please refer to Source Common Options for details |
Task Example
Simple:
This example Randomly generates data of a specified type. If you want to learn how to declare field types, click here.
schema = {
fields {
c_map = "map<string, array<int>>"
c_map_nest = "map<string, {c_int = int, c_string = string}>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
c_row = {
c_map = "map<string, map<string, string>>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
}
}
}
Random Generation
16 data matching the type are randomly generated
source {
# This is a example input plugin **only for test and demonstrate the feature input plugin**
FakeSource {
row.num = 16
schema = {
fields {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
}
}
plugin_output = "fake"
}
}
Customize the data content Simple:
This is a self-defining data source information, defining whether each piece of data is an add or delete modification operation, and defining what each field stores
source {
FakeSource {
schema = {
fields {
c_map = "map<string, string>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
}
}
rows = [
{
kind = INSERT
fields = [{"a": "b"}, [101], "c_string", true, 117, 15987, 56387395, 7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null, "bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
}
{
kind = UPDATE_BEFORE
fields = [{"a": "c"}, [102], "c_string", true, 117, 15987, 56387395, 7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null, "bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
}
{
kind = UPDATE_AFTER
fields = [{"a": "e"}, [103], "c_string", true, 117, 15987, 56387395, 7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null, "bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
}
{
kind = DELETE
fields = [{"a": "f"}, [104], "c_string", true, 117, 15987, 56387395, 7084913402530365000, 1.23, 1.23, "2924137191386439303744.39292216", null, "bWlJWmo=", "2023-04-22", "2023-04-22T23:20:58"]
}
]
}
}
Due to the constraints of the HOCON specification, users cannot directly create byte sequence objects. FakeSource uses strings to assign
bytestype values. In the example above, thebytestype field is assigned"bWlJWmo=", which is encoded from "miIZj" with base64. Hence, when assigning values tobytestype fields, please use strings encoded with base64.
Specified Data number Simple:
This case specifies the number of data generated and the length of the generated value
FakeSource {
row.num = 10
map.size = 10
array.size = 10
bytes.length = 10
string.length = 10
schema = {
fields {
c_map = "map<string, array<int>>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
c_row = {
c_map = "map<string, map<string, string>>"
c_array = "array<int>"
c_string = string
c_boolean = boolean
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
c_decimal = "decimal(30, 8)"
c_null = "null"
c_bytes = bytes
c_date = date
c_timestamp = timestamp
}
}
}
}
Template data Simple:
Randomly generated according to the specified template
Using template
FakeSource {
row.num = 5
string.fake.mode = "template"
string.template = ["tyrantlucifer", "hailin", "kris", "fanjia", "zongwen", "gaojun"]
tinyint.fake.mode = "template"
tinyint.template = [1, 2, 3, 4, 5, 6, 7, 8, 9]
smalling.fake.mode = "template"
smallint.template = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
int.fake.mode = "template"
int.template = [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
bigint.fake.mode = "template"
bigint.template = [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
float.fake.mode = "template"
float.template = [40.0, 41.0, 42.0, 43.0]
double.fake.mode = "template"
double.template = [44.0, 45.0, 46.0, 47.0]
schema {
fields {
c_string = string
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
}
}
}
Range data Simple:
The specified data generation range is randomly generated
FakeSource {
row.num = 5
string.template = ["tyrantlucifer", "hailin", "kris", "fanjia", "zongwen", "gaojun"]
tinyint.min = 1
tinyint.max = 9
smallint.min = 10
smallint.max = 19
int.min = 20
int.max = 29
bigint.min = 30
bigint.max = 39
float.min = 40.0
float.max = 43.0
double.min = 44.0
double.max = 47.0
schema {
fields {
c_string = string
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
}
}
}
Generate Multiple tables
This is a case of generating a multi-data source test.table1 and test.table2
FakeSource {
tables_configs = [
{
row.num = 16
schema {
table = "test.table1"
fields {
c_string = string
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
}
}
},
{
row.num = 17
schema {
table = "test.table2"
fields {
c_string = string
c_tinyint = tinyint
c_smallint = smallint
c_int = int
c_bigint = bigint
c_float = float
c_double = double
}
}
}
]
}
Options rows Case
rows = [
{
kind = INSERT
fields = [1, "A", 100]
},
{
kind = UPDATE_BEFORE
fields = [1, "A", 100]
},
{
kind = UPDATE_AFTER
fields = [1, "A_1", 100]
},
{
kind = DELETE
fields = [1, "A_1", 100]
}
]
Options table-names Case
source {
# This is a example source plugin **only for test and demonstrate the feature source plugin**
FakeSource {
table-names = ["test.table1", "test.table2", "test.table3"]
parallelism = 1
schema = {
fields {
name = "string"
age = "int"
}
}
}
}
Options defaultValue Case
Custom data can be generated by row and columns. For the time type, obtain the current time by
CURRENT_TIMESTAMP 、CURRENT_TIME 、 CURRENT_DATE
schema = {
fields {
pk_id = bigint
name = string
score = int
time1 = timestamp
time2 = time
time3 = date
}
}
# use rows
rows = [
{
kind = INSERT
fields = [1, "A", 100, CURRENT_TIMESTAMP, CURRENT_TIME, CURRENT_DATE]
}
]
schema = {
# use columns
columns = [
{
name = book_publication_time
type = timestamp
defaultValue = "2024-09-12 15:45:30"
comment = "book publication time"
},
{
name = book_publication_time2
type = timestamp
defaultValue = CURRENT_TIMESTAMP
comment = "book publication time2"
},
{
name = book_publication_time3
type = time
defaultValue = "15:45:30"
comment = "book publication time3"
},
{
name = book_publication_time4
type = time
defaultValue = CURRENT_TIME
comment = "book publication time4"
},
{
name = book_publication_time5
type = date
defaultValue = "2024-09-12"
comment = "book publication time5"
},
{
name = book_publication_time6
type = date
defaultValue = CURRENT_DATE
comment = "book publication time6"
}
]
}
Use Vector Example
source {
FakeSource {
row.num = 10
# Low priority
vector.dimension= 4
binary.vector.dimension = 8
# Low priority
schema = {
table = "simple_example"
columns = [
{
name = book_id
type = bigint
nullable = false
defaultValue = 0
comment = "primary key id"
},
{
name = book_intro_1
type = binary_vector
columnScale =8
comment = "vector"
},
{
name = book_intro_2
type = float16_vector
columnScale =4
comment = "vector"
},
{
name = book_intro_3
type = bfloat16_vector
columnScale =4
comment = "vector"
},
{
name = book_intro_4
type = sparse_float_vector
columnScale =4
comment = "vector"
}
]
}
}
}
Changelog
Change Log
| Change | Commit | Version |
|---|---|---|
| [improve] fake source options (#8950) | https://github.com/apache/seatunnel/commit/f8c47fb5f | 2.3.10 |
| [Improve] restruct connector common options (#8634) | https://github.com/apache/seatunnel/commit/f3499a6ee | 2.3.10 |
| [Feature][API] Support timestamp with timezone offset (#8367) | https://github.com/apache/seatunnel/commit/e18bfeabd | 2.3.9 |
| [Improve][dist]add shade check rule (#8136) | https://github.com/apache/seatunnel/commit/51ef80001 | 2.3.9 |
| [Improve][API] Unified tables_configs and table_list (#8100) | https://github.com/apache/seatunnel/commit/84c0b8d66 | 2.3.9 |
[Feature][Core] Rename result_table_name/source_table_name to plugin_input/plugin_output (#8072) | https://github.com/apache/seatunnel/commit/c7bbd322d | 2.3.9 |
| [Improve][Fake] Improve memory usage when split size is large (#7821) | https://github.com/apache/seatunnel/commit/2d41b024c | 2.3.9 |
| [Improve][Connector-V2] Time supports default value (#7639) | https://github.com/apache/seatunnel/commit/33978689f | 2.3.8 |
| [Improve][Connector-V2] Fake supports column configuration (#7503) | https://github.com/apache/seatunnel/commit/39162a4e0 | 2.3.8 |
| [Feature][Core] Add event notify for all connector (#7501) | https://github.com/apache/seatunnel/commit/d71337b0e | 2.3.8 |
| [Improve][Connector-V2] update vectorType (#7446) | https://github.com/apache/seatunnel/commit/1bba72385 | 2.3.8 |
| [Feature][Connector-V2] Fake Source support produce vector data (#7401) | https://github.com/apache/seatunnel/commit/6937d10ac | 2.3.8 |
| [Feature][Kafka] Support multi-table source read (#5992) | https://github.com/apache/seatunnel/commit/60104602d | 2.3.6 |
| [Feature][Doris] Add Doris type converter (#6354) | https://github.com/apache/seatunnel/commit/518999184 | 2.3.6 |
| [Feature][Core] Support event listener for job (#6419) | https://github.com/apache/seatunnel/commit/831d0022e | 2.3.5 |
| [Fix][FakeSource] fix random from template not include the latest value issue (#6438) | https://github.com/apache/seatunnel/commit/6ec16ac46 | 2.3.5 |
| [Improve][Catalog] Use default tablepath when can not get the tablepath from source config (#6276) | https://github.com/apache/seatunnel/commit/f8158bb80 | 2.3.4 |
| [Improve][Connector-V2] Replace CommonErrorCodeDeprecated.JSON_OPERATION_FAILED (#5978) | https://github.com/apache/seatunnel/commit/456cd1771 | 2.3.4 |
| FakeSource support generate different CatalogTable for MultipleTable (#5766) | https://github.com/apache/seatunnel/commit/a8b93805e | 2.3.4 |
| [Improve][Common] Introduce new error define rule (#5793) | https://github.com/apache/seatunnel/commit/9d1b2582b | 2.3.4 |
[Improve] Add default implement for SeaTunnelSource::getProducedType (#5670) | https://github.com/apache/seatunnel/commit/a04add699 | 2.3.4 |
| Support config tableIdentifier for schema (#5628) | https://github.com/apache/seatunnel/commit/652921fb7 | 2.3.4 |
[Feature] Add table-names from FakeSource/Assert to produce/assert multi-table (#5604) | https://github.com/apache/seatunnel/commit/2c67cd8f3 | 2.3.4 |
| Support config column/primaryKey/constraintKey in schema (#5564) | https://github.com/apache/seatunnel/commit/eac76b4e5 | 2.3.4 |
| [Improve][CheckStyle] Remove useless 'SuppressWarnings' annotation of checkstyle. (#5260) | https://github.com/apache/seatunnel/commit/51c0d709b | 2.3.4 |
| [improve][zeta] fix zeta bugs | https://github.com/apache/seatunnel/commit/3a82e8b39 | 2.3.1 |
| [chore] Code format with spotless plugin. | https://github.com/apache/seatunnel/commit/291214ad6 | 2.3.1 |
| Merge branch 'dev' into merge/cdc | https://github.com/apache/seatunnel/commit/4324ee191 | 2.3.1 |
| [Improve][Project] Code format with spotless plugin. | https://github.com/apache/seatunnel/commit/423b58303 | 2.3.1 |
| [improve][api] Refactoring schema parse (#4157) | https://github.com/apache/seatunnel/commit/b2f573a13 | 2.3.1 |
| [Improve][build] Give the maven module a human readable name (#4114) | https://github.com/apache/seatunnel/commit/d7cd60105 | 2.3.1 |
| [Improve][Project] Code format with spotless plugin. (#4101) | https://github.com/apache/seatunnel/commit/a2ab16656 | 2.3.1 |
| [Improve][Connector-fake] Optimizing Data Generation Strategies refer to #4004 (#4061) | https://github.com/apache/seatunnel/commit/c7c596a6d | 2.3.1 |
| [Improve][Connector-V2][Fake] Improve fake connector (#3932) | https://github.com/apache/seatunnel/commit/31f12431d | 2.3.1 |
| [Feature][Connector-v2][StarRocks] Support write cdc changelog event(INSERT/UPDATE/DELETE) (#3865) | https://github.com/apache/seatunnel/commit/8e3d158c0 | 2.3.1 |
| [Feature][Connector] add get source method to all source connector (#3846) | https://github.com/apache/seatunnel/commit/417178fb8 | 2.3.1 |
| [Feature][API & Connector & Doc] add parallelism and column projection interface (#3829) | https://github.com/apache/seatunnel/commit/b9164b8ba | 2.3.1 |
| [Hotfix][OptionRule] Fix option rule about all connectors (#3592) | https://github.com/apache/seatunnel/commit/226dc6a11 | 2.3.0 |
| [Improve][Connector-V2][Fake] Unified exception for fake source connector (#3520) | https://github.com/apache/seatunnel/commit/f371ad582 | 2.3.0 |
| [Connector-V2][Fake] Add Fake TableSourceFactory (#3345) | https://github.com/apache/seatunnel/commit/74b61c33a | 2.3.0 |
| [Connector-V2][ElasticSearch] Add ElasticSearch Source/Sink Factory (#3325) | https://github.com/apache/seatunnel/commit/38254e3f2 | 2.3.0 |
| [Improve][Engine] Improve Engine performance. (#3216) | https://github.com/apache/seatunnel/commit/7393c4732 | 2.3.0 |
| [hotfix][connector][fake] fix FakeSourceSplitEnumerator assigning duplicate splits when restoring (#3112) | https://github.com/apache/seatunnel/commit/98b1feda8 | 2.3.0-beta |
| [improve][connector][fake] supports setting the number of split rows and reading interval (#3098) | https://github.com/apache/seatunnel/commit/efabe6af7 | 2.3.0-beta |
| [feature][connector][fake] Support mutil splits for fake source connector (#2974) | https://github.com/apache/seatunnel/commit/c28c44b7c | 2.3.0-beta |
| [E2E][ST-Engine] Add test data consistency in 3 node cluster and fix bug (#3038) | https://github.com/apache/seatunnel/commit/97400a6f1 | 2.3.0-beta |
| [Improve][all] change Log to @Slf4j (#3001) | https://github.com/apache/seatunnel/commit/6016100f1 | 2.3.0-beta |
| [Improve][Connector-V2] Improve fake source connector (#2944) | https://github.com/apache/seatunnel/commit/044f62ef3 | 2.3.0-beta |
| [Improve][Connector-v2-Fake]Supports direct definition of data values(row) (#2839) | https://github.com/apache/seatunnel/commit/b7d9dde6c | 2.3.0-beta |
| [Connector-V2][ElasticSearch] Fix ElasticSearch Connector V2 Bug (#2817) | https://github.com/apache/seatunnel/commit/2fcbbf464 | 2.2.0-beta |
| [DEV][Api] Replace SeaTunnelContext with JobContext and remove singleton pattern (#2706) | https://github.com/apache/seatunnel/commit/cbf82f755 | 2.2.0-beta |
| [Bug][connector-fake] Fake date calculation error(#2573) | https://github.com/apache/seatunnel/commit/9ea01298f | 2.2.0-beta |
| [Bug][ConsoleSinkV2]fix fieldToString StackOverflow and add Unit-Test (#2545) | https://github.com/apache/seatunnel/commit/6f8709456 | 2.2.0-beta |
| [chore][connector-common] Rename SeatunnelSchema to SeaTunnelSchema (#2538) | https://github.com/apache/seatunnel/commit/7dc2a2738 | 2.2.0-beta |
| [Imporve][Fake-Connector-V2]support user-defined-schmea and random data for fake-table (#2406) | https://github.com/apache/seatunnel/commit/a5447528c | 2.2.0-beta |
| [api-draft][Optimize] Optimize module name (#2062) | https://github.com/apache/seatunnel/commit/f79e3112b | 2.2.0-beta |