Hudi
Hudi source connector
Description
Read data from Hudi.
Engine Supported and plugin name
- Spark: Hudi
- Flink
Options
name | type | required | default value |
---|---|---|---|
hoodie.datasource.read.paths | string | yes | - |
hoodie.file.index.enable | boolean | no | - |
hoodie.datasource.read.end.instanttime | string | no | - |
hoodie.datasource.write.precombine.field | string | no | - |
hoodie.datasource.read.incr.filters | string | no | - |
hoodie.datasource.merge.type | string | no | - |
hoodie.datasource.read.begin.instanttime | string | no | - |
hoodie.enable.data.skipping | string | no | - |
as.of.instant | string | no | - |
hoodie.datasource.query.type | string | no | - |
hoodie.datasource.read.schema.use.end.instanttime | string | no | - |
Refer to hudi read options for configurations.
hoodie.datasource.read.paths
Comma separated list of file paths to read within a Hudi table.
hoodie.file.index.enable
Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
hoodie.datasource.read.end.instanttime
Instant time to limit incrementally fetched data to. New data written with an instant_time <= END_INSTANTTIME are fetched out.
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
hoodie.datasource.read.incr.filters
For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source.
hoodie.datasource.merge.type
For Snapshot query on merge on read table, control whether we invoke the record payload implementation to merge (payload_combine) or skip merging altogetherskip_merge
hoodie.datasource.read.begin.instanttime
Instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time > BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM.
hoodie.enable.data.skipping
enable data skipping to boost query after doing z-order optimize for current table
as.of.instant
The query instant for time travel. Without specified this option, we query the latest snapshot.
hoodie.datasource.query.type
Whether data needs to be read, in incremental mode (new data since an instantTime) (or) Read Optimized mode (obtain latest view, based on base files) (or) Snapshot mode (obtain latest view, by merging base and (if any) log files)
hoodie.datasource.read.schema.use.end.instanttime
Uses end instant schema when incrementally fetched data to. Default: users latest instant schema.
Example
hudi {
hoodie.datasource.read.paths = "hdfs://"
}