pyspark - Spark partitioned data multiple files -
i have 5 tables stored csv files (a.csv, b.csv, c.csv, d.csv, e.csv). each file partitioned date. if have folder structure as:
a/ds=2017-07-01/a.csv a/ds=2017-07-02/a.csv ... e/ds=2017-07-02/e.csv then using below command automatically recognize partitions table in spark 2.x
data_facts = spark.read\ .option('inferschema', 'true')\ .option('header', 'true')\ .csv('/filestore/a/') my question whether can still keep same functionality if folder structure instead:
data/ds=2017-07-01/a.csv data/ds=2017-07-01/b.csv data/ds=2017-07-01/c.csv data/ds=2017-07-01/d.csv data/ds=2017-07-01/e.csv data/ds=2017-07-02/a.csv data/ds=2017-07-02/b.csv data/ds=2017-07-02/c.csv data/ds=2017-07-02/d.csv data/ds=2017-07-02/e.csv is there way read table across paritions in scenario? or better off moving data single folder each table?
having table @ top (a, b, c, etc) ds partition , raw [same schema!] csv files @ bottom right approach.
the second style propose require ugly hacks make partitions available , ensure tables having relevant data without cross-contaminated schemas other tables.
Comments
Post a Comment