pyspark - Spark partitioned data multiple files -


i have 5 tables stored csv files (a.csv, b.csv, c.csv, d.csv, e.csv). each file partitioned date. if have folder structure as:

a/ds=2017-07-01/a.csv a/ds=2017-07-02/a.csv  ... e/ds=2017-07-02/e.csv 

then using below command automatically recognize partitions table in spark 2.x

data_facts = spark.read\   .option('inferschema', 'true')\   .option('header', 'true')\   .csv('/filestore/a/') 

my question whether can still keep same functionality if folder structure instead:

data/ds=2017-07-01/a.csv data/ds=2017-07-01/b.csv  data/ds=2017-07-01/c.csv  data/ds=2017-07-01/d.csv  data/ds=2017-07-01/e.csv  data/ds=2017-07-02/a.csv data/ds=2017-07-02/b.csv  data/ds=2017-07-02/c.csv  data/ds=2017-07-02/d.csv  data/ds=2017-07-02/e.csv  

is there way read table across paritions in scenario? or better off moving data single folder each table?

having table @ top (a, b, c, etc) ds partition , raw [same schema!] csv files @ bottom right approach.

the second style propose require ugly hacks make partitions available , ensure tables having relevant data without cross-contaminated schemas other tables.


Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -