pyspark - Spark partitioned data multiple files -

April 15, 2010

i have 5 tables stored csv files (a.csv, b.csv, c.csv, d.csv, e.csv). each file partitioned date. if have folder structure as:

a/ds=2017-07-01/a.csv a/ds=2017-07-02/a.csv  ... e/ds=2017-07-02/e.csv

then using below command automatically recognize partitions table in spark 2.x

data_facts = spark.read\   .option('inferschema', 'true')\   .option('header', 'true')\   .csv('/filestore/a/')

my question whether can still keep same functionality if folder structure instead:

data/ds=2017-07-01/a.csv data/ds=2017-07-01/b.csv  data/ds=2017-07-01/c.csv  data/ds=2017-07-01/d.csv  data/ds=2017-07-01/e.csv  data/ds=2017-07-02/a.csv data/ds=2017-07-02/b.csv  data/ds=2017-07-02/c.csv  data/ds=2017-07-02/d.csv  data/ds=2017-07-02/e.csv

is there way read table across paritions in scenario? or better off moving data single folder each table?

having table @ top (a, b, c, etc) ds partition , raw [same schema!] csv files @ bottom right approach.

the second style propose require ugly hacks make partitions available , ensure tables having relevant data without cross-contaminated schemas other tables.

Search This Blog

Insert

pyspark - Spark partitioned data multiple files -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -