amazon s3 - Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere? -

March 15, 2013

with call df.write.csv("s3a://mybucket/mytable") know files/objects written, because of s3's eventual consistency guarantees, can't 100% sure getting listing location return (or any) of files written. if list of files/objects spark wrote, prepare manifest file redshift copy command without worrying eventual consistency. possible-- , if how?

the spark-redshift library can take care of you. if want can have @ how here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/redshiftwriter.scala#l299

edit: avoid further worry consistency using df.coalesce(filecount) output known number of file parts (for redshift want multiple of slices in cluster). can check how many files listed in spark code , how many files loaded in redshift stl_load_commits.

Search This Blog

Insert

amazon s3 - Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere? -

Comments

Post a Comment

Popular posts from this blog

service - Android MediaPlayer calls onCompletion before it already finished -

javascript - Training Neural Network to play flappy bird with genetic algorithm - Why can't it learn? -

javascript - Create a stacked percentage column -