amazon s3 - Is it possible to retrieve the list of files when a DataFrame is written, or or have spark store it somewhere? -


with call df.write.csv("s3a://mybucket/mytable") know files/objects written, because of s3's eventual consistency guarantees, can't 100% sure getting listing location return (or any) of files written. if list of files/objects spark wrote, prepare manifest file redshift copy command without worrying eventual consistency. possible-- , if how?

the spark-redshift library can take care of you. if want can have @ how here: https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/redshiftwriter.scala#l299

edit: avoid further worry consistency using df.coalesce(filecount) output known number of file parts (for redshift want multiple of slices in cluster). can check how many files listed in spark code , how many files loaded in redshift stl_load_commits.


Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -