google cloud dataflow - How can I improve performance of TextIO or AvroIO when reading a very large number of files? -


textio.read() , avroio.read() (as other beam io's) default don't perform in current apache beam runners when reading filepattern expands large number of files - example, 1m files.

how can read such large number of files efficiently?

when know in advance filepattern being read textio or avroio going expand large number of files, can use recently added feature .withhintmatchesmanyfiles(), implemented on textio , avroio.

for example:

pcollection<string> lines = p.apply(textio.read()     .from("gs://some-bucket/many/files/*")     .withhintmatchesmanyfiles()); 

using hint causes transforms execute in way optimized reading large number of files: number of files can read in case practically unlimited, , pipeline run faster, cheaper , more reliably without hint.

however, may perform worse without hint if filepattern matches small number of files (for example, few dozen or few hundred files).

under hood, hint causes transforms execute via respectively textio.readall() or avroio.readall(), more flexible , scalable versions of read() allow reading pcollection<string> of filepatterns (where each string filepattern), same caveat: if total number of files matching filepatterns small, may perform worse simple read() filepattern specified @ pipeline construction time.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -