google cloud dataflow - How can I improve performance of TextIO or AvroIO when reading a very large number of files? -

September 15, 2015

textio.read() , avroio.read() (as other beam io's) default don't perform in current apache beam runners when reading filepattern expands large number of files - example, 1m files.

how can read such large number of files efficiently?

when know in advance filepattern being read textio or avroio going expand large number of files, can use recently added feature .withhintmatchesmanyfiles(), implemented on textio , avroio.

for example:

pcollection<string> lines = p.apply(textio.read()     .from("gs://some-bucket/many/files/*")     .withhintmatchesmanyfiles());

using hint causes transforms execute in way optimized reading large number of files: number of files can read in case practically unlimited, , pipeline run faster, cheaper , more reliably without hint.

however, may perform worse without hint if filepattern matches small number of files (for example, few dozen or few hundred files).

under hood, hint causes transforms execute via respectively textio.readall() or avroio.readall(), more flexible , scalable versions of read() allow reading pcollection<string> of filepatterns (where each string filepattern), same caveat: if total number of files matching filepatterns small, may perform worse simple read() filepattern specified @ pipeline construction time.

Search This Blog

Insert

google cloud dataflow - How can I improve performance of TextIO or AvroIO when reading a very large number of files? -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -