Process two data sources successively in Apache Flink -

July 15, 2012

i'd batch process 2 files apache flink, 1 after other.

for concrete example: suppose want assign index each line, such lines second file follow first. instead of doing so, following code interleaves lines in 2 files:

val env = executionenvironment.getexecutionenvironment  val text1 = env.readtextfile("/path/to/file1") val text2 = env.readtextfile("/path/to/file2")  val union = text1.union(text2).flatmap { ... }

i want make sure of text1 sent through flatmap operator first, , then of text2. recommended way so?

thanks in advance help.

dataset.union() not provide order guarantees across inputs. records same input partition remain in order merged records other input.

but there more fundamental problem. flink parallel data processor. when processing data in parallel, global order cannot preserved. example, when flink reads files in parallel, tries split these files , process each split independently. splits handed out without particular order. hence, records of single file shuffled. need set parallelism of whole job 1 , implement custom inputformat make work.

you can make work, won't in parallel , need tweak many things. don't think flink best tool such task. have considered using simple unix commandline tools concatenate files?

Search This Blog

Insert

Process two data sources successively in Apache Flink -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -