Process two data sources successively in Apache Flink -


i'd batch process 2 files apache flink, 1 after other.

for concrete example: suppose want assign index each line, such lines second file follow first. instead of doing so, following code interleaves lines in 2 files:

val env = executionenvironment.getexecutionenvironment  val text1 = env.readtextfile("/path/to/file1") val text2 = env.readtextfile("/path/to/file2")  val union = text1.union(text2).flatmap { ... } 

i want make sure of text1 sent through flatmap operator first, , then of text2. recommended way so?

thanks in advance help.

dataset.union() not provide order guarantees across inputs. records same input partition remain in order merged records other input.

but there more fundamental problem. flink parallel data processor. when processing data in parallel, global order cannot preserved. example, when flink reads files in parallel, tries split these files , process each split independently. splits handed out without particular order. hence, records of single file shuffled. need set parallelism of whole job 1 , implement custom inputformat make work.

you can make work, won't in parallel , need tweak many things. don't think flink best tool such task. have considered using simple unix commandline tools concatenate files?


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -