Process two data sources successively in Apache Flink -
i'd batch process 2 files apache flink, 1 after other.
for concrete example: suppose want assign index each line, such lines second file follow first. instead of doing so, following code interleaves lines in 2 files:
val env = executionenvironment.getexecutionenvironment val text1 = env.readtextfile("/path/to/file1") val text2 = env.readtextfile("/path/to/file2") val union = text1.union(text2).flatmap { ... }
i want make sure of text1
sent through flatmap
operator first, , then of text2
. recommended way so?
thanks in advance help.
dataset.union()
not provide order guarantees across inputs. records same input partition remain in order merged records other input.
but there more fundamental problem. flink parallel data processor. when processing data in parallel, global order cannot preserved. example, when flink reads files in parallel, tries split these files , process each split independently. splits handed out without particular order. hence, records of single file shuffled. need set parallelism of whole job 1 , implement custom inputformat
make work.
you can make work, won't in parallel , need tweak many things. don't think flink best tool such task. have considered using simple unix commandline tools concatenate files?
Comments
Post a Comment