scala - specify partitions size with spark -
i'm using spark processing large files, have 12 partitions. have rdd1 , rdd2 make join between them, select (rdd3). problem is, consulted last partition big other partitions, partition 1 partitions 11 45000 recodrs
partition 12 9100000 recodrs
. divided 9100000 / 45000 =~ 203
. repartition rdd3 214(203+11)
last partition still big. how can balance size of partitions ?
my write own custom partitioner?
i have rdd1 , rdd2 make join between them
join
expensive operation spark. able join key, have shuffle values, , if keys not uniformly distributed, described behavior. custom partitioner won't in case.
i'd consider adjusting logic, doesn't require full join.
Comments
Post a Comment