scala - specify partitions size with spark -


i'm using spark processing large files, have 12 partitions. have rdd1 , rdd2 make join between them, select (rdd3). problem is, consulted last partition big other partitions, partition 1 partitions 11 45000 recodrs partition 12 9100000 recodrs. divided 9100000 / 45000 =~ 203. repartition rdd3 214(203+11) last partition still big. how can balance size of partitions ?

my write own custom partitioner?

i have rdd1 , rdd2 make join between them

join expensive operation spark. able join key, have shuffle values, , if keys not uniformly distributed, described behavior. custom partitioner won't in case.

i'd consider adjusting logic, doesn't require full join.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -