python - Apply custom function to cells of selected columns of a data frame in PySpark -


let's have data frame looks this:

+---+-----------+-----------+ | id|   address1|   address2| +---+-----------+-----------+ |  1|address 1.1|address 1.2| |  2|address 2.1|address 2.2| +---+-----------+-----------+ 

i apply custom function directly strings in address1 , address2 columns, example:

def example(string1, string2):     name_1 = string1.lower().split(' ')     name_2 = string2.lower().split(' ')     intersection_count = len(set(name_1) & set(name_2))      return intersection_count 

i want store result in new column, final data frame like:

+---+-----------+-----------+------+ | id|   address1|   address2|result| +---+-----------+-----------+------+ |  1|address 1.1|address 1.2|     2| |  2|address 2.1|address 2.2|     7| +---+-----------+-----------+------+ 

i've tried execute in way once applied built-in function whole column, got error:

>>> df.withcolumn('result', example(df.address1, df.address2)) traceback (most recent call last):   file "<stdin>", line 1, in <module>   file "<stdin>", line 2, in example typeerror: 'column' object not callable 

what doing wrong , how can apply custom function strings in selected columns?

you have use udf (user defined function) in spark

from pyspark.sql.functions import udf example_udf = udf(example, longtype()) df.withcolumn('result', example_udf(df.address1, df.address2)) 

Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -