python - Apply custom function to cells of selected columns of a data frame in PySpark -
let's have data frame looks this:
+---+-----------+-----------+ | id| address1| address2| +---+-----------+-----------+ | 1|address 1.1|address 1.2| | 2|address 2.1|address 2.2| +---+-----------+-----------+
i apply custom function directly strings in address1 , address2 columns, example:
def example(string1, string2): name_1 = string1.lower().split(' ') name_2 = string2.lower().split(' ') intersection_count = len(set(name_1) & set(name_2)) return intersection_count
i want store result in new column, final data frame like:
+---+-----------+-----------+------+ | id| address1| address2|result| +---+-----------+-----------+------+ | 1|address 1.1|address 1.2| 2| | 2|address 2.1|address 2.2| 7| +---+-----------+-----------+------+
i've tried execute in way once applied built-in function whole column, got error:
>>> df.withcolumn('result', example(df.address1, df.address2)) traceback (most recent call last): file "<stdin>", line 1, in <module> file "<stdin>", line 2, in example typeerror: 'column' object not callable
what doing wrong , how can apply custom function strings in selected columns?
you have use udf (user defined function) in spark
from pyspark.sql.functions import udf example_udf = udf(example, longtype()) df.withcolumn('result', example_udf(df.address1, df.address2))
Comments
Post a Comment