python - How to add new column to dataframe in pyspark -

May 15, 2012

this question has answer here:

how add new column spark dataframe (using pyspark)? 5 answers

i'm trying , long error:

df=df.withcolumn('newcolumnname', someother_df['time'])

and doesn't work. doing this:

df=df.withcolumn('newcolumnname', someother_df.select('time'))

gives me error: assertionerror: col should column

you seems combining 2 dataframes without common keys below code should work you.

import pyspark.sql.functions func  df1 = sc.parallelize([('1234','13'),('6789','68')]).todf(['col1','col2']) df2 = sc.parallelize([('7777','66'),('8888','22')]).todf(['col3','col4'])  # since there no common column between these 2 dataframes add row_index can joined df1=df1.withcolumn('row_index', func.monotonically_increasing_id()) df2=df2.withcolumn('row_index', func.monotonically_increasing_id())  # 'col3' second dataframe (i.e. df2) added first dataframe (i.e. df1) df1 = df1.join(df2["row_index","col3"], on=["row_index"]).drop("row_index") df1.show()

don't forget let know if solved problem :)

Search This Blog

Insert

python - How to add new column to dataframe in pyspark -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -