pyspark - Dataframe in pypark - How to apply aggregate functions to into two columns? -

i'm using dataframe in pyspark. have 1 table table 1 bellow. need obtain table 2. where:

num_category - how many differents categories each id
sum(count) - sum of third column in table 1 each id.

example:

table 1

id   |category | count   1    |    4    |   1  1    |    3    |   2 1    |    1    |   2 2    |    2    |   1 2    |    1    |   1

table 2

id   |num_category| sum(count)   1    |    3       |   5  2    |    2       |   2

i try:

table1 = data.groupby("id","category").agg(count("*")) cat = table1.groupby("id").agg(count("*")) count = table1.groupby("id").agg(func.sum("count")) table2 = cat.join(count, cat.id == count.id)

error:

     1 table1 = data.groupby("id","category").agg(count("*")) ---> 2 cat = table1.groupby("id").agg(count("*"))        count = table1.groupby("id").agg(func.sum("count"))        table2 = cat.join(count, cat.id == count.id)  typeerror: 'dataframe' object not callable

you can multiple column aggregation on single grouped data,

data.groupby('id').agg({'category':'count','count':'sum'}).withcolumnrenamed('count(category)',"num_category").show() +---+-------+--------+ | id|num_cat|sum(cnt)| +---+-------+--------+ |  1|      3|       5| |  2|      2|       2| +---+-------+--------+

Search This Blog

Insert

pyspark - Dataframe in pypark - How to apply aggregate functions to into two columns? -

Comments

Post a Comment