pyspark - Dataframe in pypark - How to apply aggregate functions to into two columns? -


i'm using dataframe in pyspark. have 1 table table 1 bellow. need obtain table 2. where:

  • num_category - how many differents categories each id
  • sum(count) - sum of third column in table 1 each id.

example:

table 1

id   |category | count   1    |    4    |   1  1    |    3    |   2 1    |    1    |   2 2    |    2    |   1 2    |    1    |   1 

table 2

id   |num_category| sum(count)   1    |    3       |   5  2    |    2       |   2 

i try:

table1 = data.groupby("id","category").agg(count("*")) cat = table1.groupby("id").agg(count("*")) count = table1.groupby("id").agg(func.sum("count")) table2 = cat.join(count, cat.id == count.id) 

error:

     1 table1 = data.groupby("id","category").agg(count("*")) ---> 2 cat = table1.groupby("id").agg(count("*"))        count = table1.groupby("id").agg(func.sum("count"))        table2 = cat.join(count, cat.id == count.id)  typeerror: 'dataframe' object not callable 

you can multiple column aggregation on single grouped data,

data.groupby('id').agg({'category':'count','count':'sum'}).withcolumnrenamed('count(category)',"num_category").show() +---+-------+--------+ | id|num_cat|sum(cnt)| +---+-------+--------+ |  1|      3|       5| |  2|      2|       2| +---+-------+--------+ 

Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -