pyspark - Dataframe in pypark - How to apply aggregate functions to into two columns? -
i'm using dataframe in pyspark. have 1 table table 1 bellow. need obtain table 2. where:
- num_category - how many differents categories each id
- sum(count) - sum of third column in table 1 each id.
example:
table 1
id |category | count 1 | 4 | 1 1 | 3 | 2 1 | 1 | 2 2 | 2 | 1 2 | 1 | 1 table 2
id |num_category| sum(count) 1 | 3 | 5 2 | 2 | 2 i try:
table1 = data.groupby("id","category").agg(count("*")) cat = table1.groupby("id").agg(count("*")) count = table1.groupby("id").agg(func.sum("count")) table2 = cat.join(count, cat.id == count.id) error:
1 table1 = data.groupby("id","category").agg(count("*")) ---> 2 cat = table1.groupby("id").agg(count("*")) count = table1.groupby("id").agg(func.sum("count")) table2 = cat.join(count, cat.id == count.id) typeerror: 'dataframe' object not callable
you can multiple column aggregation on single grouped data,
data.groupby('id').agg({'category':'count','count':'sum'}).withcolumnrenamed('count(category)',"num_category").show() +---+-------+--------+ | id|num_cat|sum(cnt)| +---+-------+--------+ | 1| 3| 5| | 2| 2| 2| +---+-------+--------+
Comments
Post a Comment