python - What's the idiomatic way to perform an aggregate and rename operation in pandas -
for example, how do following r data.table operation in pandas:
paths[,.( completed=sum(exists), missing=sum(not(exists)), total=.n, 'size (g)'=sum(sizemb)/1024), by=.(projectpath, pipelineid)] i.e. group projectpath , pipelineid, aggregate of columns using possibly custom functions, , rename resulting columns.
output should dataframe no hierarchical indexes, example:
projectpath pipelineid completed missing size (g) /data/pnl/projects/tracts/pnlpipe 0 2568 0 45.30824 /data/pnl/projects/tracts/pnlpipe 1 1299 0 62.69934
you can use groupby.agg:
df.groupby(['projectpath', 'pipelineid']).agg({ 'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'}, 'sizemb': {'size (g)': lambda x: x.sum()/1024} }) sample run:
df = pd.dataframe({ 'projectpath': [1,1,1,1,2,2,2,2], 'pipelineid': [1,1,2,2,1,1,2,2], 'exists': [true, false,true,true,false,false,true,false], 'sizemb': [120032,12234,223311,3223,11223,33445,3444,23321] }) df1 = df.groupby(['projectpath', 'pipelineid']).agg({ 'exists': {'completed': 'sum', 'missing': lambda x: (~x).sum(), 'total': 'size'}, 'sizemb': {'size (g)': lambda x: x.sum()/1024} }) df1.columns = df1.columns.droplevel(0) df1.reset_index() update: if want customize aggregation without using deprecated nested dictionary syntax, can use groupby.apply , return series object each group:
df.groupby(['projectpath', 'pipelineid']).apply( lambda g: pd.series({ 'completed': g.exists.sum(), 'missing': (~g.exists).sum(), 'total': g.exists.size, 'size (g)': g.sizemb.sum()/1024 }) ).reset_index() 

Comments
Post a Comment