python - How to filter grouped by in Pandas -
i newbie in pandas. have following dataset. think of dataset departments('k1') , people('k2') of company.
dframe = pd.dataframe({'k1': ['x','x','y','y','z','z'], ...: 'k2': ['p1','p2','p3','p4','p5','p6'], ...: 'dataset1': np.random.randn(6)}) ...:
if take mean grouped dept/'k1' following
dataset1 k1 x 0.153825 y -0.648500 z 1.133334
if take mean grouped people/'k2', following
in [6]: dframe.groupby('k2').mean() out[6]: dataset1 k2 p1 1.595455 p2 -1.287805 p3 0.211858 p4 -1.508859 p5 1.350336 p6 0.916332
my question how can filter mean values grouped people/'k2' greater mean of dept/'k1' belongs. e.g. p1 mean value greater x mean value belongs
out[6]: dataset1 k2 p1 1.595455 p3 0.211858 p5 1.350336
sample (changed p6
p5
):
np.random.seed(45) dframe = pd.dataframe({'k1': ['x','x','y','y','z','z'], 'k2': ['p1','p2','p3','p4','p5','p5'], 'dataset1': np.random.randn(6)}) print (dframe) dataset1 k1 k2 0 0.026375 x p1 1 0.260322 x p2 2 -0.395146 y p3 3 -0.204301 y p4 4 -1.271633 z p5 5 -2.596879 z p5
first create new column groupby
, transform
:
dframe['meank1'] = dframe.groupby('k1').transform('mean') print (dframe) dataset1 k1 k2 meank1 0 0.026375 x p1 0.143348 1 0.260322 x p2 0.143348 2 -0.395146 y p3 -0.299723 3 -0.204301 y p4 -0.299723 4 -1.271633 z p5 -1.934256 5 -2.596879 z p5 -1.934256
then aggregate agg
mean
, first
, necessary add k1
column groupby
avoid wrong output if same k2
in k1
.
dframe = dframe.groupby(['k1','k2']).agg({'dataset1':'mean', 'meank1':'first'}) print (dframe) meank1 dataset1 k1 k2 x p1 0.143348 0.026375 p2 0.143348 0.260322 y p3 -0.299723 -0.395146 p4 -0.299723 -0.204301 z p5 -1.934256 -1.934256
last filter boolean indexing
or query
:
dframe = dframe.loc[dframe['meank1'] > dframe['dataset1'], ['dataset1']] #alternative sol #dframe = dframe.query('meank1 > dataset1')[['dataset1']] print (dframe) dataset1 k1 k2 x p1 0.026375 y p3 -0.395146
and if want remove first level of multiindex
add reset_index
:
dframe = dframe.reset_index(level=0, drop=true) print (dframe) dataset1 k2 p1 0.026375 p3 -0.395146
for column index use:
dframe = dframe.reset_index(level=0, drop=true).reset_index() print (dframe) k2 dataset1 0 p1 0.026375 1 p3 -0.395146
Comments
Post a Comment