python - Speeding up calculation of nearby groups? -


i have data frame contains group id, 2 distance measures (longitude/latitude type measure), , value. given set of distances, want find number of other groups nearby, , average values of other groups nearby.

i've written following code, inefficient not complete in reasonable time large data sets. calculation of nearby retailers quick. calculation of average value of nearby retailers extremely slow. there better way make more efficient?

distances = [1,2]  df = pd.dataframe(np.random.randint(0,100,size=(100, 4)),                   columns=['group','dist1','dist2','value'])  # 1 row per group, 2 distances each row df_groups = df.groupby('group')[['dist1','dist2']].mean()  # create kdtree quick searching tree = ckdtree(df_groups[['dist1','dist2']])  # find points within given radius in distances:     closeby = tree.query_ball_tree(tree, r=i)      # put density column     df_groups['groups_within_' + str(i) + 'miles'] = [len(x) x in closeby]      # average values of nearby groups     idx, val in enumerate(df_groups.index):         val_idx = df_groups.iloc[closeby[idx]].index.values         mean = df.loc[df['group'].isin(val_idx), 'value'].mean()         df_groups.loc[val, str(i) + '_mean_values'] = mean      # merge dataframe     df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',                                   str(i) + '_mean_values']],                    left_on='group',                    right_index=true) 

its clear problem indexing main dataframe, isin method. dataframe grows in length larger search has done. propose same search, on smaller df_groups data frame , calculate updated average instead.

df = pd.dataframe(np.random.randint(0,100,size=(100000, 4)),                   columns=['group','dist1','dist2','value']) distances = [1,2] # means of values , count, totals each sample df_groups = df.groupby('group')[['dist1','dist2','value']].agg({'dist1':'mean','dist2':'mean',                                                                   'value':['mean','count']}) # remove multicolumn index df_groups.columns = [' '.join(col).strip() col in df_groups.columns.values]  #rename columns  df_groups.rename(columns={'dist1 mean':'dist1','dist2 mean':'dist2','value mean':'value','value count':                           'count'},inplace=true)   # create kdtree quick searching tree = ckdtree(df_groups[['dist1','dist2']])  in distances:     closeby = tree.query_ball_tree(tree, r=i)     # put density column     df_groups['groups_within_' + str(i) + 'miles'] = [len(x) x in closeby]     #create column subsets     df_groups['subs'] = [df_groups.index.values[idx] idx in closeby]     #set column prep updated mean calculation     df_groups['commean'] = df_groups['value'] * df_groups['count']      #perform updated mean     df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'commean'].sum() /                                           df_groups.loc[df_groups.index.isin(row), 'count'].sum()) row in df_groups['subs']]     df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',                                  str(i) + '_mean_values']],                   left_on='group',                   right_index=true) 

the formula , upated mean (m1*n1 + m2*n2)/(n1+n2)

old setup   100000 rows %timeit old(df) 1 loop, best of 3: 694 ms per loop  1000000 rows %timeit old(df) 1 loop, best of 3: 6.08 s per loop  10000000 rows %timeit old(df) 1 loop, best of 3: 6min 13s per loop 

new setup

100000 rows %timeit new(df) 10 loops, best of 3: 136 ms per loop  1000000 rows %timeit new(df) 1 loop, best of 3: 525 ms per loop  10000000 rows %timeit new(df) 1 loop, best of 3: 4.53 s per loop 

Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -