python - Speeding up calculation of nearby groups? -
i have data frame contains group id, 2 distance measures (longitude/latitude type measure), , value. given set of distances, want find number of other groups nearby, , average values of other groups nearby.
i've written following code, inefficient not complete in reasonable time large data sets. calculation of nearby retailers quick. calculation of average value of nearby retailers extremely slow. there better way make more efficient?
distances = [1,2] df = pd.dataframe(np.random.randint(0,100,size=(100, 4)), columns=['group','dist1','dist2','value']) # 1 row per group, 2 distances each row df_groups = df.groupby('group')[['dist1','dist2']].mean() # create kdtree quick searching tree = ckdtree(df_groups[['dist1','dist2']]) # find points within given radius in distances: closeby = tree.query_ball_tree(tree, r=i) # put density column df_groups['groups_within_' + str(i) + 'miles'] = [len(x) x in closeby] # average values of nearby groups idx, val in enumerate(df_groups.index): val_idx = df_groups.iloc[closeby[idx]].index.values mean = df.loc[df['group'].isin(val_idx), 'value'].mean() df_groups.loc[val, str(i) + '_mean_values'] = mean # merge dataframe df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles', str(i) + '_mean_values']], left_on='group', right_index=true)
its clear problem indexing main dataframe, isin
method. dataframe grows in length larger search has done. propose same search, on smaller df_groups
data frame , calculate updated average instead.
df = pd.dataframe(np.random.randint(0,100,size=(100000, 4)), columns=['group','dist1','dist2','value']) distances = [1,2] # means of values , count, totals each sample df_groups = df.groupby('group')[['dist1','dist2','value']].agg({'dist1':'mean','dist2':'mean', 'value':['mean','count']}) # remove multicolumn index df_groups.columns = [' '.join(col).strip() col in df_groups.columns.values] #rename columns df_groups.rename(columns={'dist1 mean':'dist1','dist2 mean':'dist2','value mean':'value','value count': 'count'},inplace=true) # create kdtree quick searching tree = ckdtree(df_groups[['dist1','dist2']]) in distances: closeby = tree.query_ball_tree(tree, r=i) # put density column df_groups['groups_within_' + str(i) + 'miles'] = [len(x) x in closeby] #create column subsets df_groups['subs'] = [df_groups.index.values[idx] idx in closeby] #set column prep updated mean calculation df_groups['commean'] = df_groups['value'] * df_groups['count'] #perform updated mean df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'commean'].sum() / df_groups.loc[df_groups.index.isin(row), 'count'].sum()) row in df_groups['subs']] df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles', str(i) + '_mean_values']], left_on='group', right_index=true)
the formula , upated mean (m1*n1 + m2*n2)/(n1+n2)
old setup 100000 rows %timeit old(df) 1 loop, best of 3: 694 ms per loop 1000000 rows %timeit old(df) 1 loop, best of 3: 6.08 s per loop 10000000 rows %timeit old(df) 1 loop, best of 3: 6min 13s per loop
new setup
100000 rows %timeit new(df) 10 loops, best of 3: 136 ms per loop 1000000 rows %timeit new(df) 1 loop, best of 3: 525 ms per loop 10000000 rows %timeit new(df) 1 loop, best of 3: 4.53 s per loop
Comments
Post a Comment