python - Pandas: expanding DataFrame by number of observations in column -


stata has function expand adds rows database corresponding values in particular column. example:

i have:

df = pd.dataframe({"a":[1, 2, 3],                     "b":[3,4,5]})      b 0  1  3 1  2  4 2  3  5 

what need:

df2 = pd.dataframe({"a":[1, 2, 3, 2, 3, 3],                      "b":[3,4,5, 4, 5, 5]})      b 0  1  3 1  2  4 2  3  5 3  2  4 4  3  5 6  3  5 

the value in df.loc[0,'a'] 1, no additional row added end of dataframe, since b=3 supposed occur once.

the value in df.loc[1,'a'] 2, 1 observation added end of dataframe, bringing total occurrences of b=4 2.

the value in df.loc[2,'a'] 3, 2 observations added end of dataframe, bringing total occurrences of b=5 3.

i've scoured prior questions me started, no luck. appreciated.

there number of possibilities, built around np.repeat:

def using_reindex(df):     return df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true)  def using_dictcomp(df):     return  pd.dataframe({col:np.repeat(df[col].values, df['a'], axis=0)                            col in df})  def using_df_values(df):     return pd.dataframe(np.repeat(df.values, df['a'], axis=0), columns=df.columns)  def using_loc(df):     return df.loc[np.repeat(df.index.values, df['a'])].reset_index(drop=true) 

for example,

in [219]: df = pd.dataframe({"a":[1, 2, 3], "b":[3,4,5]}) in [220]: df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true) out[220]:      b 0  1  3 1  2  4 2  2  4 3  3  5 4  3  5 5  3  5 

here benchmark on 1000-row dataframe; result being 500k-row dataframe:

in [208]: df = make_dataframe(1000)  in [210]: %timeit using_dictcomp(df) 10 loops, best of 3: 23.6 ms per loop  in [218]: %timeit using_reindex(df) 10 loops, best of 3: 35.8 ms per loop  in [211]: %timeit using_df_values(df) 10 loops, best of 3: 31.3 ms per loop  in [212]: %timeit using_loc(df) 1 loop, best of 3: 275 ms per loop 

this code used generate df:

import numpy np import pandas pd  def make_dataframe(nrows=100):     df = pd.dataframe(         {'a': np.arange(nrows),          'float': np.random.randn(nrows),          'str': np.random.choice('lorem ipsum dolor sit'.split(), size=nrows),          'datetime64': pd.date_range('20000101', periods=nrows)},         index=pd.date_range('20000101', periods=nrows))     return df  df = make_dataframe(1000) 

if there few columns, using_dictcomp fastest. note using_dictcomp assumes df has unique column names. dict comprehension in using_dictcomp won't repeat duplicated column names. other alternatives work repeated column names, however.

both using_reindex , using_loc assume df has unique index.


using_reindex came cᴏʟᴅsᴘᴇᴇᴅ's using_loc, in (unfortunately) deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed wasn't necessary manually repeat values -- need repeat index , let df.loc (or df.reindex) repeat rows you. avoids accessing df.values generate intermediate numpy array of object dtype if df contains columns of multiple dtypes.


Comments

Popular posts from this blog

javascript - Create a stacked percentage column -

Optimising Firebase database by automatically overwriting data -

javascript - Angular UI-Grid customTemplate directive causing rows to load slowly/? -