python - Pandas: expanding DataFrame by number of observations in column -


stata has function expand adds rows database corresponding values in particular column. example:

i have:

df = pd.dataframe({"a":[1, 2, 3],                     "b":[3,4,5]})      b 0  1  3 1  2  4 2  3  5 

what need:

df2 = pd.dataframe({"a":[1, 2, 3, 2, 3, 3],                      "b":[3,4,5, 4, 5, 5]})      b 0  1  3 1  2  4 2  3  5 3  2  4 4  3  5 6  3  5 

the value in df.loc[0,'a'] 1, no additional row added end of dataframe, since b=3 supposed occur once.

the value in df.loc[1,'a'] 2, 1 observation added end of dataframe, bringing total occurrences of b=4 2.

the value in df.loc[2,'a'] 3, 2 observations added end of dataframe, bringing total occurrences of b=5 3.

i've scoured prior questions me started, no luck. appreciated.

there number of possibilities, built around np.repeat:

def using_reindex(df):     return df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true)  def using_dictcomp(df):     return  pd.dataframe({col:np.repeat(df[col].values, df['a'], axis=0)                            col in df})  def using_df_values(df):     return pd.dataframe(np.repeat(df.values, df['a'], axis=0), columns=df.columns)  def using_loc(df):     return df.loc[np.repeat(df.index.values, df['a'])].reset_index(drop=true) 

for example,

in [219]: df = pd.dataframe({"a":[1, 2, 3], "b":[3,4,5]}) in [220]: df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true) out[220]:      b 0  1  3 1  2  4 2  2  4 3  3  5 4  3  5 5  3  5 

here benchmark on 1000-row dataframe; result being 500k-row dataframe:

in [208]: df = make_dataframe(1000)  in [210]: %timeit using_dictcomp(df) 10 loops, best of 3: 23.6 ms per loop  in [218]: %timeit using_reindex(df) 10 loops, best of 3: 35.8 ms per loop  in [211]: %timeit using_df_values(df) 10 loops, best of 3: 31.3 ms per loop  in [212]: %timeit using_loc(df) 1 loop, best of 3: 275 ms per loop 

this code used generate df:

import numpy np import pandas pd  def make_dataframe(nrows=100):     df = pd.dataframe(         {'a': np.arange(nrows),          'float': np.random.randn(nrows),          'str': np.random.choice('lorem ipsum dolor sit'.split(), size=nrows),          'datetime64': pd.date_range('20000101', periods=nrows)},         index=pd.date_range('20000101', periods=nrows))     return df  df = make_dataframe(1000) 

if there few columns, using_dictcomp fastest. note using_dictcomp assumes df has unique column names. dict comprehension in using_dictcomp won't repeat duplicated column names. other alternatives work repeated column names, however.

both using_reindex , using_loc assume df has unique index.


using_reindex came cᴏʟᴅsᴘᴇᴇᴅ's using_loc, in (unfortunately) deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed wasn't necessary manually repeat values -- need repeat index , let df.loc (or df.reindex) repeat rows you. avoids accessing df.values generate intermediate numpy array of object dtype if df contains columns of multiple dtypes.


Comments