python - Pandas: expanding DataFrame by number of observations in column -
stata has function expand adds rows database corresponding values in particular column. example:
i have:
df = pd.dataframe({"a":[1, 2, 3], "b":[3,4,5]}) b 0 1 3 1 2 4 2 3 5 what need:
df2 = pd.dataframe({"a":[1, 2, 3, 2, 3, 3], "b":[3,4,5, 4, 5, 5]}) b 0 1 3 1 2 4 2 3 5 3 2 4 4 3 5 6 3 5 the value in df.loc[0,'a'] 1, no additional row added end of dataframe, since b=3 supposed occur once.
the value in df.loc[1,'a'] 2, 1 observation added end of dataframe, bringing total occurrences of b=4 2.
the value in df.loc[2,'a'] 3, 2 observations added end of dataframe, bringing total occurrences of b=5 3.
i've scoured prior questions me started, no luck. appreciated.
there number of possibilities, built around np.repeat:
def using_reindex(df): return df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true) def using_dictcomp(df): return pd.dataframe({col:np.repeat(df[col].values, df['a'], axis=0) col in df}) def using_df_values(df): return pd.dataframe(np.repeat(df.values, df['a'], axis=0), columns=df.columns) def using_loc(df): return df.loc[np.repeat(df.index.values, df['a'])].reset_index(drop=true) for example,
in [219]: df = pd.dataframe({"a":[1, 2, 3], "b":[3,4,5]}) in [220]: df.reindex(np.repeat(df.index, df['a'])).reset_index(drop=true) out[220]: b 0 1 3 1 2 4 2 2 4 3 3 5 4 3 5 5 3 5 here benchmark on 1000-row dataframe; result being 500k-row dataframe:
in [208]: df = make_dataframe(1000) in [210]: %timeit using_dictcomp(df) 10 loops, best of 3: 23.6 ms per loop in [218]: %timeit using_reindex(df) 10 loops, best of 3: 35.8 ms per loop in [211]: %timeit using_df_values(df) 10 loops, best of 3: 31.3 ms per loop in [212]: %timeit using_loc(df) 1 loop, best of 3: 275 ms per loop this code used generate df:
import numpy np import pandas pd def make_dataframe(nrows=100): df = pd.dataframe( {'a': np.arange(nrows), 'float': np.random.randn(nrows), 'str': np.random.choice('lorem ipsum dolor sit'.split(), size=nrows), 'datetime64': pd.date_range('20000101', periods=nrows)}, index=pd.date_range('20000101', periods=nrows)) return df df = make_dataframe(1000) if there few columns, using_dictcomp fastest. note using_dictcomp assumes df has unique column names. dict comprehension in using_dictcomp won't repeat duplicated column names. other alternatives work repeated column names, however.
both using_reindex , using_loc assume df has unique index.
using_reindex came cᴏʟᴅsᴘᴇᴇᴅ's using_loc, in (unfortunately) deleted post. cᴏʟᴅsᴘᴇᴇᴅ showed wasn't necessary manually repeat values -- need repeat index , let df.loc (or df.reindex) repeat rows you. avoids accessing df.values generate intermediate numpy array of object dtype if df contains columns of multiple dtypes.
Comments
Post a Comment