python - Optimizing pandas computation -
i have 22 million rows of house property sale data in database table called sale_transactions. performing job read information table, perform calculations, , use results create entries new table. process looks this:
for index, row in zipcodes.iterrows(): # ~100k zipcodes sql_string = """select * sale_transactions zipcode = '{zipcode}' """ sql_query = sql_string.format(zipcode=row['zipcode']) df = pd.read_sql(sql_query, _engine) area_stat = create_area_stats(df) # function calculations area_stat.save() # saves django model
at moment each iteration of loop takes 20 seconds on macbook pro (16gb ram), means code going take weeks finish. expensive part read_sql
line.
how can optimize this? can't read whole sale_transactions table memory, 5 gb, hence using sql query each time capture relevant rows clause.
most answers optimizing pandas talk reading chunking, in case need perform on data combined, since performing calculations in create_area_stats
function number of sales on ten year period. don't have easy access machine loads of ram, unless start going town ec2, worry expensive , quite lot of hassle.
suggestions appreciated.
i faced similar problem , below code helped me read database (~ 40 million rows) .
offsetid = 0 totalrow = 0 while (true): df_batch=pd.read_sql_query('set work_mem="1024mb"; select * '+tablename+' row_number > '+ str(offsetid) +' order row_number limit 100000' ,con=engine) offsetid = offsetid + len(df_batch) #your operation totalrow = totalrow + len(df_batch)
you have create index called row_number in table. code read table (100 000 rows) index wise. example when want read rows 200 000 - 210 000 don't need read 0 210 000. directly read index. improve performance.
Comments
Post a Comment