python - Optimizing pandas computation -


i have 22 million rows of house property sale data in database table called sale_transactions. performing job read information table, perform calculations, , use results create entries new table. process looks this:

for index, row in zipcodes.iterrows(): # ~100k zipcodes     sql_string = """select * sale_transactions zipcode = '{zipcode}' """     sql_query = sql_string.format(zipcode=row['zipcode'])              df = pd.read_sql(sql_query, _engine)     area_stat = create_area_stats(df) # function calculations     area_stat.save() # saves django model 

at moment each iteration of loop takes 20 seconds on macbook pro (16gb ram), means code going take weeks finish. expensive part read_sql line.

how can optimize this? can't read whole sale_transactions table memory, 5 gb, hence using sql query each time capture relevant rows clause.

most answers optimizing pandas talk reading chunking, in case need perform on data combined, since performing calculations in create_area_stats function number of sales on ten year period. don't have easy access machine loads of ram, unless start going town ec2, worry expensive , quite lot of hassle.

suggestions appreciated.

i faced similar problem , below code helped me read database (~ 40 million rows) .

offsetid = 0 totalrow = 0    while (true):      df_batch=pd.read_sql_query('set work_mem="1024mb"; select * '+tablename+' row_number > '+ str(offsetid) +' order row_number limit 100000' ,con=engine)     offsetid = offsetid + len(df_batch)      #your operation      totalrow = totalrow + len(df_batch)  

you have create index called row_number in table. code read table (100 000 rows) index wise. example when want read rows 200 000 - 210 000 don't need read 0 210 000. directly read index. improve performance.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -