Spark SQL Infer Missing Data -


i have 2 tables i'm joining on date. 1 of them may not have entry given key , date, need check previous days (up 5) , use stale values if possible.

table 2 contains data of format date, attribute, <additional columns>, uploaded ideally every day each attribute (there many attributes). table 1 can have many different entries each given (date, attribute) pair, or none @ all. i'm looking way join tables 1 , 2 such every entry in table 1 joined data in table 2 corresponding date , relation attribute, , if no corresponding entry in table 2 exists, joined recent entry in table 2 predates table 1 entry, provided table 2 entry not more 5 days out of date.

i can code relatively using loop (or while loop, short-circuit), various reasons, in sql. i'm using hive of data have hivecontext available , can use windowing functions. however, can't figure out how use lag() on range opposed on physical rows in table. think can left outer joins on successive subtractions (something like

select ... table1 t1 left outer join table2 t2 on t1.date = t2.date ) t3 ... left outer join table2 on (t3.date = date_sub(t2.date, 1) , t3.table2_unique_col null) ) t4 left outer join table2 on (t4.date = date_sub(t2.date, 2) , t4.table2_unique_col null) ) t5 etc. 

but seems both highly inefficient (so many joins!) , brittle (what if in future want 7 day lookbacks? i'm no means sql expert, feels might cleanly reduce well-constructed join or group key.


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -