Spark SQL Infer Missing Data -

May 15, 2013

i have 2 tables i'm joining on date. 1 of them may not have entry given key , date, need check previous days (up 5) , use stale values if possible.

table 2 contains data of format date, attribute, <additional columns>, uploaded ideally every day each attribute (there many attributes). table 1 can have many different entries each given (date, attribute) pair, or none @ all. i'm looking way join tables 1 , 2 such every entry in table 1 joined data in table 2 corresponding date , relation attribute, , if no corresponding entry in table 2 exists, joined recent entry in table 2 predates table 1 entry, provided table 2 entry not more 5 days out of date.

i can code relatively using loop (or while loop, short-circuit), various reasons, in sql. i'm using hive of data have hivecontext available , can use windowing functions. however, can't figure out how use lag() on range opposed on physical rows in table. think can left outer joins on successive subtractions (something like

select ... table1 t1 left outer join table2 t2 on t1.date = t2.date ) t3 ... left outer join table2 on (t3.date = date_sub(t2.date, 1) , t3.table2_unique_col null) ) t4 left outer join table2 on (t4.date = date_sub(t2.date, 2) , t4.table2_unique_col null) ) t5 etc.

but seems both highly inefficient (so many joins!) , brittle (what if in future want 7 day lookbacks? i'm no means sql expert, feels might cleanly reduce well-constructed join or group key.

Search This Blog

Insert

Spark SQL Infer Missing Data -

Comments

Post a Comment

Popular posts from this blog

vue.js - Create hooks for automated testing -

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

serial port - hub4com OVERRUN Error -