Spark SQL Infer Missing Data -
i have 2 tables i'm joining on date. 1 of them may not have entry given key , date, need check previous days (up 5) , use stale values if possible.
table 2 contains data of format date, attribute, <additional columns>
, uploaded ideally every day each attribute (there many attributes). table 1 can have many different entries each given (date, attribute)
pair, or none @ all. i'm looking way join tables 1 , 2 such every entry in table 1 joined data in table 2 corresponding date
, relation attribute
, , if no corresponding entry in table 2 exists, joined recent entry in table 2 predates table 1 entry, provided table 2 entry not more 5 days out of date.
i can code relatively using loop (or while loop, short-circuit), various reasons, in sql. i'm using hive of data have hivecontext available , can use windowing functions. however, can't figure out how use lag()
on range opposed on physical rows in table. think can left outer joins
on successive subtractions (something like
select ... table1 t1 left outer join table2 t2 on t1.date = t2.date ) t3 ... left outer join table2 on (t3.date = date_sub(t2.date, 1) , t3.table2_unique_col null) ) t4 left outer join table2 on (t4.date = date_sub(t2.date, 2) , t4.table2_unique_col null) ) t5 etc.
but seems both highly inefficient (so many joins!) , brittle (what if in future want 7 day lookbacks? i'm no means sql expert, feels might cleanly reduce well-constructed join or group key.
Comments
Post a Comment