python - Inconsistent GroupBy.count() with DataFrame? -


i've came across on strange (and quite unacceptable behavior) or df. in case, have simple df queried hive (subset of 500 obs).

the variable eventtype can take values clc or imp. initially, wanted create new variable target 1 clc , 0 imp follows:

df = df.withcolumn("target", when(col("eventtype")=='clc',1).otherwise(0))

so upon checking works comparing counts, get:

>>> df.groupby('target').count().show() +------+-----+                                                                   |target|count| +------+-----+ |     1|    3| |     0|  497| +------+-----+  >>> df.groupby('eventtype').count().show() +---------+-----+                                                                |eventtype|count| +---------+-----+ |      clc|    6| |      imp|  494| +---------+-----+ 

but when checking when mistakes occur ("eventtype='clc' , target=0" , "eventtype='imp' , target=1") empty df.

then on call this:

>>> df.filter("eventtype='clc' , target=1").show() +--------------------+---------+.....+------+ |           visitorid|eventtype|..... target| +--------------------+---------+.....-------+ |b2-044ae1d1b285-b...|      clc|           1| |b2-041513341845-b...|      clc|           1| |b2-044ae1d1adc5-b...|      clc|           1| +--------------------+---------+..... ------+  >>> df.filter("eventtype='clc'").show() +--------------------+--------+     +------+ |           visitorid|eventtype|..... target| +--------------------+---------.....+------+ |b2-04b06a7db205-e...|     clc|.....+     1| |b2-041c9bc173c5-b...|     clc|.....+     1| +--------------------+--------+.....+------+ 

so number of observations clc changes, more importantly, different visitorid?

now, when run simple

>>> df.groupby('eventtype').count().show() +---------+-----+                                                                |eventtype|count| +---------+-----+ |      clc|    6| |      imp|  494| +---------+-----+  >>> df.groupby('eventtype').count().show() +---------+-----+                                                                |eventtype|count| +---------+-----+ |      clc|    5| |      imp|  495| +---------+-----+ 

it changes every time! how possible? wrong config or somehow loading data underlying db time? - read should immutable. (plus, data should same anyway, uploaded every time 2-3 only).


Comments

Popular posts from this blog

php - Vagrant up error - Uncaught Reflection Exception: Class DOMDocument does not exist -

vue.js - Create hooks for automated testing -

Add new key value to json node in java -