python - Inconsistent GroupBy.count() with DataFrame? -
i've came across on strange (and quite unacceptable behavior) or df. in case, have simple df queried hive (subset of 500 obs).
the variable eventtype
can take values clc
or imp
. initially, wanted create new variable target
1
clc
, 0
imp
follows:
df = df.withcolumn("target", when(col("eventtype")=='clc',1).otherwise(0))
so upon checking works comparing counts, get:
>>> df.groupby('target').count().show() +------+-----+ |target|count| +------+-----+ | 1| 3| | 0| 497| +------+-----+ >>> df.groupby('eventtype').count().show() +---------+-----+ |eventtype|count| +---------+-----+ | clc| 6| | imp| 494| +---------+-----+
but when checking when mistakes occur ("eventtype='clc' , target=0"
, "eventtype='imp' , target=1"
) empty df.
then on call this:
>>> df.filter("eventtype='clc' , target=1").show() +--------------------+---------+.....+------+ | visitorid|eventtype|..... target| +--------------------+---------+.....-------+ |b2-044ae1d1b285-b...| clc| 1| |b2-041513341845-b...| clc| 1| |b2-044ae1d1adc5-b...| clc| 1| +--------------------+---------+..... ------+ >>> df.filter("eventtype='clc'").show() +--------------------+--------+ +------+ | visitorid|eventtype|..... target| +--------------------+---------.....+------+ |b2-04b06a7db205-e...| clc|.....+ 1| |b2-041c9bc173c5-b...| clc|.....+ 1| +--------------------+--------+.....+------+
so number of observations clc
changes, more importantly, different visitorid
?
now, when run simple
>>> df.groupby('eventtype').count().show() +---------+-----+ |eventtype|count| +---------+-----+ | clc| 6| | imp| 494| +---------+-----+ >>> df.groupby('eventtype').count().show() +---------+-----+ |eventtype|count| +---------+-----+ | clc| 5| | imp| 495| +---------+-----+
it changes every time! how possible? wrong config or somehow loading data underlying db time? - read should immutable. (plus, data should same anyway, uploaded every time 2-3 only).
Comments
Post a Comment