python - Is it good to normalization/standardization data having large number of features with zeros -
i'm having data around 60 features , zeros of time in training data 2-3 cols may have values( precise perf log data). however, test data have values in other columns.
i've done normalization/standardization(tried both separately) , feed pca/svd(tried both separately). used these features in fit model but, giving inaccurate results.
whereas, if skip normalization/standardization step , directly feed data pca/svd , model, giving accurate results(almost above 90% accuracy).
p.s.: i've anomaly detection using isolation forest algo.
why these results varying?
normalization , standarization (depending on source used equivalently, i'm not sure mean each 1 in case, it's not important) general recommendation works in problems data more or less homogeneously distributed. anomaly detection is, definition, not kind of problem. if have data set of examples belong class a
, few belong class b
, possible (if not necessary) sparse features (features zero) discriminative problem. normalizing them turn them 0 or zero, making hard classifier (or pca/svd) grasp importance. not unreasonable better accuracy if skip normalization, , shouldn't feel doing "wrong" because "supposed it"
i don't have experience anomaly detection, have unbalanced data sets. consider form of "weighted normalization", computation of mean , variance of each feature weighted value inversely proportional number of examples in class (e.g. examples_a ^ alpha / (examples_a ^ alpha + examples_b ^ alpha)
, alpha
small negative number). if sparse features have different scales (e.g. 1 0 in 90% of cases , 3 in 10% of cases , 0 in 90% of cases , 80 in 10% of cases), scale them common range (e.g. [0, 1]).
in case, said, not apply techniques because supposed work. if doesn't work problem or particular dataset, rightful not use (and trying understand why doesn't work may yield useful insights).
Comments
Post a Comment