python - Create Feature Using K-Nearest Neighbors -
i'm relatively new python , machine learning, i've been working on building out predictive model mortgage prices. i'm struggling using k-nearest neighbor algorithm create feature.
here's how understand mechanics of want accomplish:
- i have 2 data files: mortgages sold , mortgages listed
- in both data files have same features (including lat/long).
- i want create column in mortgages listed represents median price of closely related homes in immediate area.
- i'll use methodology listed in 3 create columns 1-3 months, 4-6 months, 7-12 months.
- another column trend of 3 columns.
i've found on knn imputation, doesn't seem i'm looking for.
how go executing idea? there resources may have missed help?
any guidance appreciated. thanks!
so, understand, want fit knn model using mortgages sold data predict prices mortgages listed data. classical knn problem need find nearest features vectors in sold data each feature vector in listed data, , take median of feature vectors.
consider there n rows in sold data, , feature vectors each row x1,x2, ..., xn , corresponding prices p1, p2, ..., pn
x_train = [x1, x2, ..., xn]
y_train = [p1, p2, ..., pn]
note here each xi feature vector , representative of ith row
for now, consider want 5 closest rows in sold data each row in listed data. so, knn model parameter here might need optimised later is:
number_of_neighbours = 5
now, training code this:
from sklearn.neighbors import kneighborsclassifier
knn_model = kneighborsclassifier(n_neighbors=number_of_neighbours)
knn_model.fit(x_train, y_train)
for prediction, consider there m rows in listed data, , feature vectors each row f1, f2, ..., fm. corresponding median prices z1, z2, ..., zm need determined.
x_test = [f1, f2, ..., fm]
note feature vectors in x_train , x_test should vectorized using same vectorizer/transformer. read more vectorizers here.
the prediction code this:
y_predicted = knn_model.predict(x_test)
each element of y_predicted list contain (in case) 5 closest prices y_train. is:
y_predicted = [(p11, p12, .., p15), (p21, p22, .., p25), .., (pm1, pm2, .., pm5)]
for each jth element of y_predicted:
import numpy np
zj = np.median(np.array([pj1, pj2, .., pj5]))
hence, in way, can find median price zj each row of listed data
now, coming parameter optimisation part. hyper-parameter in knn model number_of_neighbours. can find optimal value of parameter dividing x_train 80:20 ratio. train on 80% part , cross-validate on remaining 20% part. once, sure accuracy numbers enough, can use value of hyper-parameter number_of_neighbours prediction on y_test.
in end, month-wise analysis, need create month-wise models. example, m1 = trained on 1-3 month sold data, m2 = trained on 4-6 month sold data, m3 = trained on 7-12 month sold data, etc.
reference: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighborsclassifier.html
Comments
Post a Comment