Hello, I am using linear SVM to train my model and generate a line through my data. However my model always predicts 1 for all the feature examples. Here is my code:
print data_rdd.take(5) [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]), LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]), LabeledPoint(1.0, [3.5772,2.856])]
from pyspark.mllib.classification import SVMWithSGD from pyspark.mllib.linalg import Vectors from sklearn.svm import SVC data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
X=x_df.map(lambda x:x[0]).collect() Y=x_df.map(lambda x:x[1]).collect()
pred=[] for i in X: pred.append(model.predict(i)) print pred
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Thank you for helping me. But I dont understand what did I do wrong earlier. I had done the same thing-split the data into features and labels and then train the data. Moreover right now the predictions are also not entirely correct(less 0's at the end). Do we need to specify threshold?