현재 진행중인 혼잡 인지 모델 개발 과제에서 Spark으로 구현한 Linear Regression 모델의 성능을 측정하였다.
먼저 CrossValidation Class를 이용하여 교차 검증을 하였다.
CrossValidation 을 생성하면 Default 가 로딩된 데이터를 3벌로 나누고 2벌은 Training으로 사용하고 1벌은 Test 용도로 사용된다. (70:30 비율)
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator().setLabelCol("loading_time"))
.setEstimatorParamMaps(paramGrid)
Spark.org에 가면 CrossValidation은 Model Selection / Hyper-Parameter Setting 절에 나와있는데
위와 같이 교차 검증을 하면 ParamGrid 에 적용된 여러 파라메터를 이용하여 모델을 만들어내고
그중에서 BestModel을 선택하게 된다.
만약 ParameterGrid를 아래와 같이 구성 후
val paramGrid = new ParamGridBuilder()
.addGrid(lrModel.maxIter, Array(5,10))
.addGrid(scaler.withMean, Array(false))
.addGrid(lrModel.regParam, Array(0.1, 0.01))
.addGrid(lrModel.fitIntercept)
.addGrid(lrModel.elasticNetParam, Array(0.1, 0.5, 1.0))
.build()
Traning Data 를 넣어 estimator 를 돌리면
val cvModel= crossval.fit(usageDF)
위 파라메터를 1:1로 조합하여 평가까지 마친 총 24개의 모델이 생성되고 그중에서 모델의 성능이 가장 좋은(Loss가 적은) BestModel 이 선택된다.
파라메터 조합에 의해 (3*1*2*2*2 = ) 24 개의 모델이 생성된다.
(※ BestModel은 CrossValidationModel 의 멤버다. Scala Doc 참고)
18/05/17 13:03:56 DEBUG CrossValidator: Train split 0 with multiple sets of parameters.
18/05/17 13:04:09 DEBUG CrossValidator: Got metric 861.5433748327362 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 5,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:10 DEBUG CrossValidator: Got metric 765.1945365376563 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 10,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
.
.
.
.
.
.
18/05/17 13:04:11 DEBUG CrossValidator: Train split 1 with multiple sets of parameters.
18/05/17 13:04:21 DEBUG CrossValidator: Got metric 860.2616134905805 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 5,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:21 DEBUG CrossValidator: Got metric 771.5677875027482 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 10,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:21 DEBUG CrossValidator: Got metric 1032.806862086392 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: false,
linReg_c10240015ee3-maxIter: 5,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:21 DEBUG CrossValidator: Got metric 880.478785585659 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: false,
linReg_c10240015ee3-maxIter: 10,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
.
.
.
.
.
.
18/05/17 13:04:22 DEBUG CrossValidator: Train split 2 with multiple sets of parameters.
18/05/17 13:04:32 DEBUG CrossValidator: Got metric 848.497654444327 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 5,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:32 DEBUG CrossValidator: Got metric 770.5945554442175 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 10,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
18/05/17 13:04:32 DEBUG CrossValidator: Got metric 1003.4603667812381 for model trained with {
linReg_c10240015ee3-elasticNetParam: 0.1,
linReg_c10240015ee3-fitIntercept: false,
linReg_c10240015ee3-maxIter: 5,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}.
.
.
.
.
.
.
18/05/17 13:04:33 INFO CrossValidator: Average cross-validation metrics: WrappedArray(856.7675475892145, 769.1189598282073, 1020.3820859619491, 871.7285468309349, 869.4393840467415, 781.1568743893291, 1028.9449638475828, 923.7548421118497, 834.9345231915706, 761.6526179238158, 998.0326064580495, 829.2050683755648, 869.4394673648123, 764.3736988122956, 1028.9451056318928, 906.5751822767113, 834.9406332620439, 761.6612272333043, 998.033582376551, 829.1057383186082, 856.7662036507334, 769.1155221736232, 1020.382070629361, 871.7317300580872)
18/05/17 13:04:33 INFO CrossValidator: Best set of parameters:
{
linReg_c10240015ee3-elasticNetParam: 0.5,
linReg_c10240015ee3-fitIntercept: true,
linReg_c10240015ee3-maxIter: 10,
linReg_c10240015ee3-regParam: 0.1,
stdScal_c903a8356436-withMean: false
}
BestModel을 Linear Regression으로 형변환 후 꺼내서 Summary 를 돌려보면 아래와 같이 RMSE와 R2 를 구할 수 있다.
(※ BestModel은 Model[_] 배열 타입이다. )
18/05/17 13:04:33 INFO CrossValidator: Best cross-validation metric: 761.6526179238158.
Coefficients: [-20.861792343588,-216.55935076982016,0.0,-215.8589621083975,-77.23500485674653,-95.19472690480272,-100.4687834312409,35.360700469222586,17.15761764296805,120.7430852613709,-217.58899272477564,47.23857553027786,62.72378280985852,-3.3697283596587617,1205.258504026766,36.526816853451194,-79.25765250790774,28.0318725158937,-33.338238283046,80.86053067064137]
Intercept: 1674.2473155804978
RMSE: 754.9381498513209
r2: 0.48882869105365834
CrossValidator는 입력시킨 데이터에서 TestData를 분할하여 가지고 있기 때문에 Loss를 구할 수 있는것이다.
하지만 RMSE와 R2 외에 더 자세한 Loss를 지원하지 않으므로 많은 Loss Function을 보고싶으면
RegressionEvaluator를 생성해야한다.
// Make predictions.
val predictions = cvModel.transform(TestUsageDF)
print(s"row count =($predictions.length)")
predictions.select("loading_time", "prediction")
.collect()
println(s"Regression Evaluator")
val ev_rmse = new RegressionEvaluator()
.setLabelCol("loading_time")
.setPredictionCol("prediction")
.setMetricName("rmse") // [rmse, mse, r2, mae] setting possible
val rmse = ev_rmse.evaluate(predictions)
println(s"Root Means Squared Error (RMSE) on test data = $rmse" + "\n")
val ev_mse = new RegressionEvaluator()
.setLabelCol("loading_time")
.setPredictionCol("prediction")
.setMetricName("mse") // [rmse, mse, r2, mae] setting possible
val mse = ev_mse.evaluate(predictions)
println(s"Means Squared Error (MSE) on test data = $mse" + "\n")
val ev_mae = new RegressionEvaluator()
.setLabelCol("loading_time")
.setPredictionCol("prediction")
.setMetricName("mae") // [rmse, mse, r2, mae] setting possible
val mae = ev_mae.evaluate(predictions)
println(s"Means Absolute Error (MAE) on test data = $mae" + "\n")
val ev_r2 = new RegressionEvaluator()
.setLabelCol("loading_time")
.setPredictionCol("prediction")
.setMetricName("r2") // [rmse, mse, r2, mae] setting possible
val r2 = ev_r2.evaluate(predictions)
println(s"R2 (R Square) on test data = $r2" + "\n")
이걸 돌리면 여러 Loss Function을 볼수 있다.
모델의 성능을 판단하기에 더 많은 도움을 얻을 수 있다.
Root Means Squared Error (RMSE) on test data = 454.8104696490563
Means Squared Error (MSE) on test data = 206852.56330239514
Means Absolute Error (MAE) on test data = 292.4397966771358
R2 (R Square) on test data = 0.5247370894934956
중요한 것은 CrossValidator 도 Loss 를 구할 수 있지만 두개밖에 지원하지 않는다는것이고
더 많은 Loss Function을 보려면 Regression Evaluator를 생성해야한다는 사실!!!