혼잡 인지 모델 검증(Loss 구하기)

Data Engineering/spark

혼잡 인지 모델 검증(Loss 구하기)

quantapia 2018. 5. 17. 18:13

현재 진행중인 혼잡 인지 모델 개발 과제에서 Spark으로 구현한 Linear Regression 모델의 성능을 측정하였다.

먼저 CrossValidation Class를 이용하여 교차 검증을 하였다.

CrossValidation 을 생성하면 Default 가 로딩된 데이터를 3벌로 나누고 2벌은 Training으로 사용하고 1벌은 Test 용도로 사용된다. (70:30 비율)

val crossval = new CrossValidator()

.setEstimator(pipeline)

.setEvaluator(new RegressionEvaluator().setLabelCol("loading_time"))

.setEstimatorParamMaps(paramGrid)

Spark.org에 가면 CrossValidation은 Model Selection / Hyper-Parameter Setting 절에 나와있는데

위와 같이 교차 검증을 하면 ParamGrid 에 적용된 여러 파라메터를 이용하여 모델을 만들어내고

그중에서 BestModel을 선택하게 된다.

만약 ParameterGrid를 아래와 같이 구성 후

val paramGrid = new ParamGridBuilder()

.addGrid(lrModel.maxIter, Array(5,10))

.addGrid(scaler.withMean, Array(false))

.addGrid(lrModel.regParam, Array(0.1, 0.01))

.addGrid(lrModel.fitIntercept)

.addGrid(lrModel.elasticNetParam, Array(0.1, 0.5, 1.0))

.build()

Traning Data 를 넣어 estimator 를 돌리면

val cvModel= crossval.fit(usageDF)

위 파라메터를 1:1로 조합하여 평가까지 마친 총 24개의 모델이 생성되고 그중에서 모델의 성능이 가장 좋은(Loss가 적은) BestModel 이 선택된다.

파라메터 조합에 의해 (3*1*2*2*2 = ) 24 개의 모델이 생성된다.

(※ BestModel은 CrossValidationModel 의 멤버다. Scala Doc 참고)

18/05/17 13:03:56 DEBUG CrossValidator: Train split 0 with multiple sets of parameters.

18/05/17 13:04:09 DEBUG CrossValidator: Got metric 861.5433748327362 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 5,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:10 DEBUG CrossValidator: Got metric 765.1945365376563 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 10,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:11 DEBUG CrossValidator: Train split 1 with multiple sets of parameters.

18/05/17 13:04:21 DEBUG CrossValidator: Got metric 860.2616134905805 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 5,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:21 DEBUG CrossValidator: Got metric 771.5677875027482 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 10,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:21 DEBUG CrossValidator: Got metric 1032.806862086392 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: false,

linReg_c10240015ee3-maxIter: 5,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:21 DEBUG CrossValidator: Got metric 880.478785585659 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: false,

linReg_c10240015ee3-maxIter: 10,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:22 DEBUG CrossValidator: Train split 2 with multiple sets of parameters.

18/05/17 13:04:32 DEBUG CrossValidator: Got metric 848.497654444327 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 5,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:32 DEBUG CrossValidator: Got metric 770.5945554442175 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 10,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:32 DEBUG CrossValidator: Got metric 1003.4603667812381 for model trained with {

linReg_c10240015ee3-elasticNetParam: 0.1,

linReg_c10240015ee3-fitIntercept: false,

linReg_c10240015ee3-maxIter: 5,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

18/05/17 13:04:33 INFO CrossValidator: Average cross-validation metrics: WrappedArray(856.7675475892145, 769.1189598282073, 1020.3820859619491, 871.7285468309349, 869.4393840467415, 781.1568743893291, 1028.9449638475828, 923.7548421118497, 834.9345231915706, 761.6526179238158, 998.0326064580495, 829.2050683755648, 869.4394673648123, 764.3736988122956, 1028.9451056318928, 906.5751822767113, 834.9406332620439, 761.6612272333043, 998.033582376551, 829.1057383186082, 856.7662036507334, 769.1155221736232, 1020.382070629361, 871.7317300580872)

18/05/17 13:04:33 INFO CrossValidator: Best set of parameters:

{

linReg_c10240015ee3-elasticNetParam: 0.5,

linReg_c10240015ee3-fitIntercept: true,

linReg_c10240015ee3-maxIter: 10,

linReg_c10240015ee3-regParam: 0.1,

stdScal_c903a8356436-withMean: false

}

BestModel을 Linear Regression으로 형변환 후 꺼내서 Summary 를 돌려보면 아래와 같이 RMSE와 R2 를 구할 수 있다.

(※ BestModel은 Model[_] 배열 타입이다. )

18/05/17 13:04:33 INFO CrossValidator: Best cross-validation metric: 761.6526179238158.

Coefficients: [-20.861792343588,-216.55935076982016,0.0,-215.8589621083975,-77.23500485674653,-95.19472690480272,-100.4687834312409,35.360700469222586,17.15761764296805,120.7430852613709,-217.58899272477564,47.23857553027786,62.72378280985852,-3.3697283596587617,1205.258504026766,36.526816853451194,-79.25765250790774,28.0318725158937,-33.338238283046,80.86053067064137]

Intercept: 1674.2473155804978

RMSE: 754.9381498513209

r2: 0.48882869105365834

CrossValidator는 입력시킨 데이터에서 TestData를 분할하여 가지고 있기 때문에 Loss를 구할 수 있는것이다.

하지만 RMSE와 R2 외에 더 자세한 Loss를 지원하지 않으므로 많은 Loss Function을 보고싶으면

RegressionEvaluator를 생성해야한다.

// Make predictions.

val predictions = cvModel.transform(TestUsageDF)

print(s"row count =($predictions.length)")

predictions.select("loading_time", "prediction")

.collect()

println(s"Regression Evaluator")

val ev_rmse = new RegressionEvaluator()

.setLabelCol("loading_time")

.setPredictionCol("prediction")

.setMetricName("rmse") // [rmse, mse, r2, mae] setting possible

val rmse = ev_rmse.evaluate(predictions)

println(s"Root Means Squared Error (RMSE) on test data = $rmse" + "\n")

val ev_mse = new RegressionEvaluator()

.setLabelCol("loading_time")

.setPredictionCol("prediction")

.setMetricName("mse") // [rmse, mse, r2, mae] setting possible

val mse = ev_mse.evaluate(predictions)

println(s"Means Squared Error (MSE) on test data = $mse" + "\n")

val ev_mae = new RegressionEvaluator()

.setLabelCol("loading_time")

.setPredictionCol("prediction")

.setMetricName("mae") // [rmse, mse, r2, mae] setting possible

val mae = ev_mae.evaluate(predictions)

println(s"Means Absolute Error (MAE) on test data = $mae" + "\n")

val ev_r2 = new RegressionEvaluator()

.setLabelCol("loading_time")

.setPredictionCol("prediction")

.setMetricName("r2") // [rmse, mse, r2, mae] setting possible

val r2 = ev_r2.evaluate(predictions)

println(s"R2 (R Square) on test data = $r2" + "\n")

이걸 돌리면 여러 Loss Function을 볼수 있다.

모델의 성능을 판단하기에 더 많은 도움을 얻을 수 있다.

Root Means Squared Error (RMSE) on test data = 454.8104696490563

Means Squared Error (MSE) on test data = 206852.56330239514

Means Absolute Error (MAE) on test data = 292.4397966771358

R2 (R Square) on test data = 0.5247370894934956

중요한 것은 CrossValidator 도 Loss 를 구할 수 있지만 두개밖에 지원하지 않는다는것이고

더 많은 Loss Function을 보려면 Regression Evaluator를 생성해야한다는 사실!!!

저작자표시 비영리 변경금지

현재글혼잡 인지 모델 검증(Loss 구하기)

개발자를 넘어 과학자로!!

Data Engineering ML Engineering 다음은...Deep Learning ? 도전은 계속된다.

Today :
Yesterday :

개발자를 넘어 과학자로!!

혼잡 인지 모델 검증(Loss 구하기)

'Data Engineering/spark'의 다른글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

혼잡 인지 모델 검증(Loss 구하기)

'Data Engineering/spark'의 다른글

관련글

티스토리툴바