on 06-24-2015 4:26 PM
The new Model Statistics and Model Compare functionality with version 2.2 is really great. I am wondering, however, what the designation between "Training" and "Validation" datasets are in the Model Compare Results output? There seems to be very little information in the user guide about how this is generated.
For the Auto Classification and Auto Regression algorithms, I understand how the Model Statistics will understand “Train” and “Validation” because splitting and auto-validation is part of those algorithms, but for the non-auto algorithms, there is no automated splitting of the data into Train and Validate samples, in fact 100% of the data passing through the predictive algorithm (for example R-CNR tree) is used for model training. So what does the KR value represent? Is this prediction consistency over repeated samples of training data? And how are the charts under "Model Representation" generated with labels of "Train" and "Validate"? I see differences in the 2 results for an R-CNR tree algorithm, but not sure how the Model Compare module is deciding what is "Train" and "Validate" data--I think it is all "Train".
Is there a possibility of designating Train vs Validate so that those charts are accurate?
Thanks!
Hi Hillary,
In 2.2 release, the statistics node is following the default cutting strategy (similar to APL/Auto algorithms) that is random without test in absence of a partition node. In doing this, it splits the dataset into random partitions of 75% for training and 25% for validation and thus the KI is from training and KR from validation.
This was the first step towards model comparison and it'll become more configurable in coming releases
Regards,
Jayant
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
83 | |
10 | |
10 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 | |
3 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.