R Squared Calculation

 

The R Squared (R2) column indicates the fraction of the variability of the target/exposure rate that is described by each field in the Training table (on a univariate basis). This statistic can be useful for deciding which fields to include in the analysis, and well as to understand which fields could potentially be duplicates of each other.

 

Usage

To view the R2 values, select a Target field, and, if applicable, an Exposure field. Then press the "Calculate R Squared" button. The R Squared column will be populated for each field in the table, as shown below. The R Squared calculation takes into account any Filters applied to the Training Data.

 

The user may toggle the sorting of the R2 values in the table by clicking on the "R Squared" header.

 

 

 

There are two special formats shown for the R Squared Column:

Format

Meaning

Description

Grey italicized font

Relationship with the target variable is statistically insignificant based on an F-test

The results of an F-Test indicate statistical insignificance. See description of the F Ratio below for more details.

Pink shaded background

The field has a one-to-one correspondence with another field.

This often means that the characteristics are exactly describing the same attribute, such as lcnfeld and License Field above, but the one-to-one correspondence could be less obvious.

Sorting the table in order of the R squared value is useful for seeing which fields are corresponding to others.

It is not advisable to select more than one of a group of corresponding characteristics in the analysis, since there is no additional information provided by including multiple fields with the same information. Including more than one of them will simply slow the iteration process down, or could even cause a lack of convergence. 

 

 

 

Calculations

 

F Ratio

A result of an ANOVA (analysis of variance), the F Ratio is a measure of how the variability in the target value for the unique values (or group of values for a Grouped characteristic) in a given characteristic compares with variability in the target value for the records in the entire data table.

 

First we find the sum of squares, SSTotal, and the degrees of freedom, DFTotal, for the Training Dataset as a whole (recall that this dataset is determined after the application of the Filter):

 

           

          where i represents an individual record, n = the number of records, T = Target, and E = Exposure.

          

Next, we determine the sum of squares, SSCharacteristic, and the degrees of freedom, DFCharacteristic, for each characteristic:

          

where j represents the characteristic bin and u represents the number of bins for the characteristic. In cases where there are more than 50,000 unique values of the characteristic or where there number of unique values is more than 20% of the total record count, the default groupings for grouped data are used to establish bins. Otherwise, unique values of the characteristic are used as bins.

          

Then we find the sum of squares error, SSError, and the degrees of freedom error, DFError, for each characteristic:

          

          

and finally reach the F Ratio for each characteristic by:

          

 

R2

The R2 statistic measures the fraction of the variability of the target/exposure rate across records that is described by variation between characteristic bin averages. It is calculated using the following formula, which utilizes some of the results determined in the F Ratio calculation above:

 

           

The R2 value is displayed in the R Squared column. Any characteristic with an F-Ratio < 1 will have an R2 displayed in gray, italicized font, denoting statistical insignificance as a result of the F-test.