Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAHOUT-1962 Added Ftest to OLS implemenations #300

Closed
wants to merge 3 commits into from

Conversation

dustinvanstee
Copy link
Contributor

Here are a few other mods
-- added 2 tests for OLS
-- modified print statement for summary
-- changed degreeOfFreedom to X.ncols - 1 to remove intercept
-- proposing to move intercept to first column ....

Here are a few other mods
-- added 2 tests for OLS
-- modified print statement for summary
-- changed degreeOfFreedom to X.ncols - 1 to remove intercept
-- proposing to move intercept to first column ....
val interceptCol = model.beta.length - 1
val targetMean: Double = drmTarget.colMeans().get(0)
val rssint: Double = ((drmTarget - targetMean ).t %*% (drmTarget - targetMean)).zSum()
val rssmod: Double = ete.zSum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I recall ete is a 1x1 matrix and you can also access the value via ete(0,0) see line 76.

Since we keep doing this, it might make sense just to collect the value a little ways upstream

modelOut.se = se
modelOut.tScore = tScore
modelOut.pval = pval
modelOut.degreesFreedom = X.ncol
// for degrees of freedom, dont count the intercept term that was added
modelOut.degreesFreedom = X.ncol - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying I'm above making mistakes, but why are we subtracting 1 here?


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 163.179 51.915 3.143 0.0347 *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for moving intercept to top of the list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more closely, I don't think you actually implemented like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, I was going to explore moving the intercept to the first index in the array for seperate PR.

@@ -41,16 +41,22 @@ object FittnessTests {
val r2 = 1 - (sumSquareResiduals / sumSquareTotal)
model.r2 = r2
model.testResults += ('r2 -> r2) // need setResult and setSummary method incase you change in future, also to initialize map if non exists or update value if it does
model.summary += s"\nR^2: ${r2}"
// this is already being done in -> calculateCommonStatistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is- good catch. That said, I think it should happen here, and be taken out of calculateCommonStatistics (adding to the summary string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will go back and fix that ..

val mse = residuals.assign(SQUARE).sum / residuals.nrow
model.mse = mse
model.testResults += ('mse -> mse)
model.summary += s"\nMean Squared Error: ${mse}"
// this is already being done in -> calculateCommonStatistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above


// Using Formatted Print here to pretty print the columns
"\nCoef.\t\tEstimate\t\tStd. Error\t\tt-score\t\t\tPr(Beta=0)\n" +
(0 until k).map(i => "X%-3d\t\t%+5.5f\t\t%+5.5f\t\t%+5.5f\t\t%+5.5f".format(i,model.beta(i),model.se(i),model.tScore(i),model.pval(i))).mkString("\n") +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for better formatting,

model
}

// https://en.wikipedia.org/wiki/Mean_squared_error
def MeanSquareError[R[K] <: RegressorModel[K], K](model: R[K], residuals: DrmLike[K]): R[K] = {
// TODO : I think mse denom should be (row - col) ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are nrow errors from nrow observations? Is this a d.f. thing? Link?

import JavaConversions._
import collection.JavaConversions
/**
* Created by dustinvanstee on 3/30/17.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove boilerplate from intellij


}

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file ends in blank line

class OlsSparkTestSuite extends FunSuite with MahoutSuite with DistributedSparkSuite with OrdinaryLeastSquaresTest {
// Common tests located in OrdinaryLeastSquaresTest.scala
// The test below is common to spark as I created an random RDD for larger size
test("Simple Medium Model2 - Spark Specific") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very interesting... @dlyubimov thoughts on this- doesn't jam anything else up...

I get what you're doing- may be redundant (good test coverage in math-scala should cover the spark case)

This would maybe be more appropriate if refactored as an example...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind- build is failing on this.

Keep the math tests in the math suite. They will be tested on spark as well as others.


val synDataRDD = LinearDataGenerator.generateLinearRDD(mahoutCtx, n, features, eps, partitions, intercept)

// Maybe refine this with something from the sparkbindings, but for now this is good...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thing worth doing is worth doing right... you'll find comments like this littered through out the code base dating back to 2008 i think...

https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala#L156

val synDataRDD = LinearDataGenerator.generateLinearRDD(mahoutCtx, n, features, eps, partitions, intercept)

// Maybe refine this with something from the sparkbindings, but for now this is good...
val tempRDD = synDataRDD.zipWithIndex.map( lv => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val tempRDD = drmWrapMLLibLabeledPoint(synDataRDD)

@@ -0,0 +1,64 @@
package org.apache.mahout.math.regression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need a new file- refactor the FTest to be like the coefficient of determintaiton / mse, move them to FitnessTests and add test to https://github.com/apache/mahout/blob/master/math-scala/src/test/scala/org/apache/mahout/math/algorithms/RegressionTestsSuiteBase.scala with an R-Prototype

@@ -35,6 +37,10 @@ trait LinearRegressorModel[K] extends RegressorModel[K] {
var tScore: MahoutVector = _
var pval: MahoutVector = _
var degreesFreedom: Int = _
var trainingExamples :Int = _
var fScore: Double = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these do belong here

//println("rssmod = " + rssmod)

val groupDof = X.ncol-1
val fScore = ((rssint - rssmod) / groupDof) / ( rssmod / (X.nrow - groupDof- 1 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be refactored as a fitness test- the rationale is unlike R for larger datasets the user will not invoke calcStandardStatistics, and instead will execute the tests she needs a la cart.

@rawkintrevo
Copy link
Contributor

Hey Dustin, thanks for the contribution.

Sorry this slipped through the cracks.

Just reviewed and made comments.

High level-

  • Awesome work on your first PR, this maybe was a more advanced 'beginner' issue
  • I think the F-Test needs to be refactored as a fitnessTest so the user can calculate it ala cart later.
  • The unit test for the F-Test needs to have an R prototype, e.g. I ran this same example in R, here was the F- Value, see the comments in fitness tests, you can probably use those and see what the F-score was, and just report it there.
  • Not sure the spark specific test belongs as a unit test. Maybe better as an example.

@rawkintrevo
Copy link
Contributor

As a second note- if you put MAHOUT-1962 in the title of the PR a lot of things will automatically integrate, not sure if you can edit it now- but that is a standard proceedure

@dustinvanstee dustinvanstee changed the title Added Ftest to OLS implemenations MAHOUT-1962 Added Ftest to OLS implemenations Apr 24, 2017
@rawkintrevo
Copy link
Contributor

Just beezing through this again... I would drop spark/src/test/scala/org/apache/mahout/math/algorithms/regression/OlsSparkTestSuite.scala all together.

@andrewpalumbo
Copy link
Member

this is not really apropos of this PR, but looking through, reminded me that we have some a classification stats package in https://github.com/apache/mahout/tree/master/math-scala/src/main/scala/org/apache/mahout/classifier/stats. There may be some useful code there, that could be refactored into to the new framework.

@@ -106,4 +106,83 @@ trait RegressionTestsSuiteBase extends DistributedMahoutSuite with Matchers {
(myAnswer - correctAnswer) should be < epsilon
}

test("OLS w/Ftest and Ttest validation") {

Copy link
Contributor

@rawkintrevo rawkintrevo May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an R prototype (i'm guessing you have one- since you have an answer- unless you did it by hand).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this all can be moved to the "fitness tests" , use the R-prototype in there and just confirm the F-test, should be in good shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. you don't need these separate tests.

val mse = residuals.assign(SQUARE).sum / residuals.nrow
model.mse = mse
model.testResults += ('mse -> mse)
model.summary += s"\nMean Squared Error: ${mse}"
model
}

// https://en.wikipedia.org/wiki/xxxx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well finish out the link.

@@ -47,10 +52,44 @@ object FittnessTests {

// https://en.wikipedia.org/wiki/Mean_squared_error
def MeanSquareError[R[K] <: RegressorModel[K], K](model: R[K], residuals: DrmLike[K]): R[K] = {
// TODO : I think mse denom should be (row - col) ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may be right, why do you think this?

//println(" model.beta(0) = " + model.beta(0))
//println(" model.beta(interceptCol) = " + model.beta(interceptCol))
//println("rssint = " + rssint)
//println("rssmod = " + rssmod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get rid of these.


// This is the residual sum of squares for just the intercept
//println(" drmTarget.ncol) = " + drmTarget.ncol)
val interceptCol = drmTarget.ncol - 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be drmFeatures.ncol?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition- it seems that this is only used in some print statements that need to be deleted anyway. Probably can drop this line all together

modelOut.degreesFreedom = X.ncol
modelOut.summary = generateSummaryString(modelOut)
// for degrees of freedom, dont count the intercept term that was added
modelOut.degreesFreedom = X.ncol - 1
Copy link
Contributor

@rawkintrevo rawkintrevo May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're correct you but you need to check if a intercept was added- if (model.addIntercept) {

I was wrong- the intercept also costs you one degree of freedom. revert to original line 82 and youre good.

// for degrees of freedom, dont count the intercept term that was added
modelOut.degreesFreedom = X.ncol - 1

modelOut.trainingExamples = n.toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous- could be to big of a number to fit in an Int. safeToNonNegInt(residuals.nrow)

if (calcCommonStatistics){
modelOut = calculateCommonStatistics(modelOut, drmTarget, residuals)
modelOut = calculateCommonStatistics(modelOut, X, drmTarget, residuals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK- you changed this to take X but the only thing you use X for is to calculate teh d.o.f. and model.trainingExamples which are already included in teh model (you just added them). I would refactor this so X isn't a parameter.

modelOut
}

// Since rss is needed for multiple test statistics, use this function to cache this value
def calculateResidualSumOfSquares[M[K] <: LinearRegressorModel[K]](model: M[K],residuals: DrmLike[K]) : M[K] ={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a stylistic point and because other non-linear regressors may/will need this- please move to FitnessTests

@@ -55,6 +55,8 @@ class OrdinaryLeastSquares[K] extends LinearRegressorFitter[K] {
}

var X = drmFeatures

// TODO : move intercept betas to the first column, not the last column ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start a JIRA, don't introduce a comment into the code in a file you're otherwise not touching ;)

@@ -30,6 +30,9 @@ trait RegressorModel[K] extends SupervisedModel[K] {
// Common Applicable Tests- here only for convenience.
var mse: Double = _
var r2: Double = _
var fpval: Double = _
var rss:Double = -9999.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure there is a reason, but why are you initializing to -9999.0 instead of _ ?

//println("rssint = " + rssint)
//println("rssmod = " + rssmod)

val groupDof = drmFeatures.ncol-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this with model.degreesFreedom (and refactor to not need the features in the signature)

//println("rssmod = " + rssmod)

val groupDof = drmFeatures.ncol-1
val fScore = ((rssint - model.rss) / groupDof) / ( model.rss / (drmFeatures.nrow - groupDof- 1 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is the crux of it- Looks good, but I'll feel better if you can verify with an R-prototype

@rawkintrevo
Copy link
Contributor

Big job @dustinvanstee ! Looking awesome. Made a few more comments, you're getting pretty close.

Thanks again!

-- DegreeOfFreedom = ncol
-- added safeToNeg wrapping ints
-- eliminated drmFeatures from calcCommonStat
-- moved rss calc to fitness tests
-- removed excess comments and prints
-- fixed weblink
-- added R prototype
-- elimated tests, and added a few checks to existing test
# new
-- refactored model.summary to just use function call (cleaner that
having summary get updated in multiple locations
# did not address
-- -9999 initialization for rss .. Need a suggestion
@rawkintrevo
Copy link
Contributor

sorry for delay @dustinvanstee

I'm still not in love with moving all of the summary string generation to one function- but I think really we should re think the entire thing and try to do as much as possible in one pass.

This however would be a separate JIRA.

lgtm unless any further comments

@asfgit asfgit closed this in cf5a33d May 21, 2017
@dustinvanstee dustinvanstee deleted the MAHOUT-1962 branch June 28, 2017 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants