[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs #17919

MLnick · 2017-05-09T08:43:22Z

Small clean ups from #17742 and #17845.

How was this patch tested?

Existing unit tests.

SparkQA · 2017-05-09T10:00:58Z

Test build #76665 has finished for PR 17919 at commit 0b1eaa3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-05-09T14:08:06Z

cc @mpjlu @jkbradley

mpjlu · 2017-05-09T15:49:58Z

Thanks, I am ok for this change.

jkbradley · 2017-05-10T00:30:41Z

taking a look

jkbradley

LGTM, just 1 question/comment

Thanks for doing this!

jkbradley · 2017-05-10T00:30:34Z

mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

@@ -451,6 +439,8 @@ class ALSModel private[ml] (
 @Since("1.6.0")
 object ALSModel extends MLReadable[ALSModel] {

+  @transient private[recommendation] val _f2jBLAS = new F2jBLAS


Does this require significant initialization? You could use org.apache.spark.ml.linalg.BLAS.f2jBLAS

No more or less than using ml.linalg.BLAS - I did think of that but the var needs to be exposed as private[ml]. If we're ok with that then it'll be slightly cleaner to use that, yes.

MLnick · 2017-05-11T07:41:01Z

Just decided to use ml.BLAS and expose f2jBLAS as ml / mllib private.

SparkQA · 2017-05-11T08:08:59Z

Test build #76792 has finished for PR 17919 at commit 9dfad1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-05-11T10:02:42Z

Jenkins retest this please

SparkQA · 2017-05-11T11:03:25Z

Test build #76801 has finished for PR 17919 at commit 9dfad1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-05-16T08:58:06Z

Merged to master/branch-2.2

Small clean ups from #17742 and #17845. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes #17919 from MLnick/SPARK-20677-als-perf-followup. (cherry picked from commit 25b4f41) Signed-off-by: Nick Pentreath <[email protected]>

auskalia · 2017-05-17T02:22:17Z

Hi, @MLnick, We find that just do repartition for userFeatures and productFeatures can improve the efficiency significantly on the ALS recommendForAll().

Here is our procedure:

Train ALS model
Save model as hdfs file
Submit new spark mission
Load model from hdfs file
do recommendForAll()

Firstly, when you submit spark mission with "spark.default.parallelism=x" the stage for recommendForAll will be splited the number of x^2 tasks, due to the partition number of userFeatures is equal to x and productFeatures number is equal to x. This is not reasonable. Too much network I/O operation to finish the stage.

Secondly, submitting spark mission with "spark.dynamicAllocation.enabled=true" may cause data uneven distribution on executors. We found that some executors may take n GB data(who start early), but others may just take m MB data(who start later). This may cause a few executors execute tasks slowly with high GC or crash by OOM.

We did some test to repartition on the userFeatures and productFeatures. Here is it.

case 1:
users: 480 thousand, products: 4 million, rank 25
executors: 600, default.parallelism: 100, executor-memory: 20G, executor-cores: 8
without repartition, recommendforall spent 24min
after repartition, userFeatures.repartition(100), productFeatures.repartition(100) , recommendforall spent 8min
result: 3x faster

case 2:
users: 12 million, products: 7.2 million, rank 20
executors: 800, default.parallelism: 600, executor-memory: 16G, executor-cores: 8
without repartition, recommendforall spent 16 hours
after repartition, userFeatures.repartition(800), productFeatures.repartition(100) recommendforall spent 30 mins
result: 32x faster

Note that the partition number of userFeatures and productFeatures may be different.

Above test based on the fix #17742 and #17845.

We strongly suggest that provide interface to user to have a chance to do re-partition for 2 kinds of features.

Thanks

Here is the patch for mllib, with 2 new public function of MatrixFactorizationModel

diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
index d45866c..d4412f7 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
@@ -56,8 +56,8 @@ import org.apache.spark.util.BoundedPriorityQueue
@SInCE("0.8.0")
class MatrixFactorizationModel @SInCE("0.8.0") (
@SInCE("0.8.0") val rank: Int,

@SInCE("0.8.0") val userFeatures: RDD[(Int, Array[Double])],
@SInCE("0.8.0") val productFeatures: RDD[(Int, Array[Double])])

@SInCE("0.8.0") var userFeatures: RDD[(Int, Array[Double])],
@SInCE("0.8.0") var productFeatures: RDD[(Int, Array[Double])])
extends Saveable with Serializable with Logging {

require(rank > 0)
@@ -154,6 +154,39 @@ class MatrixFactorizationModel @SInCE("0.8.0") (
predict(usersProducts.rdd.asInstanceOf[RDD[(Int, Int)]]).toJavaRDD()
}

/**
- Repartition UserFeatures
- @param partitionNum the value you want to do reparition on the userFeatures in Model
*/
@SInCE("2.2.0")
def repartitionUserFeatures(partitionNum: Int = 0): Unit =
{
if (partitionNum > 0)
{

   userFeatures = userFeatures.repartition(partitionNum)

}
else
{

   userFeatures = userFeatures.repartition(userFeatures.getNumPartitions)

}
}
/**
- Repartition ProductFeatures
- @param partitionNum the value you want to do reparition on the ProductFeatures in Model
*/
@SInCE("2.2.0")
def repartitionProductFeatures(partitionNum: Int = 0): Unit =
{
if (partitionNum > 0)
{

 productFeatures = productFeatures.repartition(partitionNum)

}
else
{

 productFeatures = productFeatures.repartition(productFeatures.getNumPartitions)

}
}
/**
- Recommends products to a user.

mpjlu · 2017-05-17T02:35:50Z

Hi @auskalia , you are right. repartition can improve the performance of recommendForAll.
In my experiment for PR 17742, I have 120 cores, I use 20 partition for userFeatures, and itemFeatures.
I also consider to provide interface to user to have a chance to do re-partition.
Since you can set the partition number when train the model, I did not do that.

auskalia · 2017-05-17T03:16:19Z

Hi @mpjlu , your are right. But I consider that sometimes we have to use several spark mission to finish our work, especially the resource is insufficient in hadoop cluster. Due to save and reload file in different mission is a common method for engineering application. So I recommend to export an interface to try do repartition features for client.

Small clean ups from apache#17742 and apache#17845. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <[email protected]> Closes apache#17919 from MLnick/SPARK-20677-als-perf-followup.

Nick Pentreath added 3 commits May 9, 2017 10:37

Use F2jBLAS and clean up code

8a6073f

mllib version

301e8b8

No need for 'recommendation' private scope

0b1eaa3

jkbradley reviewed May 10, 2017

View reviewed changes

Expose {ml, mllib}-private f2jBLAS and use that

9dfad1b

asfgit closed this in 25b4f41 May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs #17919

[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs #17919

MLnick commented May 9, 2017

SparkQA commented May 9, 2017

MLnick commented May 9, 2017

mpjlu commented May 9, 2017

jkbradley commented May 10, 2017

jkbradley left a comment

jkbradley May 10, 2017

MLnick May 10, 2017

MLnick commented May 11, 2017 •

edited

Loading

SparkQA commented May 11, 2017

MLnick commented May 11, 2017

SparkQA commented May 11, 2017

MLnick commented May 16, 2017

auskalia commented May 17, 2017

mpjlu commented May 17, 2017

auskalia commented May 17, 2017

[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs #17919

[SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all performance PRs #17919

Conversation

MLnick commented May 9, 2017

How was this patch tested?

SparkQA commented May 9, 2017

MLnick commented May 9, 2017

mpjlu commented May 9, 2017

jkbradley commented May 10, 2017

jkbradley left a comment

Choose a reason for hiding this comment

jkbradley May 10, 2017

Choose a reason for hiding this comment

MLnick May 10, 2017

Choose a reason for hiding this comment

MLnick commented May 11, 2017 • edited Loading

SparkQA commented May 11, 2017

MLnick commented May 11, 2017

SparkQA commented May 11, 2017

MLnick commented May 16, 2017

auskalia commented May 17, 2017

mpjlu commented May 17, 2017

auskalia commented May 17, 2017

MLnick commented May 11, 2017 •

edited

Loading