This library makes it easier to find recommendations and similarities between different things. There are a couple of use cases for it:
- Recommend a list of music albums/artists to a user
- Recommend an article that is similar to the current one that a user is reading
- Find other users that have the same values as another user (think matchmaking ;)
The easiest way to get this installed in your project is by using composer
composer require stojg/recommend
Presume that we have some data where users have rated artists within a scale of one to five:
$artistRatings = array(
"Abe" => array(
"Blues Traveler" => 3,
"Broken Bells" => 2,
"Norah Jones" => 4,
"Phoenix" => 5,
"Slightly Stoopid" => 1,
"The Strokes" => 2,
"Vampire Weekend" => 2
),
"Blair" => array(
"Blues Traveler" => 2,
"Broken Bells" => 3,
"Deadmau5" => 4,
"Phoenix" => 2,
"Slightly Stoopid" => 3,
"Vampire Weekend" => 3
),
"Clair" => array(
"Blues Traveler" => 5,
"Broken Bells" => 1,
"Deadmau5" => 1,
"Norah Jones" => 3,
"Phoenix" => 5,
"Slightly Stoopid" => 1
)
);
Start with loading this data into the Data class
$data = new \stojg\recommend\Data($artistRatings);
If we want to find artists that Blair might like, we execute the recommend method.
$recommendations = $data->recommend('Blair', new \stojg\recommend\strategy\Manhattan());
var_export($recommendations);
The result of that computation would be:
array (
0 => array (
'key' => 'Norah Jones',
'value' => 4,
),
1 => array (
'key' => 'The Strokes',
'value' => 2,
)
)
This means that Blair might like Norah Jones
. The Strokes on the other hand will fit her taste.
The Recommender
works by finding someone in the $artistRatings
that have rated artist similar to to Blair. In this
case it turns out to be Abe, so it then tries to find artists that Abe have rated but not Blair and return them
as a list of recommendations.
How the 'nearest' neighbour is found depends on which strategy that is chosen and how big and dense the dataset is.
The general rule is that the bigger the dataset is, the better. It have to be formatted as an array in the following format:
array(
'uniqueID' => array(
'objectID' => (int)'rating'
)
);
There are currently three (four, depending how you are counting) strategies and which one to pick depends on how the data is organized and populated.
If the data is dense (almost all objectID have a non zero rating) and the magnitude (rating) of the attributes values are important, this is a good strategy.
It can be have a defined "dimension" from 1 and up. The bigger the dimension is, the bigger the difference between the "score" will be.
Manhattan is a shortcut for a Minkowski with a dimension of one.
Use this strategy if the data is subject to grade-inflation.
I.e. if I rate most items between 2-4 and you rate things between 4-5 this strategy tries to compensate the fact that my worst (2) is equal to your worst (4).
This is the strategy to pick if the data is sparse.
I.e. If there is a list with ten thousand artists, it quite likely that the users only listened and rated a few of them.
It basically disregard the null values so they don't influence the similarity score.