Clean / process / visualize with small to big data in R, python, cascading, Hive, Pig, SQL, pandas, all in one place !
“ Oh great a 25mn commercial !!! ”
Not gonna happen. Bug me after the presentation if you want a demo of the product.
What's re-ranking ?
Let's imagine you are in charge of a web search engine's quality...
If a search engine is the core of your business
(web search, E-commerce, recruiting...)
Improving its pertinence has a direct impact on your ROI
How do we define the quality of a search engine
?
First we're gonna need to measure the relevance of a search result!
A metric for relevance right from the log?
Assuming we search for "french newspaper", we take a look at the logs.
We compute the so called dwell time of a click
i.e. the time ellapsed before the next action
Dwell time has been shown to be correlated with the relevance
We can attribute a relevance to each url
Good we have a measure of relevance !
Can we get an overall score for our search engine now?
What's the spec?
Better score for search engine putting greater relevance on top
DCG
What's the spec?
Better score for search engine putting greater relevance on top
Something between 0 and 1
NDCG
Let's get to work
Your search engine will go through different stages
TFIDF? ✓
Query expansion? ✓
PageRank? ✓
Wait people have different relevance!
Collapsing? ✓
Personalized reranking?
Personalized reranking is about reordering the n-bests results
based on the user past search history
The contest
Yandex supplied 27 days of anonymized log
34,573,630 Sessions with user id
21,073,569 Queries
64,693,054 Clicks
A team of 4
Kenji Lefevre - Pure Math
Christophe Bourguignat - Eng. / Signal Processing
Mathieu Scordia - Data Scientist
Me (Paul Masurel) - Softw. / Signal Processing
No researcher.
No experience in reranking.
Not much experience in ML for most of us.
Not exactly our job.
No expectations.
At this point,
We are but hobbits
Also, it is only a contest for us
Disclaimer
We did not respect scientific method
start simple.
Point-wise approach
Making it supervized learning
Making it supervized learning
We split 27 days of the train dataset 24 (history) + 3 days (annotated),
and select test queries just like Yandex.
Where is my X, where is my y?
For all of the annotations sessions, for each of the 10 urls to re-rank, we compute (using the train sessions) a set of feature X.
First, we try to predict the relevance of each url (0,1, or 2).
Regression / classification?
regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery.
classification : we lose the hierarchy but we can optimize the NDCG (more and that later)
According to
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning
to rank using multiple classification and gradient
boosting. In NIPS, 2007.
Classification outperforms regression.
How do we proceed
We use classification techniques to predict the outcome of each url display independantly
to output the probabilities of the different classes.
SVM forbidden here!
We sort urls by the value of
p(relevancy=1) + 3 p(relevancy=2)
Where is this 3 coming from?
Given a Bayes classifier
We want to optimize
Given a Bayes classifier
(hidden fallacy here)
Given a Bayes classifier
We therefore need to order urls by decreasing
Hence
p(relevancy=1) + 3 p(relevancy=2)
By the way
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning
to rank using multiple classification and gradient
boosting. In NIPS, 2007.
get slightly better results with linear weighting.
Settled for Random Tree Forests
Hyperparameter fitting
We are not doing typical classification here.
It is extremely important to perform grid search
directly against NDCG final score.
Features
First of all the rank
In this contest, the rank is both
The rank that has been displayed to the user
the display rank
The rank that is computed by Yandex using PageRank, non-personalized log analysis?, TF-IDF, etc.
the non-personalized rank
<APARTE>
The problem
with reranking
53% of the competitors could not improve the baseline
Why?
ideal workbench
compute non-personalized rank
select 10 bests hits and serves them in order
re-rank using log analysis.
put new ranking algorithm in prod (yeah right!)
compute NDCG on new logs
What yandex/anybody/you can afford
compute non-personalized rank
select 10 bests hits
serve 10 bests hits ranked in random order
re-rank using log analysis, including non-personalized rank as a feature
compute score against the log with the former rank
The problem here?
Users tend to click on the first few urls.
User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.
We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank
This workflow promotes over conservative re-ranking policy
Even if we know for sure that the url with rank 9 would be clicked by the user
if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1
in this contest.
Size of the biggest jump in a reranked session
In advertising,
People can afford displaying random stuff. They can (and do) :
compute non-personalized rank
select 10 bests hits
serve 10 bests hits ranked in random order
learn a ranker using log analysis
rank by setting all display rank to 0
</APARTE>
The other features
The meat : informative features
When to rerank : promoting / inhibiting features
Informative features (unexpected)
Revisits (at least 30%)
(Document, user clicks) distance. SVD. failure (why?)
(Document, user query history) distance : tfidf, not finished on time. LDA dropped attempt a bit fast.
NOTHING SMART HERE
Revisits
In the past, when the user was displayed this url, with the exact same query, it ended up by
0 satisfaction 2
1 satisfaction 1
0 satisfaction 0
0 miss
2 skipped
Revisits
We use additive smoothing to make it
0 / (3+1) = 0% satisfaction 2
1 / (3+1) = 25% satisfaction 1
0 / (3+1) = 0% satisfaction 0
(0+1) / (3+1) = 25% miss
2 / (3+1) = 50% skipped
... rudimentary but does the job.
Revisits
That's already five feature. We add
1 An overall counter of display
4 mean reciprocal rank (kind of the harmonic mean of the rank)
1 snippet quality score : twisted formula used to compute snippet quality
11 features
Many variations
(In the past|within the same sesssion),
when (user/anyone)
(with this very query | whatever query | a subquery | a super query)
and was offered (this url/this domain)
A LOT of possibilities !
We investigated 165 of them.
Query features
How complex is a query? How ambiguous is a query?
Click entropy
number of time it has been queried for
number of terms
average position within in session
average number of occurence in a session
MRR of its clicks
User click habits
Click entropy
User click rank counters
Average number of terms
Average number of different terms in a session
Total number of queries issued by the user
Seasonality
Usually clicked on weekend and we're a Monday
Long story short, point-wise approach Random tree forests
Bonsai model, 30 features - 5th place...
... and then LambdaMART (90 features)
2nd place
No regrets we didn't use lambdamart first, though
LambdaMART : 2 days of learning
Random Tree Forest : 30 minutes, including hyperparameter optimization
The human story
A LOT of work
960+ emails
360+ features
Alarm clocks set at 3:00
Kaggle makes "datascience a sport"
A sport, really?
Trial and Error
Experienced datascientists may stride, and choose their directions
more wisely,
STILL
this problem is a Random walk
Team work worked out great
Yet no "we complete each other".
Mates can fix your code / ideas
4 people will miss less things
Split the hard work
Motivation
Emulation
pampampam pulled the leaderboard up
totally different score distribution than I would have expected
leaderboard not stalling... more time, better results