On Kaggle Re-ranking contest

Tokyo Webmining - 36th

Who am I?

  • Software engineer at Dataiku
  • Former foreign student in Tokyo (情報理工学)
  • http://fulmicoton.com
  • @fulmicoton

Dataiku ?

French startup. We develop a datascience platform

Clean / process / visualize with small to big data in R, python, cascading, Hive, Pig, SQL, pandas, all in one place !

“ Oh great a 25mn commercial !!! ”

Not gonna happen. Bug me after the presentation if you want a demo of the product.

What's re-ranking ?

Let's imagine you are in charge of a web search engine's quality...

If a search engine is the core of your business

(web search, E-commerce, recruiting...)

Improving its pertinence has a direct impact on your ROI

How do we define the quality of a search engine

?

First we're gonna need to measure the relevance of a search result!

A metric for relevance right from the log?

Assuming we search for "french newspaper", we take a look at the logs.

We compute the so called dwell time of a click

i.e. the time ellapsed before the next action

Dwell time has been shown to be correlated with the relevance

We can attribute a relevance to each url

Good we have a measure of relevance !

Can we get an overall score for our search engine now?

What's the spec?

  • Better score for search engine putting greater relevance on top

DCG

What's the spec?

  • Better score for search engine putting greater relevance on top
  • Something between 0 and 1

NDCG

Let's get to work

Your search engine will go through different stages

  • TFIDF? ✓
  • Query expansion? ✓
  • PageRank? ✓


Wait people have different relevance!

  • Collapsing? ✓
  • Personalized reranking?

Personalized reranking is about reordering the n-bests results based on the user past search history

The contest

Yandex supplied 27 days of anonymized log

  • 34,573,630 Sessions with user id
  • 21,073,569 Queries
  • 64,693,054 Clicks

A team of 4


  • Kenji Lefevre - Pure Math
  • Christophe Bourguignat - Eng. / Signal Processing
  • Mathieu Scordia - Data Scientist
  • Me (Paul Masurel) - Softw. / Signal Processing


No researcher.
No experience in reranking.
Not much experience in ML for most of us.
Not exactly our job. No expectations.

At this point,

We are but hobbits

Also, it is only a contest for us

Disclaimer

We did not respect scientific method

start simple.

Point-wise approach

Making it supervized learning

Making it supervized learning

We split 27 days of the train dataset 24 (history) + 3 days (annotated), and select test queries just like Yandex.

Where is my X, where is my y?

For all of the annotations sessions, for each of the 10 urls to re-rank, we compute (using the train sessions) a set of feature X.

First, we try to predict the relevance of each url (0,1, or 2).

Regression / classification?

  • regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery.
  • classification : we lose the hierarchy but we can optimize the NDCG (more and that later)
According to

P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007.

Classification outperforms regression.

How do we proceed

We use classification techniques to predict the outcome of each url display independantly to output the probabilities of the different classes.

SVM forbidden here!

We sort urls by the value of

p(relevancy=1) + 3 p(relevancy=2)


Where is this 3 coming from?

Given a Bayes classifier

We want to optimize

Given a Bayes classifier

(hidden fallacy here)

Given a Bayes classifier

We therefore need to order urls by decreasing

Hence

p(relevancy=1) + 3 p(relevancy=2)

By the way

P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007.

get slightly better results with linear weighting.

Settled for Random Tree Forests

Hyperparameter fitting

We are not doing typical classification here. It is extremely important to perform grid search directly against NDCG final score.

Features

First of all the rank

In this contest, the rank is both

The rank that has been displayed to the user

the display rank


The rank that is computed by Yandex using PageRank, non-personalized log analysis?, TF-IDF, etc.

the non-personalized rank

<APARTE>

The problem

with reranking

53% of the competitors could not improve the baseline

Why?

ideal workbench

  1. compute non-personalized rank
  2. select 10 bests hits and serves them in order
  3. re-rank using log analysis.
  4. put new ranking algorithm in prod (yeah right!)
  5. compute NDCG on new logs

What yandex/anybody/you can afford

  1. compute non-personalized rank
  2. select 10 bests hits
  3. serve 10 bests hits ranked in random order
  4. re-rank using log analysis, including non-personalized rank as a feature
  5. compute score against the log with the former rank

The problem here?

Users tend to click on the first few urls.
  1. User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.
  2. We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank

This workflow promotes over conservative re-ranking policy

Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.

Size of the biggest jump in a reranked session

In advertising,

People can afford displaying random stuff. They can (and do) :

  1. compute non-personalized rank
  2. select 10 bests hits
  3. serve 10 bests hits ranked in random order
  4. learn a ranker using log analysis
  5. rank by setting all display rank to 0

</APARTE>

The other features

  • The meat : informative features
  • When to rerank : promoting / inhibiting features

Informative features (unexpected)

  • Revisits (at least 30%)
  • (Document, user clicks) distance. SVD. failure (why?)
  • (Document, user query history) distance : tfidf, not finished on time. LDA dropped attempt a bit fast.

NOTHING SMART HERE

Revisits

In the past, when the user was displayed this url, with the exact same query, it ended up by

  • 0 satisfaction 2
  • 1 satisfaction 1
  • 0 satisfaction 0
  • 0 miss
  • 2 skipped

Revisits

We use additive smoothing to make it
  • 0 / (3+1) = 0% satisfaction 2
  • 1 / (3+1) = 25% satisfaction 1
  • 0 / (3+1) = 0% satisfaction 0
  • (0+1) / (3+1) = 25% miss
  • 2 / (3+1) = 50% skipped

... rudimentary but does the job.

Revisits

That's already five feature. We add
  • 1 An overall counter of display
  • 4 mean reciprocal rank (kind of the harmonic mean of the rank)
  • 1 snippet quality score : twisted formula used to compute snippet quality

11 features

Many variations

  • (In the past|within the same sesssion),
  • when (user/anyone)
  • (with this very query | whatever query | a subquery | a super query)
  • and was offered (this url/this domain)

A LOT of possibilities ! We investigated 165 of them.

Query features

How complex is a query? How ambiguous is a query?

  • Click entropy
  • number of time it has been queried for
  • number of terms
  • average position within in session
  • average number of occurence in a session
  • MRR of its clicks

User click habits

  • Click entropy
  • User click rank counters
  • Average number of terms
  • Average number of different terms in a session
  • Total number of queries issued by the user

Seasonality

  • Usually clicked on weekend and we're a Monday

Long story short, point-wise approach Random tree forests

Bonsai model, 30 features - 5th place...

... and then LambdaMART (90 features)

2nd place

No regrets we didn't use lambdamart first, though

  • LambdaMART : 2 days of learning
  • Random Tree Forest : 30 minutes, including hyperparameter optimization

The human story

A LOT of work

960+ emails

360+ features

Alarm clocks set at 3:00

Kaggle makes "datascience a sport"

A sport, really?

Trial and Error

Experienced datascientists may stride, and choose their directions more wisely,

STILL

this problem is a Random walk

Team work worked out great

Yet no "we complete each other".

  • Mates can fix your code / ideas
  • 4 people will miss less things
  • Split the hard work
  • Motivation

Emulation

  • pampampam pulled the leaderboard up
  • totally different score distribution than I would have expected
  • leaderboard not stalling... more time, better results

Visualizing results

The End