Ranking View Query Results
ArangoSearch supports the two most popular ranking schemes:
Under the hood, both models rely on two main components:
- Term frequency (TF): in the simplest case defined as the number of times a term occurs in a document
- Inverse document frequency (IDF): a measure of how relevant a term is, i.e. whether the word is common or rare across all documents
See Ranking in ArangoSearch in the ArangoSearch Tutorial to learn more about the ranking model.
To sort View results from most relevant to least relevant, use a with a call to a Scoring function as expression and set the order to descending. Scoring functions expect the document emitted by a loop that iterates over a View as first argument.
You can also return the ranking score as part of the result.
FOR doc IN viewName
SEARCH …
RETURN MERGE(doc, { bm25: BM25(doc), tfidf: TFIDF(doc) })
Scoring functions cannot be used outside of SEARCH
operations, as the scores can only be computed in the context of a View, especially because of the inverse document frequency (IDF).
View definition:
{
"links": {
"imdb_vertices": {
"fields": {
"description": {
"analyzers": [
"text_en"
]
}
}
}
}
AQL Queries:
Search for movies with certain keywords in their description and rank the results using the :
Do the same but with the TFIDF()
function:
FOR doc IN imdb
SEARCH ANALYZER(doc.description IN TOKENS("amazing action world alien sci-fi science documental galaxy", "text_en"), "text_en")
SORT TFIDF(doc) DESC
RETURN {
title: doc.title,
description: doc.description,
score: TFIDF(doc)
}
Query Time Relevance Tuning
You can fine-tune the scores computed by the Okapi BM25 and TF-IDF relevance models at query time via the BOOST()
AQL function and also calculate a custom score. In addition, the BM25()
function lets you adjust the coefficients at query time.
The BOOST()
function is similar to the ANALYZER()
function in that it accepts any valid SEARCH
expression as first argument. You can set the boost factor for that sub-expression via the second parameter. Documents that match boosted parts of the search expression will get higher scores.
View definition:
{
"links": {
"imdb_vertices": {
"fields": {
"description": {
"analyzers": [
"text_en"
]
}
}
}
}
AQL Queries:
Prefer galaxy
over the other keywords:
If you are an information retrieval expert and want to fine-tuning the weighting schemes at query time, then you can do so. The BM25()
function accepts free coefficients as parameters to turn it into BM15 for instance:
FOR doc IN imdb
SEARCH ANALYZER(doc.description IN TOKENS("amazing action world alien sci-fi science documental", "text_en")
OR BOOST(doc.description IN TOKENS("galaxy", "text_en"), 5), "text_en")
LET score = BM25(doc, 1.2, 0)
SORT score DESC
LIMIT 10
RETURN {
title: doc.title,
description: doc.description,
score
}
You can also calculate a custom score, taking into account additional fields of the document.
Match movies with the (normalized) phrase star war
in the title and calculate a custom score based on BM25 and the movie runtime to favor longer movies:
FOR doc IN imdb
SEARCH PHRASE(doc.title, "Star Wars", "text_en")
LET score = BM25(doc) * LOG(doc.runtime + 1)
SORT score DESC
RETURN {
title: doc.title,
runtime: doc.runtime,
bm25: BM25(doc),
}