Getting Started

Installation

You can install ir-measures from pip:

$ pip install ir-measures

You can also install from current development version:

$ git clone git@github.com:terrierteam/ir_measures.git
$ cd ir_measures
$ python setup.py install

Command Line Interface

ir_measures can be used on the command line with an interface similar to trec_eval:

$ ir_measures path/to/qrels path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10
nDCG@10  0.6251
P@5   0.7486
P(rel=2)@5  0.6000
Judged@10   0.9486

You can alternatively use a dataset ID from ir_datasets in place of path/to/qrels.

You can see per-topic results using the -q flag (similar to trec_eval):

$ ir_measures -q path/to/qrels path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10
1   P@5 0.6000
1   nDCG@10 0.5134
2   P@5 1.0000
2   nDCG@10 0.8522
3   P@5 1.0000
...
34  Judged@10   1.0000
35  Judged@10   1.0000
all nDCG@10 0.6251
all P@5 0.7486
all P(rel=2)@5  0.6000
all Judged@10   0.9486

The first column in the output is the query ID (or all for the aggregated results, which can be disabled with the -n flag). Results are written to the output stream as they are calculated by iter_calc. Thus, they may not be in a predictable order [1].

Full list of command line arguments:

  • -h (--help): print information about running the command

  • -p X (--places X): number of decimal places to use when writing the output. Default: 4.

  • -q (--by_query): print the results by query (topic), as shown above.

  • -n (--no_summary): when used with -q, does not print aggregated (all) values.

  • --provider X: forces the use of a particular provider, rather than using the default fallback approach. Possible values are: pytrec_eval, judged, gdeval, trectools, and msmarco.

Python Interface

Compute measures from python:

>>> import ir_measures
>>> from ir_measures import *
>>> qrels = ir_measures.read_trec_qrels('path/to/qrels')
>>> run = ir_measures.read_trec_run('path/to/run')
>>> ir_measures.calc_aggregate([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels, run)
{
    nDCG@10: 0.6251,
    P@5: 0.7486,
    P(rel=2)@5: 0.6000,
    Judged@10: 0.9486
}

Per-topic results can be calculated using iter_calc:

>>> for metric in ir_measures.iter_calc([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels, run):
...     print(metric)
Metric(query_id='1', measure=P@5, value=0.6)
Metric(query_id='1', measure=nDCG@10, value=0.5134306625775544)
Metric(query_id='2', measure=P@5, value=1.0)
Metric(query_id='2', measure=nDCG@10, value=0.8521705090845474)
Metric(query_id='3', measure=P@5, value=1.0)
...
Metric(query_id='33', measure=Judged@10, value=1.0)
Metric(query_id='34', measure=Judged@10, value=1.0)

Here again, the results from iter_calc may not be returned in a predictable order [1].

Qrels formats

Query relevance assessments can be provided in a variety of formats.

namedtuple iterable: Any iterable of named tuples. You can use ir_measures.Qrel, or any other NamedTuple with the fields query_id, doc_id, and relevance (order and additional fields do not matter if another type of NamedTuple; the field names just need to match):

qrels = [
    ir_measures.Qrel("Q0", "D0", 0),
    ir_measures.Qrel("Q0", "D1", 1),
    ir_measures.Qrel("Q1", "D0", 0),
    ir_measures.Qrel("Q1", "D3", 2),
]

Note that if the results are an iterator (such as the result of a generator), ir_measures will consume the entire sequence.

ir_measures.Qrel support an optional fourth parameter, iteration. This is the source of the subtopic ID used for diversity measures (name matches TREC conventions). Note that unlike TREC-formatted qrels, this parameter is the last element, since this is required for optional parameters in namedtuples.

Pandas dataframe: A pandas dataframe with the columns query_id, doc_id, and relevance:

import pandas as pd
qrels = pd.DataFrame([
    {'query_id': "Q0", 'doc_id': "D0", 'relevance': 0},
    {'query_id': "Q0", 'doc_id': "D1", 'relevance': 1},
    {'query_id': "Q1", 'doc_id': "D0", 'relevance': 0},
    {'query_id': "Q1", 'doc_id': "D3", 'relevance': 2},
])

Dataframes support an optional fourth parameter, iteration. This is the source of the subtopic ID used for diversity measures (name matches TREC conventions).

If your dataframe has columns named something else, you can always map them with the rename function. For instance, if your dataframe has the columns qid, docno, and label, you can easily make a qrels dataframe that is compatible with ir-measures like so:

qrels = df.rename(columns={'qid': 'query_id', 'docno': 'doc_id', 'label': 'relevance'})

TREC-formatted qrels file: You can read a TREC-formatted qrels file:

# a file path:
qrels = ir_measures.read_trec_qrels('path/to/qrels')
# raw qrels file contents:
qrels = ir_measures.read_trec_qrels('''
Q0 0 D0 0
Q0 0 D1 1
Q1 0 D0 0
Q1 0 D3 2
''')
# TREC qrels format: "query_id iteration doc_id relevance".

Note that read_trec_qrels returns a generator. If you need to use the qrels multiple times, wrap it in the list constructor to read the all qrels into memory.

ir_datasets qrels: Qrels from the ir_datasets package. This mode simply adheres to the namedtuple iterable specification above:

import ir_datasets
qrels = ir_datasets.load('trec-robust04').qrels_iter()

dict-of-dict: Qrels structured in a hierarchy. At the first level, query IDs map to another dictionary. At the second level, document IDs map to (integer) relevance scores:

qrels = {
    'Q0': {
        "D0": 0,
        "D1": 1,
    },
    "Q1": {
        "D0": 0,
        "D3": 2
    }
}

Note that this format does not support the iteration field, so it should not be used with diversity measures.

Run formats

System outputs can be provided in a variety of formats.

namedtuple iterable: Any iterable of named tuples. You can use ir_measures.ScoredDoc, or any other NamedTuple with the fields query_id, doc_id, and score:

run = [
    ir_measures.ScoredDoc("Q0", "D0", 1.2),
    ir_measures.ScoredDoc("Q0", "D1", 1.0),
    ir_measures.ScoredDoc("Q1", "D0", 2.4),
    ir_measures.ScoredDoc("Q1", "D3", 3.6),
]

Note that if the results are an iterator (such as the result of a generator), ir_measures will consume the entire sequence.

Pandas dataframe: A pandas dataframe with the columns query_id, doc_id, and score:

import pandas as pd
run = pd.DataFrame([
    {'query_id': "Q0", 'doc_id': "D0", 'score': 1.2},
    {'query_id': "Q0", 'doc_id': "D1", 'score': 1.0},
    {'query_id': "Q1", 'doc_id': "D0", 'score': 2.4},
    {'query_id': "Q1", 'doc_id': "D3", 'score': 3.6},
])

If your dataframe has columns named something else, you can always map them with the rename function. For instance, if your dataframe has the columns qid, docno, and output, you can easily make a qrels dataframe that is compatible with ir-measures like so:

run = df.rename(columns={'qid': 'query_id', 'docno': 'doc_id', 'output': 'score'})

TREC-formatted run file: You can read a TREC-formatted run file:

# a file path:
run = ir_measures.read_trec_run('path/to/run')
# raw run file contents:
run = ir_measures.read_trec_run('''
Q0 0 D0 0 1.2 runid
Q0 0 D1 1 1.0 runid
Q1 0 D3 0 3.6 runid
Q1 0 D0 1 2.4 runid
''')
# TREC run format: "query_id ignored doc_id rank score runid". This parser ignores "ignored", "rank", and "runid".

Note that read_trec_run returns a generator. If you need to use the qrels multiple times, wrap it in the list constructor to read the all qrels into memory.

dict-of-dict: Run structured in a hierarchy. At the first level, query IDs map to another dictionary. At the second level, document IDs map to (float) ranking scores:

run = {
    'Q0': {
        "D0": 1.2,
        "D1": 1.0,
    },
    "Q1": {
        "D0": 2.4,
        "D3": 3.6
    }
}

Measure Objects

Measure objects speficy the measure you want to calculate, along with any parameters they may have. There are several ways to create them. The easiest is to specify them directly in code:

>>> from ir_measures import * # imports all measure names
>>> AP
AP
>>> AP(rel=2)
AP(rel=2)
>>> nDCG@20
nDCG@20
>>> P(rel=2)@10
P(rel=2)@10

Notice that measures can include parameters. For instance, AP(rel=2) is the average precision measure with a minimum relevance level of 2 (i.e., documents need to be scored at least 2 to count as relevant.) Or nDCG@20, which specifies a ranking cutoff threshold of 20. See the measure’s documentation for full details of available parameters.

If you need to get a measure object from a string (e.g., if specified by the user as a command line argument), use the ir_measures.parse_measure function:

>>> ir_measures.parse_measure('AP')
AP
>>> ir_measures.parse_measure('AP(rel=2)')
AP(rel=2)
>>> ir_measures.parse_measure('nDCG@20')
nDCG@20
>>> ir_measures.parse_measure('P(rel=2)@10')
P(rel=2)@10

If you are familiar with the measure and family names from trec_eval, you can map them to measure objects using ir_measures.parse_trec_measure():

>>> ir_measures.parse_trec_measure('map')
[AP]
>>> ir_measures.parse_trec_measure('P') # expands to multiple levels
[P@5, P@10, P@15, P@20, P@30, P@100, P@200, P@500, P@1000]
>>> ir_measures.parse_trec_measure('P_3,8') # or 'P.3,8'
[P@3, P@8]
>>> ir_measures.parse_trec_measure('ndcg')
[nDCG]
>>> ir_measures.parse_trec_measure('ndcg_cut_10')
[nDCG@10]
>>> ir_measures.parse_trec_measure('official')
[P@5, P@10, P@15, P@20, P@30, P@100, P@200, P@500, P@1000, Rprec, Bpref, IPrec@0.0, IPrec@0.1, IPrec@0.2, IPrec@0.3, IPrec@0.4, IPrec@0.5, IPrec@0.6, IPrec@0.7, IPrec@0.8, IPrec@0.9, IPrec@1.0, AP, NumQ, NumRel, NumRet(rel=1), NumRet, RR]

Note that a single trec_eval measure name can map to multiple measures, so measures are returned as a list.

Measures are be passed into methods like ir_measures.calc_aggregate, ir_measures.iter_calc, and ir_measures.evaluator. You can also calculate values from the measure object itself:

>>> AP.calc_aggregate(qrels, run)
0.2842120439595336
>>> (nDCG@10).calc_aggregate(qrels, run) # parens needed when @cutoff is used
0.6250748053944134
>>> for metric in (P(rel=2)@10).iter_calc(qrels, run):
...     print(metric)
Metric(query_id='1', measure=P(rel=2)@10, value=0.5)
Metric(query_id='2', measure=P(rel=2)@10, value=0.8)
...
Metric(query_id='35', measure=P(rel=2)@10, value=0.9)

Scoring multiple runs

Sometimes you need to evaluate several different systems using the same benchmark. To avoid redundant work for every run (such as processing qrels), you can create an evaluator(measures, qrels) object that can be re-used on multiple runs. An evaluator object has calc_aggregate(run) and calc_iter(run) methods.

>>> evaluator = ir_measures.evaluator([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels)
>>> evaluator.calc_aggregate(run1)
{nDCG@10: 0.6250, P@5: 0.7485, P(rel=2)@5: 0.6000, Judged@10: 0.9485}
>>> evaluator.calc_aggregate(run2)
{nDCG@10: 0.6285, P@5: 0.7771, P(rel=2)@5: 0.6285, Judged@10: 0.9400}
>>> evaluator.calc_aggregate(run3)
{nDCG@10: 0.5286, P@5: 0.6228, P(rel=2)@5: 0.4628, Judged@10: 0.8485}

Empty Set Behaviour

ir-measures normalizes the behavior across tools by always returning results based on all queries that appear in the provided qrels, regardless of what appears in the run. This corresponds with the -c flag in trec_eval. Queries that appear in the run but not the qrels are ignored, and queries that appear in the qrels but not the run are given a score of 0.

This behaviour is based on the following reasoning:

  1. Queries that do not appear in the qrels were not judged, and therefore cannot be properly scored if returned in the run.

  2. Queries that do not appear in the run may have returned no results, and therefore be scored as such.

We believe that these are the proper settings, so there is currently no way to change this behaviour directly in the software. If you wish to only score some of the queries provided in the qrels, you may of course filter down the qrels provided to ir-measures to only those queries.