ir-measures Documentation¶

ir-measures is a Python package that interfaces with several information retrieval (IR) evaluation tools, including pytrec_eval, gdeval, trectools, and others.

This package aims to simplify IR evaluation by providing an easy and flexible evaluation interface and by standardizing measure names (and their parameters).

Quick Start¶

You can install ir-measures using pip:

Install ir-measures¶

$ pip install ir-measures

Now that it’s installed, you can use it to compute evaluation measures! See the tabs below for examples using the Command Line Interface, the Python Interface, and the PyTerrier interface.

Compute several measures from the command line¶

$ ir_measures path/to/qrels path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10
nDCG@10  0.6251
P@5   0.7486
P(rel=2)@5  0.6000
Judged@10   0.9486

You can alternatively use a dataset ID from ir_datasets.

Using qrels from ir_datasets on the command line¶

$ ir_measures dataset_id path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10
nDCG@10  0.6251
P@5   0.7486
P(rel=2)@5  0.6000
Judged@10   0.9486

Compute several measures in Python¶

>>> import ir_measures
>>> from ir_measures import nDCG, P, Judged
>>> qrels = ir_measures.read_trec_qrels('path/to/qrels')
>>> run = ir_measures.read_trec_run('path/to/run')
>>> ir_measures.calc_aggregate([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels, run)
{
    nDCG@10: 0.6251,
    P@5: 0.7486,
    P(rel=2)@5: 0.6000,
    Judged@10: 0.9486
}

You can also use qrels from ir_datasets instead of loading them from a file.

Loading qrels from ir_datasets in Python¶

>>> import ir_datasets
>>> qrels = ir_datasets.load('dataset_id').qrels
>>> ...

ir_measures is used by the PyTerrier platform to evaluate ranking pipelines. In the following example, BM25 is evaluated using the standard measures for the TREC Deep Learning benchmark, provided by ir_measures:

Run an experiment using PyTerrier and ir_measures¶

>>> import pyterrier as pt
>>> from ir_measures import RR, nDCG, AP
>>> dataset = pt.get_dataset("irds:msmarco-passage/trec-dl-2019/judged")
>>> bm25 = pt.terrier.Retriever.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel="BM25")
>>> pt.Experiment(
>>>     [bm25],
>>>     dataset.get_topics(),
>>>     dataset.get_qrels(),
>>>     eval_metrics=[RR(rel=2), nDCG@10, nDCG@100, AP(rel=2)],
>>> )
                name  RR(rel=2)  nDCG@10  nDCG@100  AP(rel=2)
0  TerrierRetr(BM25)   0.641565  0.47954  0.487416   0.286448

Table of Contents¶

Demos

Acknowledgements¶

This extension was written by Sean MacAvaney and Craig Macdonald at the University of Glasgow, with contributions from Charlie Clarke, Benjamin Piwowarski, and Harry Scells. For a full list of contributors, see here.

If you use this package, be sure to cite:

Citation

MacAvaney et al. Streamlining Evaluation with ir-measures. ECIR (2) 2022. [link]

@inproceedings{DBLP:conf/ecir/MacAvaneyMO22a,
  author       = {Sean MacAvaney and
                  Craig Macdonald and
                  Iadh Ounis},
  editor       = {Matthias Hagen and
                  Suzan Verberne and
                  Craig Macdonald and
                  Christin Seifert and
                  Krisztian Balog and
                  Kjetil N{\o}rv{\aa}g and
                  Vinay Setty},
  title        = {Streamlining Evaluation with ir-measures},
  booktitle    = {Advances in Information Retrieval - 44th European Conference on {IR}
                  Research, {ECIR} 2022, Stavanger, Norway, April 10-14, 2022, Proceedings,
                  Part {II}},
  series       = {Lecture Notes in Computer Science},
  volume       = {13186},
  pages        = {305--310},
  publisher    = {Springer},
  year         = {2022},
  url          = {https://doi.org/10.1007/978-3-030-99739-7\_38},
  doi          = {10.1007/978-3-030-99739-7\_38},
  timestamp    = {Thu, 07 Apr 2022 18:19:50 +0200},
  biburl       = {https://dblp.org/rec/conf/ecir/MacAvaneyMO22a.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}