Measures¶

Measure objects speficy the measure to calculate, along with any parameters they have. (They do not define the implementation — that’s the job of a Provider.)

This page provides a list of the Measures that are available in this package.

`Accuracy`¶

Reports the probability that a relevant document is ranked before a non relevant one. This metric purpose is to be used for diagnosis (checking that train/test/validation accuracy match). As such, it only considers relevant documents which are within the returned ones.

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)

Supported by:

accuracy: Accuracy(rel=ANY)@ANY

`alpha_DCG`¶

A version of DCG that accounts for multiple possible query intents.

Citation

Clarke et al. Novelty and diversity in information retrieval evaluation. SIGIR 2008. [link]

@inproceedings{DBLP:conf/sigir/ClarkeKCVABM08,
  author       = {Charles L. A. Clarke and
                  Maheedhar Kolla and
                  Gordon V. Cormack and
                  Olga Vechtomova and
                  Azin Ashkan and
                  Stefan B{\"{u}}ttcher and
                  Ian MacKinnon},
  editor       = {Sung{-}Hyon Myaeng and
                  Douglas W. Oard and
                  Fabrizio Sebastiani and
                  Tat{-}Seng Chua and
                  Mun{-}Kew Leong},
  title        = {Novelty and diversity in information retrieval evaluation},
  booktitle    = {Proceedings of the 31st Annual International {ACM} {SIGIR} Conference
                  on Research and Development in Information Retrieval, {SIGIR} 2008,
                  Singapore, July 20-24, 2008},
  pages        = {659--666},
  publisher    = {{ACM}},
  year         = {2008},
  url          = {https://doi.org/10.1145/1390334.1390446},
  doi          = {10.1145/1390334.1390446},
  timestamp    = {Sun, 25 Oct 2020 23:03:58 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/ClarkeKCVABM08.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
alpha (float) - Redundancy intolerance
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: alpha_DCG(alpha=ANY,rel=ANY,judged_only=ANY)@ANY

`alpha_nDCG`¶

A version of nDCG that accounts for multiple possible query intents.

Citation

Clarke et al. Novelty and diversity in information retrieval evaluation. SIGIR 2008. [link]

@inproceedings{DBLP:conf/sigir/ClarkeKCVABM08,
  author       = {Charles L. A. Clarke and
                  Maheedhar Kolla and
                  Gordon V. Cormack and
                  Olga Vechtomova and
                  Azin Ashkan and
                  Stefan B{\"{u}}ttcher and
                  Ian MacKinnon},
  editor       = {Sung{-}Hyon Myaeng and
                  Douglas W. Oard and
                  Fabrizio Sebastiani and
                  Tat{-}Seng Chua and
                  Mun{-}Kew Leong},
  title        = {Novelty and diversity in information retrieval evaluation},
  booktitle    = {Proceedings of the 31st Annual International {ACM} {SIGIR} Conference
                  on Research and Development in Information Retrieval, {SIGIR} 2008,
                  Singapore, July 20-24, 2008},
  pages        = {659--666},
  publisher    = {{ACM}},
  year         = {2008},
  url          = {https://doi.org/10.1145/1390334.1390446},
  doi          = {10.1145/1390334.1390446},
  timestamp    = {Sun, 25 Oct 2020 23:03:58 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/ClarkeKCVABM08.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
alpha (float) - Redundancy intolerance
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: alpha_nDCG(alpha=ANY,rel=ANY,judged_only=ANY)@ANY

`AP`¶

The [Mean] Average Precision ([M]AP). The average precision of a single query is the mean of the precision scores at each relevant item returned in a search results list.

AP is typically used for adhoc ranking tasks where getting as many relevant items as possible is. It is commonly referred to as MAP, by taking the mean of AP over the query set.

Citation

Harman. Evaluation Issues in Information Retrieval. Inf. Process. Manag. 1992. [link]

@article{DBLP:journals/ipm/Harman92,
  author       = {Donna Harman},
  title        = {Evaluation Issues in Information Retrieval},
  journal      = {Inf. Process. Manag.},
  volume       = {28},
  number       = {4},
  pages        = {439--440},
  year         = {1992},
  url          = {https://doi.org/10.1016/0306-4573(92)90001-G},
  doi          = {10.1016/0306-4573(92)90001-G},
  timestamp    = {Fri, 21 Feb 2020 13:11:30 +0100},
  biburl       = {https://dblp.org/rec/journals/ipm/Harman92.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

cwl_eval: AP(rel=ANY,judged_only=False)@NOT_PROVIDED
pytrec_eval: AP(rel=ANY,judged_only=ANY)@ANY
trectools: AP(rel=1,judged_only=False)@ANY
ranx: AP(rel=ANY,judged_only=False)@ANY

`AP_IA`¶

Intent-aware (Mean) Average Precision

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: AP_IA(rel=ANY,judged_only=ANY)

`BPM`¶

The Bejeweled Player Model (BPM).

Citation

Zhang et al. Evaluating Web Search with a Bejeweled Player Model. SIGIR 2017. [link]

@inproceedings{DBLP:conf/sigir/ZhangLLZXM17,
  author       = {Fan Zhang and
                  Yiqun Liu and
                  Xin Li and
                  Min Zhang and
                  Yinghui Xu and
                  Shaoping Ma},
  editor       = {Noriko Kando and
                  Tetsuya Sakai and
                  Hideo Joho and
                  Hang Li and
                  Arjen P. de Vries and
                  Ryen W. White},
  title        = {Evaluating Web Search with a Bejeweled Player Model},
  booktitle    = {Proceedings of the 40th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, Shinjuku, Tokyo,
                  Japan, August 7-11, 2017},
  pages        = {425--434},
  publisher    = {{ACM}},
  year         = {2017},
  url          = {https://doi.org/10.1145/3077136.3080841},
  doi          = {10.1145/3077136.3080841},
  timestamp    = {Tue, 15 Nov 2022 13:06:00 +0100},
  biburl       = {https://dblp.org/rec/conf/sigir/ZhangLLZXM17.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
T (float) - total desired gain (normalized)
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: BPM(T=ANY,min_rel=ANY,max_rel=REQUIRED)@ANY

`Bpref`¶

Binary Preference (Bpref). This measure examines the relative ranks of judged relevant and non-relevant documents. Non-judged documents are not considered.

Citation

Buckley and Voorhees. Retrieval evaluation with incomplete information. SIGIR 2004. [link]

@inproceedings{DBLP:conf/sigir/BuckleyV04,
  author       = {Chris Buckley and
                  Ellen M. Voorhees},
  editor       = {Mark Sanderson and
                  Kalervo J{\"{a}}rvelin and
                  James Allan and
                  Peter Bruza},
  title        = {Retrieval evaluation with incomplete information},
  booktitle    = {{SIGIR} 2004: Proceedings of the 27th Annual International {ACM} {SIGIR}
                  Conference on Research and Development in Information Retrieval, Sheffield,
                  UK, July 25-29, 2004},
  pages        = {25--32},
  publisher    = {{ACM}},
  year         = {2004},
  url          = {https://doi.org/10.1145/1008992.1009000},
  doi          = {10.1145/1008992.1009000},
  timestamp    = {Thu, 14 Oct 2021 10:27:19 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/BuckleyV04.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)

Supported by:

pytrec_eval: Bpref(rel=ANY)
trectools: Bpref(rel=1)

`Compat`¶

Compatibility measure desribed in:

Citation

Clarke et al. Assessing Top- Preferences. ACM Trans. Inf. Syst. 2021. [link]

@article{DBLP:journals/tois/ClarkeVS21,
  author       = {Charles L. A. Clarke and
                  Alexandra Vtyurina and
                  Mark D. Smucker},
  title        = {Assessing Top- Preferences},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {39},
  number       = {3},
  pages        = {33:1--33:21},
  year         = {2021},
  url          = {https://doi.org/10.1145/3451161},
  doi          = {10.1145/3451161},
  timestamp    = {Sat, 09 Apr 2022 12:20:33 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/ClarkeVS21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

p (float) - persistence
normalize (bool) - apply normalization for finite ideal rankings

Supported by:

compat: Compat(p=ANY,normalize=ANY)

`ERR`¶

The Expected Reciprocal Rank (ERR) is a precision-focused measure. In essence, an extension of reciprocal rank that encapsulates both graded relevance and a more realistic cascade-based user model of how users brwose a ranking.

Parameters:

cutoff (int) - ranking cutoff threshold

Supported by:

gdeval: ERR@REQUIRED

`ERR_IA`¶

Intent-Aware Expected Reciprocal Rank with collection-independent normalisation.

Citation

Chapelle et al. Expected reciprocal rank for graded relevance. CIKM 2009. [link]

@inproceedings{DBLP:conf/cikm/ChapelleMZG09,
  author       = {Olivier Chapelle and
                  Donald Metlzer and
                  Ya Zhang and
                  Pierre Grinspan},
  editor       = {David Wai{-}Lok Cheung and
                  Il{-}Yeol Song and
                  Wesley W. Chu and
                  Xiaohua Hu and
                  Jimmy Lin},
  title        = {Expected reciprocal rank for graded relevance},
  booktitle    = {Proceedings of the 18th {ACM} Conference on Information and Knowledge
                  Management, {CIKM} 2009, Hong Kong, China, November 2-6, 2009},
  pages        = {621--630},
  publisher    = {{ACM}},
  year         = {2009},
  url          = {https://doi.org/10.1145/1645953.1646033},
  doi          = {10.1145/1645953.1646033},
  timestamp    = {Mon, 11 Mar 2024 13:45:28 +0100},
  biburl       = {https://dblp.org/rec/conf/cikm/ChapelleMZG09.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: ERR_IA(rel=ANY,judged_only=ANY)@ANY

`infAP`¶

Inferred AP. AP implementation that accounts for pooled-but-unjudged documents by assuming that they are relevant at the same proportion as other judged documents. Essentially, skips documents that were pooled-but-not-judged, and assumes unjudged are non-relevant.

Pooled-but-unjudged indicated by a score of -1, by convention. Note that not all qrels use this convention.

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)

Supported by:

pytrec_eval: infAP(rel=ANY)

`INSQ`¶

INSQ

Citation

Moffat et al. Models and metrics: IR evaluation as a user process. ADCS 2012. [link]

@inproceedings{DBLP:conf/adcs/MoffatST12,
  author       = {Alistair Moffat and
                  Falk Scholer and
                  Paul Thomas},
  editor       = {Andrew Trotman and
                  Sally Jo Cunningham and
                  Laurianne Sitbon},
  title        = {Models and metrics: {IR} evaluation as a user process},
  booktitle    = {The Seventeenth Australasian Document Computing Symposium, {ADCS}
                  '12, Dunedin, New Zealand, December 5-6, 2012},
  pages        = {47--54},
  publisher    = {{ACM}},
  year         = {2012},
  url          = {https://doi.org/10.1145/2407085.2407092},
  doi          = {10.1145/2407085.2407092},
  timestamp    = {Mon, 26 Jun 2023 20:48:56 +0200},
  biburl       = {https://dblp.org/rec/conf/adcs/MoffatST12.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

T (float) - total desired gain (normalized)
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: INSQ(T=ANY,min_rel=ANY,max_rel=REQUIRED)

`INST`¶

INST, a variant of INSQ

Citation

Bailey et al. User Variability and IR System Evaluation. SIGIR 2015. [link]

@inproceedings{DBLP:conf/sigir/BaileyMST15,
  author       = {Peter Bailey and
                  Alistair Moffat and
                  Falk Scholer and
                  Paul Thomas},
  editor       = {Ricardo Baeza{-}Yates and
                  Mounia Lalmas and
                  Alistair Moffat and
                  Berthier A. Ribeiro{-}Neto},
  title        = {User Variability and {IR} System Evaluation},
  booktitle    = {Proceedings of the 38th International {ACM} {SIGIR} Conference on
                  Research and Development in Information Retrieval, Santiago, Chile,
                  August 9-13, 2015},
  pages        = {625--634},
  publisher    = {{ACM}},
  year         = {2015},
  url          = {https://doi.org/10.1145/2766462.2767728},
  doi          = {10.1145/2766462.2767728},
  timestamp    = {Mon, 26 Jun 2023 20:45:16 +0200},
  biburl       = {https://dblp.org/rec/conf/sigir/BaileyMST15.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

T (float) - total desired gain (normalized)
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: INST(T=ANY,min_rel=ANY,max_rel=REQUIRED)

`IPrec`¶

Interpolated Precision at a given recall cutoff. Used for building precision-recall graphs. Unlike most measures, where @ indicates an absolute cutoff threshold, here @ sets the recall cutoff.

Parameters:

recall (float) - recall threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: IPrec(judged_only=ANY)@ANY

`Judged`¶

Percentage of results in the top k (cutoff) results that have relevance judgments. Equivalent to P@k with a rel lower than any judgment.

Parameters:

cutoff (int) - ranking cutoff threshold

Supported by:

judged: Judged@ANY

`nDCG`¶

The normalized Discounted Cumulative Gain (nDCG). Uses graded labels - systems that put the highest graded documents at the top of the ranking. It is normalized wrt. the Ideal NDCG, i.e. documents ranked in descending order of graded label.

Citation

Järvelin and Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 2002. [link]

@article{DBLP:journals/tois/JarvelinK02,
  author       = {Kalervo J{\"{a}}rvelin and
                  Jaana Kek{\"{a}}l{\"{a}}inen},
  title        = {Cumulated gain-based evaluation of {IR} techniques},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {20},
  number       = {4},
  pages        = {422--446},
  year         = {2002},
  url          = {http://doi.acm.org/10.1145/582415.582418},
  doi          = {10.1145/582415.582418},
  timestamp    = {Fri, 09 Jun 2017 11:03:19 +0200},
  biburl       = {https://dblp.org/rec/journals/tois/JarvelinK02.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
dcg (str) - DCG formulation
gains (dict) - custom gain mapping (int-to-int)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: nDCG(dcg='log2',gains=ANY,judged_only=ANY)@ANY
gdeval: nDCG(dcg='exp-log2',gains=NOT_PROVIDED,judged_only=False)@REQUIRED
trectools: nDCG(dcg=ANY,gains=NOT_PROVIDED,judged_only=False)@ANY
ranx: nDCG(dcg=('log2', 'exp-log2'),gains=NOT_PROVIDED,judged_only=False)@ANY

`NERR10`¶

Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure, version from Equation (10) of the the following paper.

Citation

Azzopardi et al. ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other Metrics. ICTIR 2021. [link]

@inproceedings{DBLP:conf/ictir/AzzopardiMM21,
  author       = {Leif Azzopardi and
                  Joel Mackenzie and
                  Alistair Moffat},
  editor       = {Faegheh Hasibi and
                  Yi Fang and
                  Akiko Aizawa},
  title        = {{ERR} is not {C/W/L:} Exploring the Relationship Between Expected
                  Reciprocal Rank and Other Metrics},
  booktitle    = {{ICTIR} '21: The 2021 {ACM} {SIGIR} International Conference on the
                  Theory of Information Retrieval, Virtual Event, Canada, July 11, 2021},
  pages        = {231--237},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3471158.3472239},
  doi          = {10.1145/3471158.3472239},
  timestamp    = {Fri, 10 Sep 2021 14:39:10 +0200},
  biburl       = {https://dblp.org/rec/conf/ictir/AzzopardiMM21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

p (float) - persistence
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: NERR10(p=ANY,min_rel=ANY,max_rel=REQUIRED)

`NERR11`¶

Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure, version from Equation (12) of the the following paper.

Citation

Azzopardi et al. ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other Metrics. ICTIR 2021. [link]

@inproceedings{DBLP:conf/ictir/AzzopardiMM21,
  author       = {Leif Azzopardi and
                  Joel Mackenzie and
                  Alistair Moffat},
  editor       = {Faegheh Hasibi and
                  Yi Fang and
                  Akiko Aizawa},
  title        = {{ERR} is not {C/W/L:} Exploring the Relationship Between Expected
                  Reciprocal Rank and Other Metrics},
  booktitle    = {{ICTIR} '21: The 2021 {ACM} {SIGIR} International Conference on the
                  Theory of Information Retrieval, Virtual Event, Canada, July 11, 2021},
  pages        = {231--237},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3471158.3472239},
  doi          = {10.1145/3471158.3472239},
  timestamp    = {Fri, 10 Sep 2021 14:39:10 +0200},
  biburl       = {https://dblp.org/rec/conf/ictir/AzzopardiMM21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

T (float) - total desired gain (normalized)
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: NERR11(T=ANY,min_rel=ANY,max_rel=REQUIRED)

`NERR8`¶

Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure, version from Equation (8) of the the following paper.

Citation

Azzopardi et al. ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other Metrics. ICTIR 2021. [link]

@inproceedings{DBLP:conf/ictir/AzzopardiMM21,
  author       = {Leif Azzopardi and
                  Joel Mackenzie and
                  Alistair Moffat},
  editor       = {Faegheh Hasibi and
                  Yi Fang and
                  Akiko Aizawa},
  title        = {{ERR} is not {C/W/L:} Exploring the Relationship Between Expected
                  Reciprocal Rank and Other Metrics},
  booktitle    = {{ICTIR} '21: The 2021 {ACM} {SIGIR} International Conference on the
                  Theory of Information Retrieval, Virtual Event, Canada, July 11, 2021},
  pages        = {231--237},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3471158.3472239},
  doi          = {10.1145/3471158.3472239},
  timestamp    = {Fri, 10 Sep 2021 14:39:10 +0200},
  biburl       = {https://dblp.org/rec/conf/ictir/AzzopardiMM21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: NERR8(min_rel=ANY,max_rel=REQUIRED)@REQUIRED

`NERR9`¶

Version of the Not (but Nearly) Expected Reciprocal Rank (NERR) measure, version from Equation (9) of the the following paper.

Citation

Azzopardi et al. ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other Metrics. ICTIR 2021. [link]

@inproceedings{DBLP:conf/ictir/AzzopardiMM21,
  author       = {Leif Azzopardi and
                  Joel Mackenzie and
                  Alistair Moffat},
  editor       = {Faegheh Hasibi and
                  Yi Fang and
                  Akiko Aizawa},
  title        = {{ERR} is not {C/W/L:} Exploring the Relationship Between Expected
                  Reciprocal Rank and Other Metrics},
  booktitle    = {{ICTIR} '21: The 2021 {ACM} {SIGIR} International Conference on the
                  Theory of Information Retrieval, Virtual Event, Canada, July 11, 2021},
  pages        = {231--237},
  publisher    = {{ACM}},
  year         = {2021},
  url          = {https://doi.org/10.1145/3471158.3472239},
  doi          = {10.1145/3471158.3472239},
  timestamp    = {Fri, 10 Sep 2021 14:39:10 +0200},
  biburl       = {https://dblp.org/rec/conf/ictir/AzzopardiMM21.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: NERR9(min_rel=ANY,max_rel=REQUIRED)@REQUIRED

`nERR_IA`¶

Intent-Aware Expected Reciprocal Rank with collection-dependent normalisation.

Citation

Chapelle et al. Expected reciprocal rank for graded relevance. CIKM 2009. [link]

@inproceedings{DBLP:conf/cikm/ChapelleMZG09,
  author       = {Olivier Chapelle and
                  Donald Metlzer and
                  Ya Zhang and
                  Pierre Grinspan},
  editor       = {David Wai{-}Lok Cheung and
                  Il{-}Yeol Song and
                  Wesley W. Chu and
                  Xiaohua Hu and
                  Jimmy Lin},
  title        = {Expected reciprocal rank for graded relevance},
  booktitle    = {Proceedings of the 18th {ACM} Conference on Information and Knowledge
                  Management, {CIKM} 2009, Hong Kong, China, November 2-6, 2009},
  pages        = {621--630},
  publisher    = {{ACM}},
  year         = {2009},
  url          = {https://doi.org/10.1145/1645953.1646033},
  doi          = {10.1145/1645953.1646033},
  timestamp    = {Mon, 11 Mar 2024 13:45:28 +0100},
  biburl       = {https://dblp.org/rec/conf/cikm/ChapelleMZG09.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: nERR_IA(rel=ANY,judged_only=ANY)@ANY

`nNRBP`¶

Novelty- and Rank-Biased Precision with collection-dependent normalisation.

Citation

Clarke et al. An Effectiveness Measure for Ambiguous and Underspecified Queries. ICTIR 2009. [link]

@inproceedings{DBLP:conf/ictir/ClarkeKV09,
  author       = {Charles L. A. Clarke and
                  Maheedhar Kolla and
                  Olga Vechtomova},
  editor       = {Leif Azzopardi and
                  Gabriella Kazai and
                  Stephen E. Robertson and
                  Stefan M. R{\"{u}}ger and
                  Milad Shokouhi and
                  Dawei Song and
                  Emine Yilmaz},
  title        = {An Effectiveness Measure for Ambiguous and Underspecified Queries},
  booktitle    = {Advances in Information Retrieval Theory, Second International Conference
                  on the Theory of Information Retrieval, {ICTIR} 2009, Cambridge, UK,
                  September 10-12, 2009, Proceedings},
  series       = {Lecture Notes in Computer Science},
  volume       = {5766},
  pages        = {188--199},
  publisher    = {Springer},
  year         = {2009},
  url          = {https://doi.org/10.1007/978-3-642-04417-5\_17},
  doi          = {10.1007/978-3-642-04417-5\_17},
  timestamp    = {Sun, 25 Oct 2020 23:12:59 +0100},
  biburl       = {https://dblp.org/rec/conf/ictir/ClarkeKV09.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
alpha (float) - Redundancy intolerance
beta (float) - Patience

Supported by:

pyndeval: nNRBP(alpha=ANY,beta=ANY,rel=ANY)

`NRBP`¶

Novelty- and Rank-Biased Precision with collection-independent normalisation.

Citation

Clarke et al. An Effectiveness Measure for Ambiguous and Underspecified Queries. ICTIR 2009. [link]

@inproceedings{DBLP:conf/ictir/ClarkeKV09,
  author       = {Charles L. A. Clarke and
                  Maheedhar Kolla and
                  Olga Vechtomova},
  editor       = {Leif Azzopardi and
                  Gabriella Kazai and
                  Stephen E. Robertson and
                  Stefan M. R{\"{u}}ger and
                  Milad Shokouhi and
                  Dawei Song and
                  Emine Yilmaz},
  title        = {An Effectiveness Measure for Ambiguous and Underspecified Queries},
  booktitle    = {Advances in Information Retrieval Theory, Second International Conference
                  on the Theory of Information Retrieval, {ICTIR} 2009, Cambridge, UK,
                  September 10-12, 2009, Proceedings},
  series       = {Lecture Notes in Computer Science},
  volume       = {5766},
  pages        = {188--199},
  publisher    = {Springer},
  year         = {2009},
  url          = {https://doi.org/10.1007/978-3-642-04417-5\_17},
  doi          = {10.1007/978-3-642-04417-5\_17},
  timestamp    = {Sun, 25 Oct 2020 23:12:59 +0100},
  biburl       = {https://dblp.org/rec/conf/ictir/ClarkeKV09.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
alpha (float) - Redundancy intolerance
beta (float) - Patience

Supported by:

pyndeval: NRBP(alpha=ANY,beta=ANY,rel=ANY)

`NumQ`¶

The total number of queries.

Supported by:

pytrec_eval: NumQ

`NumRel`¶

The number of relevant documents the query has (independent of what the system retrieved).

Parameters:

rel (int) - minimum relevance score to be counted (inclusive)

Supported by:

pytrec_eval: NumRel(rel=1)

`NumRet`¶

The number of results returned. When rel is provided, counts the number of documents returned with at least that relevance score (inclusive).

Parameters:

rel (int) - minimum relevance score to be counted (inclusive), or all documents returned if NOT_PROVIDED

Supported by:

pytrec_eval: NumRet(rel=ANY)
ranx: NumRet(rel=REQUIRED)

`P`¶

Basic measure for that computes the percentage of documents in the top cutoff results that are labeled as relevant. cutoff is a required parameter, and can be provided as P@cutoff.

Citation

Rijsbergen. Information Retrieval. 1979.

@book{DBLP:books/bu/Rijsbergen79,
  author       = {C. J. van Rijsbergen},
  title        = {Information Retrieval},
  publisher    = {Butterworth},
  year         = {1979},
  isbn         = {0-408-70929-4},
  timestamp    = {Thu, 03 Jan 2002 11:51:10 +0100},
  biburl       = {https://dblp.org/rec/books/bu/Rijsbergen79.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

cwl_eval: P(rel=ANY,judged_only=False)@ANY
pytrec_eval: P(rel=ANY,judged_only=ANY)@ANY
trectools: P(rel=1,judged_only=False)@ANY
ranx: P(rel=ANY,judged_only=False)@ANY

`P_IA`¶

Intent-aware Precision@k.

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - calculate measure using only judged documents (i.e., discard unjudged documents)

Supported by:

pyndeval: P_IA(rel=ANY,judged_only=ANY)@ANY

`R`¶

Recall@k (R@k). The fraction of relevant documents for a query that have been retrieved by rank k.

NOTE: Some tasks define Recall@k as whether any relevant documents are found in the top k results. This software follows the TREC convention and refers to that measure as Success@k.

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: R(judged_only=ANY)@ANY
ranx: R(judged_only=False)@ANY

`RBP`¶

The Rank-Biased Precision (RBP).

Citation

Moffat and Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 2008. [link]

@article{DBLP:journals/tois/MoffatZ08,
  author       = {Alistair Moffat and
                  Justin Zobel},
  title        = {Rank-biased precision for measurement of retrieval effectiveness},
  journal      = {{ACM} Trans. Inf. Syst.},
  volume       = {27},
  number       = {1},
  pages        = {2:1--2:27},
  year         = {2008},
  url          = {https://doi.org/10.1145/1416950.1416952},
  doi          = {10.1145/1416950.1416952},
  timestamp    = {Tue, 06 Nov 2018 12:51:56 +0100},
  biburl       = {https://dblp.org/rec/journals/tois/MoffatZ08.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
p (float) - persistence
rel (int) - minimum relevance score to be considered relevant (inclusive), or NOT_PROVIDED to use graded relevance

Supported by:

cwl_eval: RBP(rel=REQUIRED,p=ANY)@NOT_PROVIDED
trectools: RBP(p=ANY,rel=ANY)@ANY

`Rprec`¶

The precision at R, where R is the number of relevant documents for a given query. Has the cute property that it is also the recall at R.

Citation

Buckley and Voorhees. Retrieval System Evaluation. 2005. [link]

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: Rprec(rel=ANY,judged_only=ANY)
trectools: Rprec(rel=1,judged_only=False)
ranx: Rprec(rel=ANY,judged_only=False)

`RR`¶

The [Mean] Reciprocal Rank ([M]RR) is a precision-focused measure that scores based on the reciprocal of the rank of the highest-scoring relevance document. An optional cutoff can be provided to limit the depth explored. rel (default 1) controls which relevance level is considered relevant.

Citation

Kantor and Voorhees. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Inf. Retr. 2000. [link]

@article{DBLP:journals/ir/KantorV00,
  author       = {Paul B. Kantor and
                  Ellen M. Voorhees},
  title        = {The {TREC-5} Confusion Track: Comparing Retrieval Methods for Scanned
                  Text},
  journal      = {Inf. Retr.},
  volume       = {2},
  number       = {2/3},
  pages        = {165--176},
  year         = {2000},
  url          = {https://doi.org/10.1023/A:1009902609570},
  doi          = {10.1023/A:1009902609570},
  timestamp    = {Thu, 14 Oct 2021 09:13:06 +0200},
  biburl       = {https://dblp.org/rec/journals/ir/KantorV00.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

cwl_eval: RR(rel=ANY,judged_only=False)@NOT_PROVIDED
pytrec_eval: RR(rel=ANY,judged_only=ANY)@NOT_PROVIDED
trectools: RR(rel=1,judged_only=False)@NOT_PROVIDED
msmarco: RR(rel=ANY,judged_only=False)@ANY
ranx: RR(rel=ANY,judged_only=False)@NOT_PROVIDED

`SDCG`¶

The Scaled Discounted Cumulative Gain (SDCG), a variant of nDCG that assumes more fully-relevant documents exist but are not labeled.

Parameters:

cutoff (int) - ranking cutoff threshold
dcg (str) - DCG formulation
min_rel (int) - minimum relevance score
max_rel (int) - maximum relevance score

Supported by:

cwl_eval: SDCG(dcg='log2',min_rel=ANY,max_rel=REQUIRED)@REQUIRED

`SetAP`¶

The unranked Set AP (SetAP); i.e., SetP * SetR

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: SetAP(rel=ANY,judged_only=ANY)

`SetF`¶

The Set F measure (SetF); i.e., the harmonic mean of SetP and SetR

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
beta (float) - relative importance of R to P in the harmonic mean
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: SetF(rel=ANY,beta=ANY,judged_only=ANY)

`SetP`¶

The Set Precision (SetP); i.e., the number of relevant docs divided by the total number retrieved

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)
relative (bool) - calculate the measure using the maximum possible SetP for the provided result size
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: SetP(rel=ANY,relative=ANY,judged_only=ANY)
ranx: SetP(rel=ANY,judged_only=False)

`SetR`¶

The Set Recall (SetR); i.e., the number of relevant docs divided by the total number of relevant documents

Parameters:

rel (int) - minimum relevance score to be considered relevant (inclusive)

Supported by:

pytrec_eval: SetR(rel=ANY)
ranx: SetR(rel=ANY)

`StRecall`¶

Subtopic recall (the number of subtopics covered by the top k docs)

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)

Supported by:

pyndeval: StRecall(rel=ANY)@ANY

`Success`¶

1 if a document with at least rel relevance is found in the first cutoff documents, else 0.

NOTE: Some refer to this measure as Recall@k. This software follows the TREC convention, where Recall@k is defined as the proportion of known relevant documents retrieved in the top k results.

Parameters:

cutoff (int) - ranking cutoff threshold
rel (int) - minimum relevance score to be considered relevant (inclusive)
judged_only (bool) - ignore returned documents that do not have relevance judgments

Supported by:

pytrec_eval: Success(rel=ANY,judged_only=ANY)@ANY
ranx: Success(rel=ANY,judged_only=False)@REQUIRED

Aliases¶

These provide shortcuts to “canonical” measures, and are typically used when multiple names or casings for the same measure exist. You can use them just like any other measure and the identifiers are equal (e.g., AP == MAP) but the names will appear in the canonical form when printed.

BPref → Bpref
MAP → AP
MAP_IA → AP_IA
MRR → RR
NDCG → nDCG
NumRelRet → NumRet(rel=1)
Precision → P
Recall → R
RPrec → Rprec
SetRelP → SetP(relative=True)
α_DCG → alpha_DCG
α_nDCG → alpha_nDCG

Measures¶

Accuracy¶

alpha_DCG¶

alpha_nDCG¶

AP¶

AP_IA¶

BPM¶

Bpref¶

Compat¶

ERR¶

ERR_IA¶

infAP¶

INSQ¶

INST¶

IPrec¶

Judged¶

nDCG¶

NERR10¶

NERR11¶

NERR8¶

NERR9¶

nERR_IA¶

nNRBP¶

NRBP¶

NumQ¶

NumRel¶

NumRet¶

P¶

P_IA¶

R¶

RBP¶

Rprec¶

RR¶

SDCG¶

SetAP¶

SetF¶

SetP¶

SetR¶

StRecall¶

Success¶

Aliases¶

`Accuracy`¶

`alpha_DCG`¶

`alpha_nDCG`¶

`AP`¶

`AP_IA`¶

`BPM`¶

`Bpref`¶

`Compat`¶

`ERR`¶

`ERR_IA`¶

`infAP`¶

`INSQ`¶

`INST`¶

`IPrec`¶

`Judged`¶

`nDCG`¶

`NERR10`¶

`NERR11`¶

`NERR8`¶

`NERR9`¶

`nERR_IA`¶

`nNRBP`¶

`NRBP`¶

`NumQ`¶

`NumRel`¶

`NumRet`¶

`P`¶

`P_IA`¶

`R`¶

`RBP`¶

`Rprec`¶

`RR`¶

`SDCG`¶

`SetAP`¶

`SetF`¶

`SetP`¶

`SetR`¶

`StRecall`¶

`Success`¶