Getting Started ======================================= Installation --------------------------------------- You can install ir-measures from pip:: $ pip install ir-measures You can also install from current development version:: $ git clone git@github.com:terrierteam/ir_measures.git $ cd ir_measures $ python setup.py install Command Line Interface --------------------------------------- ``ir_measures`` can be used on the command line with an interface similar to `trec_eval `_:: $ ir_measures path/to/qrels path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10 nDCG@10 0.6251 P@5 0.7486 P(rel=2)@5 0.6000 Judged@10 0.9486 You can alternatively use a dataset ID from `ir_datasets `_ in place of ``path/to/qrels``. You can see per-topic results using the ``-q`` flag (similar to trec_eval):: $ ir_measures -q path/to/qrels path/to/run nDCG@10 P@5 'P(rel=2)@5' Judged@10 1 P@5 0.6000 1 nDCG@10 0.5134 2 P@5 1.0000 2 nDCG@10 0.8522 3 P@5 1.0000 ... 34 Judged@10 1.0000 35 Judged@10 1.0000 all nDCG@10 0.6251 all P@5 0.7486 all P(rel=2)@5 0.6000 all Judged@10 0.9486 The first column in the output is the query ID (or ``all`` for the aggregated results, which can be disabled with the ``-n`` flag). Results are written to the output stream as they are calculated by ``iter_calc``. Thus, they may not be in a predictable order [1]_. Full list of command line arguments: - ``-h`` (``--help``): print information about running the command - ``-p X`` (``--places X``): number of decimal places to use when writing the output. Default: ``4``. - ``-q`` (``--by_query``): print the results by query (topic), as shown above. - ``-n`` (``--no_summary``): when used with ``-q``, does not print aggregated (``all``) values. - ``--provider X``: forces the use of a particular provider, rather than using the default fallback approach. Possible values are: ``pytrec_eval``, ``judged``, ``gdeval``, ``trectools``, and ``msmarco``. Python Interface --------------------------------------- Compute measures from python: >>> import ir_measures >>> from ir_measures import * >>> qrels = ir_measures.read_trec_qrels('path/to/qrels') >>> run = ir_measures.read_trec_run('path/to/run') >>> ir_measures.calc_aggregate([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels, run) { nDCG@10: 0.6251, P@5: 0.7486, P(rel=2)@5: 0.6000, Judged@10: 0.9486 } Per-topic results can be calculated using ``iter_calc``: >>> for metric in ir_measures.iter_calc([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels, run): ... print(metric) Metric(query_id='1', measure=P@5, value=0.6) Metric(query_id='1', measure=nDCG@10, value=0.5134306625775544) Metric(query_id='2', measure=P@5, value=1.0) Metric(query_id='2', measure=nDCG@10, value=0.8521705090845474) Metric(query_id='3', measure=P@5, value=1.0) ... Metric(query_id='33', measure=Judged@10, value=1.0) Metric(query_id='34', measure=Judged@10, value=1.0) Here again, the results from ``iter_calc`` may not be returned in a predictable order [1]_. Qrels formats --------------------------------------- Query relevance assessments can be provided in a variety of formats. **namedtuple iterable**: Any iterable of named tuples. You can use ``ir_measures.Qrel``, or any other NamedTuple with the fields ``query_id``, ``doc_id``, and ``relevance`` (order and additional fields do not matter if another type of NamedTuple; the field names just need to match):: qrels = [ ir_measures.Qrel("Q0", "D0", 0), ir_measures.Qrel("Q0", "D1", 1), ir_measures.Qrel("Q1", "D0", 0), ir_measures.Qrel("Q1", "D3", 2), ] Note that if the results are an iterator (such as the result of a generator), ``ir_measures`` will consume the entire sequence. ``ir_measures.Qrel`` support an optional fourth parameter, ``iteration``. This is the source of the subtopic ID used for diversity measures (name matches TREC conventions). Note that unlike TREC-formatted qrels, this parameter is the last element, since this is required for optional parameters in namedtuples. **Pandas dataframe**: A pandas dataframe with the columns ``query_id``, ``doc_id``, and ``relevance``:: import pandas as pd qrels = pd.DataFrame([ {'query_id': "Q0", 'doc_id': "D0", 'relevance': 0}, {'query_id': "Q0", 'doc_id': "D1", 'relevance': 1}, {'query_id': "Q1", 'doc_id': "D0", 'relevance': 0}, {'query_id': "Q1", 'doc_id': "D3", 'relevance': 2}, ]) Dataframes support an optional fourth parameter, ``iteration``. This is the source of the subtopic ID used for diversity measures (name matches TREC conventions). If your dataframe has columns named something else, you can always map them with the ``rename`` function. For instance, if your dataframe has the columns ``qid``, ``docno``, and ``label``, you can easily make a qrels dataframe that is compatible with ir-measures like so:: qrels = df.rename(columns={'qid': 'query_id', 'docno': 'doc_id', 'label': 'relevance'}) **TREC-formatted qrels file**: You can read a TREC-formatted qrels file:: # a file path: qrels = ir_measures.read_trec_qrels('path/to/qrels') # raw qrels file contents: qrels = ir_measures.read_trec_qrels(''' Q0 0 D0 0 Q0 0 D1 1 Q1 0 D0 0 Q1 0 D3 2 ''') # TREC qrels format: "query_id iteration doc_id relevance". Note that ``read_trec_qrels`` returns a generator. If you need to use the qrels multiple times, wrap it in the ``list`` constructor to read the all qrels into memory. **ir_datasets qrels**: Qrels from the `ir_datasets package `_. This mode simply adheres to the **namedtuple iterable** specification above:: import ir_datasets qrels = ir_datasets.load('trec-robust04').qrels_iter() **dict-of-dict**: Qrels structured in a hierarchy. At the first level, query IDs map to another dictionary. At the second level, document IDs map to (integer) relevance scores:: qrels = { 'Q0': { "D0": 0, "D1": 1, }, "Q1": { "D0": 0, "D3": 2 } } Note that this format does not support the iteration field, so it should not be used with diversity measures. Run formats --------------------------------------- System outputs can be provided in a variety of formats. **namedtuple iterable**: Any iterable of named tuples. You can use ``ir_measures.ScoredDoc``, or any other NamedTuple with the fields ``query_id``, ``doc_id``, and ``score``:: run = [ ir_measures.ScoredDoc("Q0", "D0", 1.2), ir_measures.ScoredDoc("Q0", "D1", 1.0), ir_measures.ScoredDoc("Q1", "D0", 2.4), ir_measures.ScoredDoc("Q1", "D3", 3.6), ] Note that if the results are an iterator (such as the result of a generator), ``ir_measures`` will consume the entire sequence. **Pandas dataframe**: A pandas dataframe with the columns ``query_id``, ``doc_id``, and ``score``:: import pandas as pd run = pd.DataFrame([ {'query_id': "Q0", 'doc_id': "D0", 'score': 1.2}, {'query_id': "Q0", 'doc_id': "D1", 'score': 1.0}, {'query_id': "Q1", 'doc_id': "D0", 'score': 2.4}, {'query_id': "Q1", 'doc_id': "D3", 'score': 3.6}, ]) If your dataframe has columns named something else, you can always map them with the ``rename`` function. For instance, if your dataframe has the columns ``qid``, ``docno``, and ``output``, you can easily make a qrels dataframe that is compatible with ir-measures like so:: run = df.rename(columns={'qid': 'query_id', 'docno': 'doc_id', 'output': 'score'}) **TREC-formatted run file**: You can read a TREC-formatted run file:: # a file path: run = ir_measures.read_trec_run('path/to/run') # raw run file contents: run = ir_measures.read_trec_run(''' Q0 0 D0 0 1.2 runid Q0 0 D1 1 1.0 runid Q1 0 D3 0 3.6 runid Q1 0 D0 1 2.4 runid ''') # TREC run format: "query_id ignored doc_id rank score runid". This parser ignores "ignored", "rank", and "runid". Note that ``read_trec_run`` returns a generator. If you need to use the qrels multiple times, wrap it in the ``list`` constructor to read the all qrels into memory. **dict-of-dict**: Run structured in a hierarchy. At the first level, query IDs map to another dictionary. At the second level, document IDs map to (float) ranking scores:: run = { 'Q0': { "D0": 1.2, "D1": 1.0, }, "Q1": { "D0": 2.4, "D3": 3.6 } } Measure Objects --------------------------------------- Measure objects speficy the measure you want to calculate, along with any parameters they may have. There are several ways to create them. The easiest is to specify them directly in code: >>> from ir_measures import * # imports all measure names >>> AP AP >>> AP(rel=2) AP(rel=2) >>> nDCG@20 nDCG@20 >>> P(rel=2)@10 P(rel=2)@10 Notice that measures can include parameters. For instance, ``AP(rel=2)`` is the average precision measure with a minimum relevance level of 2 (i.e., documents need to be scored at least 2 to count as relevant.) Or ``nDCG@20``, which specifies a ranking cutoff threshold of 20. See the measure's documentation for full details of available parameters. If you need to get a measure object from a string (e.g., if specified by the user as a command line argument), use the ``ir_measures.parse_measure`` function: >>> ir_measures.parse_measure('AP') AP >>> ir_measures.parse_measure('AP(rel=2)') AP(rel=2) >>> ir_measures.parse_measure('nDCG@20') nDCG@20 >>> ir_measures.parse_measure('P(rel=2)@10') P(rel=2)@10 If you are familiar with the measure and family names from ``trec_eval``, you can map them to measure objects using ``ir_measures.parse_trec_measure()``: >>> ir_measures.parse_trec_measure('map') [AP] >>> ir_measures.parse_trec_measure('P') # expands to multiple levels [P@5, P@10, P@15, P@20, P@30, P@100, P@200, P@500, P@1000] >>> ir_measures.parse_trec_measure('P_3,8') # or 'P.3,8' [P@3, P@8] >>> ir_measures.parse_trec_measure('ndcg') [nDCG] >>> ir_measures.parse_trec_measure('ndcg_cut_10') [nDCG@10] >>> ir_measures.parse_trec_measure('official') [P@5, P@10, P@15, P@20, P@30, P@100, P@200, P@500, P@1000, Rprec, Bpref, IPrec@0.0, IPrec@0.1, IPrec@0.2, IPrec@0.3, IPrec@0.4, IPrec@0.5, IPrec@0.6, IPrec@0.7, IPrec@0.8, IPrec@0.9, IPrec@1.0, AP, NumQ, NumRel, NumRet(rel=1), NumRet, RR] Note that a single ``trec_eval`` measure name can map to multiple measures, so measures are returned as a list. Measures are be passed into methods like ``ir_measures.calc_aggregate``, ``ir_measures.iter_calc``, and ``ir_measures.evaluator``. You can also calculate values from the measure object itself: >>> AP.calc_aggregate(qrels, run) 0.2842120439595336 >>> (nDCG@10).calc_aggregate(qrels, run) # parens needed when @cutoff is used 0.6250748053944134 >>> for metric in (P(rel=2)@10).iter_calc(qrels, run): ... print(metric) Metric(query_id='1', measure=P(rel=2)@10, value=0.5) Metric(query_id='2', measure=P(rel=2)@10, value=0.8) ... Metric(query_id='35', measure=P(rel=2)@10, value=0.9) Scoring multiple runs --------------------------------------- Sometimes you need to evaluate several different systems using the same benchmark. To avoid redundant work for every run (such as processing qrels), you can create an ``evaluator(measures, qrels)`` object that can be re-used on multiple runs. An evaluator object has ``calc_aggregate(run)`` and ``calc_iter(run)`` methods. >>> evaluator = ir_measures.evaluator([nDCG@10, P@5, P(rel=2)@5, Judged@10], qrels) >>> evaluator.calc_aggregate(run1) {nDCG@10: 0.6250, P@5: 0.7485, P(rel=2)@5: 0.6000, Judged@10: 0.9485} >>> evaluator.calc_aggregate(run2) {nDCG@10: 0.6285, P@5: 0.7771, P(rel=2)@5: 0.6285, Judged@10: 0.9400} >>> evaluator.calc_aggregate(run3) {nDCG@10: 0.5286, P@5: 0.6228, P(rel=2)@5: 0.4628, Judged@10: 0.8485} .. [1] In the examples, ``P@5`` and ``nDCG@10`` are returned first, as they are both calculated in one invocation of ``pytrec_eval``. Then, results for ``P(rel=2)@5`` are returned (as a second invocation of ``pytrec_eval`` because it only supports one relevance level at a time). Finally, results for ``Judged@10`` are returned, as these are calculated by the ``judged`` provider. Empty Set Behaviour --------------------------------------- ir-measures normalizes the behavior across tools by always returning results based on all queries that appear in the provided qrels, regardless of what appears in the run. This corresponds with the ``-c`` flag in ``trec_eval``. Queries that appear in the run but not the qrels are ignored, and queries that appear in the qrels but not the run are given a score of 0. This behaviour is based on the following reasoning: 1. Queries that do not appear in the qrels were not judged, and therefore cannot be properly scored if returned in the run. 2. Queries that do not appear in the run may have returned no results, and therefore be scored as such. We believe that these are the proper settings, so there is currently no way to change this behaviour directly in the software. If you wish to only score some of the queries provided in the qrels, you may of course filter down the qrels provided to ir-measures to only those queries.