Participants in the TIGER challenge will develop an algorithm to perform three tasks: given a whole-slide image, the algorithm is expected to:

- segment several tissue compartments
- detect lymphocytes and plasma cells
- compute a single TILs score

The performance of algorithms for these three tasks will be evaluated automatically by the grand-challenge platform based on hidden test data (experimental test set during the challenge, final test set at the end of the challenge, see Data section for details). Two leaderboards will be created, and top-3 methods in each leaderboard will be awarded a prize (see Rules section for details).

#### Leaderboard 1¶

In Leaderboard 1 (L1), we will evaluate the performance of each
algorithm in "computer vision" tasks, i.e., __tissue segmentation__
and __cell detection__. During the TIGER challenge, an experimental
set of n=26 WSIs will be used, for which regions of interest have been
manually annotated with regions of invasive tumor and tumor-associated
stroma, as well as with point annotations of lymphocytes and plasma
cells. At the end of TIGER, a final test set of n=38 WSIs will be used;
these WSIs will contain the same type of annotations as in the
experimental test set.

For __tissue segmentation__, we compute the **Dice score** for
stroma, i.e. tumor-associated stroma (mask value 2) grouped together
with inflamed stroma (mask value 6) vs. the other classes, and invasive
tumor (mask value 1) vs. the other classes. ('Normal/Healthy' stroma is
considered as part of rest). When computing the tumor and stroma
metrics all other mask values will be considered as rest.

The motivation is that regions of invasive tumor and of tumor-associated stroma play a central role in the definition of the TILs score.

For __cell detection__ (lymphocytes and plasma cells), we perform a
**Free Response Operating Characteristic (FROC)** analysis, computing
sensitivity (TPR) versus average false positives (FP) per mm² over all
slides. Participants will have to submit a list of (x,y) coordinates
with the location of each predicted cell (see
Submission requirements for details), which will be
compared with the manual reference standard. **Note that in Leaderboard
1, we expect algorithms to predict the location of cells (lymphocytes
and plasma cells) in any tissue compartments, not solely limited to the
tumor-associated stroma**. This is because, in this phase, we only
assess the "computer vision performance" of algorithms. In the FROC
analysis, we will consider a "hit" of a manually annotated cell if the
location of the predicted cell is within 4 microns from the manual
annotation ("hit criterion"). From this, we will compute TPs, FPs, and
FNs, and use them in the FROC analysis. From the FROC curve, we derive
an "FROC score" by taking sensitivity at five pre-selected values of
FP/mm²: [10, 20, 50, 100, 200, 300]. The score computation may be
fine-tuned during the challenge to better compare the best methods.

Below we show a schematic overview of the pipeline of leaderboard 1. Please note that your algorithm should expect two inputs, namely 1) an image and 2) a mask. For leaderboard 1, this mask contains multiple regions of interest. A region of interest consists of values of 1s, and the rest of the mask consists of 0s. For leaderboard 1, your algorithm will only need to predict the segmentation map and the lymphocytes in the region of interest, i.e., where the mask has a value of 1.

The overall score in L1 will be based on a **ranking** combination of
the segmentation and detection performance. Baseline performance for
detection and segmentation will be provided before Leaderboard 1 is open
for submission.

Participants are allowed to make __two submissions per week__ to
leaderboard 1, to be evaluated on the experimental test set, and to
update leaderboard 1.

#### Leaderboard 2¶

In Leaderboard 2 (L2), we will evaluate the prognostic value of the automated "TILs score" generated by submitted algorithms for each WSI in the test set. During the TIGER challenge, an experimental set of n=200 WSIs will be used, for which clinicopathological variables are available, including recurrence and survival data. At the end of TIGER, the prognostic value of the automated TILs score will be assessed on a final test set of n=707 cases.

Below we show a schematic overview of the pipeline of leaderboard 2. Please note that your algorithm should expect two inputs, namely 1) an image and 2) a mask. For leaderboard 2, this mask contains 1s for tissue and 0s for background and fat. For leaderboard 2, we expect prediction masks, a JSON with detections, and a TILs score. However, only the TILs score will be used for the evaluation.

To assess the prognostic value of the automated TILs score, for each
submission we will build a multivariate Cox regression model trained
with predefined clinical variables and the produced TILs score. We will
compute the **concordance index** (C-index) of this model and rank
algorithms based on its value. If you are familiar with Receiver
Operating Characteristic (ROC) curves, the c-index can be seen as the
equivalent of the Area Under the Curve (AUC) computed based on a fitted
survival model. In particular, we will use the Uno’s C-index, which was
described in
this paper.

In leaderboard 2, the C-index of a __baseline survival model where only
predefined clinical variables__ are included (age, morphology subtype,
grade, molecular subtype, stage, surgery, adjuvant therapy) is **0.63
[CI: 0.42, 0.82]**. In order to be eligible for an award, models
trained with TILs scores added to other clinical variables will need to
have a C-index higher than the one of the baseline survival model.
Additionally, we also report on L2 the performance of a regression model
trained with all aforementioned clinical variables plus the TILs score
produced by the __TIGER baseline algorithm__, which results in a
C-index of **0.70 [CI: 0.51, 0.87]**. Note that participants will not
have to train the regression model, it will be done automatically on
grand-challenge.org after submissions to leaderboard 2 as part of the
evaluation procedure. Furthermore, we will provide a Kaplan-Meier curve
showing the recurrence-free survival probability that compares two
groups based on the median of the predicted TILS scores of your model.
This curve is not used to rank your method and only allows you to see
how your model performs in a univariate setting.

#### Submissions and evaluation¶

Submissions and evaluation of algorithms in the TIGER challenge are scheduled as follows.

Once leaderboard 1 is opened, participants have a limit of 2 submissions per week to L1. After each submission, the submitted algorithm will be run on the experimental test set of L1 and the leaderboard will be updated.

Once leaderboard 2 is opened, every week only a limited number of
algorithms will be run on the experimental test set of L2. The expected
number of algorithms evaluated per week on L2 is five, but it can be
fine-tuned during the development of the challenge. Every week__,__
__only the top newest algorithms submitted to L1 will be invited to
submit to L2__. For example, in the first week L2 is opened, the top-5
methods of L1 will be evaluated. In the second week, if the top-5 has N
new algorithms, those will be run on the experimental test set of L2,
and the remaining 5-N algorithms after the top-5 will be run as well. A
similar mechanism will be applied until the end of TIGER. This procedure
can be fine-tuned during the challenge depending on the number of
participating teams, the number of submissions to L1, and the computing
time of submitted algorithms. __If your algorithm is not among the top
newest algorithms and you still want to submit it to L2, please get in
touch with one of the organizers__.