Evaluation - TIGER - Grand Challenge

Participants in the TIGER challenge will develop an algorithm to perform three tasks: given a whole-slide image, the algorithm is expected to:

segment several tissue compartments
detect lymphocytes and plasma cells
compute a single TILs score

The performance of algorithms for these three tasks will be evaluated automatically by the grand-challenge platform based on hidden test data (experimental test set during the challenge, final test set at the end of the challenge, see Data section for details). Two leaderboards will be created, and top-3 methods in each leaderboard will be awarded a prize (see Rules section for details).

Leaderboard 1¶

In Leaderboard 1 (L1), we will evaluate the performance of each algorithm in "computer vision" tasks, i.e., tissue segmentation and cell detection. During the TIGER challenge, an experimental set of n=26 WSIs will be used, for which regions of interest have been manually annotated with regions of invasive tumor and tumor-associated stroma, as well as with point annotations of lymphocytes and plasma cells. At the end of TIGER, a final test set of n=38 WSIs will be used; these WSIs will contain the same type of annotations as in the experimental test set.

For tissue segmentation, we compute the Dice score for stroma, i.e. tumor-associated stroma (mask value 2) grouped together with inflamed stroma (mask value 6) vs. the other classes, and invasive tumor (mask value 1) vs. the other classes. ('Normal/Healthy' stroma is considered as part of rest). When computing the tumor and stroma metrics all other mask values will be considered as rest.

The motivation is that regions of invasive tumor and of tumor-associated stroma play a central role in the definition of the TILs score.

For cell detection (lymphocytes and plasma cells), we perform a Free Response Operating Characteristic (FROC) analysis, computing sensitivity (TPR) versus average false positives (FP) per mm² over all slides. Participants will have to submit a list of (x,y) coordinates with the location of each predicted cell (see Submission requirements for details), which will be compared with the manual reference standard. Note that in Leaderboard 1, we expect algorithms to predict the location of cells (lymphocytes and plasma cells) in any tissue compartments, not solely limited to the tumor-associated stroma. This is because, in this phase, we only assess the "computer vision performance" of algorithms. In the FROC analysis, we will consider a "hit" of a manually annotated cell if the location of the predicted cell is within 4 microns from the manual annotation ("hit criterion"). From this, we will compute TPs, FPs, and FNs, and use them in the FROC analysis. From the FROC curve, we derive an "FROC score" by taking sensitivity at five pre-selected values of FP/mm²: [10, 20, 50, 100, 200, 300]. The score computation may be fine-tuned during the challenge to better compare the best methods.

Below we show a schematic overview of the pipeline of leaderboard 1. Please note that your algorithm should expect two inputs, namely 1) an image and 2) a mask. For leaderboard 1, this mask contains multiple regions of interest. A region of interest consists of values of 1s, and the rest of the mask consists of 0s. For leaderboard 1, your algorithm will only need to predict the segmentation map and the lymphocytes in the region of interest, i.e., where the mask has a value of 1.

The overall score in L1 will be based on a ranking combination of the segmentation and detection performance. Baseline performance for detection and segmentation will be provided before Leaderboard 1 is open for submission.

Participants are allowed to make two submissions per week to leaderboard 1, to be evaluated on the experimental test set, and to update leaderboard 1.

Leaderboard 2¶

In Leaderboard 2 (L2), we will evaluate the prognostic value of the automated "TILs score" generated by submitted algorithms for each WSI in the test set. During the TIGER challenge, an experimental set of n=200 WSIs will be used, for which clinicopathological variables are available, including recurrence and survival data. At the end of TIGER, the prognostic value of the automated TILs score will be assessed on a final test set of n=707 cases.

Below we show a schematic overview of the pipeline of leaderboard 2. Please note that your algorithm should expect two inputs, namely 1) an image and 2) a mask. For leaderboard 2, this mask contains 1s for tissue and 0s for background and fat. For leaderboard 2, we expect prediction masks, a JSON with detections, and a TILs score. However, only the TILs score will be used for the evaluation.

To assess the prognostic value of the automated TILs score, for each submission we will build a multivariate Cox regression model trained with predefined clinical variables and the produced TILs score. We will compute the concordance index (C-index) of this model and rank algorithms based on its value. If you are familiar with Receiver Operating Characteristic (ROC) curves, the c-index can be seen as the equivalent of the Area Under the Curve (AUC) computed based on a fitted survival model. In particular, we will use the Uno’s C-index, which was described in this paper.

In leaderboard 2, the C-index of a baseline survival model where only predefined clinical variables are included (age, morphology subtype, grade, molecular subtype, stage, surgery, adjuvant therapy) is 0.63 [CI: 0.42, 0.82]. In order to be eligible for an award, models trained with TILs scores added to other clinical variables will need to have a C-index higher than the one of the baseline survival model. Additionally, we also report on L2 the performance of a regression model trained with all aforementioned clinical variables plus the TILs score produced by the TIGER baseline algorithm, which results in a C-index of 0.70 [CI: 0.51, 0.87]. Note that participants will not have to train the regression model, it will be done automatically on grand-challenge.org after submissions to leaderboard 2 as part of the evaluation procedure. Furthermore, we will provide a Kaplan-Meier curve showing the recurrence-free survival probability that compares two groups based on the median of the predicted TILS scores of your model. This curve is not used to rank your method and only allows you to see how your model performs in a univariate setting.

Submissions and evaluation¶

Submissions and evaluation of algorithms in the TIGER challenge are scheduled as follows.

Once leaderboard 1 is opened, participants have a limit of 2 submissions per week to L1. After each submission, the submitted algorithm will be run on the experimental test set of L1 and the leaderboard will be updated.

Once leaderboard 2 is opened, every week only a limited number of algorithms will be run on the experimental test set of L2. The expected number of algorithms evaluated per week on L2 is five, but it can be fine-tuned during the development of the challenge. Every week, only the top newest algorithms submitted to L1 will be invited to submit to L2. For example, in the first week L2 is opened, the top-5 methods of L1 will be evaluated. In the second week, if the top-5 has N new algorithms, those will be run on the experimental test set of L2, and the remaining 5-N algorithms after the top-5 will be run as well. A similar mechanism will be applied until the end of TIGER. This procedure can be fine-tuned during the challenge depending on the number of participating teams, the number of submissions to L1, and the computing time of submitted algorithms. If your algorithm is not among the top newest algorithms and you still want to submit it to L2, please get in touch with one of the organizers.