Training data


In TIGER, we publicly release training data containing digital pathology images of Her2 positive (Her2+) and Triple Negative (TNBC) breast cancer whole-slide images, together with manual annotations. Training data comes from multiple sources. A subset of Her2+ and TNBC cases is provided by the Radboud University Medical Center (RUMC) (Nijmegen, Netherlands). A second subset of Her2+ and TNBC cases is provided by the Jules Bordet Institut (JB) (Bruxelles, Belgium). A third subset of TNBC cases only is derived from the TCGA-BRCA archive obtained from the Genomic Data Commons Data Portal.

All slides and annotations can be downloaded by the participants and used to develop AI models to 1) segment several tissue compartments and 2) detect lymphocytes and plasma cells, which are the main morphological components to consider to assess the TILs in breast cancer, and finally use this information to design a "TILs score". We release training data in the format of three datasets, which we call WSIROIS, WSIBULK, and WSITILS, described on this page. All data, both at WSI and at ROI level, is released at a spacing (pixel size) of approximately 0.5 um/px.

How to download the public training dataset

The full training dataset, containing the WSIROIS, WSIBULK and WSITILS subsets, can be downloaded from AWS Open Data at this link.

To download the public training set, please make sure that the latest version of the AWS CLI is installed on your system by following these instructions:

https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

With the AWS CLI installed, you can download the public training set (no AWS account required) by running:

aws s3 cp s3://tiger-training/ /path/to/destination/ --recursive --no-sign-request

If you have any difficulties downloading the data, please ask a question in the forum.

Structure and content of the training dataset

The structure of the training data is the following (screenshot taken from the data-structure.txt file):


For all slides, we kept the original anonymized filename provided by each data source. Slides from TCGA-BRCA are identifiable via the TCGA- suffix in the filename (e.g., TCGA-A1-A0SK-01Z-00-DX1.A44D70FA-4D96-43F4-9DD7-A61535786297.tif). Slides from Radboud University Medical Center are named TC_S01_P000XXX_C0001_BYYY, where XXX is an anonymized patient ID and YYY is an internal code that is not relevant for TIGER. Slides from Institut Jules Bordet are named as either ZZZB or ZZZS, where ZZZ is an anonymized patient ID, the suffix "B" indicates a biopsy, and "S" indicates a surgical resection. Note that this naming strategy is not necessarily the same used in the test datasets.

1. WSIROIS: Whole-slide images with manual annotations in regions of interest

In this set, we release n=195 whole-slide images of breast cancer, both (core-needle) biopsies, and surgical resections, with regions of interest (ROI) selected and manually annotated. This dataset contains images and annotations from multiple sources:

  • TCGA: n=151 WSIs of TNBC cases from the TGCA-BRCA archive (the original slides can also be downloaded from the GDC Data Portal). Annotations are extracted and adapted from the publicly available BCSS [1] and NuCLS [2] datasets. The full set of n=151 TCGA-BRCA slides used in TIGER, already converted to TIF and 0.5 um/px magnification can be downloaded from this link. Note that the manual annotations released with TIGER refer to 0.5 um/px magnification and may not be directly applicable to some of the original TCGA-BRCA slides (originally scanned at 0.25 um/px). 
  • RUMC: n=26 WSIs of TNBC and Her2+ cases from Radboud University Medical Center (Netherlands). Annotations were made by a panel of board-certified breast pathologists.
  • JB: n=18 WSIs of TNBC and Her2+ cases from Jules Bordet Institute (Belgium). Annotations were made by a panel of board-certified breast pathologists.

In each WSI, ROIs are manually annotated with both polygons indicating different tissue compartments, and with point annotations indicating lymphocytes and plasma cells. The following is an example of WSI (case 114S from the training set) with three ROIs manually annotated, viewed with the ASAP histopathology viewer:


In each ROI, the following regions are annotated, with the corresponding labels:

  • invasive tumor (label=1): this class contains regions of the invasive tumor, including several morphological subtypes, such as invasive ductal carcinoma and invasive lobular carcinoma;
  • tumor-associated stroma (label=2): this class contains regions of stroma (i.e., connective tissue) that are associated with the tumor. This means stromal regions contained within the main bulk of the tumor and in its close surrounding; in some cases, the tumor-associated stroma might resemble the "healthy" stroma, typically found outside of the tumor bulk; 
  • in-situ tumor (label=3): this class contains regions of in-situ malignant lesions, such as ductal carcinoma in situ (DCIS) or lobular carcinoma in situ (LCIS).
  • healthy glands (label=4): this class contains regions of glands with healthy epithelial cells;
  • necrosis not in-situ (label=5): this class contains regions of necrotic tissue that are not part of in-situ tumor; for example, ductal carcinoma in situ (DCIS) often presents a typical necrotic pattern, which can be considered as part of the lesion itself, such a necrotic region is not annotated as "necrosis" but as "in-situ tumor";
  • inflamed stroma (label=6): this class contains tumor-associated stroma that has a high density of lymphocytes (i.e., it is "inflamed"). When it comes to assessing the TILs, inflamed stroma and tumor-associated stroma can be considered together, but were annotated separately to take into account for differences in their visual patterns;
  • rest (label=7): this class contains regions of several tissue compartments that are not specifically annotated in the other categories; examples are healthy stroma, erythrocytes, adipose tissue, skin, nipple, etc.

Additionally, most ROIs contain annotations of lymphocytes and plasma cells in the form of bounding boxes. Cells were annotated using point annotations and then squared bounding boxes of 8x8 microns were constructed centered on the point annotation. This fixed-size bounding box size is inspired by previous work on lymphocyte detection [3], where an average equivalent diameter of 8 microns was used for lymphocytes.

Annotations at WSI level are released in two formats, both as a multiresolution TIF image and as an XML file in ASAP formatThe following are visual examples of the first ROI in case 114S in the WSIROIS training set (A1), the corresponding mask of manually annotated tissue compartments (B1, from the TIF file), the tissue borders and bounding boxes from the XML file (C1) and the combination of B1 and C1 in D1. On the left, we show an overview of labels when XML files are loaded on ASAP, and a legend with class names and corresponding colors when the "label" LUT is used in ASAP with 255/127 window/level setting. 


In all ROIs, both tissue compartments and lymphocytes are annotated. We release annotations in several formats, which can be found in the folders "roi-level-annotations" and "wsi-level-annotations" of the training dataset:

  • WSI-level annotations of both tissue compartments and cells in XML format compatible with ASAP 2.0, where tissue compartments are annotated as polygons, and lymphocytes and plasma cells are annotated with bounding boxes;
  • WSI-level annotations of tissue compartments as multiresolution TIF images compatible with ASAP 2.0, where regions of tissue compartments are labeled with the class labels as detailed above;
  • ROI-level annotations of both tissue compartments and cells. ROI images are released in PNG format; cell annotations are released as bounding boxes in the standard COCO format for object detection; tissue compartment annotations are released as PNG images containing pixel-wise class labels.

Examples of Python code to read released WSIs and annotations and to generate mini-batches to train AI models are provided in the Code section of this site.

TCGA annotations derived from BCSS and NuCLS

You will notice that in annotations on RUMC and JB slides, 3 ROIs per slide are annotated, with an ROI size of approximately 500x500 microns. In annotations from TCGA slides, smaller ROIs are annotated and the number of ROIs varies in each case. The reason is that TCGA annotations are derived by combining data from two existing datasets, the BCSS and the NuCLS datasets, based on projects supported by the US National Institutes of Health National Cancer Institute (U24CA194362 and U01CA220401). In BCSS,  a set of n=151 TNBC slides from TCGA was selected and one (large) ROI per slide was annotated with dense annotations of multiple tissue compartments (no cells were annotated).  In NuCLS, a set of n=124 TNBC slides from TCGA was selected, which is a subset of the n=151 slides used in BCSS. In each slide in NuCLS, a variable number of (smaller) ROIs were selected within the (larger) ROI from BCSS and multiple cell types were annotated. 

In the example below, A2 depicts an ROI from the WSIROIS dataset of TIGER derived from the BCSS dataset. In TIGER, we also release corresponding annotations from BCSS where we have merged the original classes and relabeled them (see relabeling map in the text below) to match TIGER classes (B2), both as TIF files and as XML files (see C2 and D2). Within this ROI, the NuCLS dataset provides cell annotations in smaller ROIs (see E2). In each smaller ROI, bounding box annotations of lymphocytes + plasma cells are released (see F2 for details of 3 ROIs).


In TIGER, we release the n=151 TCGA slides of TNBC cases used in BCSS. For each slide, we release tissue annotations in larger ROIs after merging and re-labeling BCSS classes (see directory annotations-tissue-bcss-masks or annotations-tissue-bcss-xmls); in the subset of n=124 slides common to BCSS and NuCLS, we release annotations of tissue and cells for the smaller ROIs (see directory annotations-tissue-cells-masks or annotations-tissue-cells-xmls).

Relabeling of BCSS and NuCLS data

From the original BCSS dataset, we took the following classes and relabeled them as follows for TIGER:

BCSS label TIGER label
normal_acinus_or_duct healthy glands

mostly_lymphocytic_infiltrate
inflamed stroma

exclude 
exclude (label = 0)

mostly_stroma
tumor-associated stroma

necrosis_or_debris
necrosis not in-situ

mostly_plasma_cells
inflamed stroma

roi 
tumor-associated stroma

glandular_secretions
rest

nerve 
rest

other_immune_infiltrate
inflamed stroma

angioinvasion 
invasive tumor

lymphatics
rest

blood_vessel
rest

mostly_blood
rest

skin_adnexia
rest

mostly_tumor 
invasive tumor

mostly_dcis
in-situ tumor

idc
invasive tumor

metaplasia_NOS 
rest

undetermined
rest

mostly_fat 
rest

mostly_mucoid_material 
rest

background 
exclude (label = 0)


Necrosis and glandular_secretions inside DCIS were re-labeled as in-situ tumor and glandular_secretions inside IDC as invasive-tumor.

From the original NuCLS dataset, we took original annotations from the "Corrected Single-rate dataset" and adapted them to TIGER as follows. Since NuCLS contains both bounding box and cell border annotations, we took the centroid of each annotation and used that as a reference to build a fixed-size bounding box annotation for lymphocytes and plasma cells for TIGER. Furthermore, we merged annotations of lymphocytes and plasma cells into a single class and excluded the rest, including cells originally labeled as "Unlabeled":

NuCLS cell labelTIGER cell label


Tumor 
excluded



Mitotic figures
excluded 



Fibroblast 
excluded



Vasc. end.
excluded



Macrophage 
excluded



Lymphocyte 
Lymphocytes and plasma cells



Plasma cells
Lymphocytes and plasma cells



Neutrophils 
excluded



Eosinophils 
excluded



Myoepith.
excluded



Normal ep.
excluded



Apoptotic 
excluded



Unlabeled 
excluded



2. WSIBULK: Whole-slide images with coarse manual annotation of the tumor bulk

We also release a set of n=93 WSI of both biopsies and surgical resections of TNBC and Her2+ breast cancer tissue from RUMC and JB. For each WSI, we provide coarse annotations of the "tumor bulk". With this, we mean the (coarse) manual annotation of one or more regions in the slide that contain invasive tumor cells. The following are examples of this type of annotations in a biopsy and in a surgical resection:


We define these as annotations of the "tumor bulk" because we made sure that all cancer cells belonging to the invasive part of the tumor are confined within the manually annotated regions. This also means that those regions also contain tissue compartments that do contain other types of cells, other than tumor cells, including in-situ tumors. While making annotations, no attention was paid to the exact distance between the annotated border and the invasive front of the tumor, which is therefore a distance that may vary between cases. In any case, we made sure that all tissue outside of these annotations does not contain tumor cells belonging to the invasive part of the tumor (but they may contain tumor cells belonging to in-situ lesions). 


3. WSITILS: Whole-slide images with visual estimation of the TILs at slide level

Finally, we release a set of n=82 WSI of both biopsies and surgical resections of TNBC and Her2+ breast cancer tissue from RUMC and JB, where the visual assessment of the TILs has been done at WSI level. In this dataset, no manual annotations are provided, but a list of TIL values, one per slide, is stored in the tiger-til-scores-wsitils.csv file. TILs in this set of cases have been assessed by a board-certified breast pathologist, following the recommendation of the TILs working group [4]. Several cases contain comments indicating potential pitfalls in either visual TILs scoring or machine-based TILs scoring, some of which were addressed in [5].


Private test data

There are also private test sets, which are not directly accessible by the participants, both for leaderboard 1 and for leaderboard 2. In both cases, there is both an experimental test set, and a final test set. In total, there are four test sets, two for each leaderboard. All sets contain whole-slide images of breast cancer.

Experimental test set

The experimental test set for leaderboard 1 consists of 26 whole-slide images with 130 regions of interest manually annotated. Annotations include regions of tumor, tumor-associated stroma, and location of lymphocytes. These annotations will be used, during the challenge, the evaluate the performance in leaderboard 1 (see Evaluation section for details).

The experimental test set for leaderboard 2 consists of 200 whole-slide images of breast cancer cases. For each case, algorithms will have to compute a single "TILs score", which will be used to perform survival analysis during the challenge to assess its prognostic value. Cases are obtained from both clinical routines and from a phase-3 clinical trial. All slides are stained with H&E. In some cases, more than one tissue sample could be present on the slide, as found in several training cases from TCGA.

Multiple submissions are allowed to run algorithms on the experimental test set, see the Rules section for details.

Final test set

The final test set for leaderboard 1 consists of 38 whole-slide images with 149 regions of interest manually annotated. As in the experimental test set, annotations include regions of tumor, tumor-associated stroma, and location of lymphocytes. These annotations will be used, at the end of the challenge, to evaluate the performance in the final leaderboard 1 (see Evaluation section for details).

The final test set for leaderboard 2 consists of 707 whole-slide images of breast cancer cases. As for the experimental test set, for each case, algorithms will have to compute a single "TILs score", which will be used to perform survival analysis at the end of the challenge to assess its prognostic value. As in the experimental test set, cases are obtained from both clinical routines and from a phase-3 clinical trial. All slides are stained with H&E. In some cases, more than one tissue sample could be present on the slide, as found in several training cases from TCGA.

Participants are allowed to submit only one time to run their algorithm on the final test set, see the Rules section for details.

Data license

All training data annotations (for TCGA-BRCA, RUMC, and JB slides) are released under a CC BY-NC 4.0 license.

Training slides from RUMC and JB are also released under a CC BY-NC 4.0 license.

Training slides from TCGA-BRCA are shared in the same format as slides from RUMC and JB (i.e., multiresolution TIF files at 0.5 um/px maximum resolution), which we have adapted from the original SVS format to make a uniform format throughout all training WSIs. The same rights applicable to original TCGA-BRCA slides apply to the shared slides. 


References

[1] M. Amgad et al. "Structured crowdsourcing enables convolutional segmentation of histology images", Bioinformatics, 2019 Sep 15;35(18):3461-3467.

[2] M. Amgad et al. "NuCLS: A scalable crowdsourcing, deep learning approach and dataset for nucleus classification, localization and segmentation", arXiv:2102.09099.

[3] Z. Swiderska-Chadaj et al., "Learning to detect lymphocytes in immunohistochemistry with deep learning", Medical Image Analysis, 2019;58:101547.

[4] R. Salgado et al., "The evaluation of tumor-infiltrating lymphocytes (TILs) in breast cancer: recommendations by an International TILs Working Group 2014", Annals of Oncology, 2015 Feb;26(2):259-71.

[5] Z. Kos et al., "Pitfalls in assessing stromal tumor infiltrating lymphocytes (sTILs) in breast cancer". npj Breast Cancer 6, 17 (2020). https://doi.org/10.1038/s41523-020-0156-0.