Deep-learning model for evaluating histopathology of acute renal tubular injury – Scientific Reports

Kidney sample and criteria of acute kidney injury

This study was performed with the approval of the Ethical Committee of Jeonbuk National University Hospital. All methods were performed in accordance with the relevant guidelines and regulations. In a previous study, kidney samples were collected from a mouse model of cisplatin-induced acute tubular injury13. We re-analyzed kidney samples from male C57BL/6 mice (age: 8–9 weeks; weight: 20–25 g). The mice were divided into two groups: control buffer-treated and cisplatin-treated. Mice in the cisplatin group were intraperitoneally administered a single dose of cisplatin (Cis; 20 mg/kg; Sigma Chemical Co., St. Louis, MO, USA), whereas mice in the control group were intraperitoneally administered saline. Histological measurements were performed 72 h after treatment with cisplatin or the control buffer. To evaluate the function of the injured kidney, blood samples were collected three days after cisplatin administration to measure serum creatinine levels. When serum creatinine was above 0.5 mg/dL, acute kidney injury caused by cisplatin was determined.

Histopathology and assessment of tubular injury

Kidney tissue was fixed in formalin and embedded in paraffin blocks. Hematoxylin and eosin (HE) staining was performed to assess renal tubular injury. Sections of 3-µm thickness were stained using the Periodic acid-Schiff (PAS) Stain Kit (Abcam, Cambridge, MA, USA; catalog no. 150680) in accordance with the manufacturer’s instructions12,14. Tubular injury was evaluated by three blinded observers who examined at least 20 cortical fields (× 200 magnification) of the PAS-stained kidney sections. Tubular injury (necrotic tubules) was defined as tubular dilation, tubular atrophy, tubular cast formation, brush border loss, or thickening of the tubular basement membrane. Finally, the slides were digitized using a Motic Easy ScanPRO slide scanner (Motic Asia Corp., Kowloon, Hong Kong) at 40× magnification.

Datasets

Forty-five whole-slice images (WSIs) with 400 generated patches were used for the segmentation model devolopment. Ground-truth annotations were created using the SUPERVISELY polygon tool (supervisely.com). Polygons mark segment annotations by placing waypoints along the boundaries of the objects that the model must segment. All annotations were reviewed by three nephrologists with extensive experience in nephropathology. The pathologists engaged in discussions to resolve disagreements. Four predefined classes were annotated: (1) glomerulus, (2) healthy tubules, (3) necrotic tubules, and (4) tubules with casts. Figure 1A, B and C show examples of the whole-slide images of H&E and PAS-stained kidney section obtained using a slide scanner and a randomly generated patch without annotations, respectively. The annotations consisting of four different structures, ‘glomerulus,’ ‘healthy tubules,’ ‘necrotic tubules,’ and ‘tubules with cast’ are shown in Fig. 1D. In total, 27,478 annotations, along with their corresponding patches, were partitioned into two distinct proportions: a training subset comprising 80% of the data and a testing subset constituting the remaining 20%. Patches that belonged to the same WSI did not appear in either the training or testing proportions to ensure robust generalization of the segmentation models. Subsequently, to fine-tune the model hyperparameters, the training subset underwent further random splitting into training (80%) and validation (20%) subsets. This approach aimed to facilitate the refinement of model performance by iteratively adjusting the hyperparameters based on the validation set, while preserving the independence of the testing set for the final evaluation of model generalization (Table 1 and Figs. 2, 3 and 4).

Figure 1
figure 1

AB Whole slide image of H & E (A) and PAS (B)-stained kidney section was digitalized using slide scanner at 40× magnification. Randomly generated patch without annotations. B H&E and PAS staining images of healthy tubules, necrotic tubules, and tubules with casts after cisplatin administration. C Randomly generated patch with annotations comprised four different structures: “glomerulus,” “healthy tubules,” “necrotic tubules,” and “tubule with cast”.

Table 1 The number of annotations in each class used in training and test set for segmentation model.
Figure 2
figure 2

Representative PAS-stained images, ground truth mask and predicted mask generated by the CNNs in training set.

Figure 3
figure 3

Representative PAS-stained images, ground truth mask and predicted mask generated by the CNNs in validation set.

Figure 4
figure 4

Representative PAS-stained images, ground-truth masks, and predicted masks generated by CNNs in test set.

Preprocessing

Because the pathology images were represented in an RGB data structure, the pixel values of the images ranged from 0 to 255. The pixels were scaled to a range between zero and one to avoid gradient explosions during the training phase. The patch images were resized to 512 × 512 pixels before being fed into the deep-learning model for segmentation. Three different augmentation methods were used to address overfitting resulting from a limited number of samples: horizontal flipping, rotation, and brightness adjustment. The third augmentation method was used because of varying degrees of slide brightness. Although we performed PAS staining for all histological slides using the same protocol, the degree of staining and, consequently, the overall brightness of the specimen may have differed among the different slides because the tissue embedded in paraffin was collected at various times. Thus, a random adjustment of the contrast of patches can improve model performance. The augmentation methods were applied only to the training and validation datasets, and not to the test set. All augmentation protocols were implemented using the Python Albumentation library15. We applied 3 augmentation methods to the 50% of the training images: (1) horizontal flip, (2) rotate images with random angles from − 90 to 90°, and (3) contrast change.

Proposed model framework

In this study, we proposed to use DeepLabV316, which is a two-stage segmentation framework for the segmentation task. The architecture of the DeepLabV3 encoder consists of Atrous Spatial Pyramid Pooling (ASPP) blocks that allow it to maintain the Field-of-View (FOV) of the network layers and effectively capture contextual information at different scales. Moreover, DeepLabV3 uses dilated or “-atrous” convolution layers to maintain high-precision predictions while maintaining a wide FOV. This is particularly critical for histopathological imaging because of the fine-grained structures and textures. In addition, the dense structure of the images leads to an extreme foreground–background class-imbalance phenomenon. To overcome this challenge, we integrated an objective function, which is the summation of the Dice Loss17 and Focal Loss18 functions. Unlike classification tasks, the outputs of segmentation problems are continuous, rather than categorical. Thus, Dice Loss is particularly suitable for continuous maps because it measures the overlap between a prediction and target. Furthermore, Dice Loss is independent of the statistical distribution of labels and penalizes misclassifications based on the overlap between the predicted regions and ground truths. The last part of our object function is the Focal Loss function, which was used in the RetinaNet18 deep-learning model to mitigate the class-imbalance problem in dense object detection. Furthermore, we integrated DeepLabV3 with a MobileNet backbone designed for mobile and embedded devices such that the developed model can be applied to devices that might have limited computational resources in clinical environments.

As presented in Table 2, our datasets were imbalanced, with the number of annotations for the Glomerulus class being relatively small compared to the other classes. To address this issue, the objective function assigns a higher weight to examples in the minority class, and a lower weight to those in the majority class. Mathematically, the objective function can be described by the following equation:

$$L(y,overline{p}) = 1 – frac{{left( {2yoverline{p} + 1} right)}}{{left( {y + overline{p + 1} } right)}} – left( {y – overline{p}} right)^{gamma } log_{b} left( {overline{p}} right),$$

(1)

where (y), (overline{p }), and (gamma) correspond to the ground truth, model prediction, and the parameter that controls the degree of focus on the difficulty of the examples, respectively. If (gamma) is set to 0, the Focal Loss is reduced to the standard cross-entropy loss. The proposed model was implemented using PyTorch19, and the loss function was obtained from the MONAI library19,20. The training procedure took approximately 4 h on a graphics processing unit (GPU) RTX 3090 24 GB.

Table 2 Quantitative segmentation performance of four classes in the actue tublar injury images in training, validation and testing sets.

Data analyses

Network performance was quantitatively assessed using instance-level DICE and IoU scores. In image segmentation, the DICE and IoU are commonly used to evaluate the performance of segmentation algorithms. They measured the similarity between the predicted segmentation mask and ground-truth mask. While DICE measures the ratio of the intersection of the two masks to the sum of their areas, the IoU metric calculates the overlap between the predictions and human masks by taking the ratio of their intersection to their union. In addition, sensitivity, specificity, and accuracy were calculated. In this study, we used these metrics to evaluate the performance of the proposed system comprehensively.

Comparison with other model

In our comprehensive comparative analysis, we used U-Net21 and SegFormer22, two widely used neural network architectures. U-Net, a widely used convolutional neural network architecture for semantic segmentation, features a distinctive U-shaped design comprising the contracting, bottleneck, and expansive paths. It excels at capturing intricate spatial features and is known for its success in medical image segmentation tasks. SegFormer, a state-of-the-art algorithm for segmentation, adopts a transformer-based architecture23 with lightweight multilayer perception. It demonstrates an extremely high level of performance on the Cityscapes24 dataset, highlighting its effectiveness in diverse computer vision applications. We applied the standard architectures of U-Net and SegFormer without modification and used the same training, validation, and test subsets as in our model. The DICE and IoU values of U-Net and SegFormer were measured for comparison.

Statistical analyses

We used One-way ANOVA (or t-tests) for comparison between deepLabV3, UNet and Segformer by comparing respective Dice and IoU coeffecienct. P < 0.05 was considered statistically significant.