Home Overview Model Dual-Conditioning Self-Supervised Extension Quantitative Results Qualitative Results Downstream Eval BibTex ☰

Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

NeurIPS 2025

Saghir Alfasly Wataru Uegami MD Enamul Hoq Ghazal Alabtah H.R.Tizhoosh

KIMIA Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA

Paper

Supplement

Code

Demo Video

Overview

This study addresses the challenge of generating synthetic histopathology images that preserve tissue heterogeneity and fine morphological details. While generative AI methods have shown success in natural image domains, their direct application to histopathology has been limited by a tendency to produce homogeneous tissue samples. The proposed framework, termed HeteroTissue-Diffuse, is a latent diffusion model that synthesizes heterogeneous histopathology images through a novel conditioning mechanism. The framework is designed to scale to both annotated and unannotated datasets, enabling the generation of realistic, diverse, and annotated synthetic tissue slides.

Model

HeteroTissue-Diffuse is based on a latent diffusion model, operating in a compressed latent space rather than raw pixels to improve efficiency and stability. Instead of depending solely on text prompts or abstract embeddings, the model incorporates raw tissue exemplars alongside spatial information, ensuring that generated samples retain clinically relevant details such as nuclear texture, staining variations, and cellular morphology.

Dual-Conditioning

The key innovation here is the dual-conditioning mechanism which combines:

Semantic maps → specifies the spatial arrangement of different tissue types within an image.
Raw tissue crops → provides authentic morphological exemplars for each region.

This combination ensures both spatial accuracy and morphological fidelity. Semantic maps guide the placement of tissue types, while tissue crops provide the appearance details that embeddings or text-based prompts typically lose. The result is the generation of heterogeneous samples that faithfully reflect real-world histopathology.

Self-Supervised Extension

To address the lack of large-scale pixel-wise annotations, a self-supervised extension was developed for the TCGA dataset consisting of 11,765 whole-slide images. Patches from these slides were clustered into 100 tissue phenotypes using embeddings from a foundation model trained on histopathology. These clusters were then used to generate pseudo-semantic maps, enabling the training of the diffusion model without manual annotation. This approach allows the framework to scale to massive unannotated datasets while preserving patient privacy.

Quantitave Results

Quantitative evaluation used the Fréchet Distance (FD) to measure fidelity. On Camelyon16, FD was reduced from 430.1 to 72.0, a six-fold improvement when using the proposed conditioning.

Similar improvements were observed across PANDA and TCGA datasets, where FD decreased by factors of two to three. For segmentation tasks, DeepLabv3+ models trained exclusively on synthetic data achieved test IoU scores of 0.71 (Camelyon16) and 0.95 (PANDA), compared to 0.72 and 0.96 for real data. This demonstrates that synthetic data can nearly match the performance of real data in downstream tasks.

Qualitative Results

Qualitative evaluation involved a blinded study with a certified pathologist who rated 120 randomly selected images from both real and synthetic datasets without prior knowledge of their origin. The images were assessed on overall quality, structural detail, and nuclear morphology. Synthetic images generated with visual prompt conditioning received scores comparable to real images across all metrics.

The pathologist concluded that the two types of images were indistinguishable, noting in some cases that synthetic images appeared of equal or higher quality: "The generated images tended to have equal or higher quality than the real images."

Downstream Task Evaluation

For downstream applications, segmentation models trained on synthetic datasets achieved nearly the same performance as models trained on real data.

In Camelyon16 and PANDA, the difference in IoU between synthetic and real training was only 1-2%, demonstrating that synthetic datasets can be used not merely for augmentation but also as a substitute for real patient data. Models trained on unconditioned synthetic data showed larger performance drops, confirming that semantic and visual conditioning are essential for clinically viable synthesis.

Citation

 
        @inproceedings{alfasly2025semantic,
          title={Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology},
          author={Saghir Alfasly and Wataru Uegami and MD ENAMUL HOQ and Ghazal Alabtah and Hamid Tizhoosh},
          booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
          year={2025},
          url={https://openreview.net/forum?id=yNVDkAjGjw}
          }

KIMIA Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA

This web page template is borrowed from here.