Redacting sensitive text from DICOM medical images in Python

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

This article is cross-posted and originally published in the Data Science @ Microsoft Medium blog.

Image by National Cancer Institute on Unsplash.

Publicly available medical imaging datasets contribute greatly to education, research, and Machine Learning developments in the healthcare space in academia, industry, and beyond. However, reliably and securely de-identifying sensitive Personal Health Information (PHI) from the medical images such that they can be shared is a challenge.

While proprietary solutions do exist, they often depend on moving data into the cloud and are not always accessible to smaller labs and groups without sufficient budget or resources to spin up and maintain the services. But open source and on-prem–compatible solutions have the potential to empower a large user base and can be extended to support specific use cases.

In this article, I’m excited to introduce the first lossless, high-recall open source solution to redact text PHI burnt into DICOM medical images that can run both on-prem and in the cloud.

Challenges of existing methods

Existing methods have several challenges, including ones relating to file formats and reliable text detection, as I discuss below.

DICOM images versus standard image formats

Masking areas of an image to redact sensitive information is a topic of interest for many industries and scenarios. Tools and developments in this area, however, typically focus on images and documents in common data formats (e.g., PNG, JPEG, and TIFF).

Medical images (e.g., MRI, CT, and ultrasound) are often represented in the DICOM file format rather than in formats readily supported by tools and ML models designed to redact text from images. The DICOM file format was developed to standardize medical images collected across various equipment and ensure that important metadata (e.g., patient information and equipment settings) are contained in the same file along with the pixel data.

While medical images can be converted into more common image formats to then be used in Computer Vision and Natural Language Processing (NLP) models for redacting text, doing so results in image quality loss. The pixel data in DICOM files can exist with different photometric interpolations that do not necessarily align one to one with how pixels are represented in common formats supported by the ML models. Although DICOM pixel data can be saved to common image formats and used in ML models without much issue, the main problem is writing the redacted image back to DICOM after the compression and loss of metadata that occurred during the initial conversion.

Comparing the same DICOM image before and after converting to and from PNG using dicomviewer.net.

Reliably detecting text PHI

Solutions built to redact text PHI from images identify all text in the image, detect which text is sensitive, and then mask pixels around the sensitive text to redact. Named Entity Recognition (NER) is a common NLP approach used in this case to classify text as belonging to sensitive categories (e.g., person and address).

While this can work well with documents that have paragraphs of text and plenty of context for the model to use, performance may be less reliable with limited text such as in the case of text PHI burnt into DICOM images.

Arguably more difficult is the ability to correctly recognize text instances that were not included in and are dissimilar to the data used in training the pre-trained NER model. One of the most common challenges in this space is detecting names the model may have never been trained to recognize. Unfortunately, this introduces bias and results in low recall that reduces the likelihood that names from underrepresented groups would be reliably redacted.

Our approach

To address these challenges and provide an open-source solution, Presidio (open-source Python library for data de-identification) now supports redacting text PHI from DICOM images.

Diagram summarizing the Presidio DICOM image redactor methodology.

The Presidio DICOM image redactor approach circumvents the image quality loss issue by directly altering the pixel values in the original DICOM file. Determining the pixels to alter is based on bounding boxes identified through a combination of OCR (to detect all text) and NER (to affect only sensitive text PHI).

In addition, the approach addresses the low recall issue by using the DICOM metadata to create a custom recognizer per image. By using patient information already present in other fields of the metadata, we extend the capabilities of the NER to recognize names and other sensitive information that it would not recognize by default.

Before and after running the Presidio DICOM image redactor on a chest scan from The Cancer Imaging Archive (TCIA) Pseudo-PHI-DICOM dataset.

Running the code to redact from DICOM images in Presidio is simple and can be done either on loaded DICOM images or on DICOM files.

import pydicom from presidio_image_redactor import DicomImageRedactorEngine # Set input and output paths input_path = "path/to/your/dicom/file.dcm" output_dir = "./output" # Initialize the engine engine = DicomImageRedactorEngine() # Option 1: Redact from a loaded DICOM image dicom_image = pydicom.dcmread(input_path) redacted_dicom_image = engine.redact(dicom_image, fill="contrast") # Option 2: Redact from DICOM file engine.redact_from_file(input_path, output_dir, fill="contrast") # Option 3: Redact from directory engine.redact_from_directory("path/to/your/dicom", output_dir, fill="contrast")

The box fill setting can even be set to blend in with the background (fill = “background”).

Setting the box fill to the background color.

To try the code yourself with more sample data, see the demo notebook in the Presidio repo.

Conclusion

With the Presidio DICOM image redactor engine, there is now an open source and easy-to-use solution for de-identifying text PHI burnt into DICOM images. Not only does this maintain DICOM image quality and redact text PHI with high recall, but it can also be run completely on-prem.

If this seems useful for you or your group, or if you end up using Presidio, I would love to hear from you in the comments!

Although we are confident in our approach, the engine was tested with a small dataset given the limited availability of public DICOM data with burnt-in text PHI. If you have suggestions to improve this module, I highly encourage you to contribute to Presidio in the spirit of open source.

Dataset citation

The DICOM image redactor engine was developed using The Cancer Imaging Archive (TCIA) Pseudo-PHI-DICOM dataset for the evaluation of medical image de-identification.

Source: Rutherford, M., Mun, S.K., Levine, B., Bennett, W.C., Smith, K., Farmer, P., Jarosz, J., Wagner, U., Farahani, K., Prior, F. (2021). A DICOM dataset for evaluation of medical image de-identification (Pseudo-PHI-DICOM-Data) [Data set]. The Cancer Imaging Archive. DOI: https://doi.org/10.7937/s17z-r072

Author note and acknowledgment

Nile Wilson works as a Data Scientist 2 in the Microsoft Industry Solutions Engineering (ISE) Data & AI Healthcare and Life Sciences (HLS) industry team. With a background in biomedical engineering, Nile has combined her experiences in various customer-facing ML projects, her experience working with DICOM images in grad school, and support from colleagues to develop this open source tooling to de-identify text PHI burnt into DICOM images.

Special thanks to Guy Bertental for developing the direct pixel manipulation portion of the code, to Yousef Al-Kofahi and Karol Zak for advice on working with DICOM pixel data, and to Coby Peled, Sharon Hart, and Omri Mendels for their feedback and support in contributing to Presidio.