Principal goal: evaluating and implementing different techniques for detecting, recognising and eliminating text containing personal data in medical images.
Medical images can contain personal information “burned in” the pixel data. This is a problem sometimes overlooked by anonymisation tools, and when it is addressed it usually requires users’ input (like identifying the zones containing the personal information).
The objective of the project is to evaluate different possible solutions for the automatic handling of textual data in medical images. This process can be divided in several steps:
- Detecting the presence of text in the image.
- Localizing the text.
- Recognising the text.
- Evaluating the risk (in terms of confidentiality) and deciding whether to remove it or not.
- Remove risky text.
Various possible approaches can be found for realising each of these steps, and it would be necessary to evaluate which are the most adequate for medical images.
Approaches include the usage of Artificial Neural Networks (and other machine learning methods) and wavelets.
A great challenge is to decide whether a certain text is personal data or not, the difficulty of this task depends on the additional information (not in the pixels) available.
In the case of images in DICOM (Digital Imaging and Communications) format the information in the header can be used to decide if a given text should be removed or not. While for images generated in the same device the personal information would usually appear in the same zone of the images, and if this information is available it would simplify the problem.
But in the general case of medical images coming from different sources and with different formats there is no easy way to establish a criterion for deciding which text should be removed. There has been work for anonymising free text medical data, but these methods normally make use of syntactic and lexical information not available in images. The objective in this case is to research if it is possible to establish some rules for deciding when a text should be removed in the general case.