基于多尺度特征解析的放射學報告生成方法

2024-01-30 02:42王瑞梁建國花嶸

北京大學學報（自然科學版） 2024年1期

關鍵詞：放射學編碼器關鍵

王瑞梁建國花嶸

北京大學學報(自然科學版) 第60卷第1期 2024年1月

Acta Scientiarum Naturalium Universitatis Pekinensis, Vol. 60, No. 1 (Jan. 2024)

10.13209/j.0479-8023.2023.076

2023–05–18;

2023–07–31

基于多尺度特征解析的放射學報告生成方法

王瑞梁建國花嶸?

山東科技大學計算機科學與工程學院, 青島 266590; ?通信作者, E-mail: huarong@sdust.edu.cn

在使用深度學習模型自動生成放射學報告時, 由于數據的極度不平衡, 當前的模型難以識別異常區域特征, 從而導致對疾病的錯判與漏判。為了提升模型對疾病的識別能力, 提高放射學報告的質量, 提出使用多尺度特征解析 Transformer(MFPT)模型來生成放射學報告。構建一個關鍵特征強化注意力(KFEA)模塊, 以便加強對關鍵特征的利用; 設計一個多模態特征融合(MFF)模塊, 以便促進語義特征與視覺特征的特征融合, 緩解特征差異造成的影響; 探索階段感知(SA)模塊在放射學報告任務中對初級特征的優化作用。最后, 在流行的放射學報告數據集 IU X-Ray 上, 與當前的主流模型進行對比實驗, 結果表明, 所提模型取得當前最佳效果。

注意力機制；特征融合；放射學報告；Transformer；圖像–文本生成

圖像描述是一個涉及自然語言處理和計算機視覺的跨模態任務, 目的是使計算機理解圖片的內容并對其生成相應的描述[1–9]。在 Huang 等[8]提出的模型中, 注意力模塊通過度量注意力結果與查詢之間的相關性來提升效果, Wang 等[9]利用多特征預融合的方法來減少特征偏差造成的特征損失。放射學報告的生成任務由圖像描述任務衍生而來, 圖 1 為一份來自 IU X-Ray 數據集的放射學報告, 其中包含大量關鍵單詞和初級單詞。關鍵單詞主要由醫學專業單詞構成, 初級單詞主要由描述關鍵單詞的形容詞組成。模型對關鍵特征和初級特征的解析能力將直接影響這兩種單詞的預測準確度, 從而影響報告的質量。與普通圖片相比, 放射學圖片具有相似度高以及數據不平衡等特點, 因此, 現有的圖像描述模型不能很好地適用放射學報告自動生成任務, 研究人員開始針對放射學報告生成的特點進行相應的研究[10–16]。

單下劃線標注的單詞代表初級單詞, 加粗的單詞表示關鍵單詞

Liu 等[10]首先預測醫學單詞, 然后有條件地生成與這些單詞關系密切的單詞來生成報告。這種方法過度依賴醫學單詞的預測準確度, 導致其仍然具有一定的限制性。Jing 等[11]提出使用標簽特征來緩解關鍵特征和初級特征的損失, 但是忽略了對標簽數量的學習, 致使模型引入新的特征偏差。Zhang等[12]通過構建醫學知識圖, 使模型能夠了解不同疾病之間的關聯, 從而提高對關鍵單詞預測的準確度。但是, 他們未使用 Transformer 框架作為解碼器, 并且沒有考慮初級單詞的生成, 致使關鍵單詞與初級單詞之間的依賴關聯性不足以滿足任務的需要。Song 等[13]使用一種新的方法來提升對關鍵單詞和初級單詞的預測精度, 將輸入圖片與正常圖片進行對比, 學習兩者之間的差距, 提升模型對疾病的預測能力。但是, 該方法隨機提取的正常圖片特征不可避免地與輸入的正常圖片特征有所差別, 導致模型對疾病的判斷能力下降, 影響報告的整體質量。

模型對語義特征和視覺特征之間的多模態特征解析能力直接影響能否生成結構合適的報告。最近, 針對生成放射學報告任務的多模態模型研究取得一定的進展。Chen 等[14]提出一個跨模態記憶模塊, 加強了模型對文本與圖像之間映射關系的學習。Chen 等[15]設計一個具有記憶功能的內存模塊和一個基于內存模塊的歸一化層來學習不同時期的多模態特征依賴關系, 使模型能生成結構合適的放射學報告。上述研究未對關鍵特征和初級特征進行優化處理, 導致模型對圖像特征的理解不夠充分, 從而影響模型對疾病的預測能力。You 等[16]提出Align Hierarchical Attention (AHA)模塊來對齊標簽語義特征和視覺特征, 但沒有針對單詞語義特征與視覺特征之間的差異進行優化, 導致生成的報告質量未達到預期。

為了加強模型的特征解析能力, 緩解多模態特征之間的特征差異, 本文提出使用多尺度特征解析Transformer (multi-scale feature parsing Transformer,MFPT)模型來生成放射學報告, 以便減少對疾病的誤判和錯判, 生成更高質量的放射學報告。本文構建了一個 KFEA (key features enhance attention)模塊, 對關鍵特征進行特征強化, 提高對關鍵單詞的預測準確性。設計一個 MFF (multi-modal feature fusion)模塊, 促進語義特征與視覺特征融合, 調整報告的結構。本文還探索了 SA (stage awareness)模塊在醫療報告領域中對初級特征的強化作用。

1 模型的設計與實現

本研究使用序列到序列(sequence-to-sequence, Seq2Seq)的方法來自動生成放射學報告。首先使用視覺提取器, 提取放射學圖像特征作為初始序列, 編碼解碼后得到最終的報告:

其中,是提取到的第個patch特征,是映射的特征向量的大小,表示生成序列中第個單詞對應的張量,表示生成報告中的所有單詞的集合。

1.1 模型結構

如圖 2 所示, 本文提出的多尺度特征解析 Trans-former (MFPT)模型由視覺提取器、編碼器和解碼器三部分組成。本文工作主要體現在解碼器中的KFEA 和 MFF 模塊以及編碼器中的 SA 模塊。

圖2 多尺度特征解析Transformer模型整體結構

視覺提取器本文使用預訓練過的 ResNet-101模型[17]作為視覺提取器來提取視覺特征, 如式(3) 所示:

其中,ve代表視覺提取器的提取操作,表示輸入的放射學圖片所對應的張量空間。

編碼器本文的編碼器與 Vanilla Transformer的編碼器有所不同, 我們額外使用一個階段感知模塊來整合不同編碼器層數之間的初級特征。此模塊用下式表示:

其中,e代表編碼器,表示第層編碼器的輸出,表示編碼器最終的輸出, SA 表示階段感知模塊的操作。

解碼器解碼器的設計是在 Vanilla Transformer的解碼器基礎上增加 KFEA 模塊和MFF模塊, 解碼器根據編碼器的輸出={1,2, …,z}和已經預測完的單詞序列, 生成下一個單詞, 計算過程如下:

其中,表示在第個時間步長預測的單詞序列所對應的張量,d表示解碼器,CA表示交叉注意力模塊的輸出。

損失函數本文采用與 Chen 等[15]相同的損失函數來訓練模型, 具體表示為

其中,表示模型的參數。

1.2 階段感知模塊(SA)

受 Zeng 等[18]的啟發, 我們使用 SA 模塊來減少訓練過程中初級特征的損失。該模塊分為兩步執行, 第一步對不同的編碼層設置不同的權重, 如式(8)所示:

式中,表示一個可調節的權重因子。

1.3 多模態特征融合模塊(MFF)

傳統的特征融合方法如圖 3 所示, 它使用單一的語義特征作為查詢向量, 直接對視覺特征進行過濾查詢。具體來說, 首先將目標報告所對應的文本序列送入嵌入層形成語義特征; 然后使用 Masked Multi-Head Attention 模塊對語義特征進行注意力運算, 經 Add&layer Norm 層處理; 最后, 將結果作為查詢向量輸入交叉注意力模塊中, 與經過編碼器的視覺特征進行交叉注意運算, 從而實現多模態的融合。這種方法忽略了視覺特征與語義特征之間的特征差異, 特征融合過程中計算跨度過大, 限制了模型的推理能力。

圖3 傳統的特征融合方法

如圖 4 所示, 本文提出使用多模態特征融合模塊來解決這一問題。該模塊將語義特征和關鍵的視覺特征作為查詢向量, 幫助模型建立視覺特征與語義特征之間的聯系。首先提取視覺關鍵特征, 然后使用多模態特征融合模塊, 將其與語義特征進行預融合后, 再進行傳統的特征融合計算。計算公式可表示為

式中, Zm表示視覺關鍵特征, Mean 表示平均運算, H1 表示在執行 MFF 模塊過程中的隱藏狀態, ,和表示可訓練的投影矩陣, σ代表 Sigmoid激活函數,表示哈達瑪積, H2 表示 MFF 模塊的輸出, H3 表示交叉注意力模塊的查詢向量, MMSA 表示 Masked Multi-Head Attention 模塊。之后, H3 被輸入交叉注意力模塊中, 生成當前的隱藏狀態 H。

1.4 關鍵特征強化注意力模塊(KFEA)

以往的研究中大多使用交叉注意力模塊輸出的隱藏狀態, 直接預測放射學報告, 但由于關鍵單詞比普通單詞出現頻率低, 導致模型不能生成合適的關鍵單詞。在進行交叉注意力運算時, 視覺特征難以滿足關鍵單詞的查詢要求, 不可避免地損失了部分關鍵特征。本文使用 KFEA 模塊, 利用全局特征和突出特征對上下文特征進行注意力計算, 從而提高對關鍵特征的特征利用能力。計算流程如下:

2 實驗設置與結果分析

2.1 數據集和評價指標

本文采用廣泛使用的公共放射學數據集 IU X-Ray[19], 其中包括 7470 張胸部 X 光圖像和3955 份報告。采用與主流模型相同的 7:1:2 的比例劃分訓練集、驗證集和測試集。根據當前主流的研究慣例, 去除沒有報告的圖片, 將所有字母轉換為小寫, 并且剔除特殊符號等影響因素。用 BLEU[20], METE-OR[21]和 ROUGE[22]指標來評估本文模型。其中, BN表示 N-grams 為時的 BLEU 指標。

2.2 實驗設置

我們遵循當前主流模型的實驗設置, 將 Vanilla Transformer 作為 Base 模型。此外, 使用在 Image-Net 1K數據集預訓練的 ResNet-101 模型提取網格特征, 多頭注意力機制的頭數設置為 8, 嵌入向量為512。IU X-Ray 數據集中包含一個患者的正面和側面圖像, 將其合并, 作為視覺提取器的輸入。將每個 epoch 的速率衰減設置為 0.8, beam size 的大小設置為 3, 視覺提取器和其他參數的學習率分別設置為 1×10–4和 5×10–5, batch size 的大小設置為 32。

2.3 對比實驗

將本文提出的模型與當前的主流模型 Ada-Att[23], M2Transformer[24], CMCL[25], R2Gen[15],CMN[14], PPKED[26], CA[13]和 GSKET[27]進行對比, 實驗結果如表 1 所示。與圖像描述模型 AdaAtt[23]和 M2Transformer[24]相比, 本文模型的性能提升明顯, 表明需要為醫療報告生成設計專用模型。與文獻[13–15,25–27]對比可知, 本文模型在各項指標中均優于當前主流模型, 表明對關鍵特征和初級特征進行強化利用, 對多模態特征進行預融合, 有助于生成更可靠的放射學報告。

2.4 消融實驗

通過消融實驗, 對比 Base 模型與本文模型的效果, 結果如表 2 所示?？梢钥闯? 與 Base 模型相比, 增加每個模塊后性能都有所提升, 3 種模塊隨機組合也都能獲得不錯的效果。此外, 可以發現 KFEA模塊的提升幅度最大, 證明在沒有引進標簽特征的情況下, 本文提出的針對關鍵單詞優化方法獲得良好的效果, 為緩解錯判漏判問題提出了新思路。

表1 本文模型和已有模型的自然語言生成指標效果對比(%)

說明: 粗體數字表示性能最優, 下同。

表2 Base模型與本文模型的效果比較(%)

在數據集 IU X-Ray 上, 對 SA 模塊中新引進的超參數進行消融實驗, 結果如表 3 所示?？梢园l現, 當=0.2 時, 綜合指標達到最優效果, 超過 0.2后, 各項指標小幅度下降。

2.5 復雜度分析

基于 IU X-Ray 數據集, 對本文模型進行復雜度分析, 結果如表 4 所示。與 R2Gen 模型[15]相比, 在使用更少參數的情況下, 本文模型能夠獲得更好的效果。

2.6 實驗結果分析

為了更好地了解本文模型生成的報告質量, 對一組前胸部和側面胸部圖像醫學案例進行測試分析, 結果如圖 5 所示?？梢钥闯? 與 Base 模型相比, 本文模型可以生成更多更準確的關鍵單詞和初級單詞, 成功地捕捉到生成影像學報告所需要的關鍵特征和初級特征, 能夠生成與 Ground-truth 基本上一致的描述。此外, 本文模型生成的報告明顯長于Base 模型生成的報告, 說明本文模型采取的多模態融合策略更加有效。

3 結論

本研究提出一個多尺度特征解析 Transformer (MFPT)的模型。在該模型中, 設計了一個關鍵特征強化注意力模塊, 構建了一個多模態特征融合模塊, 集成了一個階段感知模塊來生成影像學報告。本文模型在提升模型對關鍵特征和初級特征解析能力的同時, 促進了語義特征與視覺特征融合, 使得錯報、漏報問題得到初步解決, 生成的放射學報告結構更加合理。但是, 本文模型未對疾病之間的關系進行學習, 具有一定的局限性。未來研究中將嘗試建立并利用疾病關系知識圖, 以便加強模型對疾病之間關系的理解。

表3 SA模塊中加權因子 μ的消融結果(%)

表4 復雜度分析結果

Ground-truth 表示人工書寫的報告, Base 表示基礎模型生成的報告, Ours 表示本模型生成的報告。單下劃線標注的單詞代表初級單詞, 加粗的單詞表示關鍵單詞

[1] Chen X, Fang H, Lin T Y, et al. Microsoft COCO cap-tions: data collection and evaluation server. Computer Science, 2015, 5: 1–7

[2] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual ques-tion answering // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018: 6077–6086

[3] Vinyals O, Toshev A, Bengio S, et al. Show and tell: a neural image caption generator // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, 2015: 3156–3164

[4] Xu K, Ba J, Kiros R, et al. Show, attend and tell: neural image caption generation with visual attention // Pro-ceedings of the International Conference on Machine Learning. Lille, 2015: 2048–2057

[5] Liu Fenglin, Liu Yuanxin, Ren Xuancheng, et al. Alig-ning visual regions and textual concepts for semantic-grounded image representations // Proceedings of the Annual Conference on Neural Information Processing Systems. Vancouver, 2019: 6847–6857

[6] Liu Fenglin, Ren Xuancheng, Liu Yuanxin, et al. Exp-loring and distilling cross-modal information for image captioning // Proceedings of the International Joint Conference on Artificial Intelligence. Macau, 2019: 5095–5101

[7] Liu Fenglin, Ren Xuancheng, Liu Yuanxin, et al. sim-Net: stepwise image-topic merging network for gene-rating detailed and comprehensive image captions // Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, 2018: 137–149

[8] Huang L, Wang W, Chen J, et al. Attention on attention for image captioning // Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition. Long Beach, 2019: 4634–4643

[9] Wang Y, Xu J, Sun Y. End-to-end transformer based model for image captioning // Proceedings of the AAAI Conference on Artificial Intelligence. Online Meeting, 2022: 2585–2594

[10] Liu G, Hsu T M H, McDermott M, et al. Clinically accurate chest X-ray report generation // Proceedings of the Conference on Machine Learning for Healthcare Conference. Ann Arbor, 2019: 249–269

[11] Jing B, Xie P, Xing E. On the Automatic generation of medical imaging reports // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018: 2577–2586

[12] Zhang Y, Wang X, Xu Z, et al. When radiology report generation meets knowledge graph. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 12910–12917

[13] Song X, Zhang X, Ji J, et al. Cross-modal contrastive attention model for medical report generation // Pro-ceedings of the 29th International Conference on Com-putational Linguistics. Gyeongju, 2022: 2388–2397

[14] Chen Z, Shen Y, Song Y, et al. Cross-modal memory networks for radiology report generation. Computers & Electrical Engineering, 2022, 98: 1879–0755

[15] Chen Z, Song Y, Chang T H, et al. Generating radio-logy reports via memory-driven transformer // Pro-ceedings of the Conference on Empirical Methods in Natural Language Processing. Online Meeting, 2020: 1439–1449

[16] You D, Liu F, Ge S, et al. Aligntransformer: hierarchi-cal alignment of visual regions and disease tags for medical report generation // Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Strasbourg, 2021: 72–82

[17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition // Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, 2016: 770–778

[18] Zeng Pengpeng, Zhang Haonan, Song Jingkuan, et al. S2 transformer for image captioning // Proceedings of the International Joint Conference on Artificial Intel-ligence. Vienna, 2022: 1608–1614

[19] Demner-Fushman D, Kohli M D, Rosenman M B, et al. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Me-dical Informatics Association, 2016, 23(2): 304–310

[20] Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation // Pro-ceedings of the Annual Meeting of the Association for Computational Linguistics. Philadelphia, 2002: 311–318

[21] Denkowski M, Lavie A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems // Proceedings of the sixth work-shop on statistical machine translation. Edinburgh, 2011: 85–91

[22] Lin C Y. Rouge: a package for automatic evaluation of summaries // Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Barcelona, 2004: 74–81

[23] Lu Jiasen, Xiong Caiming, Parikh D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning // Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition. Honolulu, 2017: 3242–3250

[24] Cornia M, Stefanini M, Baraldi L, et al. Meshed-memory transformer for image captioning // Procee-dings of the IEEE Conference on Computer Vision and Pattern Recognition. Online Meeting, 2020: 10575–10584

[25] Liu Fenglin, Ge Shen, Wu Xian. Competence-based multimodal curriculum learning for medical report generation // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online Meeting, 2021: 3001–3012

[26] Liu Fenglin, Wu Xian, Ge Shen, et al. Exploring and distilling posterior and prior knowledge for radiology report generation // Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition. Online Meeting, 2021: 13753–13762

[27] Yang S, Wu X, Ge S, et al. Knowledge matters: chest radiology report generation with general and specific knowledge. Medical Image Analysis, 2022, 80: 102510

Radiology Report Generation Method Based on Multi-scale Feature Parsing

WANG Rui, LIANG Jianguo, HUA Rong?

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590; ? Corresponding author, E-mail: huarong@sdust.edu.cn

When using deep learning models to automatically generate radiology reports, due to the extreme imbalance of data, it is difficult for current models to identify abnormal regional features, which leads to misjudgment and missed judgment of the disease. In order to improve the model’s ability to identify diseases and improve the quality of reports, the authors use a multi-scale feature parsing Transformer (MFPT) model to generate radiology reports. Among them, a key feature enhanced attention (KFEA) module is constructed to strengthen the utilization of key features. A multi-modal feature fusion (MFF) module is designed to promote the feature fusion of semantic features and visual features and alleviate the impact caused by feature differences. This paper explores the role of stage-aware (SA) module in optimizing primary features in radiology reporting tasks. Finally, compared with the current mainstream models on the popular radiology report dataset IU X-Ray, the results show that the proposed model has achieved the current best effect.

attention mechanism; feature fusion; radiology report; Transformer; image-text generation