Introduction
- Vision language models (VLM) are great for a variety of tasks
- for certain reason avoid proprietary models
- public facing VLMs
- seem confident
- closed proprietary models have unknown post-training processes, which hinders research into explainability, generalizability and effective evaluation (essentially we are on one eye blind)
- instead take the effort of fine-tuning a VLM yourself and calibrate it based on your own medical expertise
- in healthcare domain it is more efficient to build a specialist model with very distinct capabilities rather than teach one model everything
- for wide practical adaptation fine-tuning should be a commodity task and easy to do even without an entire team of ML engineers available (as is often the case in Academic medical centres)
- let’s see an example of how to fine-tune a open-source VLM architecture on custom medical data
Llava
alternatively fine-tune SAM
https://github.com/microsoft/LLaVA-Med
https://medium.com/ubiai-nlp/how-to-fine-tune-llava-on-your-custom-dataset-aca118a90bc3
The dataset
either TotalSegmentator or MSD (Medical Segmentation Decathlon) https://github.com/wasserth/TotalSegmentator
CHECK WHICH DATASET IS EASIER TO LOAD (THEY ARE BOTH EQUALLY NICE AND WELL KNOWN)
Fine-tuning
https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md
https://github.com/efenocchi/torchtune/blob/feat/deeplake-v4/torchtune/datasets/_utils.py
https://ubiai.tools/how-to-fine-tune-llava-on-your-custom-dataset/