Fusing multimodal knowledge in language models