Integrating visual context into language models for situated social conversation starters
- Author
- Ruben Janssens (UGent) , Pieter Wolfert, Thomas Demeester (UGent) and Tony Belpaeme (UGent)
- Organization
- Project
- Abstract
- Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.
- Keywords
- Oral communication, Visualization, Data models, Task analysis, Social robots, Natural language processing, Affective computing, Natural language generation, vision-and-language, conversation models, embodied conversational agents
Downloads
-
(...).pdf
- full text (Published version)
- |
- UGent only
- |
- |
- 2.40 MB
-
DS798 acc.pdf
- full text (Accepted manuscript)
- |
- open access
- |
- |
- 3.59 MB
Citation
Please use this url to cite or link to this publication: http://hdl.handle.net/1854/LU-01JTFYW2NVSHMD3Q6ZWC94VM91
- MLA
- Janssens, Ruben, et al. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, vol. 16, no. 1, 2025, pp. 223–36, doi:10.1109/TAFFC.2024.3428704.
- APA
- Janssens, R., Wolfert, P., Demeester, T., & Belpaeme, T. (2025). Integrating visual context into language models for situated social conversation starters. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 16(1), 223–236. https://doi.org/10.1109/TAFFC.2024.3428704
- Chicago author-date
- Janssens, Ruben, Pieter Wolfert, Thomas Demeester, and Tony Belpaeme. 2025. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 16 (1): 223–36. https://doi.org/10.1109/TAFFC.2024.3428704.
- Chicago author-date (all authors)
- Janssens, Ruben, Pieter Wolfert, Thomas Demeester, and Tony Belpaeme. 2025. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 16 (1): 223–236. doi:10.1109/TAFFC.2024.3428704.
- Vancouver
- 1.Janssens R, Wolfert P, Demeester T, Belpaeme T. Integrating visual context into language models for situated social conversation starters. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING. 2025;16(1):223–36.
- IEEE
- [1]R. Janssens, P. Wolfert, T. Demeester, and T. Belpaeme, “Integrating visual context into language models for situated social conversation starters,” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, vol. 16, no. 1, pp. 223–236, 2025.
@article{01JTFYW2NVSHMD3Q6ZWC94VM91,
abstract = {{Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.}},
author = {{Janssens, Ruben and Wolfert, Pieter and Demeester, Thomas and Belpaeme, Tony}},
issn = {{1949-3045}},
journal = {{IEEE TRANSACTIONS ON AFFECTIVE COMPUTING}},
keywords = {{Oral communication,Visualization,Data models,Task analysis,Social robots,Natural language processing,Affective computing,Natural language generation,vision-and-language,conversation models,embodied conversational agents}},
language = {{eng}},
number = {{1}},
pages = {{223--236}},
title = {{Integrating visual context into language models for situated social conversation starters}},
url = {{http://doi.org/10.1109/TAFFC.2024.3428704}},
volume = {{16}},
year = {{2025}},
}
- Altmetric
- View in Altmetric
- Web of Science
- Times cited: