Advanced search
2 files | 5.99 MB Add to list

Integrating visual context into language models for situated social conversation starters

Author
Organization
Project
Abstract
Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.
Keywords
Oral communication, Visualization, Data models, Task analysis, Social robots, Natural language processing, Affective computing, Natural language generation, vision-and-language, conversation models, embodied conversational agents

Downloads

  • (...).pdf
    • full text (Published version)
    • |
    • UGent only
    • |
    • PDF
    • |
    • 2.40 MB
  • DS798 acc.pdf
    • full text (Accepted manuscript)
    • |
    • open access
    • |
    • PDF
    • |
    • 3.59 MB

Citation

Please use this url to cite or link to this publication:

MLA
Janssens, Ruben, et al. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, vol. 16, no. 1, 2025, pp. 223–36, doi:10.1109/TAFFC.2024.3428704.
APA
Janssens, R., Wolfert, P., Demeester, T., & Belpaeme, T. (2025). Integrating visual context into language models for situated social conversation starters. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 16(1), 223–236. https://doi.org/10.1109/TAFFC.2024.3428704
Chicago author-date
Janssens, Ruben, Pieter Wolfert, Thomas Demeester, and Tony Belpaeme. 2025. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 16 (1): 223–36. https://doi.org/10.1109/TAFFC.2024.3428704.
Chicago author-date (all authors)
Janssens, Ruben, Pieter Wolfert, Thomas Demeester, and Tony Belpaeme. 2025. “Integrating Visual Context into Language Models for Situated Social Conversation Starters.” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 16 (1): 223–236. doi:10.1109/TAFFC.2024.3428704.
Vancouver
1.
Janssens R, Wolfert P, Demeester T, Belpaeme T. Integrating visual context into language models for situated social conversation starters. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING. 2025;16(1):223–36.
IEEE
[1]
R. Janssens, P. Wolfert, T. Demeester, and T. Belpaeme, “Integrating visual context into language models for situated social conversation starters,” IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, vol. 16, no. 1, pp. 223–236, 2025.
@article{01JTFYW2NVSHMD3Q6ZWC94VM91,
  abstract     = {{Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.}},
  author       = {{Janssens, Ruben and Wolfert, Pieter and Demeester, Thomas and Belpaeme, Tony}},
  issn         = {{1949-3045}},
  journal      = {{IEEE TRANSACTIONS ON AFFECTIVE COMPUTING}},
  keywords     = {{Oral communication,Visualization,Data models,Task analysis,Social robots,Natural language processing,Affective computing,Natural language generation,vision-and-language,conversation models,embodied conversational agents}},
  language     = {{eng}},
  number       = {{1}},
  pages        = {{223--236}},
  title        = {{Integrating visual context into language models for situated social conversation starters}},
  url          = {{http://doi.org/10.1109/TAFFC.2024.3428704}},
  volume       = {{16}},
  year         = {{2025}},
}

Altmetric
View in Altmetric
Web of Science
Times cited: