Detailed Program
Talk Session 1: Using machine learning to track behavior
The first talk session will discuss using machine learning to track infants’ behavior. Most research about infants’ everyday behavior has relied on laborious human coding of video recordings. Recent advances in wearable technologies—movement sensors, ECG, headcams/eye trackers, and audio recorders—have enormous potential to measure continuous, full-day behavior. However, sensors do not directly measure behaviors of psychological interest, such as sleep/wake states, motor actions, joint attention, and infant-caregiver proximity. Speakers will demonstrate how they have leveraged machine learning algorithms to classify infant and caregiver behavior from wearable sensors.
My talk will discuss capturing the dynamics of caregiver–child proximity across early life using wearable sensing technology. Our lab developed the TotTag, a wearable device that measures physical proximity between caregivers and children twice per second, allowing unprecedented ecological assessment of how proximity patterns differ across families and contexts, change across development, and relate to developmental outcomes. This approach enables continuous monitoring of caregiver–child interactions in natural settings. We are currently using this methodology in longitudinal studies tracking family proximity patterns from pregnancy through early childhood, capturing everyday experiences during developmental transitions that were previously difficult to measure outside laboratory settings.
Wearable movement sensors have the potential to describe real-time infant motor behavior in naturalistic settings, but raw sensor data need to be classified into relevant categories. We will present novel machine learning classifiers for infant posture (e.g., sitting, standing), locomotion (e.g., crawling, walking), and restraint (e.g., placed in seat and/or held) and describe results about each behavior from a dataset of 850 hours of in-home recording.
Our interdisciplinary team has developed LittleBeats, an infant wearable platform that synchronously collects multimodal data (e.g., ECG, audio, motion) for extended periods of time (e.g., daylong recordings) in real-world contexts (e.g., the home). We use multimodal deep learning algorithms applied to LittleBeats data to assess a variety constructs, including infants’ regulation of emotions, sleep/wake states, and physical activity/sedentary behavior. We will present an example of these multimodal deep learning architectures for the detection of infant wake/sleep states, including REM and non-REM sleep stages. Advantages and challenges of multimodal machine learning approaches will be discussed.
Automatic detection of behavioral cues from video offers promising tools for supporting the early diagnosis of neurodevelopmental conditions. In this talk, I will present our work on detecting specific infant movements—such as face-touch gestures—from video recordings, enabling insights into how babies engage with themselves and their environment. I will also discuss our results on the automatic analysis of videos during the Strange Situation Procedure to predict child attachment scores. In addition, I will present a multimodal machine learning framework that processes multiple behavioral signals to model behavior more holistically. Finally, I will highlight our recent efforts on interpretable, concept-based deep learning approaches, emphasising the importance of explainability in child behavior analysis.
Everyday movement behavior is exhibited at vastly varying temporal scales ranging from micro-scale (~seconds) to macro-scale (~minutes to ~hours), where the higher-scale units (e.g., “playing football”) causally condition the observed statistics of the lower-scale units (e.g., “instantaneous posture”). By focusing only on any single level of analysis, crucially important contextual information is lost. Thus, a grand challenge in improving real-world behavioral analytics is to find the relevant and robust activity recognition targets for all temporal scales. Here, a particular challenge for inertial sensors is to reconcile the mismatch between the measured signals and the visually interpreted content of the event.
Poster Session
Large volumes of video data that capture a diverse range of a child’s experiences are essential for comprehensively observing their development. However, innovation in developmental psychology is limited in model training resulting from 1) inaccessibility of infant videos due to the privacy concerns of sharing identifiable data and 2) high compute demands of accurately extracting features from videos in real-world settings. We created an ensemble deep learning method that merges the best capabilities of multiple tools (i.e., YuNet, RetinaFace) to quickly and accurately automatically track and extract face and movement coordinates while deidentifying infant and child faces in videos. This worked with complicated movements (i.e., running, dancing, turning, being upside down) and in settings (i.e., playgrounds, multiple fast-paced interactions, toys and accessories blocking the face) that are commonly experienced by children. This achieved two major goals, which included 1) creating and validating an affordable option to enable sharing deidentified videos at scale, and 2) extracting features (i.e., degree of movement, number and length of interactions, closeness with others) at a noticeably reduced compute time that could potentially enable closer-to-real-time insights from video data collected (e.g., wearables). This approach only needed 36.8% of the time that the state-of-the-art method (i.e., RetinaFace) needed to detect 99.6% of all faces in large complex naturalistic videos. Future work will apply this method to auto-deidentify video data for responsible sharing with collaborators, validate on more datasets to continuously improve model performance balanced with accuracy, and increase the possibilities of feature extraction from each video.
Maternal education is a well-established predictor of children’s language development, but less is known about its role in the emergence of emotional vocabulary. The present study examined whether maternal education influences children’s production of emotion words. Using Wordbank’s database of MacArthur-Bates Communicative Development Inventories (CDIs), we analyzed American English-speaking children who completed the “Words and Sentences” form. Emotional vocabulary was measured by the production of eight words (“happy,” “sad,” “mad,” “scared,” “love,” “hate,” “hurt,” “tired”), and maternal education was categorized into three levels: Below College, College, and Above College. Regression analyses revealed a significant effect of education: children of mothers with Below College education produced fewer emotion words than those of mothers with College degrees (p = .02). Children whose mothers fell into the Above College category showed a marginal advantage (p ≈ .057). Age and sex were strong predictors, with older children and females consistently producing more emotional vocabulary. Additional analyses using decision tree models revealed the same trend: the data split first by age and then by sex, confirming these as the most predictive features of emotion word production. Notably, maternal education did not emerge, suggesting that while there is a correlation between education and vocabulary, education is not the strongest standalone predictor. Our findings suggest that maternal education is associated with children’s emotional vocabulary; however, developmental factors such as age and sex remain stronger predictors. More broadly, this study demonstrates how applying machine learning to large-scale vocabulary datasets can be used to capture environmental influences on language development.
Locomotor skill onset, such as crawling and walking, has been temporally linked to increased night wakings and decreased night sleep duration. Explanations for why motor skill onset and sleep are related in infancy remain speculative due to a dearth of automated, non-invasive ways to document infants’ movements during sleep.
Self-touch generates haptic, proprioceptive and motor information and may be fundamental for developing body sense or establishing complex behavior, such as goal-directed actions. The aim of this study was to explore the timing and frequency of infants’ self-touch during nightly wake episodes around the acquisition of independent crawling. Preliminary findings show that infants touch their own bodies during nightly wake episodes more often and for longer bouts before crawling onset than after, suggesting that self-touch may function to construct an up-to-date frame of reference for action and body sense.
This project capitalizes on auto-videosomnography (computer vision coding of sleep metrics) and manual coding of infants’ self-touch to lay the foundation for a new line of work using computer vision to automatize infants’ self-touch to identify: 1) the movement associated with fragmented sleep and 2) whether touch location is skill specific.
Conversational turn-taking shows intricate temporal coordination. Across cultures, conversational turn transitions occur within half a second, with minimum overlaps or gaps. Interestingly, rapid and smooth turn-taking is observed in prelinguistic infants as young as five months and many non-human species, which suggests that turn-taking might be relatively low-level and automatic.
To understand what triggers or inhibits a turn, aside from speech contents, we propose to use machine learning methods and explore whether turn-taking is predictable from timing alone. We ask three questions: 1) can the next turn transition be predicted from recent turns’ timing? If so, 2) what temporal window of past turns (e.g., 0.5s, 1s) provides the best predictive performance, and 3) which type of model (e.g., linear regression, recurrent NN, transformer-based models) predicts turn-taking better?
We annotated naturalistic conversations from 66 Singaporean parents and their two-year-olds, and defined the dyads’ states as speaking (1) or not (0). We processed the annotations to form two binary streams (P(t), C(t)) with a fixed timestep of 0.1 second. Along the binary streams, a sliding window of experimentally manipulated length (5 or 10 timesteps) captures the input data. We will evaluate the models focusing on turn transitions. For example, at the end of a turn (when t:[0,0], and t-1: [0,1]/ [1,0]), when will the next turn start and who will initiate it (t?: [1,0]/ [0,1]). We will compute the accuracy, F1 score, and log-loss for each model, to determine which model performs best, and test whether the models perform better than chance.
Long-form naturalistic audio recordings offer rich opportunities to study real-world language environments surrounding children. Automatic transcription tools such as Whisper allow researchers to convert hours of audio to analyzable text, however determining the intended audience of each utterance remains a key limitation. Current methods cannot distinguish whether an adult is speaking to their infant or conversing with another adult—a fundamental distinction for understanding children’s linguistic input. We present an automated tool that addresses this gap by classifying transcribed utterances as either Infant-Directed Speech (IDS) or Adult-Directed Speech (ADS) using a fine-tuned DistilBERT transformer model, achieving an F-score of 90% for classification of IDS. The original dataset comprised 8,233 manually annotated utterances from naturalistic home recordings, which were balanced across participants to create a training set of 2,838 utterances (1,419 IDS, 1,419 ADS) from 44 participants. This tool transforms the analysis of large-scale naturalistic recordings by allowing researchers to focus specifically on child-directed interactions without manual annotation—a process that is prohibitively time-consuming at scale.
Contingent responses to child vocalizations have been shown to be beneficial for young language learners as early as 5 months of age (e.g. Goldstein et al. 2003). Furthermore, the properties of caretaker speech used in interactions with infants and toddlers, and in particular immediately following children’s vocalizations, have a facilitating effect on infants’ speech development (e.g. Goldstein et al. 2009, Tamis-LeMonda et al. 2014). However, some studies point out that the directionality of exaggerated prosodic and phonetic properties of “parentese” with regard to child-initiation of interactions is not always clear (Marklund et al. 2021). Another less studied area is the interaction of vocalizations and gestures.
This poster describes a study in progress addressing the question of parental responses to children’s pointing gestures. The data consists of a longitudinal corpus of 6 Tuatschin (Romansch variety spoken in Switzerland) children between the ages of 2;0 and 4;0. Videos of the children’s interactions with their caretakers were recorded by the parents in their own home environments and annotated in ELAN. Pointing gestures are extracted by first extracting all hand movements and estimating the pointing pose using MediaPipe, and then using spatial temporal graph convolutional networks to categorize the poses as pointing events. Following this, the caretakers’ response utterances immediately following the pointing gestures are extracted and analyzed according to content of speech (based on transcribed speech) and prosodic features (latency, F0 max and range, and intensity). A test set of responses to pointing gestures is contrasted with a matched sample of contingent responses to vocalizations not accompanied by pointing gestures.
Naturalistic day-long recordings offer researchers the opportunity to study the physical and social environments surrounding infants and children. These recordings allow for analyses across modalities, over time, and at multiple hierarchical levels. In this paper, we focus on audio data to present automated methods to measure patterns ranging from low-level acoustic features to high-level semantic complexity and surprisal. Our goal is to illustrate potential untapped applications of wearable devices, particularly audio recordings, in developmental research. Using environmental predictability as an example, we demonstrate how complex constructs that unfold over different timescales can be meaningfully analysed. We also discuss the key considerations and limitations associated with this approach, highlighting both its utility and the practical challenges it entails.
Transcribing naturalistic caregiver-child interactions is a labor-intensive task in developmental research. While automatic speech recognition (ASR) models offer potential solutions, current ASR systems, including OpenAI’s Whisper, were primarily trained on adult-directed speech (ADS) and have not been systematically evaluated for infant-directed speech (IDS). IDS differs from ADS in prosody, phonetics, utterance structure, and lexical patterns, posing unique challenges for ASR transcription. In this study, we benchmarked Whisper’s transcription accuracy on a dataset of naturalistic caregiver-infant recordings, comparing ASR-generated transcripts to human-annotated gold-standard transcriptions. We evaluated Whisper’s performance using word error rate (WER) and accuracy, and conducted a failure analysis to identify systematic errors made by Whisper, including utterance segmentation mismatches, high error rates in short and repetitive speech and inconsistent filler word handling. Results indicated that Whisper struggles with short, contextually ambiguous utterances and fails to reliably segment IDS utterances. Based on these results and early user experience, we propose a preliminary pipeline for integrating ASR technology with human transcription workflows to enhance the efficiency of processing large-scale naturalistic speech data in developmental research.
Talk Session 2: Using machine learning to track environments—Audio and video
The second talk session will discuss using machine learning to track early environments. Infants interact with their caregivers in context, and this environmental context crucially shapes developmental outcomes. For example, caregivers’ speech input and sensitivity, as well as broader features of environments (e.g., noise, household chaos) can impact language and socio-emotional learning. How can researchers capture these environmental features using audio and video recordings? Speakers will discuss how they have applied machine learning algorithms to capture the early auditory and visual environment.
This talk will present the ongoing research works on detection and classification of acoustic scene and events (DCASE) at the Signal Processing Research Centre. I will introduce acoustic scene classification, audio tagging and sound event detection. Further, I will highlight the issues and solutions associated with deep learning models for learning the DCASE tasks incrementally over existing knowledge. I will discuss the available audio pre-trained models, ways to extract the embeddings and solve target DCASE tasks. Finally, I will list all ongoing research works in our Audio Research Group and discuss collaboration opportunities.
Long-form recordings let us capture the richness of children’s everyday language environments, offering insight into how they hear and use language in real life. Using tools like LENA and open-source alternatives (e.g., Voice Type Classifier (VTC), Lavechin et al. 2020), we measure key aspects of early language experience, including child vocalizations, adult speech, and conversational turns. I will share how we collect and analyze these recordings, our efforts made towards standardization in this domain, and some of the challenges we still face. This work highlights the potential and limitations of current methods in capturing children’s linguistic experiences.
Understanding the emotional content of speech heard by infants is crucial for studying early emotional experiences and their impact on infants’ developmental outcomes. This talk will present a state-of-the-art approach for automatic emotional analysis of child-directed speech in a neonatal intensive care unit environment. I will cover the development process of the approach, including data collection, the main challenges encountered, and the solutions implemented to address them. Additionally, I will briefly discuss the performance level of the proposed approach and the applications where the approach has been used.
This talk will describe progress and lessons learned drawing from our experience training classifiers to detect parent-infant behavior in context, including infant crying, parent holding, and household chaos. We will link specific examples from our experiences with guidelines for developmentalists interested in using and developing such models for their own work. These guidelines will highlight common pitfalls and challenges with using artificial intelligence for research and intervention, matched with best practices and practical recommendations for researchers working in this field.
Infants and toddlers learn about objects quite differently from today’s artificial intelligence systems. To better understand these processes, we develop computational models of how infants and toddlers acquire hierarchies of increasingly abstract object representations. These models learn from actual or simulated first-person visual input during extended interactions with objects. We show that strong object representations can result from just minutes of such first-person experience and highlight the benefits of toddlers’ gaze behavior for learning. Furthermore, we demonstrate the emergence of very abstract concepts such as “kitchen object” or “garden object” in these models.
Talk Session 3: Using machine learning to track learning through interactions with the environment
The final talk session will discuss how to track learning through interactions with the environment by leveraging machine learning methods. Five talks will integrate research from developmental psychology, machine learning, and robotics on how infants and machines accelerate learning through embodied and multimodal interactions with their environment. Speakers will highlight research on how integrating vision, language, and action leads to greater learning efficiency, and how these findings can be applied to developing artificial intelligence that better reflects human-like learning.
Predictable vs unpredictable caregiver interactions: using machine learning to quantify multi-dimensional features of caregiver responsiveness.
This talk will present a multi-modal approach to analyzing the complexity of environments and behaviors around infants, demonstrating automatic methods measuring both low-level salient features and high-level semantic features. We will introduce the use of spectral flux as a measure of acoustic change in the surrounding speech or in environmental sounds. In addition, we extract and quantify facial and hand movements (MMPose). Higher-level linguistic features, such as information rate and semantic surprisal, are derived from time-stamped automatic transcription (WhisperX) and processed using lossless compression techniques and GPT-2. Finally, facial novelty is measured as the Kullback-Leibler divergence between consecutive facial expressions. By integrating these multi-modal signals, we aim to provide a comprehensive framework for investigating the dynamic complexity of early environments.
Unlike machines trained on extensive data, infants can extract regularities from limited - yet information-rich - input. How do they accomplish this feat? My talk will highlight work from our lab on how infants can track and integrate multimodal input statistics in their environment to accelerate learning. For instance, using machine learning techniques like word2vec we show that parents and infants jointly create semantic regularities during everyday toy play interactions. We will also discuss how integrating vision, language, and action leads to greater learning efficiency and how these insights can be applied to developing machines capable of learning more flexibly, efficiently, and autonomously, like a child.
Machine learning of data from child-worn sensors is shedding new light on early social and language development in group settings. Sensing movement and interpersonal orientation allows for data-driven approaches to inferring children when they are in social contact with other children and teachers. Movement/orientation data evidence homophily (children’s preference for similar others) at both the dyadic and larger group level. Machine learning of audio recordings suggests that child MLU can be measured reliably in vivo, indexing assessed language ability, and that the development of child MLU over a school year is associated with child-specific measures of teacher.
This talk will highlight research efforts to develop machines that perceive and learn from the world in ways that parallel humans. Specifically, I will present ongoing work in my lab on designing highly performing video-and-language models and social robot conversation agents, demonstrating how insights from developmental science have contributed to infant- and toddler-inspired machine learning architectures and robots.