1 Introduction
Molyneux’s problem, a captivating philosophical puzzle, has intrigued thinkers for centuries. First proposed by the Irish scientist and politician William Molyneux (1656-1698) in the 17th century, this intriguing thought experiment raises intricate questions about the nature of sensory perception and cognition. At its core, Molyneux’s problem poses a deceptively simple question: If a person blind from birth were suddenly granted sight, would they be able to visually recognize, name, and distinguish objects that were previously known only through touch?
The implications of this thought experiment extend far beyond its initial formulation, touching upon fundamental aspects of human experience, multi-modal perception and integration, construction of mental representations, among others. Despite centuries of inquiry, a definitive resolution to the Molyneux’s problem has remained elusive, with various interpretations and approaches yielding inconclusive results.
In recent years, however, advances in cognitive science and neuroscience have offered new insights into the mechanisms underlying sensory perception and cognition. One prominent approach is predictive processing (PP), which proposes that the mind operates by generating and updating internal models of the world to anticipate sensory input (Chanes & Barrett, 2020; Clark, 2013, 2015b; Friston, 2005, 2012; Hohwy, 2013). According to this framework, perception is an active, top-down process in which the mind compares predictions which are generated by the internal model with incoming sensory information. The discrepancies between predictions and sensory information generate prediction errors, which serve to update the internal model, thereby enhancing the correspondence between the internal model and the surroundings.
Alongside PP, situated accounts of cognition, collectively known as ‘4E cognition’ (embodied, embedded, extended, and enacted), have gained traction (Chemero, 2013; Clark, 1996, 1999; Clark & Chalmers, 1998; Fusaroli et al., 2014; Gallagher, 2005, 2017; Menary, 2010; Newen et al., 2018; Telakivi, 2023; Varela et al., 2016). These approaches emphasize the role of the environment, body, and action in shaping cognitive processes, offering a comprehensive framework for understanding cognition beyond purely internal mechanisms
This paper aims to explore the potential of Situated Predictive Processing (SPP) as a novel approach to addressing the Molyneux’s problem. SPP posits that traditional PP is embodied in a brain that is both neuroplastic and sparse, where content arises from the dynamic interaction with the body and environment. By integrating the principles of PP with those of 4E cognition through the concept of situated mental representations, this work seeks to establish a framework that synthesizes theoretical insights with the limited empirical evidence on the Molyneux’s problem.
2 Molyneux’s problem
Molyneux’s problem, initially formulated in the context of Locke’s An Essay Concerning Humane Understanding (Locke republished in 1975), asks whether a person blind from birth could distinguish objects by sight alone if granted vision. This question explored the relationship between visual and tactile sensations and the role of experience. Empiricists like Molyneux, Locke, and Berkeley argued that sensorial cross-modal recognition depends on experience, while rationalists such as Leibniz posited that reasoning could allow recognition without prior visual (Bruno & Mandelbaum, 2010; Glenney, 2012). Neither unanimous solution nor interpretation were drawn in the first theoretical approaches to the problem. Empirical approaches began in the 18th century, notably with Richard Grant and William Cheselden’s cataract surgeries which suggested that patients could not immediately recognize shapes by sight (Cheselden, 1728; Sassen, 2004; Wade, 2020). However, this empirical paradigm was not free of criticism ranging from moderate claims regarding the experimental design (e.g., the eyes did not have enough time to recover after the surgery) to more sceptical ones about the cataracts per se (e.g., cataracts do not cause a complete blindness in many cases) (Glenney, 2011; Wade, 2020).
In the second half of the 19th century, Meltzoff and Borton (Meltzoff & Borton, 1979) investigated the cross-modal perceptual abilities of one-month-old infants. The researchers introduced the newborns to two different pacifiers, one with nubs and one without, allowing them to explore the objects tactually for ninety seconds through their mouths. Following this initial tactile exploration, the infants were then presented with visual images of both pacifiers. The researchers measured how long the infants spent looking at each image and found that they tended to spend significantly more time examining the pacifier they had previously explored. This finding implied an early cross-modal representational ability between tactile and visual sensorial modalities. However, subsequent studies which tried to replicate this finding yielded contradictory conclusions (Maurer et al., 1999).
Over the last two decades, research on treated congenital blindness has gained attention, similar to earlier works by Grant and Cheselden (Cheselden, 1728; Degenaar & Collins, 1996; Loaiza, 2020; Sassen, 2004; Wade, 2020). Held and colleagues (2011) examined five individuals with congenital blindness, aged between 8 and 17 years, who had either cataracts or corneal opacities that left them only able to perceive light and dark. Following 48 hours of recovery post-treatment, participants were presented with one object from a pair that featured subtle morphological differences, either through visual-visual, tactual-tactual, or cross-modal (tactual-visual) presentations. The findings revealed that participants could not visually recognize objects they had previously explored solely through touch. However, they demonstrated the ability to gradually establish cross-modal mappings shortly after visual restoration.
Since the publication of Held and colleagues’ (2011) empirically negative response to the Molyneux’s problem, some authors have discussed and criticized their approach. Schwenkler (2012; 2013) argued that Held and colleagues (2011) failed to demonstrate whether their negative results were due to a lack of immediate cross-modal shape recognition capabilities or were indicative of a purely visual deficit, suggesting that the brain is capable of integrating sensory information even without prior visual experience. However, distinguishing how the visual system develops in conjunction with or in the absence of cross-modal integration remains a complex empirical challenge, as these processes are reciprocally and intrinsically connected. Connolly (2013) stressed the need for study more refined study designs that accurately capture the perceptual abilities of newly sighted individuals, implying that previous methodologies may have missed critical aspects of sensory integration. Cheng (2015) and Clarke (2016) took an even more critical stance, rejecting Schwenkler’s second proposal. Cheng argued that while the Molyneux’s problem is empirically approachable, it remains elusive due to significant methodological limitations. Together, these discussions highlight the ongoing challenges and suggest potential strategies for exploring sensory modality integration in newly sighted individuals.
While the Molyneux’s problem focuses on visuo-tactile relationship, other cross-modal integrations might provide new insights worth to consider. For instance, studies employing visuo-haptic illusion paradigms have explored cross-modal perception in treated congenital blind individuals. Pant and colleagues (2021) examined the Size-Weight Illusion (SWI), where smaller objects are perceived as heavier than larger ones of the same weight. They found no significant differences between normally sighted and treated congenital blind individuals, suggesting that early-in-life visual disruptions do not impede later cross-modal visuo-haptic integrations necessary for SWI. Similar conclusions were reached by Piller and colleagues (2023), even though they found more variability in the time required for post-sight restoration adaptation.
Studies examining other cross-modal integrations, such as audio-visual or visuo-motor ones, revealed impairments in treated congenital blind individuals, even after decades of sight recovery (Guerreiro et al., 2016b; Putzar et al., 2007, 2010). However, other studies observed a gradual recovery of these abilities (Ostrovsky et al., 2009; Piller et al., 2023). Interestingly, illusions involving the interpretation of two-dimensional perspective cues as three-dimensional depth (e.g., Ponzo and Müller-Lyer illusions) already arise in treated congenital blind individuals within forty-eight hours post-recovery (Gandhi et al., 2015). This phenomenon could suggest either that rapid development of visual processing occurs post-recovery or that certain cognitive processes are innate, surviving early visual deprivation. If the former holds true, it implies that cross-modal integration requires more time than intra-modal development. Contrarily, Putzar and colleagues (2010) reported that treated congenital blind individuals who had recovered sight for decades still performed worse than normally sighted individuals in an orientation face recognition task. Overall, these findings showcase the significant variability in recovery experiences, indicating that the type of sensory capability being restored, whether cross- or intra-modal, plays a significant role in the process.
The historical journey from early to contemporary studies on treated congenital blindness highlights the complexity of sensory perception development and cross-modal integration. A general analysis of these studies indicates the pivotal role of experience, thereby reinforcing the empiricist stance and its negative response to the Molyneux’s problem. However, existing perceptual theories fall short of fully explaining why this experiential foundation is essential for the proper development of cross-modal mappings. In the following sections, a novel situated predictive processing account will be proposed to address this gap.
3 Situated predictive processing
3.1 Classical predictive processing
PP is a theoretical paradigm in computational and cognitive neuroscience that posits the mind constructs generative models of both its surroundings and the body to predict incoming sensory input during cognition, action, and perception (Clark, 2013; Friston, 2005; Hohwy, 2013). This framework has roots in Helmholtz and Kant’s theories of perception (see for a historical review: Clark, 2013; Swanson, 2016), which argued that perception is a process of probabilistic inference where sensory input is combined with prior knowledge (Helmholz, 1867; Kant & Hatfield, 2005). PP operates through a hierarchically organized multilevel bidirectional cascade of top-down and bottom-up signals. Top-down signals, generated by probabilistic generative models, flow downward to be compared with the upward bottom-up signals generated from sensory receptors (Clark, 2013). At each level of this hierarchy, a matching process occurs between top-down predictions (or ‘priors’) and bottom-up sensory inputs (Dempster et al., 1977; Neal & Hinton, 1998). The discrepancy between the two generates ‘prediction error’, which is transmitted upward to update the generative model, helping the system better represent its surroundings and body. The influence of top-down or bottom-up signals depends on their expected precision: imprecise sensory signals increase the influence of priors, while scenarios with less robust priors rely more on sensory information (Hohwy, 2012). This continuous updating allows the system to make more accurate predictions in the foreseeable future when finding a similar scenario or, alternatively, to actively sample the world in ways that reinforce the current generative model.
Nevertheless, the system still needs a mechanism to account for the variability in the distribution of prediction errors when minimizing them. This is addressed through precision (i.e., the inverse of variability) which is used to weight prediction errors and determine their significance in updating the generative model (Hohwy, 2012). For instance, a noisy prediction error should have less impact on the model update than a solid one, as it may not accurately represent the discrepancies between the model and the surroundings. Overall, higher precision, which corresponds to lower uncertainty, assigns greater weight to the prediction error deemed reliable (Clark, 2013; Friston, 2005, 2009, 2010). However, this process is context-dependent, influenced by factors like sensory modalities (for an example, see Bays & Wolpert, 2007). The precision weighting is based on the priors from the generative model concerning the noise in both the surroundings and the system itself, shaping the expectations for the precision of prediction error. In this way, the system tries to represent the variability in the system-world interaction.
PP offers a powerful theoretical framework for understanding how the mind constructs generative models to predict sensory input and iteratively update these models by matching predictions with actual sensory information. PP also provides valuable insights into how the mind represents and interacts with the surroundings and the body, shedding light on fundamental cognitive processes such as perception and action. Nevertheless, while PP places the brain as the spotlight for all these cognitive processes, other contemporary situated approaches challenge this central role by decentralizing cognition, attributing it not only to the mind but also to external factors such as the body, culture, tools, and the environment. In the next section, 4E cognition, an umbrella term for situated accounts of cognition, will be explored and discussed.
3.2 4E cognition
4E cognition is a framework in cognitive science that emphasizes the embodied, embedded, enacted, and extended nature of cognition (Gallagher, 2017; Menary, 2010; Newen et al., 2018; Noë, 2009; Rowlands, 2010). Embodiment refers to the idea that cognitive processes are shaped by the physical body, suggesting that our bodily experiences significantly influence how we think and perceive the world (Chemero, 2013; Clark, 1996, 1999; Clark & Chalmers, 1998; Gallagher, 2005; Varela et al., 1991). Embeddedness refers to the notion that cognition is deeply intertwined with the physical and social environment in which it occurs (Clark, 2013; De Jaegher et al., 2010; De Jaegher & Di Paolo, 2007; Gallagher, 2008; Hutto et al., 2020). Extension posits that cognitive processes can extend beyond the brain and body, incorporating tools, devices, and technologies that enhance or simulate cognitive abilities (Di Paolo, 2009; Di Paolo & Thompson, 2014; Stewart et al., 2014). Finally, enaction emphasizes the idea that cognition is dynamically shaped by our interactions with the environment, where our actions and perceptions are fundamentally linked (Di Paolo & Thompson, 2014; Gallagher, 2017; Gangopadhyay & Kiverstein, 2009; Hutto, 2022; Hutto & Myin, 2013).
The 4E cognition framework acknowledges that cognition is not something that only occurs within the brain (i.e., internalist accounts), but it is also deeply intertwined with our bodily experiences, the environments in which we live and act, and the tools and technologies that we use (i.e., externalist accounts), leading to a dynamic coupling between the brain-body-world as an autonomous, self-regulating system (Carney, 2020).
Within the 4E cognition framework, strong positions assert that cognition is fundamentally constituted by these external factors. Proponents of strong 4E cognition emphasize that cognitive processes are not merely influenced by these elements; instead, they are intrinsically linked to them, suggesting that our understanding of the mind must encompass the entire system of brain, body, and environment as a unified, self-regulating entity without the brain being on the spotlight (Gallagher, 2005, 2017; Hutto, 2022; Hutto & Myin, 2013; Noë, 2009). For instance, an individual’s ability to perceive and interact with their environment should be considered as a dynamic interplay between the brain, body, and surroundings as a whole cognitive entity. Even though the role of the mind in cognition slightly differs between 4E cognition and PP, both emphasize the enactive feature of cognitive processes, in which action and perception are two sides of the same coin. However, 4E cognition and PP also differ in the use of mental representations. In the following section, mental representations are going to be defined and characterized, before exploring the so-called ‘representations war’ between 4E cognition and PP. A treaty peace between the two is going to be proposed leading to an account of SPP.
Additionally, strong 4E advocates contend that this approach leads to a more comprehensive understanding of cognition than traditional models that rely solely on internal mental representations. They argue that cognitive processes cannot be disentangled from the physical and social environments that shape them, thus calling for a re-evaluation of how we conceptualize mental representations in cognitive science. By recognizing that cognition is deeply situated, strong 4E cognition challenges the notion of a purely internal mind and suggests that understanding cognitive processes requires a broader and more integrative perspective that encompasses all dimensions of human experience.
While strong 4E proponents advocate for a model of cognition that emphasizes the importance of other factors beyond the brain, PP can also align with these views by acknowledging the situatedness of cognitive processes (Clark, 2022; Nave, 2025; Ohata & Tani, 2020). However, PP maintains that mental representations play a crucial role in the way we construct and update our understanding of our surroundings. To bridge the gap between these perspectives, it is essential to invoke the concept of situated mental representations, which recognize the interplay between internal models and the dynamic interactions with our surroundings. This approach can lead to the establishment of an SPP framework that integrates the strengths of both 4E cognition and PP. Before delving into the concept of situated mental representations and how they might reconcile them, it is essential to first explore what is a ‘mental representation’.
3.3 Mental representations
In contemporary cognitive science, mental representations are generally understood as the building blocks of cognition. They serve as internal models of the world, allowing us to perceive, think, and act. These mental entities that ‘contentfully’ stand in for objects, events, properties, and relations of the environment, forming the foundation for cognitive processes such as perception, memory, attention, language, and reasoning (Bermúdez, 2010; Von Eckardt, 2012). The phrase’‘contentfully’ stand in for’ is key here, as it highlights the representational nature of these mental states, i.e., acting as substitutes for things in the world (see Vilarroya, 2017) for a debate that encompasses the concept up to neural representations). However, despite its importance, the concept of mental representation remains elusive as there is no widespread agreement of the implications for one thing to represent another one (Roth, 2010). Different fields, such as cognitive neuroscience and philosophy of mind, often diverge in their views due to varying ontological and epistemological foundations. Although definitions vary (Ramsey, 2007; Roth, 2010; Vilarroya, 2017), it is generally accepted that they are mental objects with semantic properties and that they ‘stand in for’ something else (Ramsey, 2016; Vilarroya, 2017).
This lack of a clear and widely accepted definition partly arises from the functional ambiguity of mental representations, which is why they are often referred to as a ‘cluster concept’ (i.e., a concept that encompasses several different properties) (Cummins, 1995; Ramsey, 2007). Ramsey, in his book ‘Representations reconsidered’ (Ramsey, 2007), elegantly illustrates this debate. However, before delving into this complex debate, it is important to briefly define two basic features and identify the main views on mental representations up to now.
Mental representations involve intentionality, i.e., they have intrinsic meaning or are about something, in contrast to other types of representations (e.g., traffic symbols) whose their meaning derives from the mental states of an agent interpreting them (Egan, 2014; Ramsey, 2007; Von Eckardt, 2012; Williams, 2018). As will be discussed later, the state-of-the-art accepts that the meaning is intrinsic to the mental representation per se, although some authors argue that an interpreter (i.e., an agent using and even creating that representation) is necessary to ascribe meaning. In addition, mental representations may possess causal properties (Egan, 2014; Ramsey, 2007), as exemplified in belief-like representations. For instance, if an agent holds a belief-like mental representation that counting to ten will help them calm down when frustrated, this belief-like mental representation will cause them to count to ten in frustrating situations. Dretske (1997) and Ramsey (2007) stated that their causality is in virtue of their content, even though their content is causally inert.
The intimate connection between intentionality and causality has led to the claim that mental representation is a functional notion (Haugeland, 1991; Pierce, 1931; Ramsey, 2007). This implies that if a cognitive account employs mental representations within its theoretical framework, then it must clearly specify the functional role they play and avoid any type of redundancy (Ramsey, 2007). However, understanding how a mental representation serves as a representation is, as Ramsey puts it, “different from the account of the conditions responsible for its representational content” (Ramsey, 2007, p. 30). While the content is crucial to the representation’s function, it does not fully define or reduce it. The search for the specific role of mental representations within a theoretical cognitive framework is what Ramsey termed the ‘job description challenge’ (Ramsey, 2007). This challenge seeks for an explanatory benefit in describing the internal parts of a system in representational terms, i.e., that their role as representations must not be redundant.
Mental representations have been employed across distinct theoretical frameworks, most notably the classical computational theory of cognition (CCTC) and the connectionist framework. CCTC posits that cognition is grounded in inner computations1, while the connectionist framework emphasizes the connections within neural networks as the basis for cognitive processes. Mainly within the CCTC framework, but in other novel cognitive frameworks as well, two general types of mental representations have received considerable attention: the Input-Output representations (IO-representations) (Cummins 1991) and the Structural/Simulation representations (S-representations) (Cummins, 1991; see Lee & Calder, 2023 for a recent and elegant review; Swoyer, 1991). The former describe mental representations as the inputs and outputs of either a computational process, according to the CCTC, or a neural network, according to the connectionists (Cummins, 1991; Ramsey, 2007). The latter describe mental representations as sharing structural isomorphism with their target and are exploited for this resemblance. In S-representations, the pattern of relations between the parts of the target is reflected in the representation itself (Cummins, 1991). According to Ramsey (2007), IO-representations are necessary for the function of sub-systems, while S-representations allow the system to exploit the structural similarity between the representations and the target for cognitive purposes. However, some authors pointed out that the distinction between the S- and IO-representations seems blurrier than initially expected, arguing that both types of representations share functional similarities and are more interconnected than initially believed (Facchin, 2021b; Morgan, 2014; Nirshberg & Shapiro, 2021; Shagrir, 2012; Sprevak, 2011). Overall, they proposed that both types involve mapping relationships between inputs and outputs or rely on structural correspondence with the world. Recently, Facchin (2024) suggested that S-representations need to be reclassified as the traditional notion designate a large variety of different and distinct types of representations, which could explain this blurriness. Whether viewed as fundamentally distinct or not, S- and IO-representations play a functional role in CCTC and meet the job description challenge.
On the contrary, other types of mental representations have not successfully met the job description challenge: the receptor/detector-representations (r/d-representations) (Lettvin et al., 1959; and Hubel & Wiesel, 1962; Hubel & Wiesel, 1968 for empirical background) and tacit-representations. R/d-representations refer to the network of neurons responsible for detecting a specific stimulus, leading to the assumption that they carry/represent the information of those stimuli. However, Ramsey (2007) argued that r/d-representations can be explained purely in causal-physical terms without invoking representational terms. For instance, when a network of neurons ‘x’ is activated by the sight of a red apple through a cascade of neural communication from the retina to the visual cortex, it communicates with other neural networks that organize the response, such as grasping and eating the apple. As Ramsey pointed out, this entire process can be understood without the need for representational explanations. On the other hand, tacit-representations apply to neural networks not because they are triggered by a stimulus, but because the entire network encodes the information through its connections (Rumelhart et al., 1986a, 1986b). Even though this might seem familiar to S-representations, tacit-representations do not share structural isomorphism with their target; rather, the representational information is distributed across the network’s connections. Ramsey (2007) argued that these tacit-representations display dispositional properties and constrain the capabilities of the neural network. However, this does not necessarily mean that they require a representational status. Ramsey adverted that accepting tacit-representations as mental representations would imply that anything with a disposition for something could be considered representational (e.g., “[r]ocks are now representational, since, after all, even a rock (in this sense)”knows how” [it has the disposition] to roll down a hill” (Ramsey, 2007, pp. 170–171) (italics are added to emphasize).
Despite their central role in cognition and the fact that their concept can be historically traced back to the philosophies of Aristotle and Aquinas, there is still no widespread agreement on the precise definition of mental representations. Most researchers adopt a working definition that describes mental representations as mental objects that stand in for something else. The majority agree that intentionality and causality are two crucial features of mental representations, often emphasizing a strong connection between the two. Given the importance of their functional role in cognitive frameworks, Ramsey (2007) proposed the job description challenge in order to evaluate whether the notion of mental representations was either redundant or significant within a cognitive framework. Ramsey defended that IO- and S-representations successfully meet the challenge, while r/d- and tacit-representations fall short. However, the ongoing debate over their necessity in cognition has fuelled the so-called ‘representation wars’, where proponents of two leading contemporary cognitive theories, i.e., PP and 4E cognition, dispute the necessity of mental representations for a well-functioning cognitive framework. In the following subsection, representation wars are going to be discussed, outlining the key arguments from both sides of this scaramouche.
3.3.1 Mental representations in PP
The discussion of mental representations in PP was initiated with Clark’s (2013, 2015b) characterization of the PP framework. According to him, mental representations in PP are probabilistic and action-oriented mirrors of the world. These mental representations enable organisms to engage successfully with their environment by minimizing prediction error, a key aspect of PP that ultimately supports survival and autopoiesis. The multi-level probabilistic generative models, in which mental representations play a central role, guide perception and action (Clark, 2013, 2015b; Williams, 2018). According to Clark, these mental representations carry abstract content, serving as causal-loops designed to predict states of the world through their representational properties. Gładziejewski (2016) advanced this view by proposing that the representations in PP align with prototypical S-representations (Cummins, 1991; Ramsey, 2007; Swoyer, 1991). As Gładziejewski put it, “cognitive systems navigate their actions through the use of a sort of causal-probabilistic”maps” of the world” (Gładziejewski, 2016, p. 569). The structural similarity of S-representations within PP can be understood as the brain implementing Bayesian networks (Pearl, 2000), “whose structure resembles the causal-probabilistic structure of our system’s environment” (Gładziejewski, 2016, p. 571; Wiese, 2017; Williams, 2018). These networks also align with the causal loops within PP’s hierarchical predictive structure. However, Gładziejewski (2016) and Williams (2018) clarified that these maps and generative models do not function identically but share key features (e.g., action-guiding through active inference (i.e., active inference Brown et al., 2011), detached/decoupled, unable the detection of representation errors) which help them meet the job description challenge (Ramsey, 2007). Prediction errors play a crucial role by constraining the structural mapping between the hierarchical generative model and the causal-probabilistic structure of the world (Clark, 2012; Gładziejewski, 2016; Hohwy, 2013, 2016). Regarding the feature of detachment/decouplement, Gładziejewski admitted that this is an open debate. Even though he was prone to claim that representational posits in PP work in a completely detached manner, this question is unresolved. Detachment, however, is a pivotal feature in the representation wars, as situated accounts of cognition argue that cognition involves continuous interaction between the mind and external factors. Thus, the role of detachment in PP warrants further exploration in this ongoing debate.
Wiese (2017) took a step further Gładziejewski’s proposal regarding PP’s mental representations by delving into their content. Wiese distinguished between two types of content, following Egan’s (2014) framework: (i) cognitive content and (ii) mathematical content. While the latter refers to the computational system that performs a task, the former refers to the content relative to the context and cannot be derived from the computational aspects. As proposed by Gładziejewski (2016), the structure of models in PP is composed of three elements: likelihoods, dynamic relations, and prior probabilities. Building on this, Wiese (2017) argued that each of these elements are the mathematical content of the mental representations, while the functional relations between the variables at different hierarchical levels account for the cognitive content. In terms of Ramsey’s (2007) classification of representations, Wiese’s approach suggests a combination of S-representations, which carry the cognitive content, and IO-representations, which carry the mathematical content. This combination can be viewed as part of a gradual representationalism framework, as the one proposed by Toribio and Clark (1994), which suggests that representations may vary along a continuum depending on their function and content. This gradualism was further developed by Rutar and colleagues (2022), who identified two gradual features of S-representations: structural similarity and decoupling. Nevertheless, in light of the discussions around the functional similarity between S- and IO-representations (Facchin, 2024; Shagrir, 2012; Sprevak, 2011), if one accepts that these two types of representations are functionally equivalent, it follows that there should not be any distinction between the carriers of mathematical and cognitive content. This means that Wiese’s account requires a slight reformulation, acknowledging that both mathematical and cognitive content refer to different functional aspects or ways of usage of the same mental representations, in line with Sprevak’s (2011) critique.
Both Wiese (2017) and Williams (2018), building on Gładziejewski’s (2016) work, argued that the representation of causal-probabilistic dependencies among variables in the surroundings forms a dynamical model of both the body and its environment. Williams (2018) claimed that the content of mental representations in PP is organism-relative, constructing a model of the world from the perspective of a self-organising entity, shaped by its body’s physiological needs.
More recently, Rutar and colleagues (2022) suggested the idea of gradation in representational features within PP, expanding on the earlier work by Toribio and Clark (1994). Since PP’s mental representations behave similarly to S-representations (Gładziejewski, 2016), Rutar proposed that gradation should be assessed in terms of two key aspects: structural similarity and decoupling. Both features can be consequently gradually decomposed as follows: structural similarity can be broken down into the number of preserved relations (i.e., relations between the parts of the representation) and space granularity (i.e., the information carried besides the relations of the parts). Decoupling, on the other hand, can be understood through the hierarchical level (i.e., from higher to lower levels that are proximal to sensory information) and the precision weighting of prediction error (i.e., adapting the accuracy of the representation).
Anderson and Chemero (2013), Orlandi (2016) and later Downey (2018), van Es (2020) and Facchin (2021a, 2021b), all challenged the representationalist stance in PP. Anderson and Chemero (2013) argued that since the bottom-up and top-down signals central to PP can be interpreted in non-representational terms, there is no need for a representational theoretical framework. Similarly but a step further, Orlandi (2016) and van Es (2020) proposed that the causal loops are better understood as covariations/correlations between two proximal levels in the hierarchy. The same would happen to priors and likelihoods. This view would align PP’s posits with r/d-representations, which fail to meet the job description challenge (as discussed previously in the #3.3. mental representations section). Facchin (2021a, 2021b) adopted a similar position, explicitly rejecting Gładziejewski ‘s claim that mental representations in PP behave as prototypical S-representations. Focusing on sensorimotor contingencies (i.e., refer to the regular ways in which sensory inputs change in response to an agent’s movements), Facchin argued that the role of generative models in PP is primarily to guide an agent’s interactions with the world, rather than to construct internal models that merely represent it. Thus, the processes in PP can be better understood by focusing on the enacted and embodied nature of cognitive processes. Downey (2018) introduced a fictionalist perspective on representationalism in PP. He proposed that although PP entails mental representations as theoretical posits, they play an explanatory role, meeting the job description challenge, without needing to metaphysically exist. Downey claimed that this fictionalist approach could resolve the representation wars. Downey’s argumentation is based on Orlandi’s work (2016), presented above, and Ramsey (2017) refusal of the necessity of cognition to be representationalist. However, this eliminativist -fictionalist perspective presents a contradiction: if mental representations do not ontologically exist, they cannot exert causal powers, thereby failing to meet the job description challenge. Nonetheless, it can be agreed that this fictionalist discourse is a ’weak’ eliminativist position, serving as a transitional framework towards potentially non-representationalist PP accounts.
So far, the consensus is that PP posits mental representations within a multi-level hierarchical generative model that guides both perception and action (Clark, 2013, 2015b). These mental representations are thought to represent the causal relations in worldly states through causal loops (Gładziejewski, 2016). Some authors argued that they meet the job description challenge proposed by Ramsey (2007) because they resemble prototypical S-representations (Gładziejewski, 2016). Wiese (2017) further suggested that the content of these mental representations can be divided into cognitive and mathematical components, while Rutar and colleagues (2022) emphasized the importance of gradation in PP’s representationalism. However, critics like Orlandi (2016), Downey (2018), and van Es (2020) argued against representational posits of PP by claiming that they fail to meet the job description challenge, as they seem to act more like r/d-representations. Despite this, Downey denoted that they still have an explanatory role within PP, proposing a fictionalist perspective. Van Es, meanwhile, leaned toward non-representationalism. In the following section, non-representationalist perspectives advocated by proponents of 4E cognition will be explored.
3.3.2 Representation wars: 4E cognition’s non-representationalism
Ecological psychology, founded by Gibson (2015), along with more recent contributions from Favela (2023), has argued that a paradigm shift is underway in the cognitive sciences, shifting away from a representation-centred framework. Many early proponents of embodied, embedded, extended, and enacted accounts of cognition also advocated for the elimination of mental representations from cognitive accounts (Chemero, 2013; Hutto & Myin, 2013; Shapiro, 2011; Varela et al., 1991). The general upshot is that the body and world themselves serve as representations external to the mind, eliminating the need for internal mental representations (Hutto & Myin, 2013; O’Regan & Noë, 2001). In contrast to representation-centred paradigms, which posit that perception and cognition aim to build objective models of the world in an observer-independent manner (Anderson, 2017; Engel et al., 2016), action-oriented frameworks like 4E cognition advocate for a performative understanding of the mind (Anderson, 2014). Thus, 4E cognition assigns the brain the role of a control system that governs the organism’s interactions with the world, rather than creating internal models of it (Anderson, 2017; Chemero, 2013; Cisek, 1999).
At first glance, the positions of PP representationalists and 4E cognition non-representationalists seem irreconcilable, given that they place the focus of cognition on opposing extremes (for an elegant overview of the debate see Başoğlu, 2021). But is this divide truly irremediable? Clark claimed that peace could be reached if PP was understood under situated features. As Clark put it,
“Dynamically speaking, the whole embodied, active system here self-organizes around the organismically-computable quantity”prediction error”. […] Is this an inner economy bloated with representations, detached from the world? Not at all. This is an inner economy geared for action, whose inner states bear contents in virtue of the way they lock embodied agents onto properties and features of their worlds. But it is simultaneously a structured economy built of nested system, whose communal project is both to model and engage the (organism-relative) world” (Clark, 2015a, p. 6)
In order to establish a situated PP framework, two key elements are necessary: (i) mental representations, as posited within PP frameworks, and (ii) an understanding of cognition that encompasses both internalist and externalist perspectives. This work proposes the concept of situated mental representations as a potential reconciliation between PP and 4E cognition. In the following sections, the notion of situated mental representations will be explored, drawing on recent work by Piccinini (2022), before discussing their explanatory potential to bridge the gap PP and 4E cognition.
3.3.3 Situated mental representations: treaty peace
Situated mental representations began to take shape during the debate between Dokic and Recanati (Dokic, 2007). Dokic argued that certain authors had implicitly tied the notion of ‘situation’ to mental representations, suggesting that their truth conditions could be influenced by context. Dokic emphasized the importance of ad hoc or temporary/occasional concepts, defined as transient constructions held in working memory (Dokic, 2007, p. 205). Dokic set up an example where the concept of ‘dog’ arises distinct mental representations depending on the context (e.g., the mental representation will be different if you are in the Parc Ciutadella or in the Arctic tundra). In this view, a ‘situation’ encompasses the various factors that make mental representation capable of expressing an absolute proposition. Thus, the situation is comprised of, as Dokic put it, “relational facts between a representation and its propositional constituents” (Dokic, 2007, p. 215). However, Dokic’s account seemed to lean towards understanding situational factors as primarily cognitive, a point that Recanati contested (Dokic, 2007, p. 218).
Both Clark (1996) and Miłkowski (2017) examined the possibility of cognition being both representational and situated. Miłkowski (2017) argued that representational computational mechanisms should be understood as embedded within larger mechanisms that dynamically process feedback from the environment. While both authors concurred that cognition should encompass both representational and situated elements, they did not develop a specific cognitive framework to encapsulate this duality.
Recently, Newen and Vosgerau (2020) claimed that mental representations must be understood as “non-static, use-dependent, and situated relative to a certain behaviour or cognitive ability” (Newen & Vosgerau, 2020, p. 2). The functional roles of the mental representations, they proposed, are realized through mechanistic relations that extend beyond the neural level to bodily and even social levels. Situated mental representations are use-dependent, meaning that their content is intimately tied to the purpose for which the representation is employed. Their situatedness directly applies to the fact that the vehicle of the representation can be a combination of neural and bodily states, with the content varying depending on the explanatory level, suggesting a form of gradation (Clark & Toribio, 1994). Through this, Newen and Vosgerau (2020) constructed the first comprehensive framework for situated mental representations.
Finally, Piccinini (2022) proposed a framework for situated mental representations, explaining how situatedness solves issues surrounding the content of mental representations. Piccinini’s framework builds on S-representations and informational teleosemantics (i.e., the semantic content of a mental representation comes from the information that they have regarding their function (e.g., Dretske, 1997). Piccinini argued that a representational account of cognition requires situatedness, i.e., it needs to be understood as embodied, embedded, enacted, and with affect. This necessity originates from the dynamic interactions between the nervous system, the body, and the environment, as well as the system’s use of feedback from these interactions to update its models-it is important to denote that his is similar to the principles of PP-. This situated representationalism leads to representations with (i) original (i.e., not derivative) semantic content, (ii) neural (and probably bodily) vehicles that are coordinated with their content, (iii) a causal role aligned with the system’s purposes, (iv) a distal representation of stimuli, (v) the potential to misrepresent.
According to Piccinini, “the vehicles of neural representations and their semantic content are two sides of the same coin. That is, the same functional properties that turn a system of internal states into a neural representation system are also sufficient to give such internal states their semantic content” (Piccinini, 2022, p. 5). Piccinini defined that their content display three causal processes: the learning process that creates the content, the causal process which creates the content (i.e., classical function of representation in terms of stand in for something), and the process guiding the behaviour of the system (i.e., the other classical function). Thus, Piccinini introduced a new causal process related to the creation and updating of representations: learning. The concept of learning is inherently situated, as neurocognitive systems present plasticity, which is a dynamic response of the system to their surroundings by changing their cellular and molecular structures. According to Piccinini, this active learning is key for generating original semantic content for mental representations and requires embodiment (i.e., the system requires a body to receive information from within and outside and establish real-time feedback loops with its surroundings), embeddedness (i.e., the body and environment provide information sources and form part of the feedback loop), enaction (i.e., dynamism is essential as the sensory information changes over time and the body moves), and affect (i.e., affective states directly influence reinforcement learning, which is linked to active).
Even though Piccinini tried to demonstrate the importance of active learning for understanding situated mental representations, a system without active learning should still be able to generate mental representations with original content. However, this work aligns with Piccinini’s claim that to create original content, systems need to be embodied, embedded, extended, and enacted. While affect undoubtedly influences the content, it does not seem to be a necessary requirement. Active learning serves as an excellent example of how content might be updated and showcases the situatedness of mental representations. However, Piccinini’s argument holds even in the absence of active learning.
In summary, frameworks on the situatedness of mental representations are gradually being established (Heras-Escribano & Martı́nez Moreno, 2024), highlighting the importance of situatedness in addressing the problem of content. Piccinini’s work (2022) explicitly references to 4E cognition, while also implicitly aligning with PP, given its focus on feedback loops and active learning through motion and model updates, both central aspects of PP. Therefore, Piccinini’s framework is suitable to address the next challenge: the development of an SPP account.
4 Situated predictive processing
Situated accounts of PP emphasize the importance of the environment, body, and action in shaping and implementing predictions. These approaches represent a middle ground between traditional cognitive models that treat cognition as a largely internal process and the radical situated views, which emphasize the external dynamic nature of cognitive processes. Most of the existing situated accounts focus on a specific aspect of 4E cognition (i.e., embodied, embedded, extended, or enacted).
Even though many 4E cognitivists might challenge the idea that interoceptive PP accounts are inherently embodied, interoception (i.e., the perception of internal bodily states) places the body at the center of cognition, treating it as a dynamic system rather than merely passive vessel (see Khalsa et al., 2018; Petzschner et al., 2021 for general overviews of interoception and predictive processing). Seth and colleagues (Seth et al., 2012; Seth, 2013; Seth & Tsakiris, 2018; Seth & Friston, 2016) proposed a model in which interoceptive prediction error, which underpins the subjective sense of presence, runs in parallel with exteroceptive prediction error, which underpins the sense of agency. According to this model, subjective feeling states, such as emotions, arise from interoceptive inference (i.e., analogous to active inference but in a bodily manner). In this view, emotions are cognitive evaluations of the body’s physiological states. Barrett and collaborators (Barrett, 2016; Barrett & Simmons, 2015; Kleckner et al., 2017) described the neural underpinnings of the so-called ‘Embodied Predictive Interoception Coding’ (EPIC) model, with the term ‘embodied’ explicitly mentioned, which explains an embodied and constructed account of emotions [i.e., similar to Seth’s proposals mentioned above) and is associated with allostasis (i.e., adaptive processes that maintain stability through change (Schulkin & Sterling, 2019)). Allostasis itself has been associated with interoceptive predictive processing by Shulkin and Sterling (2019). Pezzulo and colleagues (2015, 2021) reviewed the role of interoception and homeostatic regulation in active inference. Owens and colleagues (2018) approached interoceptive inference empirically by examining the connection between cardiac interoception and autonomic cardiac control. Other approaches to PP highlight the role of bodily experiences in shaping sensory processing and prediction-making (Apps & Tsakiris, 2014; Seth & Friston, 2016). More recently, Badcock, Friston, and Ramstead (2019) developed a ‘hierarchically mechanistic mind’ with evolutionary systems theory of psychology, which integrates a situated, embodied, Bayesian brain. They defined the brain as the following,
“[A]n embodied, complex adaptive control system that actively minimises the variational free-energy (and, implicitly, the entropy) of (far from equilibrium) phenotypic states via self-fulfilling action-perception cycles [which might be linked to PP], which are mediated by recursive interactions between hierarchically organised (functionally differentiated and differentially integrated) neurocognitive processes.” (Badcock et al., 2019, p. 17) (italics are added to emphasize).
These approaches suggest that sensory inputs are not processed purely in isolation but are instead modulated by internal bodily signals and states, such as interoceptive and proprioceptive signals. In summary, interoceptive PP accounts propose an embodied form of PP in which predictions serve to minimize the energy required to keep allostasis of the organism or to respond effectively to incoming external signals. At the same time, prediction errors update the internal generative model, refining it for future occasions with similar internal or external signals.
Regarding embeddedness, both social knowledge (Brodski et al., 2015; Brodski-Guerniero et al., 2017; Chanes et al., 2018; Draganov et al., 2023; Ramos-Grille et al., 2022) and environmental factors (Constant et al., 2020) have been proposed to shape perception. Kilner and colleagues (2007) proposed that the mirror neuron system, which is involved in action observation and imitation, can be understood through PP principles. Briefly, the mirror neuron system consists of distinct brain regions that are active not only when a subject executes an action but also when observing the action from others, effectively transforming visual information into knowledge or skills (Bonini et al., 2022; Rizzolatti & Craighero, 2004). Kilner and colleagues (2007) suggested that during action observation, this system tries to infer the most likely cause of an action by minimizing prediction errors. In this scenario, the ‘cause’ refers to the intentional mental states that caused/motivated the action. Thus, when observing two actions that are identical in sequence but with distinct intentions, PP allows to distinguish between them by integrating motor information through the mirror neuron system with additional sensory information from other brain modules. Constant and colleagues (2020) formulated an active inference formulation that “views cognitive niche construction as a cognitive function aimed at optimizing organisms’ generative models” (Constant et al., 2020, p. 1), similar to what is normally understood as a mixture of embedded and extended cognition. Cognitive niche construction involves behaviours and knowledge supported by sociocultural practices, playing a critical role in human evolution and cognition. The reciprocal relationship between individual cognitive processes and collective sociocultural practices means that as individuals interact with their cultural surroundings, they update their generative models to better predict and navigate these sociocultural environments. This dynamic mechanism of updating enhances the ability to function effectively within cultural contexts. As individuals grow up within a particular culture, their generative models develop in tandem with the affordances and practices of that culture, creating a reciprocal relationship between the individual’s internal models and the external social environment. This approach highlights how cultural and environmental factors are not separate from cognition but are intricately woven into the very fabric of how we predict, perceive, and interact with the world.
Extended approaches to PP have been left aside until recently, even though some criticisms against these proposals have already been raised (Facchin, 2023; Hohwy, 2016, 2018). Kirchhoff and Kiverstein (2021) defended that an extended PP is feasible, even when incorporating the Markov blanket formalism. They argued that self-evidencing processes, which contribute to maintaining the organizational integrity of the individual over time and, thus, distinguishing it from the environment, are semipermeable. This permeability allows external elements to be integrated when necessary. Kersten (2022) supported this view, proposing that prediction error minimization can be used to frame extended systems as genuine cognitive systems. According to Kersten, extended systems need to engage in prediction error minimization at an algorithmic level in order to be part of the cognitive process. Similarly, Clark (2022) emphasized the importance of distinguishing between the process of ‘recruitment’ and actual cognitive processing. Clark argued for the continuous flow and transformation of information between the cognizant and extended systems, both working to minimize prediction error. For instance, when someone uses glasses to improve vision, the clarity of the received visual signals increase. This alters the prediction of accuracy concerning exteroception, leading the system to update its generative model and modify its priors about the reliability of this sensory modality. Recently, Kersten (2024) expanded on Clark’s proposal by distinguishing between two important senses of recruitment: ready-to-hand and adaptive recruitments, emphasizing the role of temporality in their functioning. More sophisticated approaches involve neuromodulation, which may influence the neural mechanisms of PP, thereby altering behaviour. Draganov and colleagues (2023) demonstrated how socio-affective predictions could be modified using transcranial alternate current stimulation, implying that brain oscillation modulation can transiently alter the generative model.
Enactive approaches to PP emphasize the importance of action in shaping and implementing predictions (Gallagher, 2017). These approaches suggest that motor control functions as an active inference process, where predictions based on proprioceptive signals are fulfilled through peripheral motor reflexes (Adams et al., 2013; Brown et al., 2011; Friston et al., 2011; Millidge et al., 2021). This mechanism enables organisms to adjust their actions based on their prior knowledge and internal models of the world (e.g., a goalkeeper repositioning to save a goal) to enhance the alignment between predictions and incoming sensory signals, thereby minimizing prediction error. Building on this perspective, Seth (2014) described an enacted and embodied account of PP that explains sensorimotor contingencies and perceptual presence (i.e., way in which objects are experienced as whole and present in the environment). This framework also extends to explaining phenomena such as synaesthesia. Facchin (2021a) further argued that this enacted and embodied account of PP aligns more closely with anti-representationalist views, suggesting that PP can operate without relying heavily on internal representations and instead depends on the dynamic brain-body-environment interaction. Ridderinkhof and Brass (2015) proposed a PP framework to explain kinesthetic motor imagery (i.e., cognitive ability that allows an individual to perform and experience motor actions through the mind, without executing such actions in a first-person perspective). The general idea is that this process facilitates the updating of the generative model, improving predictive motor control when actual actions must be executed. Bruineberg and colleagues (2018) diverged from the traditional Helmholtzian perspective of perception by proposing that the generative model in the context of PP is not a merely source of internal representations, but a tool for guiding an organism’s interactions with the environment to maintain a stable brain-body-environment dynamic system. Similarly, Tschantz and colleagues (2020) developed a framework combining goal-oriented and epistemic behaviours through active inference. This approach generates models that balance action-oriented goals with the need for information gathering, in order to create accurate and detailed models that are relevant to specific actions. While both proposals retain a representational aspect, they shift towards a more situated approach, as the representations that they proposed do not simply mirror the causal probabilistic structure of the environment. Instead, they emphasized the enactive coupling between the brain-body-environment. Tschantz and colleagues (2020) highlighted that these representations may be even less veridical than classical representations but are more functionally useful for a system that is actively engaged with the environment.
Finally, the affective feature of situatedness has been more recently proposed (Thompson, 2010) and, consequently, a comprehensive framework is still under development. As mentioned earlier in the embodied PP frameworks, Seth and collaborators (Seth et al., 2012; Seth, 2013; Seth & Tsakiris, 2018; Seth & Friston, 2016) and Barrett and collaborators (Barrett, 2016; Barrett & Simmons, 2015; Kleckner et al., 2017) have suggested that emotions should be understood as evaluations of the physiological states of the body through interoceptive PP. Additionally, Ridderinkhof (2014, 2017) proposed a PP framework for emotional actions, emphasizing that PP allows to understand the impulsive and purposive features of emotional actions as evaluations and fine-tunings of anticipated action effects based on the predicted sensory consequences. Piccinini, in his work on situated mental representations (Piccinini, 2022), referred to the importance of affect in reinforcement learning, which in turn impacts active learning processes. Since active learning is closely linked to PP, it follows that affect can facilitate the updating of the generative model through this learning mechanism.
Overall, situated accounts of PP provide a more comprehensive and nuanced view by emphasizing the importance of embodiment, embeddedness, extension, enaction, and even affect in shaping and implementing predictions and understanding the myriads of suggested functions of prediction errors. However, these accounts face a challenge concerning their reliance on traditional mental representations. Thus, the situated mental representations proposed in the previous section serve to bridge both PP and 4E cognition, as they (i) introduce mental representations which are essential for contemporary PP frameworks, and (ii) require a combined externalist-internalist approach to cognition. This discussion will pave the way for a future proposal of SPP that integrates all aspects of 4E cognition within a general framework, while explaining how situated mental representations do an explanatory work.
4.1 Situated predictive processing as a framework to understand the Molyneux problem
To this point, this work has set up the foundations for an SPP framework which employs situated mental representations in contrast to traditional ones. However, it remains unclear how SPP can elucidate the mechanisms behind the Molyneux’s problem and help interpret the findings from empirical studies conducted in recent years. In essence, Molyneux’s problem deeply questions about the origin of cross-modal mappings: (i) by experience or (ii) innately. This work argues that neither one nor the other but a combination of both, with experience playing a slightly more significant role.
SPP posits that the brain relies on priors from a generative model that represents the surroundings in a situated manner. In the case of a congenitally blind individual presented with two distinct tactile stimuli, the individual would use their generative model to differentiate between them. If they fail to do so, prediction errors would arise from the mismatch between top-down predictions and bottom-up sensory input. These prediction errors would then guide the update of the generative model. According to SPP, these generative models are inherently embodied; they are shaped by the specific characteristics of the individual’s brain and body and are sensitive to changes in bodily states. For instance, if an individual experiences reduced tactile sensitivity due to a peripheral nervous system injury, their generative model, which was built on the assumption of normal sensory function, would generate top-down predictions based on previous experiences. However, the reduced sensory signals caused by the injury would produce a mismatch between predictions and sensory input, leading to significant prediction errors that update the model accordingly. It is likely that the system weights more precision on the generative model by the time being, but still, it is going to update it to improve the alignment and representation of the surroundings in this new embodiment. But what would happen if the peripheral system recovered, and the bottom-up neural signals got back to the normal activity from before the injury? It would happen the same, but in the opposite direction. The top-down predictions from the generative model would predict lower bottom-up neural activity, creating prediction errors that would go upwards again to update the generative model. While this example illustrates how perception adjusts in response to bodily changes, the shifts involved are not as dramatic as those posed by the Molyneux’s problem. Yet, there is another critical aspect missing from this example, which is highly relevant to the Molyneux’s problem: cross-modal integration.
When thinking about generative models, it is common to focus on a single sensory modality. However, there is no contradiction in considering a more comprehensive generative model that encompasses multiple modalities to account for associations between them. This cross-modality of PP has been empirically analyzed in a few studies (Das et al., 2023; Dercksen et al., 2021; Sánchez-Garcı́a et al., 2011). The findings generally suggest that the system relies primarily on intra-model predictions that match the incoming sensorial information, but cross-modal predictions are also integrated. However, the system prioritizes certain modalities over others based on their reliability in a given context Sánchez-García and colleagues (2011) found that visual predictions tend to have an advantage over auditory predictions in a cross-modal framework. Therefore, PP, and by extension SPP, must accommodate cross-modality within their frameworks. Nonetheless, further empirical investigation is needed to understand how sensory modality weighting occurs and its broader implications.
Returning to a scenario more aligned with the Molyneux’s problem, let us now consider an individual who has suffered congenital blindness. Throughout their life, they have developed generative models across various sensory modalities, even cross-modal maps, except for the visual ones (though some studies, discussed later, question the extent of this). This individual can tactually distinguish between two objects and may also associate their distinct sounds with tactile information in a cross-modal generative model. After sight restoration, they would not have a pre-existing generative model for the visual modality and would thus be unable to make confident predictions that link visual stimuli to tactile information. The system would, therefore, assign low precision to the priors from the visual modality and would refrain from making uncertain predictions, instead placing greater weight on the bottom-up sensory information. However, because generative models for other sensory modalities have already been established, the association between incoming visual information and these stored models could allow the creation of a new associative generative model at a faster pace. The system can leverage the reliability of these pre-existing generative models to guide the active learning process for the visual modality, which may be further accelerated by integrating multiple sensory modalities simultaneously into a larger associative generative model.
Overall, this framework aligns with the findings of Held (2011), Pant (2021), and Piller (2023), and their respective colleagues, who demonstrated that young adults with congenital blindness, after sight restoration, were able to gradually develop cross-modal visuo-tactile mappings within a short period. However, it is important to highlight that the Molyneux’s problem poses a relatively simple task in the context of building a cross-modal generative model, i.e., the visual discrimination of two objects that can be already discriminated tactually. What would happen in more complex scenarios or in cases of visual impairments that cannot be fully reversed? Severe visual deprivation experiments on domestic cats, which consisted of removing or limiting the visual input to the brain through several methods such as dark rearing, and binocular or monocular deprivation (see Kandel, 2013 for a general overview; Wiesel & Hubel, 1965), have provided some insight. These studies revealed that visual input is crucial for the proper development of the visual system during a critical developmental period. Had visual deprivation extended beyond this critical period, cats would have failed to develop a functioning visual system, resulting in significant visual deficits or even complete blindness. These findings are consistent with earlier empiricist perspectives. In a more recent line of thought and drawing from these findings, Gallagher (1996, 2005), one of the main 4E cognition promoters, recently argued that comparing the visual capabilities of blind-recovered individuals to those of control individuals may be problematic, as long-term deprivation can lead to neurodegenerative changes in the visual system that impact the recovery of the visual system.
Similar to the experiments on visually deprived cat, the lack of visual input has significant negative implications for the system’s visual processing, as it has been reported for the audio-visual (Guerreiro et al., 2016a, 2016b; Putzar et al., 2007, 2010), visuo-motor (Ostrovsky et al., 2009), and tactile-propioceptive (Petkova et al., 2012) cross-maps (see Nava et al. 2024 for a review). Thus, cross-modal mappings appear to be both context- and modality-dependent, suggesting that the Molyneux’s problem would be answered even more negatively if the visuo-tactile task involved something more complex than distinguishing two tactile-known objects. Under the SPP framework, this might be explained by the fact that the PP’s generative model is embodied within a nervous system that is reciprocally intertwined with a body and that, during the development of the individual, both systems require the signalling of the other to develop properly.
This system’s enactive ability to gradually create generative models through associations with other sensory modalities can be attributed to two key features of the brain: (i) neural plasticity and (ii) brain sparsity. Neural plasticity, which is important for PP, allows the brain to reorganize itself by forming new or eliminating old connections throughout life. Based on Hebbian learning (Hebb, 1949), it is summarized with its maxim “neurons that fire together, wire together”, as well as synaptic plasticity through long-term potentiation and depression (LTP and LTD, respectively) (Bear & Malenka, 1994; Bliss & Collingridge, 1993; Bliss & Gardner‐Medwin, 1973), which enable the brain to adapt dynamically as needed. Brain sparsity challenges the traditional notion of brain modularity, which suggests that specific functions are localized in discrete brain modules. Instead, brain sparsity suggests that a simple one-to-one mapping between brain areas and functions is an oversimplification, advocating for a network-based perspective (Huntenburg et al., 2018; Pessoa, 2014). In this view, multiple networks dynamically interact to perform functions, thereby challenging rigid modular boundaries. Overall, these features imply that the mind is embodied in a neuroplastic and sparse brain, which constrains its ability to form associative and non-associative generative models. In addition, the brain’s dependence on external stimuli for proper development (i.e., demonstrated in studies of visual deprivation) reinforces the notion that cognition is embodied within an enacted system.
Amedi and colleagues (2005) elegantly reviewed how the occipital areas of blind people, which are normally in charge of visual processing in the visual system hierarchy, are repurposed for other sensory modalities, including tactile (see Sadato et al., 1996 for an example), motor (see Ricciardi et al., 2009 for an example), or even for other cognitive functions such as language and memory. Using fMRI, Peelen and colleagues (2014) found that the occipitotemporal cortex of blind individuals was activated during shape comparisons similarly to sighted individuals. Therefore, when areas once responsible for visual stimuli (e.g., occipital areas) processing are recruited for other functions (e.g., tactile processing) in congenital blindness, these regions may more quickly develop associative generative models after visual restoration, utilizing this neural flexibility.
The Molyneux’s problem has briefly been approached by prosthetic vision strategies, including artificial retinas, sensory substitution devices, and visual prosthetic systems, as noted by Evans (1985), leading to a reformulation of the question: “would a formerly blind individual, after having regained a degree of visual functionality by means of a prosthetic device, pass the Molyneux test?” (Jacomuzzi et al., 2003, p. 270). Artificial retinas, which produce electrical impulses when activated by light to induce phosphene perception (i.e., a luminous sensation produced by mechanical or electrical stimulation of the retina), can slightly improve vision (Chow, 2004; Ramirez et al., 2023). However, artificial retinas have mainly been used for degenerative-caused blindness, which means they do not directly apply to the Molyneux’s problem. Sensory substitution devices (SSDs) provide visual information by stimulating a non-visual modality. Bach-Y-Rita and colleagues (Bach-Y-Rita et al., 1969) used electrical stimulators on body areas with haptic receptors, showing that patterns of visual discrimination can be learnt through haptic stimulation. Their successful approach was later refined (Deroy & Auvray, 2012; Nau et al., 2015; Reich et al., 2012; Ward & Meijer, 2010) leading to striking findings. After brief training with SSDs, blind individuals demonstrated the ability to point at targets, recognize patterns, and perform tasks like motion tracking and object distance estimation. Similar to Amedi and colleagues’ work (Amedi et al., 2005), these studies emphasized the brain’s ability to reorganize and adapt, recruiting the visual system for object recognition while using haptic-derived information. SSDs effectively facilitate cross-modal perception, generating an extrinsic and artificial visuo-tactile associative model, while recruiting visual system areas, which could help enhance post-sight recovery. Finally, regarding cortical prostheses, Dobelle’s pioneering work (2000) demonstrated that visual cortical prostheses could help blind individuals by creating more specific and individualized phosphene perceptions than those created by artificial retinas. Nevertheless, subsequent studies have emphasized the challenges of using these type of prostheses due to high variability in perception and the need for extensive training (Lewis et al., 2015; Najarpour Foroushani et al., 2018). Although these prosthetic devices are still evolving, they highlight an extended feature of the Molyneux’s problem, even suggesting the creation of extrinsic associative generative models and recruiting the visual system after haptic stimulation.
Synaesthesia and the Molyneux’s problem, while differing in their origins and dispositions, both explore the brain’s capacity for cross-modal mapping. Synaesthesia often results from atypical neural connections that result in stable, automatic associations between sensory modalities (Ward, 2013). For example, hearing a sound might consistently evoke a perception of colour. While synaesthesia is commonly acquired, developmental synaesthesia is of particular interest here: if a specific developmental visuo-tactile synaesthesia were present in a congenital blind individual, it could potentially yield a positive answer to the Molyneux’s problem. Even though synaesthetic congenital blindness cases are rare, a case report described an acquired audio-tactile synaesthesia in a congenital blind individual triggered by LSD, resulting in ‘visual-like’ qualia, similar to experiences reported by users of SSDs (Dell’Erba et al., 2018). While this case is not conclusive due to its non-developmental nature, it raises the possibility of visuo-tactile synaesthetic blind individuals providing a positive answer to the Molyneux’s problem.
Synaesthesia has been theoretically approached by Seth’s PP model of sensorimotor contingencies (2014). Seth suggested that, in typical perception, generative models are rich in counterfactuals, contributing to a sense of perceptual presence. However, in synaesthetes, generative models may exhibit unusually high prior precision, leading to a reduced role of counterfactuals. Synaesthesia emerges by drastic changes in neural networks, influenced by neural plasticity. Both SPP and Seth’s enacted and embodied PP (2014) can account for synaesthesia, as they are embodied in a plastic brain. Conversely, a negative answer to the Molyneux’s problem is the rule, as the counterfactually-rich generative model is low in prior precisions due to the absence of previous visual experience.
To sum up, SPP offers a comprehensive framework for understanding how the mind is embodied in a brain that displays neuroplasticity, sparsity, and is enacted with the surroundings. In addition, the brain can even be extended through sensory substitution devices (Bach-Y-Rita et al., 1969; Deroy & Auvray, 2012; Nau et al., 2015; Reich et al., 2012; Ward & Meijer, 2010). Neuroplasticity and brain sparsity are intrinsic features of the brain’s adaptability, suggesting that the answer to Molyneux’s problem may not be entirely negative, being synaesthesia an extreme case. The brain is a highly flexible system capable of adjusting to novel scenarios, within certain limits. In blind individuals, for instance, the brain may repurpose visual networks for other functions that can later facilitate the development of associative generative models post-visual restoration. Nevertheless, experience remains crucial for proper neurodevelopment, establishing the groundwork for associative generative models, and driving neural plasticity. Thus, while the brain’s adaptability offers some potential for cross-modal learning, this process still requires experience, meaning that the answer to the Molyneux’s problem still leans towards a negative conclusion.
5 Concluding remarks
Even though the Molyneux’s problem was first published over three centuries ago, it still remains unsolved. While empirical approaches during the 18th century (Cheselden, 1728; Sassen, 2004; Wade, 2020) and contemporary studies (Held et al., 2011; Pant et al., 2021; Piller et al., 2023) predominantly suggest a negative answer, recent critiques regarding experimental design (Cheng, 2015; Clarke, 2016; Connolly, 2013; Schwenkler, 2012, 2015) demonstrate that the definitive resolution to the problem is still elusive.
This work describes an SPP framework and suggests that it provides a valuable lens for understanding the Molyneux’s problem. According to traditional PP, brain generates predictions about incoming sensory information based on its internal generative models about the world, which are updated by comparing top-down predictions to bottom-up sensory inputs. Crucially, in contrast to traditional views, SPP posits that these generative models derive their content from dynamic brain-body-environment interactions, as Piccinini (2022) argued, thereby resolving the problem of content in cognitive systems.
SPP explains why individuals born blind, upon sight restoration, struggle to predict or visually distinguish objects previously familiar through touch. However, it also demonstrates how the brain’s intrinsic properties, including neural plasticity and sparsity, allow for a gradual reconfiguration and cross-modal adaptation. As such, SPP offers a ‘moderate’ negative answer to the Molyneux’s problem: while cross-modal predictions without previous visual experience are not immediate and require experience, neuroplasticity and sparsity allow for rapid adaptation once visual experience is gained, leading to the development of associative generative models.
This work sets up the foundations for an SPP framework, but further theoretical and empirical research is needed to fully explore SPP in cross-modal perception. Future studies may also uncover additional implications for the study of sensory substitution and the plasticity of cognitive systems, shedding further light on the intricate Molyneux’s problem.
Acknowledgments
I would like to specifically thank Lorena Chanes for the discussions about mental representations, predictive processing, and 4E cognition that incited the beginning of this manuscript. I am grateful for the indefatigable assistance and discussion of Ube Cisfúgar. I also wanted to express my deep gratitude to Núria Peñuelas for the general discussions that helped me to arrange my ideas. I am grateful to the anonymous reviewers from Philosophy and the Mind Sciences for their insightful feedback, which has significantly enhanced both the flow and content of the paper, transforming it from the rough initial draft into its current form.
Funding
The work for this research was generously funded by the Centro Internacional de Neurociencia y Ética (Spain).