Author: "Bernardi, Raffaella" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bernardi, Raffaella"' showing total 9 results

Start Over Author "Bernardi, Raffaella" Language undetermined

9 results on '"Bernardi, Raffaella"'

1. Garbage In, Flowers Out: Noisy Training Data Help Generative Models at Test Time

Author: Testoni, Alberto and Bernardi, Raffaella
Subjects: training data, vision and language, General Medicine, conversational system, training data, vision and language, conversational system
Abstract: Despite important progress, conversational systems often generate dialogues that sound unnatural to humans. We conjecture that the reason lies in the different training and testing conditions: agents are trained in a controlled “lab” setting but tested in the “wild”. During training, they learn to utter a sentence given the ground-truth dialogue history generated by human annotators. On the other hand, during testing, the agents must interact with each other, and hence deal with noisy data. We propose to fill this gap between the training and testing environments by training the model with mixed batches containing both samples of human and machine-generated dialogues. We assess the validity of the proposed method on GuessWhat?!, a visual referential game. We show that our method improves the linguistic quality of the generated dialogues, and it leads to higher accuracy of the guessing task; simple perturbations of the ground-truth dialogue history that mimic machine-generated data do not account for a similar improvement. Finally, we run a human evaluation experiment on a sample of machine-machine dialogues to complement the quantitative analysis. This experiment shows that also human annotators successfully exploit dialogues generated by a model trained with mixed batches to solve the task. Hence, the mixed-batch training does not cause a language drift. Moreover, we find that the new training regime allows human annotators to be significantly more confident when selecting the target object, showing that the generated dialogues are informative.
Published: 2022
Full Text: View/download PDF

2. Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

Author: Testoni, Alberto and Bernardi, Raffaella
Subjects: FOS: Computer and information sciences, Dialogue Strategies, Computer Science - Computation and Language, Visual Dialogues, Dialogue Strategies, Visual Dialogues, Computation and Language (cs.CL)
Abstract: Generating goal-oriented questions in Visual Dialogue tasks is a challenging and long-standing problem. State-Of-The-Art systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we design Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model's conjecture about the referent. We take the GuessWhat?! game as a case-study. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking., Comment: Accepted to EMNLP 2021 (main conference)
Published: 2021
Full Text: View/download PDF

3. Pay Attention to Those Sets! Learning Quantification from Images

Author: Sorodoc, Ionut, Pezzelle, Sandro, Herbelot, Aurélie, Dimiccoli, Mariella, and Bernardi, Raffaella
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: Major advances have recently been made in merging language and vision representations. But most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw data to perform certain types of higher-level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like 'few', 'some' and 'all'. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in 'most fish are red', most encodes the proportion of fish which are red fish. In this paper, we study how well current language and vision strategies model such relations. We show that state-of-the-art attention mechanisms coupled with a traditional linguistic formalisation of quantifiers gives best performance on the task. Additionally, we provide insights on the role of 'gist' representations in quantification. A 'logical' strategy to tackle the task would be to first obtain a numerosity estimation for the two involved sets and then compare their cardinalities. We however argue that precisely identifying the composition of the sets is not only beyond current state-of-the-art models but perhaps even detrimental to a task that is most efficiently performed by refining the approximate numerosity estimator of the system., Comment: Submitted to Journal Paper, 28 pages, 12 figures, 5 tables
Published: 2017
Full Text: View/download PDF

4. Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation

Author: Lazaridou, Angeliki, Nguyen, Dat Tien, Bernardi, Raffaella, and Baroni, Marco
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Computation and Language (cs.CL)
Abstract: We introduce language-driven image generation, the task of generating an image visualizing the semantic contents of a word embedding, e.g., given the word embedding of grasshopper, we generate a natural image of a grasshopper. We implement a simple method based on two mapping functions. The first takes as input a word embedding (as produced, e.g., by the word2vec toolkit) and maps it onto a high-level visual space (e.g., the space defined by one of the top layers of a Convolutional Neural Network). The second function maps this abstract visual representation to pixel space, in order to generate the target image. Several user studies suggest that the current system produces images that capture general visual properties of the concepts encoded in the word embedding, such as color or typical environment, and are sufficient to discriminate between general categories of objects., Comment: A 6-page version to appear at the Multimodal Machine Learning NIPS 2015 Workshop
Published: 2015
Full Text: View/download PDF

5. Nominal coercion in space: Mass/count nouns and distributional semantics

Author: null Hürlimann, Manuela, null Bernardi, Raffaella, and null Paperno, Denis
Published: 2014
Full Text: View/download PDF

6. The Devil is in the Detail: A Magnifying Glass for the GuessWhich Visual Dialogue Game

Author: Testoni, Alberto, Ravi Shekhar, Fernández, Raquel, and Bernardi, Raffaella

7. Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task

Author: Shekhar, Ravi, Bernardi, Raffaella, and Fernández, Raquel
Subjects: INF/01 INFORMATICA
Abstract: Most human language understanding is grounded in perception. There is thus growing interest in combining information from language and vision. Multiple models based on Neural Networks have been proposed to merge language and vision information. All the models share a common backbone consisting of an encoder which learns to merge the two types of representation to perform a specific task. While some models have seemed extremely successful on those tasks, it remains unclear how the reported results should be interpreted and what those models are actually learning. Our contribution is three-fold. We have proposed (a) a new model of Visually Grounded Dialogue; (b) a diagnostic dataset to evaluate the encoder ability to merge visual and language input; (c) a method to evaluate the quality of the multimodal representation computed by the encoder as general purposed representations. We have proposed and analyzed a cognitive plausible architecture in which dialogue system modules are connected through a common \emph{grounded dialogue state encoder}. Our in-depth analysis of the dialogues shows the importance of going beyond task-success in the evaluation of Visual Dialogues: the dialogues themselves should play a crucial role in such evaluation. We have proposed a diagnostic dataset, \emph{FOIL} which consists of images associated with incorrect captions that the model has to detect and correct. Finally, we have used FOIL to evaluate the quality of the multimodal representation produced by an encoder trained on different multimodal tasks. We have shown how the training task used effects the stability of the representation, their transferability and the model confidence.
Published: 2019

8. Learning the Meaning of Quantifiers from Language and Vision

Author: Pezzelle, Sandro and Bernardi, Raffaella
Subjects: L-LIN/01 GLOTTOLOGIA E LINGUISTICA, ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
Abstract: Defining the meaning of vague quantifiers (‘few’, ‘most’, ‘all’) has been, and still is, the Holy Grail of a mare magnum of studies in philosophy, logic, and linguistics. The way by which they are learned by children has been largely investigated in the realm of language acquisition, and the mechanisms underlying their comprehension and processing have received attention from experimental pragmatics, cognitive psychology, and neuroscience. Very often their meaning has been tied to that of numbers, amounts, and proportions, and many attempts have been made to place them on ordered scales. In this thesis, I study quantifiers from a novel, cognitively-inspired computational perspective. By carrying out several behavioral studies with human speakers, I seek to answer several questions concerning their meaning and use: Is the choice of quantifiers modulated by the linguistic context? Do quantifiers lie on a mental, semantically-ordered scale? Which are the features of such a scale? By exploiting recent advances in computational linguistics and computer vision, I test the performance of state-of-art neural networks in performing the same tasks and propose novel architectures to model speakers’ use of quantifiers in grounded contexts. In particular, I ask the following questions: Can the meaning of quantifiers be learned from visual scenes? How does this mechanism compare with that subtending comparatives, numbers, and proportions? The contribution of this work is two-fold: On the cognitive level, it sheds new light on various issues concerning the meaning and use of such expressions, and provides experimental evidence supporting the validity of the foundational theories. On the computational level, it proposes a novel, theoretically-informed approach to the modeling of vague and context-dependent expressions from both linguistic and visual data. By carefully analyzing the performance and errors of the models, I show the effectiveness of neural networks in performing challenging, high-level tasks. At the same time, I highlight commonalities and differences with human behavior.
Published: 2018

9. Exploiting Text Corpora for Data Enrichment in Language and Vision Applications

Author: Le, Dieu Thu and Bernardi, Raffaella
Subjects: INF/01 INFORMATICA
Abstract: During the last decade, machine learning techniques have been used successfully in many applications. The performance of these systems depends largely on the quality and quantity of the training data. For many tasks, the data itself is not rich enough. For example, text documents such as user-queries, users-comments and short advertisements consist of only few words. Therefore direct word-based representations are sparse which makes it difficult to measure good similarities for clustering or classification. In many other applications, training data is too expensive to fully obtain. In the task of human action recognition from still images, the total number of possible actions is the cartesian product of objects and verbs. This combinatorial explosion of verb-object relations makes the task of learning human actions directly from their visual appearance computationally prohibitive and makes the collection of proper-sized image datasets infeasible. This thesis proposes a framework to enrich poor data with knowledge automatically extracted from large-scale text corpora. It considers various text modeling techniques to extract knowledge. The data enrichment framework is illustrated in different tasks in both language and vision applications. For language applications, we apply data enrichment to query classification. A topic model is estimated on external text corpora as a reference set. This model is then used to analyze topics for short queries and categories, generating shared context between them. The experimental results show that the data enrichment process increases the performance of the system, helping to find better categories for a given query. For vision applications, we employ the knowledge extracted from large scale text corpora to predict objects in context and recognize human actions in images. We investigate the problem of modeling text corpora for knowledge extraction and discuss which model is the most suitable for each particular task. In the first task, we learn the relations between objects from text corpora to predict how different objects often occur together using a probability model. This knowledge is then used to help predict new objects given other objects in the images. In the human action recognition task, we combine the knowledge extracted from external text corpora with the visual features from the images. Based on the visually recognized objects, scenes and relative positions between the human and objects in these images, the most plausible actions are suggested using the knowledge learned from the general external text. This model allows recognizing unseen actions and even outperforms a visual Bag-of-Words model in a realistic scenario where only few visual training examples are available.
Published: 2014

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

9 results on '"Bernardi, Raffaella"'

1. Garbage In, Flowers Out: Noisy Training Data Help Generative Models at Test Time

2. Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy

3. Pay Attention to Those Sets! Learning Quantification from Images

4. Unveiling the Dreams of Word Embeddings: Towards Language-Driven Image Generation

5. Nominal coercion in space: Mass/count nouns and distributional semantics

6. The Devil is in the Detail: A Magnifying Glass for the GuessWhich Visual Dialogue Game

7. Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task

8. Learning the Meaning of Quantifiers from Language and Vision

9. Exploiting Text Corpora for Data Enrichment in Language and Vision Applications

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Database

Publisher

9 results on '"Bernardi, Raffaella"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources