François Mairesse

Email:

Update: now working at Vlingo in Cambridge, MA

[ Research | Publications | Theses | Online demos | Data and software | Talks | Teaching ]
[ Google Scholar | Linkedin profile | Young Researchers' Roundtable on Spoken Dialogue Systems ]




I am a research associate at the Cambridge University Machine Intelligence Lab, in the Dialogue Systems Group headed by Prof. Steve Young. I am currently working on the EU FP7 CLASSiC project (Computational Learning in Adaptive Systems for Spoken Conversation), which focuses on statistical methods for data-driven semantic parsing, dialogue management and natural language generation. I completed my Ph.D. thesis in 2008 under the supervision of Prof. Marilyn Walker, at the Computer Science Department of the University of Sheffield, United Kingdom. I obtained a Master of Engineering and Computer Science in 2004 from the Université Catholique de Louvain in Belgium.

I have been working on statistical methods for natural language understanding, natural language generation and opinion mining. These problems require learning structured prediction models from a large amount of annotated data. I have been especially interested in crowdsourcing for collecting data, in order to model the wide range of speaking styles found in natural language.

Research Interests:

  • Opinion mining from text and spoken utterances
  • Learning to generate natural language from data
  • Robust semantic parsing for spoken utterances
  • Paraphrase acquisition from corpora
  • Expressive language generation and text-to-speech synthesis
  • Learning to detect mood, emotion and personality for user modelling

Journal articles (Google Scholar):


Peer-reviewed publications at international conferences:

Theses:
Online demos:
  • CamInfo: The Cambridge Tourist Information Dialogue System (requires a microphone)
    This Java applet is an interface to our group's live dialogue system, which provides information about most places in Cambridge, including pubs, restaurants, colleges, museums, etc. The system can also be called using the number +44 1223 852 453. The system implements the HIS framework, i.e. it relies on Partially-observable Markov Decision Processes to reason over multiple hypotheses about the user input, which are provided by the ATK speech recogniser. Some functionalities of Personage are used for language generation (e.g., syntactic aggregation, WordNet synonym selection). The speech synthesiser is an HTS voice trained on emphasis-dependent context features using the two-pass context clustering method.

  • Personage: Language Generation with Personality
    The Personage generator can produce personality-rich utterances for presenting information in the restaurant domain. You can use the interactive interface to observe how each utterance varies along the extraversion dimension. Personage is based on models of the generation parameters computed from human personality ratings, detailed in this paper. An online demo is available, and the Java stand-alone generator can be downloaded here.

  • Automatic personality recognition
    What does your language reveal about you? The personality recognition models can estimate your scores along the 5 main personality dimensions based on your input text. Models are detailed in this paper.

Data and software:

Here are various human-annotated datasets and freely available software. Feel free to use and modify them for non-commercial purposes.

  • BAGEL training and evaluation data
    This contains the 404 semantically aligned utterances used for training and evaluating the BAGEL statistical language generator, together with the naturalness and informativeness ratings of 1616 utterances generated using different learning configurations, i.e. using active learning and random sampling. More details in this paper.

  • Emphasis-annotated ARCTIC database for speaker AWB
    This corpus contains word-level emphasis annotations for the first 597 utterances (set A) of the ARCTIC speech database, i.e. the words or phrases perceived as the focus of speaker AWB's utterances.

  • The Personage Language Generator is now maintained by the Natural Language and Dialogue Systems Group at UCSC.

  • Personage dataset: a personality-annotated corpus
    This dataset contains 580 utterances annotated with personality/stylistic ratings from human judges, for each Big Five trait. The data also includes the generation decisions made for each utterance, as well as the intermediary content plan tree, sentence plan tree and syntactic structures. Naturalness ratings are also included. This data was used for evaluating the Personage generator, as well as for training parameter estimation models (Mairesse & Walker, 2007, 2008). More details in the Personage dataset readme file.

  • Personality Recognizer v1.02 (new version 06/06/2007)
    This Java command-line application extracts psycholinguistic features from multiple text files and runs the included models to compute personality scores for all Big Five traits. An online demo is also available.

  • jMRC - MRC Psycholinguistic Database Java Interface v0.9
    This Java interface allows you to query the MRC Psycholinguistic Database from your Java programs, providing psycholinguistic features for over 150,000 words.

Talks:
Teaching:

Counters


François Mairesse, 2011 -