Text Normalization

In the text normalization shared task we want to address the problem related to chat-speak style of communication. Recently, some research has been carried out in this area for SMS communications and from the perspective of machine translation approaches. In this shared task we attempt to generalize the problem to Web 2.0 contents and to explore additional alternatives the participants can come out with.

Motivation:

the web 2.0 has transferred the authorship of the contents from the institutions to the people; the web is not only the place to publish institutional information as it was, is a chanel where users exchange, explain or write  about their lives and interests, give opinions and rate other's opinions, upload media and photos and files, most of the times in a very casual way. The users, specially the young ones, use their web sites as they use the short text messages or chats, without minding the spelling or deliberately shortening the words and using contextual slang.
During years there has been a big effort to produce natural language processing tools that try to understand well written sentences, but these tools can not be applied out of the box to analyse the contents of the web 2.0, not even syntactic tools like stemming can bring to common stems words that have been shortened (like Xmas or Christmas). Another task has been devoted to reduce a family of words to a single concept in order to improve the similarity between a document and a query or other documents; from words to stems then lemmas and finally concepts giving a reduction in the number of dimensions of the vectors that represent documents.
In order to retrieval or mine the data in the web 2.0 first we need to understand the contents on it. The misspelled words increase the variability for the same concept. There is a need to correct and expand the text in order to apply the algorithms for correct text or, adapt these techniques to deal with the texts found in the web 2.0

Objectives:

The objective in this track is to correct the texts present in the web 2.0, in order to produce sentences that are at least syntactically correct, and that could be suitable for further treatment for existing tools or information retrieval.

Description:

For this track there is a corpus obtained from different sites belonging to the Web 2.0. The objective was to crawl sites where users behave in different ways: Twitter, chats, ratings of products, news and comments on theses news and forums. There is no parallel corpus of misspelled/corrected text for this track. What is given is a train set of some hundred thousands sentences, with a high proportion of misspelled words, produced from different users in different contexts.This training dataset is described and can be downloaded from this link. The evaluation dataset will be composed by similar sentences from similar sites.
We encourage the participants to share generated information as dictionaries, to publish a parallel corpus if they annotate some of the sentences, so other users can use it. This can be done before (better) or after the evaluation, as one of the objectives in this workshop is to start producing a reference corpus for these tasks.

Evaluation:

The evaluation of this track will be conducted over a test dataset specifically prepared for this track. System outputs should provide one equivalent well written sentence to each of the test sentences.
The evaluation will be performed in two steps. The first one, using spell checkers, and the second one, the participating teams will be asked to rank the answers given by the different systems for a small set of sentences.