Training Dataset (data already available for download)

Here you can download the training dataset for CAW2 2009 shared-tasks. You must be logged in to be able to download the data. If you still have not created an account, do it now. When you are logged in, come back to this page and you will see the download link below.

The provided dataset intends to comprehend a representative sample of what can be found in web 2.0. The data have been collected from six different public sites which have different kinds of information and are written in different styles. Some of them can be thought as more appropriate for some specific shared-task than others, but the three tasks should be able to deal with any message from any site. Participant teams are encouraged to preprocess and filter the data set to select the more relevant parts of it depending of the shared task they intend to participate in.

In the case of text normalization, this task can be seen as a previous step before performing the other two tasks. However, it is not intended that all teams participate in all the tasks, for this reason those sites for which text normalization is critical should be preferably used for shared-task 1 and those for which text normalization is not critical should be preferably used for shared-tasks 2 and 3.

No additional data should be used in any shared-tasks unless it can be shared with, and made publicly available for, all participant teams. If you intend to use any additional data and/or resource you must either upload it into this site or provide a link for download so it is available to all other participant teams.

The following table presents a brief summary of the dataset, find the links to the download pages below the legal notices.

Website original site URL Messages zip file size Main characteristic
Twitter http://twitter.com/ 900.000 58MB Short messages, chatspeak style
Myspace http://www.myspace.com/ 380.000 53MB Forum discussions
Slashdot http://slashdot.org/ 140.000 34MB Comments on news-posts
Ciao http://www.ciao.com/ 20.000 14MB Ratings about movies
Kongregate http://www.kongregate.com/ 150.000 2.4MB Real-time chats, games on-line

The datal follows the dtd that you will find at the end of this page.

Please, read the following important legal notices before downloading the dataset:

1.- The training dataset has been automatically crawled from public websites by Fundación Barcelona Media (FBM). The dataset is property of FBM and it is released for research purposes only. The dataset cannot be used for any commercial or non-commercial activity different from research.

2.- FBM is not responsible for any private or sensible information contained within the dataset. In case you find any private or sensible information that should be removed from the dataset, you should notify immediately to FBM  (caw2 at  barcelonamedia dot org)

3.- You are not allowed to provide the dataset to any third party or collaborator. Any third party or collaborator interested in the dataset must register and download the dataset from the workshop website.

4.- You should acknowledge FBM and provide the link to this website in any publication or notification about any research work that uses the provided dataset.

5.- FBM is not responsible of the use you make of this dataset and the related consequences.

6.- By downloading the dataset you are making explicit your agreement to all terms and conditions of use exposed here.

Download area: Ciao, Kongregate, MySpace, Slashdot, Twitter

AttachmentSize
data.dtd_.txt1.22 KB