Here you can download the training dataset for CAW2 2009 shared-tasks. You must be logged in to be able to download the data. If you still have not created an account, do it now. When you are logged in, come back to this page and you will see the download link below.
The provided dataset intends to comprehend a representative sample of what can be found in web 2.0. The data have been collected from six different public sites which have different kinds of information and are written in different styles. Some of them can be thought as more appropriate for some specific shared-task than others, but the three tasks should be able to deal with any message from any site. Participant teams are encouraged to preprocess and filter the data set to select the more relevant parts of it depending of the shared task they intend to participate in.
In the case of text normalization, this task can be seen as a previous step before performing the other two tasks. However, it is not intended that all teams participate in all the tasks, for this reason those sites for which text normalization is critical should be preferably used for shared-task 1 and those for which text normalization is not critical should be preferably used for shared-tasks 2 and 3.
No additional data should be used in any shared-tasks unless it can be shared with, and made publicly available for, all participant teams. If you intend to use any additional data and/or resource you must either upload it into this site or provide a link for download so it is available to all other participant teams.
The following table presents a brief summary of the dataset, find the links to the download pages below the legal notices.
| Website | original site URL | Messages | zip file size | Main characteristic |
| http://twitter.com/ | 900.000 | 58MB | Short messages, chatspeak style | |
| Myspace | http://www.myspace.com/ | 380.000 | 53MB | Forum discussions |
| Slashdot | http://slashdot.org/ | 140.000 | 34MB | Comments on news-posts |
| Ciao | http://www.ciao.com/ | 20.000 | 14MB | Ratings about movies |
| Kongregate | http://www.kongregate.com/ | 150.000 | 2.4MB | Real-time chats, games on-line |
The datal follows the dtd that you will find at the end of this page.
Please, read the following important legal notices before downloading the dataset:
1.- The training dataset has been automatically crawled from public websites by Fundación Barcelona Media (FBM). The dataset is property of FBM and it is released for research purposes only. The dataset cannot be used for any commercial or non-commercial activity different from research.
2.- FBM is not responsible for any private or sensible information contained within the dataset. In case you find any private or sensible information that should be removed from the dataset, you should notify immediately to FBM (caw2 at barcelonamedia dot org)
3.- You are not allowed to provide the dataset to any third party or collaborator. Any third party or collaborator interested in the dataset must register and download the dataset from the workshop website.
4.- You should acknowledge FBM and provide the link to this website in any publication or notification about any research work that uses the provided dataset.
5.- FBM is not responsible of the use you make of this dataset and the related consequences.
6.- By downloading the dataset you are making explicit your agreement to all terms and conditions of use exposed here.
Download area: Ciao, Kongregate, MySpace, Slashdot, Twitter
| Attachment | Size |
|---|---|
| data.dtd_.txt | 1.22 KB |