MySpace corpus

MySpace: www.myspace.com

MySpace is a popular social networking site which offers its registered users the posibility to participate in form discussion about several predefined topics. Everyone can start a new thread in these forums and particpate freely in a thread created by an other user, moderaters may exist though (depending on the forum topic) which my eliminate certain type of content and even ban a certain user.

The threads have been chosen from three different forum topics:

 

The training data set contains about 380.000 comments to 16346 threads dealing either with one of the above mentioned topics.

The zip file contains 16346 files (one for each thread) with the corresponding comments and has the following format:

<thread id="MS_3906_470">
<title>a string</title><!-- the title of the article -->
<topics>
<first>
<title>Campus Life</title> <!-- the main forum topic -->
</first>
<second>
<title>General</title><!-- the secondary forum topic -->
</second>
</topics>
<posts> <!-- the list of post of this thread  -->
<post id="MS_3906_98934"><!-- the post id  -->
<user id="MS_139431535"><!-- the user id of the author of the post -->
<username>a username</username><!-- the name of the user -->
<sex>M</sex><!-- the sex of the user  (M or F) -->
<age>27</age><!-- the age of the user -->
<city>a city</city><!-- the home city of the user -->
<province>a province</province><!-- the home province of the user -->
<country>a country</country><!-- the home country of the user -->
</user>
<date>1080718080</date><!-- unix time stamp format of the date the post was published -->
<body>The text of the opening post of the thread</body>
</post>
<post id="MS_3906_4469055345"><!-- the next post id  -->
<user id="MS_7285237645"><!-- the user id of the author of the next post -->
<username>another username</username><!-- the name of this user -->
<sex>F</sex><!-- the sex of this user  (M or F) -->
<age>45</age><!-- the age of this user -->
<city>another city</city><!-- the home city of the user -->
<province>another province</province><!-- the home province of this user -->
<country>another country</country><!-- the home country of this user -->
</user>
<date>1213464300</date><!-- unix time stamp format of the date the post was published -->
<body>The text of the next post</body>
</post>
<post> <!-- other posts --></post>
<post> <!-- other posts --></post>
</posts>
</thread>

AttachmentSize
MySpace.zip52.8 MB