Web mining deals with understanding, and discovering information in, the World Wide Web. Web mining focuses on analyzing three different sources of information: web structure, user activity and the contents. When referring to the Web 2.0, web structure and user activity related data can be dealt with in a very similar way that in the case of the traditional Web, however, in the case of contents, conventional analysis and mining procedures are not suitable anymore. This is mainly because, in the Web 2.0, contents are generated by users, who make a very free use of language and are constantly incorporating new communication elements which are generally context dependent. This kind of language can also be found on chats, SMS, e-mails and other channels of informal textual communication.
This workshop focused on the problem of making Web 2.0 both searchable and analyzable in terms of its contents. This is an extremely important endeavor for current web mining technologies because of two reasons: first, user generated content (UGC) is growing faster than ever in the cyberspace and, two, automatic analysis of UGC will allow improving the user experience of common citizens about Internet resources and opportunities, while, simultaneously, detecting and tracking criminal and terrorist activity.
In this first edition of the workshop we attempted to focus the attention of interested research groups and companies into the new challenges and opportunities related to Web 2.0 content analysis. More specifically, we focused on specific tasks on the scope of text content mining, with the intention of extending the coverage to multimedia data in future editions of the workshop. According to this, for the first edition of the workshop, we collected and provided a corpus which was used as experimental collection to conduct research in three specific shared tasks: text normalization, opinion mining and misbehavior detection.
In the text normalization shared task we wanted to address the problem related to chat-speak style of communication. Recently, some research has been carried out in this area for SMS communications and from the perspective of machine translation approaches. In this shared task we attempted to generalize the problem to Web 2.0 contents and to explore additional alternatives the participants could come out with.
In the opinion mining shared task we wanted to address problems such as determining text subjectivity and polarity, and sentiment analysis. Although these problems have been already approached from different perspectives, most of the research has been carried out on specific domain data and applications where users are requested to rate services or products. Our intention was to focus the attention into the more general domain in which Web 2.0 users express their sentiments and opinions in their daily interaction within a virtual community.
Finally, in the misbehavior detection shared task, we wanted to address the problems of detecting inappropriate activity in which some users in a virtual community could be molesting or offensive to some other members of the community. We considered that this shared task could provide a good starting point for a future shared task with the more ambitious goal of classifying users and detecting identity supplantation for on-line criminal activity.
Feel free to join us and participate in this small quest for improving our understanding of virtual communities…
The organizing committee
Barcelona, December 2008.
- Joan Codina - Universitat Pompeu Fabra
- Jens Grivolla - Barcelona Media Innovation Centre
- Andreas Kaltenbrunner - Barcelona Media Innovation Centre
- Rafael E. Banchs - Barcelona Media Innovation Centre
- Ricardo Baeza-Yates - Yahoo! Research