In the misbehavior detection shared task, we want to address the problems of detecting inappropriate activity in which some users in a virtual community can be molesting or offensive to some other members of the community. We consider that this shared task can provide a good starting point for a future shared task with the more ambitious goal of classifying users and detecting identity supplantation for on-line criminal activity.
The great amount of interaction between users in the Web 2.0 and the nearly total absence of restrictions and control on what is published on many websites opens the door to all type of misbehavior. Some forms of misbehavior are direct extensions of in classic human rudeness in interpersonal relationships while others can only be found on-line where anonymity allows to overcome certain social inhibitions of the real world. Many websites use a user driven moderation system to perform a soft form of regulation user-behavior.
The objectives of this task is thus the detection of all type of misbehavior which can be found in typical websites with UGC. This is of course a very wide area and we will restrict the task therefore to the following four forms of misbehavior:
Some users perform annoying actions towards others. They either systematically depreciate the contributions of another user or on the contrary
are too keen in trying to establish a connection with a certain user who is not interested in this connection.
This could be related to Spam detection, but here we will focus on the change of subject in a discussion.
This shared task is organized in two tracks: which correspond to the detection of the two forms of misbehavior mentioned above. For all tracks, a common training dataset should be used in order to allow for system performance comparison. This training dataset is described and can be downloaded from this link. Similarly, evaluation datasets, which are specific for each track, will be provided for evaluation purposes during the first week of March 2009.
Training data for both tracks should be restricted to the provided dataset, from which each participant team should filter and extract those subsets considered relevant for the specific approach to be implemented. Any possible approach is allowed: rule-based, statistical, supervised, unsupervised, etc. Any additional source of information and resources such as dictionaries, language models, annotated data (extracted from the provided dataset), etc. should be notified to and shared with other participant teams through the resource-sharing section of the shared task.
Evaluation of system performances will be conducted over a test dataset specifically prepared for the task. System outputs should provide examples degree of association to each basic category for each comment in the test data set. Evaluation scores for system outputs will be computed in terms of vector distances between each system output and its corresponding reference vector. Reference vectors will be computed from manual annotations of the test dataset, which will be made public after submission of the evaluation results.