The invited Speakers will be Hugo Zaragoza and Philippe Roussel
is researcher working on Information Retrieval at Yahoo! Research Barcelona. He is interested in the applications of machine learning (ML) and natural language processing (NLP) for information retrieval (IR). More specifically, Hugo is interested in developing measures or relevance (i.e. ranking functions) between linguistic objects such as search queries and web documents. From 2001 to 2006 worked at Microsoft Research (Cambridge, UK) with Stephen Robertson, mostly on probabilistic ranking methods for corporate and web search, but also on document classification, expert finding, relevance feedback and dialogue generation for games. While at Microsoft Research also spent a considerable amount of time collaborating with Microsoft product groups such as MSN-Search and SharePoint Portal Server. Studied computer science and completed a Ph.D. at the LIP6 (U. Paris 6) on the application of dynamic probabilistic models to a wide range of Information Access problems.
At Yahoo! Research Barcelona we are developing new search prototypes going beyond document retrieval and bringing to the user information locked inside of web pages or user generated data. I will discuss the more recent work in this area, discussing entity ranking in Wikipedia and news, entity visualization, browsing with semantic roles, and work on structured (semantic) data on the web.
is senior researcher at the Barcelona Media Innovation Center, working on web crawling techniques for specific web sites like social networks or newspapers, for subsequent content analysis. After his first works in the Logic Programming field he has contributed to create as the co-author of the Prolog language, he as worked in the field of information systems, as an academic at the University of Nice Sophia-Antipolis, as well as a consultant for private companies.
More and more, Web crawling can be envisaged, as part of a reverse engineering process which consists in re-constituting the original databases used to generate web pages. After a survey of past and actual tools used for solving this task, and after presenting a case analysis about newspapers, we will show how a good knowlege of page HTML templates for a given web site can not only allow the extraction of structured information, but can also improve the crawling of web sites with frequent updates.