ENCONTRO SPE-CIM: ESTATÍSTICA E DATA MINING

SPE News
2010-09-26

Teve lugar no dia 17 de Junho um Encontro SPE-CIM subordinado ao tema "Estatística e Data Mining".

 

Organizador:
Pedro Duarte Silva
FEG/CEGE, Universidade Católica Portuguesa, C.R. Porto

Discussant :
Paulo Gomes
Comissão de Coordenação e Desenvolvimento Regional do Norte & ISEGI - Univ. Nova de Lisboa

PROGRAMA :

Data Mining from Data Streams 
João Gama , Faculdade de Economia – Universidade do Porto

Semi-supervised Classification and Clustering 
Mário Figueiredo , Instituto de Telecomunicações; IST

Desirable Properties of Clustering Solutions
Margarida Cardoso , ISCTE – Instituto Universitário de Lisboa

ABSTRACTS

 

João Gama
Faculdade de Economia – Universidade do Porto


The Data Mining community is faced to new challenges with the advent of sources producing continuous flow of data. These sources, known as data streams, are characterized by high-speed flow of huge amounts of data generated from non stationary distributions. In consequence, new learning techniques are needed to process streaming data in reasonable time and space. The goal of this talk is to present and discuss the research problems, issues and challenges in learning from data streams. We will discuss the current trends, challenges and open issues and future directions in learning from data streams.

 

Semi-supervised Classification and Clustering
Mário Figueiredo
Instituto de Telecomunicações; Instituto Superior Técnico


Recently, there has been considerable interest in non-standard statistical learning scenarios, namely in the so-called semi-supervised learning problems.
Most formulations of semi-supervised learning see the problem from one of two (symmetrical) perspectives: supervised learning (namely, classification) with some (maybe many) missing labels; unsupervised learning (namely, clustering) with additional information. In this talk, I will review recent work in these two areas, with special emphasis on our own work. For semi-supervised learning of classifiers, I will describe an approach which, unlike previous approaches, is non-transductive, thus computationally inexpensive to use on future data. For semi-supervised clustering, I will present a new method, which is able to incorporate pairwise prior information in a computationally efficient way.


Desirable Properties of Clustering Solutions
Margarida Cardoso

ISCTE – Instituto Universitário de Lisboa


In the present work we focus on the evaluation of Clustering solutions (partitions in particular). Clustering evaluation generally relies on some desirable properties of clustering solutions: the properties of clusters’ compactness and separation, as well as the property of stability are often considered as indicators of clustering quality. In fact, since the real clustering is unknown (clustering being originated by an unsupervised process), one should focus on obtaining good enough partitions.
Clustering quality is, however, a difficult concept to put in practice. Furthermore, when aiming for clusters compactness and separation one does not necessarily meet the real clusters. Similarly, when focusing on the property of stability, one may find that solutions which are more stable but do not necessarily fit better the real solution. Recent contributions concerning these issues as addressed by the different paradigms of Statistics and Data Mining will be discussed in the presentation.

Publicado em SPE News