Speaker: Brunero Liseo, Sapienza - Università di Roma
Abstract: Record linkage is a statistical method Which AIMS at identifying Whether two or blackberries records refer to the same statistical entity or not. Duplications of the same entity Within one single source or across different files may be interpreted as "clusters of records", showing strong similarities across their fields. In this paper we frame the linkage process into a formal record Bayesian clustering model and we investigate the role of species sampling models as prior distributions for the clustering structure. In fact the different entities underlying latent one or will more data sources can be treated as the sampled species and the Observed records as noisy measurements of Their features. We also discuss an important issue in the clustering approach to entity resolution, That is the need to bound the clusters sizes even for large data sets. The Theproposed statistical models will be Used Both in a classical record linkage scenery and blackberries in the complex framework of multiple duplications on the same data source and will serve as models for supporting Bayesian regression analyzes with linked data.
Key words: Species Sampling, Latent structure , Hit and Miss Algorithm, Clustering.
1. Tancredi, A. and Liseo, B. (2015) Regression Analysis with linked data: Problems and possible solutions. Statistics, 75.1: 19-35.
2. Lahiri, P. and Larsen, MD, (2005). Regression analysis with linked data. Journal of the American Statistical Association, 100, pp. 222-230.
3. Tancredi, A., Liseo, B. (2011) A Hierarchical Bayesian Approach to Record Linkage and Population Size Problems. The Annals of Applied Statistics, Vol. 5,
No. 2B, 1553-1585.
4. RC Steorts, R. & S. Hall Fienberg. (2013). A Bayesian approach to graphical record linkage and de-duplication. Journal of the American Statistical Association, (In Press) preprint arXiv: 1312.4645.