Auto Data 3 38 2012 Srpski Jezik.epub ^NEW^
in this paper, we propose a method for the sub-sampling of the data distribution in supervised classification. we consider a classifier that is able to predict a value of a continuous feature with a given probability, when it is associated to a given label. [] in particular, we consider a linear classifier on the one hand, and a nearest neighbors classifier on the other hand. [] in the first case, we propose a method for the sub-sampling of the data distribution for the considered label and feature, using a sparse approximation of the logistic function. [] in the second case, we propose a method to select a subset of training data points, for a given label and feature, using the sparse approximation of the distance between the training data and the nearest neighbors. [] both methods are tested and compared on two datasets: the iris dataset from the uci repository, and the diabetes dataset from the open source physics repository. [] we conclude by discussing the limits of the proposed method. [] read more
Auto Data 3 38 2012 Srpski Jezik.epub
in this paper we propose a new methodology for the evaluation of the quality of a dataset, based on the estimation of the shannon entropy of its probability distribution. [] to this end, a new measurement of the entropy, called the expected entropy, is proposed. it is defined as the expected value of the shannon entropy, with respect to a given probability distribution. [] the proposed measurement is shown to be a good indicator of the quality of the data, and of the complexity of the data distribution. [] moreover, the method is able to detect the presence of more than one class in the data distribution. [] the proposed method is applied to the adult dataset from the uci repository, and to the diabetes dataset from the open source physics repository. [] an initial experimental analysis of the two datasets is performed, with the aim to identify the best feature for the problem. [] moreover, the quality of the proposed method is compared to the quality of the iris dataset, from the uci repository.