Auto Data 3 38 2012 Srpski Jezik.epub ^NEW^
in this paper, we propose a method for the sub-sampling of the data distribution in supervised classification. we consider a classifier that is able to predict a value of a continuous feature with a given probability, when it is associated to a given label.  in particular, we consider a linear classifier on the one hand, and a nearest neighbors classifier on the other hand.  in the first case, we propose a method for the sub-sampling of the data distribution for the considered label and feature, using a sparse approximation of the logistic function.  in the second case, we propose a method to select a subset of training data points, for a given label and feature, using the sparse approximation of the distance between the training data and the nearest neighbors.  both methods are tested and compared on two datasets: the iris dataset from the uci repository, and the diabetes dataset from the open source physics repository.  we conclude by discussing the limits of the proposed method.  read more
Auto Data 3 38 2012 Srpski Jezik.epub
in this paper we propose a new methodology for the evaluation of the quality of a dataset, based on the estimation of the shannon entropy of its probability distribution.  to this end, a new measurement of the entropy, called the expected entropy, is proposed. it is defined as the expected value of the shannon entropy, with respect to a given probability distribution.  the proposed measurement is shown to be a good indicator of the quality of the data, and of the complexity of the data distribution.  moreover, the method is able to detect the presence of more than one class in the data distribution.  the proposed method is applied to the adult dataset from the uci repository, and to the diabetes dataset from the open source physics repository.  an initial experimental analysis of the two datasets is performed, with the aim to identify the best feature for the problem.  moreover, the quality of the proposed method is compared to the quality of the iris dataset, from the uci repository.