A Novel Data Pre-processing Method:
K-Harmonic Means based Feature
Weighting
Journal Name:
- Dicle Üniversitesi Mühendislik Fakültesi Mühendislik Dergisi
Keywords (Original Language):
Author Name | University of Author |
---|---|
Abstract (2. Language):
Data mining is the analysis of data for relationships
that have not previously been discovered (Olson &
Delen, 2008). Data mining involves various stages.
One of the most important is the data pre-processing
stage. In this stage, different methods such as
normalization, feature selection, completion of
missing data and feature weighting are applied.
Data pre-processing methods are applied to input
data in order to improve the performance of the
classification algorithms. Feature weighting
methods included in data pre-processing methods
are of great importance in the field of machine
learning. With weighting methods, a linearly
inseparable dataset is converted into linearly
separable datasets (Polat & Gunes, 2006).
Feature weighting algorithms are usually developed
by using clustering methods. In the literature,
researchers are presented different methods on this
subject. In particular, successful results have been
obtained with k-means clustering based on
weighting methods (Güneş & Ark, 2010, Polat &
Durduran, 2012). The k-means method is an
effective method although it also has some
disadvantages. The most important disadvantage is
its sensitivity to the central values in the initial
stages. Alternative methods have been developed
because of this problem associated with the k-means
algorithm. One of these methods is use of the kharmonic
means clustering algorithm. The
algorithm uses the harmonic mean of the distance to
the centre from each data point. Unlike the k-means
clustering algorithm, the k-harmonic means
clustering algorithm is insensitive to the selection of
the starting point. This being the case, it is expected
that it will provide effective results when used for
feature weighting. The recommended weighting
method is entitled khmFW and, in conjunction with
the khmFW algorithm, its aim is to convert the
linearly inseparable dataset into linearly separable
datasets. In order to achieve this, the recommended
weighting method aggregates data points which are
closer to each other.
In order to test the effectiveness of a computer-aided
medical diagnostic system which they developed,
researchers are conducting experiments on datasets
that are open to common use. One of the databases
used for this purpose is the UCI machine learning
repository [3]. The proposed system used in this
study has been tested on a heart disease dataset and
a BUPA liver disorders dataset obtained from this
database. These datasets have a distribution which
cannot be separated linearly.
The features obtained after the feature weighting
process were then classified using a different
classification algorithm method. In order to test the
performance of the proposed method, classification
accuracy, specificity, sensitivity, mean absolute
error, root mean square error, kappa value and
AUC values under the ROC curve have been used.
The experimental results show that the proposed
method gave better results compared to existing
methods described in the literature. The proposed
system can serve as a useful medical decision
support tool for medical diagnosis.
With the proposed method, accuracy rates of
89.62% and 85.79% were obtained for the heart
disease data set and the BUPA liver disorders data
set, respectively. These results provide better results
than many studies on the same data in the literature.
With the proposed weighting method, it is possible to
increase the success of the classification algorithm
in data mining operations. With this weighting
method, better solutions can be obtained for
different problems.
Bookmark/Search this post with
Abstract (Original Language):
Veri madenciliği, büyük ölçekli veriler arasından önceden bilinmeyen, nitelikli bilgilerin keşfedilmesi
sürecidir. Bu süreç farklı adımlardan oluşmaktadır. Bu sürecin ilk adımı verilerin toplanması ve ön işleme
aşamasıdır. Sınıflandırma algoritmalarının performansını arttırmak için giriş verilerine, ön işleme
yöntemleri uygulanmaktadır. Veri ön işleme yöntemleri içerisinde yer alan öznitelik ağırlıklandırma
yöntemleri, veri madenciliği alanında büyük önem taşımaktadır. Bu çalışmada yeni bir öznitelik
ağırlıklandırma yöntemi geliştirilmiştir. Önerilen ağırlıklandırma yöntemi k-harmonik ortalamalar
algoritması kullanılarak geliştirilmiştir. Bu ağırlıklandırma yöntemi ile doğrusal olarak ayrılamayan veri
setini, doğrusal ayrılabilir veri setine dönüştürmek hedeflenmiştir. Ağırlıklandırma işleminden sonra elde
edilen öznitelikler üç farklı sınıflandırma algoritması ile sınıflandırıldı. Geliştirilen algoritma medikal veri
setlerine uygulanarak başarısı değerlendirilmiştir. Veri seti olarak doğrusal olarak ayrılabilir olmayan
dağılıma sahip veri setleri tercih edilmiştir. Önerilen yöntemin başarısını test etmek için, sınıflandırma
doğruluğu, duyarlılık, özgüllük, ortalama mutlak hata, ortalama karesel hata karekökü, kappa değeri ve
ROC eğrisinin altında kalan alan (AUC) değerlerinden yararlanılmıştır. Deneysel sonuçlar, önerilen
yöntemin literatürdeki mevcut yöntemlere göre daha iyi sonuçlar verdiğini göstermiştir. Medikal tanı için
önerilen sistem yararlı bir tıbbi karar destek aracı olarak hizmet verebilir.
FULL TEXT (PDF):
- 4