You are here

Yeni bir veri önişleme metodu: k-Harmonik kümeleme tabanlı öznitelik ağırlıklandırma

A Novel Data Pre-processing Method: K-Harmonic Means based Feature Weighting

Journal Name:

Publication Year:

Author NameUniversity of Author
Abstract (2. Language): 
Data mining is the analysis of data for relationships that have not previously been discovered (Olson & Delen, 2008). Data mining involves various stages. One of the most important is the data pre-processing stage. In this stage, different methods such as normalization, feature selection, completion of missing data and feature weighting are applied. Data pre-processing methods are applied to input data in order to improve the performance of the classification algorithms. Feature weighting methods included in data pre-processing methods are of great importance in the field of machine learning. With weighting methods, a linearly inseparable dataset is converted into linearly separable datasets (Polat & Gunes, 2006). Feature weighting algorithms are usually developed by using clustering methods. In the literature, researchers are presented different methods on this subject. In particular, successful results have been obtained with k-means clustering based on weighting methods (Güneş & Ark, 2010, Polat & Durduran, 2012). The k-means method is an effective method although it also has some disadvantages. The most important disadvantage is its sensitivity to the central values in the initial stages. Alternative methods have been developed because of this problem associated with the k-means algorithm. One of these methods is use of the kharmonic means clustering algorithm. The algorithm uses the harmonic mean of the distance to the centre from each data point. Unlike the k-means clustering algorithm, the k-harmonic means clustering algorithm is insensitive to the selection of the starting point. This being the case, it is expected that it will provide effective results when used for feature weighting. The recommended weighting method is entitled khmFW and, in conjunction with the khmFW algorithm, its aim is to convert the linearly inseparable dataset into linearly separable datasets. In order to achieve this, the recommended weighting method aggregates data points which are closer to each other. In order to test the effectiveness of a computer-aided medical diagnostic system which they developed, researchers are conducting experiments on datasets that are open to common use. One of the databases used for this purpose is the UCI machine learning repository [3]. The proposed system used in this study has been tested on a heart disease dataset and a BUPA liver disorders dataset obtained from this database. These datasets have a distribution which cannot be separated linearly. The features obtained after the feature weighting process were then classified using a different classification algorithm method. In order to test the performance of the proposed method, classification accuracy, specificity, sensitivity, mean absolute error, root mean square error, kappa value and AUC values under the ROC curve have been used. The experimental results show that the proposed method gave better results compared to existing methods described in the literature. The proposed system can serve as a useful medical decision support tool for medical diagnosis. With the proposed method, accuracy rates of 89.62% and 85.79% were obtained for the heart disease data set and the BUPA liver disorders data set, respectively. These results provide better results than many studies on the same data in the literature. With the proposed weighting method, it is possible to increase the success of the classification algorithm in data mining operations. With this weighting method, better solutions can be obtained for different problems.
Abstract (Original Language): 
Veri madenciliği, büyük ölçekli veriler arasından önceden bilinmeyen, nitelikli bilgilerin keşfedilmesi sürecidir. Bu süreç farklı adımlardan oluşmaktadır. Bu sürecin ilk adımı verilerin toplanması ve ön işleme aşamasıdır. Sınıflandırma algoritmalarının performansını arttırmak için giriş verilerine, ön işleme yöntemleri uygulanmaktadır. Veri ön işleme yöntemleri içerisinde yer alan öznitelik ağırlıklandırma yöntemleri, veri madenciliği alanında büyük önem taşımaktadır. Bu çalışmada yeni bir öznitelik ağırlıklandırma yöntemi geliştirilmiştir. Önerilen ağırlıklandırma yöntemi k-harmonik ortalamalar algoritması kullanılarak geliştirilmiştir. Bu ağırlıklandırma yöntemi ile doğrusal olarak ayrılamayan veri setini, doğrusal ayrılabilir veri setine dönüştürmek hedeflenmiştir. Ağırlıklandırma işleminden sonra elde edilen öznitelikler üç farklı sınıflandırma algoritması ile sınıflandırıldı. Geliştirilen algoritma medikal veri setlerine uygulanarak başarısı değerlendirilmiştir. Veri seti olarak doğrusal olarak ayrılabilir olmayan dağılıma sahip veri setleri tercih edilmiştir. Önerilen yöntemin başarısını test etmek için, sınıflandırma doğruluğu, duyarlılık, özgüllük, ortalama mutlak hata, ortalama karesel hata karekökü, kappa değeri ve ROC eğrisinin altında kalan alan (AUC) değerlerinden yararlanılmıştır. Deneysel sonuçlar, önerilen yöntemin literatürdeki mevcut yöntemlere göre daha iyi sonuçlar verdiğini göstermiştir. Medikal tanı için önerilen sistem yararlı bir tıbbi karar destek aracı olarak hizmet verebilir.
767
779

REFERENCES

References: 

Ahmad, F., Isa, N. A. M., Hussain, Z., ve Osman, M.
K. (2013). Intelligent medical disease diagnosis
using improved hybrid genetic algorithm -
multilayer perceptron network Journal of Medical
Systems, 37, 2, 9934.
Al-Obeidat, F., Belacela, N., Carretero, J.A. ve
Mahanti, P., (2011). An evolutionary framework
using particle swarm optimization for classification
method PROAFTN, Applied Soft Computing, 11,
8, 4971-4980.
Alpaydın, E., (2014). Introduction to machine
learning, MIT press.
Bache, K. ve Lichman, M., (2013). UCI machine
learning repository. http://archive.ics.uci.edu/ ml,
(18.05.2016)
Bezdek, J.C., 1981. Pattern recognition with fuzzy
objective function algorithms, New York: Plenum
Press.
Duch, W., Adamczak, R. ve Grabczewski, K.,
(2001). A new methodology of extraction,
optimization and application of crisp and fuzzy
logical rules, IEEE Transactions on Neural
Networks, 12, 2, 277-306.
Güneş, S., Polat, K., ve Yosunkaya, Ş., (2010).
Efficient sleep stage recognition system based on
EEG signal using k-means clustering based feature
weighting, Expert Systems with Applications, 37,
12, 7922-7928.
Freedman, D.A., (2009). Statistical Models: Theory
and Practice, Cambridge University Press., 128-
129.
Kahramanli, H. ve Allahverdi, N., (2008). Design of
a hybrid system for the diabetes and heart diseases,
Expert Systems with Applications, 35, 1-2, 82-89.
Kaufman, L. ve Rousseeuw, P., (1987). Clustering
by means of medoids, North-Holland.
Lee, Y.J. ve Mangasarian, O.L., (2001). SSVM: A
smooth support vector machine for classification,
Computational Optimization and Applications, 20,
1, 5-22.
Li, D.C., Liu, C.W. ve Hu, S.C., (2011). A fuzzybased
data transformation for feature extraction to
increase classification performance with small
medical datasets, Artificial Intelligence in
Medicine, 52, 1, 45-52.
López, F.M., Puertas, S.M. ve Arriaza, J.T., (2014).
Training of support vector machine with the use of
multivariate normalization, Applied Soft
Computing, 24, 1105-1111.
MacQueen, J.B., (1967). Some methods for
classification and analysis of multivariate
observations, Proceedings, 5th Berkeley
Symposium on Mathematical Statistics and
Probability, 281-297.
Mantas, C.J. ve Abellán, J., (2014). Credal-C4. 5:
Decision tree based on imprecise probabilities to
classify noisy data, Expert Systems with
Applications, 41, 10, 4625-4637.
Olson, D. L., ve Delen, D., (2008). Advanced data
mining techniques, Springer Science & Business
Media.
Ozsen, S. ve Yucelbas, C., (2015). On the evolution
of ellipsoidal recognition regions in artificial
immune systems, Applied Soft Computing, 31,
210-222.
Ozsen, S. ve Gunes, S., (2008). Effect of featuretype
in selecting distance measure for an artificial
immune system as a pattern recognizer, Digital
Signal Processing, 18, 4, 635-645.
Ozsen, S., Gunes, S., Kara, S. ve Latifoglu, F.,
(2009). Use of kernel functions in artificial
immune systems for the nonlinear classification
problems, IEEE Transactions on Information
Technology in Biomedicine, 13, 4, 621-628.
Peker, M., (2016). An efficient sleep scoring system
based on EEG signal using complex-valued
machine learning algorithms,
Neurocomputing, 207, 165-177.
Polat, K. ve Gunes, S., (2006). A hybrid medical
decision making system based on principles
component analysis, k-NN based weighted preprocessing
and adaptive neuro-fuzzy inference
system, Digital Signal Processing, 16, 6, 913-921.
Polat, K., Sahan, S., Kodaz, H. ve Gunes, S., (2007).
Breast cancer and liver disorders classification
using artificial immune recognition system (AIRS)
with performance evaluation by fuzzy resource
allocation mechanism, Expert Systems with
Applications, 32, 1, 172-183.
Polat, K. ve Durduran, S.S., (2011). Subtractive
clustering attribute weighting (SCAW) to
discriminate the traffic accidents on Konya–
Afyonkarahisar highway in Turkey with the help
of GIS: A case study, Advances in Engineering
Software, 42, 7, 491-500.
Polat, K. ve Gunes, S., (2009). A new feature
selection method on classification of medical
datasets: Kernel f-score feature selection, Expert
Systems with Applications, 36, 7, 10367-10373.
Polat, K. ve Durduran, S.S., (2012). Automatic
determination of traffic accidents based on KMCbased
attribute weighting, Neural Computing and
Applications, 21, 6, 1271-1279.
778
M. Peker
Quinlan, L., 1993. C4.5: Programs for Machine
Learning, 1st ed., Morgan Kaufmann, San
Franscisco, 70-80.
Sahan, S., Polat, K., Kodaz, H. ve Gunes, S., (2005).
The medical applications of attribute weighted
artificial immune system (AWAIS): Diagnosis of
heart and diabetes diseases, Lecture Notes in
Computer Science, 3627, 456-468.
Savitha, R., Suresh, S., Sundararajan, N. ve Kim,
H.J., (2012). A fully complex-valued radial basis
function classifier for real-valued classification
problems, Neurocomputing, 78, 1, 104-110.
Subbulakshmi, C.V., Deepa, S.N., ve Malathi, N.,
(2012). Extreme learning machine for two category
data classification, Proceedings, In 2012 IEEE
International Conference on Advanced
Communication Control and Computing
Technologies (ICACCCT), 458-461.
Sun, Y., (2007). Iterative RELIEF for feature
weighting: algorithms, theories, and applications,
IEEE Transactions on Pattern Analysis and
Machine Intelligence, 29, 6, 1035-1051.
Tahir, M.A., Bouridane, A. ve Kurugollu, F., (2007).
Simultaneous feature selection and feature
weighting using hybrid tabu search/k-nearest
neighbor classifier, Pattern Recognition Letters,
28, 4, 438-446.
Tian, J., Li, M. ve Chen, F., (2009). A hybrid
classification algorithm based on coevolutionary
EBFNN and domain covering method, Neural
Computing and Applications, 18, 3, 293-308.
Unal, Y., Polat, K. ve Kocer, H.E., (2014). Pairwise
FCM based feature weighting for improved
classification of vertebral column disorders,
Computers in Biology and Medicine, 46, 61-70.
Yang, C.Y., Chou, J.J. ve Lian, F.L., (2013). Robust
classifier learning with fuzzy class labels for largemargin
support vector machines, Neurocomputing,
99, 1-14.
Zhang, B. ve Hsu, M., (1999). K-harmonic means: a
data clustering algorithm, Technical Report,
Hewlett-Packard Labs, HPL-1999-124.

Thank you for copying data from http://www.arastirmax.com