Internal Validation measures for Data Streams Mining Algorithms between Micro-ClustersDr.S.Gopinathan1, Mr.
L.Ramesh21Associate Professor, Department of Computer Science,University of Madras,Chennai,600005,India.2Research Scholar, Department of Computer Science,University of Madras, Chennai, 600005,India.AbstractClustering data streams has become very important method for data mining and warehousing.
A common technique for evaluation of micro-clustering results is to use validity measures. In this paper we use variety of Internal measures namely SSQ, Silhoutte, average. between, average. within, max.diameter,ave.
within.cluster.ss,g2,pearsongamma,dunn,dunn2,entropy,wb.ratio are evaluate Data Stream mining Algorithms. This paper describes DBSTREAM, CluStream, Den Stream, D-Stream, D-Stream + Attraction the reclustering offline clustering process between micro clusters with different radius fixation.
Keywords: Validity measures, data mining and warehousing, data stream, internal measure, offline clustering1. INTRODUCTIONIn this paper, we evaluate internal validation measures using data stream mining algorithms with real data sets. In the data mining clustering is an technique to unsupervised classification process fundamental to data mining (one of the most important tasks in data analytics2.Clustering validation is a technique to find a set of micro clusters that best fits and worst fits natural partitions(number of clusters) without any class information.
Commonly there are two types of clustering technique, which are based on external validation measures and internal validation measures. External validation measures based on previous knowledge about data and Internal validation measures based on information intrinsic to the data alone 2.Data stream clustering is typically done as a two-stage process with an online part which summarizes the data into many micro-clusters or grid cells and then, in an offline process, these micro clusters are reclustered/merged into smaller number of final clusters 1.
Micro-Clustering is the most fundamental technique in Data mining. The goal of clustering is to divide the data elements into groups of similar objects, where each group is referred to as a cluster, consisting of objects that are similar to one another and dissimilar to objects of other groups. Clustering is efficiently used in several exploratory pattern analysis, machine learning, data mining and bioinformatics problems. The basic problem in the context of clustering is to group a given assortment of unlabelled patterns into significant clusters. Cluster Analysis is the automatic process of grouping data into different groups, so that the data in each group share similar trends and pattern. The clusters which are formed are defined as the organization of datasets into homogeneous and/or well separated groups with respect to distance or equivalently similarity measure.
The paper is organized as follows. In section 2 related works, in section 3 Existing methods. After a brief discussion of the block diagram of proposed system in section 4, the present in section 5 proposed Algorithm. In section 6 Internal Evaluation measures. In section 7 Experimental Results with using data stream mining algorithms and real data sets.
finally conclude the paper with Section 8.2. RELATED WORKSIn the research work needed number of related Published Articles. I referred 100 Articles for the work.
I present only 10 papers in the section. Present Cooper 4 in the work of reference on internal cluster validation. That work compared 30 cluster validity indexes. The authors called them stopping criteria because they were used to stop the agglomerative process of a hierarchical clustering algorithm. The hierarchical clustering algorithms (single-linkage, complete linkage, average-linkage and ward) are used 108 synthetic datasets with varying number of non-overlapped clusters(2,3,4 or 5),dimensionality (4,6 or 8) and cluster sizes.
They presented the results in a tabular format, showing the number of clusters. Moreover, the tables also included the number of times that prediction of each cluster validation index overestimated or underestimated the real number of clusters by 1 or 2.Bezdek et al.
7 published a paper that comparing 23 cluster validation index based on 3 runs of the EM algorithm and 12 synthetic datasets. The datasets were formed by 3 or 6 Gaussians clusters and the results are presented in tables that showed 15 cluster validity index was performed by dimitraiadou et al.8 based on 100 runs of k-means and hard 600 synthetic datasets and the results of binary attributes which made the experiment and the results presentation . Brun et al.
10 compared 8 CVIs using several clustering algorithms: k-means, fuzzy c-means, SOM, single linkage, complete-linkage and EM. They used 600 synthetic datasets based on 6 models with varying dimensionality (2 or 10), cluster shape (spherical or Gaussian) and number of clusters (2 or 4). The novelty in this work can be found in the comparison methodology. The authors compared the partitions obtained by the clustering algorithms with the correct partitions and computed an error value for each partition. Then, the ”quality” of the CVI is measured as its correlation with the measured error values. In this work, not just internal but also external and relative indices are examined.
The results show that the Rand index is highly correlated with the error measure.Dubes 9 two years later. The novelty of this work is that the author used some tables where the score of each CVI was shown according to the values of each experimental factor—clustering algorithm, dataset dimensionality, and number of clusters. Moreover, he used the ?2 statistic to test the effect of each factor on the behaviour of the compared CVIs. Certainly, the use of statistical tests to validate the experimental results is not common practice in clustering, as opposed to other areas such as supervised learning .The main drawback of this work is that it compares just 2 CVIs (Davies–Bouldin and the modified Hubert statistic).
The experiment is performed in 2 parallel works of 32 and 64 synthetic datasets, 3 clustering algorithms (single-linkage, complete-linkage and Cluster) and 100 runs. The datasets’ characteristics were controlled in the generation process and they used different sizes (50 or 100 objects), dimensionality (2, 3, 4 or 5), number of clusters (2, 4, 6 or 8), sampling window (cubic or spherical) and cluster overlap.3. BLOCK DIAGRAM OF PROPOSED SYSTEMThe following figure represents the framework of our paper for Internal Validation measures for Data Streams Mining Algorithms between Micro-Cluster.
Fig.1.Block diagram of proposed systemIn this frame work, the input dataset, and it will takes the data stream mining algorithm which algorithms are DBSTREAM, D-Stream, DenStream, D-Stream+Attraction, Clustream with using internal validation measures like namely SSQ, Silhoutte, average.
between, average. within, max.diameter,ave.within.cluster.
ss,g2,pearsongamma,dunn,dunn2,entropy,wb.ratio are evaluate Data Stream mining Algorithms by using cassini data set.4. ALGORITHM FOR PROPOSED SYSTEMStep 1: InitializationInput: Read the datasetBeginStep 2: Radius fixation like r=0.
8, 0.6, 0.4, 0.2, 0.
1 Set radius of current data stream mining algorithmStep 3: Update on data stream algorithm (DBSTREAM, D-Stream, CluStream,Den Stream, D-Stream Attraction Micro clusters Step 4: Use attraction for reclustering Reclustering for data stream miningStep 5: Internal evaluation measures of data stream mining algorithms Using Macro clusterStep 6: Data stream mining algorithm End begins.5. EXPERIMENTAL AND RESULTSIn the experiment section the proposed Algorithm is implemented and tested with cassini data set using ‘R’ Software. The figure’s and tables Evidence that they are showing the Result of the proposed work. To measure the performance of the enhancement system, we have calculated the various data stream mining algorithm. The radius is fixed in various level r=0.
8, 0.6, 0.4, 0.2, 0.1 to the original input Cassini data for evaluating internal validation measures.
Fig 2,3,4,5 and 6 represents the data sets of Cassini. Figure 2 cassini data set using DBSTREAM Figure 3 cassini dataset using D-Stream Figure:4 cassini dataset using Clustream Figure:5 cassini dataset using D-Stream+ Attraction Figure:6 cassini dataset using Denstreamr# of MCs 0.831 0.
647 0.457 0.274 0.180DBSTREAM without shared density 1.01 2.
85 0.79 1.22 4.85D-Stream 310.
36 141.88 201.98 184.63 196.27D-Stream + attraction 204.
77 232.75 215.43 188.38 217.68DenStream 1.78 1.
26 0.91 0.67 0.70CluStream 0.
81 1.11 1.02 0.72 1.
10Table -1 SSQ for the Cassini data using data stream mining Algorithm Figure-7 SSQ for the Cassini data using data stream mining Algorithmwith different radius r=0.8,0.6,0.
4,0.2, 0.1r# of MCs 0.831 0.
647 0.457 0.274 0.180DBSTREAM without shared density 0.54 0.25 0.
44 -0.34 -0.06D-Stream 0.
47 0.33 0.23 0.24 0.42D-Stream + attraction 0.
30 0.22 0.29 0.
26 0.13DenStream 0.53 0.05 0.26 -0.12 0.
32CluStream 0.54 0.38 0.
43 0.49 0.60 Table-2 Silhouette for the Cassini data using data stream mining algorithms Figure-8 Silhouette for the Cassini data using data stream mining Algorithmwith different radius r=0.8,0.6,0.4,0.
2, 0.1T r# of MCs 0.831 0.
647 0.457 0.274 0.
180DBSTREAM without shared density 0.41 0.44 0.45 0.39 0.32D-Stream 4.
63 4.91 5.67 5.50 5.62D-Stream + attraction 5.73 5.
80 6.45 5.18 5.68DenStream 0.
60 0.37 0.35 0.
32 0.31CluStream 0.37 0.46 0.26 0.
32 0.43Table-3 average. between for the cassini data using data stream mining algorithm Figure-9 average.
between for the Cassini data using data stream mining Algorithmwith different radius r=0.8,0.6,0.4,0.2, 0.
1r# of MCs 0.831 0.647 0.
457 0.274 0.180DBSTREAM without shared density 0.14 0.23 0.14 0.38 0.33D-Stream 2.26 3.53 3.27 2.38 3.41D-Stream + attraction 2.77 2.49 2.13 2.52 2.38DenStream 0.13 0.19 0.15 0.32 0.28CluStream 0.14 0.18 0.10 0.10 0.12Table -4 average.within for the cassini data using data stream mining algorithm Figure -10 average.within for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 1.17 1.17 1.17 0.95 1.17D-Stream 20.11 15.30 16.96 13.14 14.57D-Stream + attraction 17.14 15.74 11.03 19.93 9.01DenStream 0.83 0.08 1.00 0.99 0.78CluStream 0.71 0.88 0.63 0.71 0.58Table 5 Max.diameter for the cassini data using data stream mining algorithm Figure 11 Max.diameter for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.02 0.04 0.02 0.08 0.07D-Stream 8.64 5.51 4.50 6.17 5.67D-Stream + attraction 6.41 11.36 5.88 6.92 6.76DenStream 0.01 0.02 0.03 0.04 0.04CluStream 0.01 0.02 0.01 0.01 0.02TABLE 6 ave.with.cluster.ss for the cassini data using data stream mining algorithm Figure 12 ave.with.cluster.ss for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.84 0.67 0.86 0.05 0.04D-Stream 0.44 0.48 0.67 0.71 0.82D-Stream + attraction 0.75 0.80 0.89 0.86 0.68DenStream 0.99 0.74 0.63 0.08 0.07CluStream 0.79 0.73 0.77 0.83 0.78Table 7 g2 for the cassini data using data stream mining algorithms Figure 14 g2 for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.68 0.51 0.68 0.05 -D-Stream 0.34 0.29 0.53 0.60 0.61D-Stream + attraction 0.50 0.60 0.63 0.38 0.54DenStream 0.88 0.63 0.53 0.02 0.08CluStream 0.60 0.53 0.52 0.58 0.56Table 8 pearsongamma for the cassini data using data stream mining algorithm Figure 15 pearsongamma for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.07 0.02 0.01 0.01 -D-Stream 0.01 0.01 – 0.02 0.02D-Stream + attraction 0.01 0.02 0.01 0.01 0.03DenStream 0.02 0.47 0.01 – -CluStream 0.02 0.01 0.02 0.05 0.02Table 9 dunn for Cassini data using data stream mining algorithm Figure 16 dunn for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.58 0.64 0.74 0.11 0.08D-Stream 0.70 0.60 0.68 0.51 0.71D-Stream + attraction 0.73 0.76 0.61 0.70 0.52DenStream 1.13 0.68 0.41 0.05 0.03CluStream 0.76 0.83 0.72 0.68 0.75Table 10 dunn2 for the Cassini data using data stream mining algorithms Figure 17 dunn2 for the Cassini data using data stream mining Algorithm with different radius r=0.8, 0.6, 0.4, 0.2, 0.1r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 1.24 1.00 1.34 0.97 0.39D-Stream 1.20 1.43 1.46 1.52 1.54D-Stream + attraction 1.35 1.31 1.36 1.36 1.22DenStream 0.73 0.94 1.56 1.61 1.71CluStream 1.33 1.33 1.31 1.25 1.24Table 11 entropy for the cassini data using data stream mining algorithms Figure 18 entropy for the Cassini data using data stream mining AlgorithmWith different radius r=0.8,0.6,0.4,0.2, 0.1.r# of MCs 0.831 0.647 0.457 0.274 0.180DBSTREAM without shared density 0.33 0.48 0.31 0.97 0.98D-Stream 0.52 0.52 0.43 0.47 0.48D-Stream + attraction 0.40 0.36 0.37 0.45 0.40DenStream 0.20 0.44 0.43 0.89 0.72CluStream 0.35 0.56 0.46 0.29 0.31Table :12:- wb.ratio for the cassini data using data stream mining algorithmFrom the above result of table 1, we can conclude that the internal validation measure SSQ all the data stream mining algorithms are positive values. The table 2, shows that the Silhouette validation measure using DBSTREAM without shared density data stream mining algorithm is radius level is r=0.2, 0.1 values are negative but other algorithms values are positive. The table 3,4,5,6 and 7 shows that the internal validation measures are namely average. between, average. within, max.diameter,ave.within.cluster.ss,g2, pearsongamma, dunn, dunn2, entropy, wb.ratio using all the data stream mining algorithm is radius level is r=0.8, 0.6, 0.4, 0.2, 0.1 values are positive value.6. CONCLUSIONIn this paper, we presented a comparison of 12 internal cluster indexes measures using data stream clustering algorithm. It is best of our knowledge, the internal validation measures indexes using micro clusters. Experiments also show that various radius fixation of the data stream mining algorithm .We have used radius level are r=0.8, 0.6, 0.4, 0.2, 0.1 and also found out the various data stream mining algorithm in micro clusters.7. REFERENCES1 Michael Hahsler and Matthew Bolanos,” Clustering Data Streams Based on Shared Density Between Micro- Clusters”, IEEE transactions on knowledge and data engineering, Jan 2016.2 Erendira Rendon,Itzel Abundenz,Alejandra Arimendi and Elvia M.quiroz,”Internal versus External cluster validation indexes”,International Journal of computer and Communications, Issue 1,volume 5,20113 Bezdek J. C. Pal N.R.,. Some new indexes of cluster validity. IEEE Trans. Syst. Man, Cyber. Part B 28 (3), 1998, pp. 301-315.4 G.W.Milligan,M.C.Cooper,”An Examination of procedures for determining the number of clusters in a dataset “,Psychometrica 50(1985)159-179.5 Davis D. L., Bouldin D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach.Intel. (PAMI)1 (2), 1998, pp.224-227 5 Halkidi M., Vazirgiannis, M. Quality scheme assessment in the clustering process. In Proc. PKDD (Principles and Practice of Knowledge in databases). Lyon, France. Lecture Notes in Artificial Intelligence. Spring –Verlag Gmbh, vol.1910, 2000, pp. 265-279.6 Rendón E, Abundez I, Arizmendi A, Quiroz EM. Internal versus external cluster validation indexes. International Journal of computers and communications. 2011;5(1):27-34.7 J.C.Bezdek, W.Q.Li,Y.Attikiouzel,M.Windham,A Geomentric approach to cluster validity for normal mixtures,soft computing –A Fusion foundations,Methodologies and Applications 166-1798 E.Dimitriadou, S.Donicar,A.Weigessel,An examination of indexes for determining the number of clusters in binary data sets,Psychometrika 137-159.9 M.Brun,C.Sima,J.Lowey,B.Carrol,E.Suh,E.R.Doughetey,Model-based evaluation of clustering validation measures, Pattern Recognition 807-824.9 R.C.Dubes,How many cluster are best?-an experiment,Pattern Recognition 645-663.10 Kovács F, Legány C, Babos A. Cluster validity measurement techniques. In6th International symposium of hungarian researchers on computational intelligence 2005 Nov 18.ACKNOWLEDGEMENTDr.S.Gopinathan1 working as an Associate Professor in the Department of Computer Science, University of Madras, Chennai, India. He has 18 years of teaching experience for post graduate in the field of Computer Science and Research. He has published number of papers. He has produced 14 M. Phil Scholars in the Computer Science, 8 PhD Research Scholars are registered under him. He also has been serving as a panel member for various competitive examinations and Universities. His interested area of research is Image Processing, Data Mining and Warehousing and Neural Network. L.Ramesh is currently Full time Ph.D., Research scholar in the Department of Computer Science, University of Madras. He Received B.Sc(CS) , M.Sc(CS) and B.Ed., M.Phil (cs) at Dr.Ambedkar Government Arts College, Chennai. His area of interest includes Data Mining and Warehousing , Neural Network.