Skip to content

The K-Means-Clusteranalyse

Segmentation and evaluation of customer potential - Part 4

Evaluation of customers and customer relationships has always been a challenge for marketing departments. At the same time, it is an important steering element for the communication strategy, media planning and budgeting. In this series of articles, we describe different approaches to customer evaluation and highlight their advantages and disadvantages.

In the fourth part, we describe the K-Means cluster analysis. This is a more complex method for weighting objects, which is also used in customer segmentation.

The methods presented so far in this series of articles have been characterised by their simplicity and ease of explanation. The downside of this was that only a small set of factors could be used for the explanation. K-Means cluster analysis refers to an algorithm that incorporates several factors in customer segmentation, but is still intuitively easy to understand and follow.

In-Short

  1. Fix number of groups k.
  2. Choose k random starting values for the cluster centres.
  3. Assign all remaining customers to the nearest cluster centre.
  4. Based on the assigned points, determine new cluster centres by taking the mean.
  5. Repeat steps 3 and 4 until the cluster assignments no longer change.

The applications are manifold, but it makes sense to use the method when no underlying structure in the customer data is known. Nevertheless, the goal should be defined before the application, because the number of segments formed must be fixed in advance.

At the beginning, the algorithm randomly selects the k “start customers” from the customer data, which are considered to be the centres of their respective clusters at the beginning. All other customers are assigned to the groups according to which of the start customers they are most “similar” to. The group centres are then re-determined and the remaining data reassigned. This is repeated until there is no shift in the allocation of customers due to the change in group centres. At this point, the algorithm stops and the algorithm returns the final created segments. Usually, the cluster allocation changes a lot in the first steps, but nevertheless quickly approaches the final result afterwards.

The method again has some strengths as well as some weaknesses. The simple calculation and the associated low memory requirements make the method attractive for large data sets. In marketing, the method guarantees cleanly separated groups, which makes it easy to adapt the measures to the segments.

A major disadvantage of K-Means clustering is that the result depends on the (mostly randomly) chosen starting values. This raises a reliability problem in specific applications. In practice, the results differ only slightly and the analysis of the results for a few different starting value combinations is sufficient to be able to estimate this uncertainty factor.

More restrictive in practice is the fact that categorical variables cannot be included unless a numerical scheme is stored (this would be the case with school grades, for example). As a rule of thumb, only variables for which a mean value can be calculated cleanly should be used. An example where this is not possible would be a variable “country”. Finally, the procedure minimises the deviations from a calculated mean value. If this is not cleanly defined, no interpretable results can be expected.

Furthermore, it must be considered that existing outliers in the data can shift the cluster centres and thus distort the boundaries between the segments. For this reason, outliers should be removed before the actual analysis and assigned to their own segment.

Translated with www.DeepL.com/Translator (free version)

More complex extensions

k-Median: This algorithm is essentially identical, but the “similarity” of the points is calculated differently. The median (instead of the mean) is more robust against outliers, as it reflects the value at which half of the remaining points are above it and the other half below it. The exact value of the particularly large or small points is thus irrelevant. In summary, this procedure is only minimally different from the widely used one, but can perform better in some application areas.

Compare clusterings with different k: If it is not clear in advance how many groups the customers are to be divided into, this raises several problems in the application of the analysis. If one calculates segmentations with different numbers of groups, these are not comparable with ordinary calculations (via cost functions), since cost functions deliver better values the higher the number of groups.

A solution in this problem setting is the so-called silhouette coefficient, which gives a measure of the quality of a clustering that is independent of the number of clusters. The coefficient can therefore be used to calculate and compare clusterings of different sizes and thus select the optimal variant.

The underlying idea is that for each customer it is evaluated how good his assignment to the two closest clusters is. If this evaluation (silhouette) is averaged over all customers, the result is the silhouette coefficient.

K-means++ : This extension is an approach to counter the dependence of the results on the initial value. Here, the cluster centres are not chosen randomly, but according to an algorithm. The probability that a certain customer is chosen as a cluster centre increases the further away the object is from the other centres already chosen.

Normally, this algorithm converges much faster than the standard K-Means, although the results are usually equally good.

Conclusion

The K-Means cluster analysis offers a mixture of previously presented, simple segmentation methods and more complex algorithms for cluster analysis. The K-Means algorithm is easy to explain and delivers intuitively understandable results, behind which lies a well-considered algorithm that enjoys great popularity among analysts and data scientists.

The difficulties and challenges the algorithm faces can be well controlled with the help of the described extensions. Due to the efficient calculation, it is also suitable for use with large amounts of data.

Sources

  • MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp. 281–297.
  • https://www.bigdata-insider.de/was-ist-der-k-means-algorithmus-a-734637/
  • https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/
  • https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1