Implementation of an incremental fuzzy-c-means clustering algorithm #750

NicolasBizzozzero · 2021-10-29T14:57:01Z

NicolasBizzozzero
Oct 29, 2021

Hello !
After a quick glance at the current implementations of river's clustering algorithms, I think that adding a fuzzy clustering method would be a great idea. A fuzzy clustering method provides coefficients of memberships for a data point to each cluster instead of a "crisp" assignation. It provides a little bit more information, which lack on crisp clustering methods.

The Fuzzy c-means clustering algorithm is one of the most well-known and used. Its behavior closely match the famous kmeans algorithm, but with the addition of the fuzzy component. Moreover, in my experience, it's still a pretty fast algorithm !

I'm pretty confident about successfully implementing an incremental version of this algorithm, and I would like to know beforehand if this contribution would be useful and match the philosophy of your library.

Thank you for your time !
Regards.

Source:

[1] DUNN, Joseph C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 1973, vol. 3, no 3, p. 32–57.
[2] BEZDEK, James C., EHRLICH, Robert, and FULL, William. FCM: The fuzzy c-means clustering algorithm. Computers & geosciences, 1984, vol. 10, no 2-3, p. 191-203.

MaxHalford · 2021-10-29T15:35:08Z

MaxHalford
Oct 29, 2021
Maintainer

Hey @NicolasBizzozzero, it's great to see here! A contribution from you would be more than welcome. It's great that you opened up a discussion beforehand.

I'm not the expert on online clustering. @hoanganhngo610 and @jacobmontiel should have more insights than me. I assume they're somewhat aware of fuzzy c-means, so they should be able to answer you.

In terms of API, in River a clusterer has to assign a cluster label to an observation. In K-means, we pick the closest cluster. Maybe here you could label with the most relevant cluster? I'm mentioning this because it might be worth seeing if changing the API for clustering is worthwhile. Maybe we also want to explore the score for each cluster for a given observation. Maybe not. Again, it would be nice if @hoanganhngo610 and @jacobmontiel could chime in on this.

5 replies

NicolasBizzozzero Oct 29, 2021
Author

To be honest I wanted to contribute for a looong time, but I never allowed myself the time to do it ahah. But better late than never isn't it ? Anyway thank you for your answer !

Indeed, changing the base clusterer API may be overkill for one algorithm.
What do you all feel about creating a new base class FuzzyClusterer inheriting Clusterer and providing two methods ? Something like :

class FuzzyClusterer(Clusterer):
	  @abc.abstractmethod
	  def predict_multiples(self, x: dict) -> dict:
	      """ Predicts all the cluster's memberships for a set of features `x`. """
	  
	  def predict_one(self, x: dict) -> int:
	      memberships = predict_multiples(x=x)
	      return max(memberships, key=memberships.get)

This way predict_one matches the current Clusterer API, and a call to predict_multiples can provides the whole scores if the user want them.

raphaelsty Oct 29, 2021
Maintainer

@NicolasBizzozzero 🤗

MaxHalford Oct 29, 2021
Maintainer

To be honest I wanted to contribute for a looong time, but I never allowed myself the time to do it ahah. But better late than never isn't it ?

Amen to that, this is not a job.

As to the API, I think I prefer using score_one instead of predict_multiples. But I'm ok with idea of introducing a new class.

A question though: would predict_multiples return a score for every cluster? Or just the most likely clusters? In the former case, then we can might want to consider that other clusterers are fuzzy clusterers and should have a score_one method too. I'll let you think about this :)

NicolasBizzozzero Oct 29, 2021
Author

Indeed, I envisaged predict_multiples to return a dict assigning a score to each cluster. For instance :

>>> k_means = cluster.KMeans(n_clusters=4)
>>> k_means.predict_one({0: 0, 1: 0})              # Returns an int between [0;4[
2

>>> fuzzy_c_means = cluster.FuzzyCMeans(n_clusters=4)
>>> fuzzy_c_means.predict_one({0: 0, 1: 0})        # Also returns an int between [0;4[
2
>>> fuzzy_c_means.predict_multiples({0: 0, 1: 0})  # Returns the memberships of this data to all clusters. Values sums up to 1.
{
    0: 0.11,
    1: 0.03,
    2: 0.81,
    3: 0.05
}

That's the additional information that can be provided with fuzzy logic, because each data point is included to all clusters, but only in some degree ! I think that differs from the use case of score_one which seems to only returns a float, the score of a data point ?

MaxHalford Oct 29, 2021
Maintainer

I think that differs from the use case of score_one which seems to only returns a float, the score of a data point ?

Actually I meant what you're doing in your last example. I was just bickering about the name :)

That's the additional information that can be provided with fuzzy logic, because each data point is included to all clusters, but only in some degree

Ok, but isn't that also true for K-means? Each point belongs to each cluster, and the membership depends on the distance, right? Or maybe in fuzzy clustering, the scores needs to sum up to 1?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of an incremental fuzzy-c-means clustering algorithm #750

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implementation of an incremental fuzzy-c-means clustering algorithm #750

NicolasBizzozzero Oct 29, 2021

Replies: 1 comment · 5 replies

MaxHalford Oct 29, 2021 Maintainer

NicolasBizzozzero Oct 29, 2021 Author

raphaelsty Oct 29, 2021 Maintainer

MaxHalford Oct 29, 2021 Maintainer

NicolasBizzozzero Oct 29, 2021 Author

MaxHalford Oct 29, 2021 Maintainer

NicolasBizzozzero
Oct 29, 2021

Replies: 1 comment 5 replies

MaxHalford
Oct 29, 2021
Maintainer

NicolasBizzozzero Oct 29, 2021
Author

raphaelsty Oct 29, 2021
Maintainer

MaxHalford Oct 29, 2021
Maintainer

NicolasBizzozzero Oct 29, 2021
Author

MaxHalford Oct 29, 2021
Maintainer