Skip to content

[Question] EFB before, or after, histogram binning? #6999

@adrienm7

Description

@adrienm7

Hello,
I would like to know the order of operations happening in LightGBM.
Reading the code, it seems that EFB happens when the data is converted into the LightGBM Dataset (src/io/dataset.cpp#L246.

But I'm confused whether histogram binning happens before or after that step.
It doesn’t seem to be said in the paper, or not clearly. For example there is this passage page 6:

Since the histogram-based algorithm stores discrete bins instead of continuous values of the features, we can construct a feature bundle by letting exclusive features reside in different bins. This can be done by adding offsets to the original values of the features. For example, suppose we have two features in a feature bundle. Originally, feature A takes value from [0, 10) and feature B takes value [0, 20). We then add an offset of 10 to the values of feature B so that the refined feature takes values from [10, 30).

And also here shiyu1994 says:

e. Note that the feature grouping is done with the discretized version of feature values (bin values), so in this step the boundaries of the single feature histogram is not related, since the values are already discretized.

➜ With this, it it seems that the first step is histogram binning, and then EFB.

But then I see this where guolinke says:

The purpose of using EFB is to speed up the training, as the time of constructing feature histogram is reduced.

And here he writes:

In any cases, the histogram is always rebuilt for each tree, for both normal features and bundled features. The bundled feature is to merge several features into one feature. And the histogram is dynamically rebuilt during training.

So I'm really confused. Is EFB happening before or after histogram binning?
What are the reasons of this order, and impact?
Thanks in advance for your answers

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions