-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multm_prev_ layer and enhance gemm() function for PLANE_WISE operations #3020
Conversation
…PU & CUDA) and Add `add_to` Parameter
* remove using namespace std from headers * more std:: * more std:: * more std:: on windows stuff * remove uses of using namespace std::chrono * do not use C++17 features * Add Davis suggestion * revert some more stuff * revert removing include * more std::chrono stuff
Nice, I was just wondering if |
We can change the name without any problem. I am already dealing with compilation issues likely due to static uses of mult_prev_ (in the template part), and we will decide on the name to keep afterward. |
On the other hand, I was thinking of using the same convention for the transformation to be applied to softmax and thus have a special layer named softmaxm. So we would take mat_softmax or perhaps better msoftmax? |
Would it be too difficult to have just an |
It would be simpler to use for some people, but we would lose the flexibility to build attention in a potentially specific way (even though it currently follows fairly standard and structured steps). For instance, we can decide whether or not to mask, apply an additional filter for whether or not to remove a pad token before applying softmax, and so on. I was thinking more of providing, as you did for ResNet, an external definition file that gives a certain definition of the network... |
Yes, we would lose flexibility, or maybe that layer could be initialized with a struct of options that control the behavior/features of the attention layer. But yes, it would still be less flexible. |
It would be harder to implement something like flash attention without an explicit |
Indeed, I can add a high-level declaration in the layer definition file, similar to what was done for the inception layer, like :
|
Sorry, I'm just catching up on these threads. Seems like this PR is still being worked on? There are conflicts with master in any case. Let me know when I should look it over :) |
@davis, no you can do the merging. I think the conflicts with the master come from the fact that I created several branches from my own Dlib fork to be able to work on several layers in parallel. Any new layers currently being created are finished and can be integrated please. |
You please merge them :) I'll review them once there aren't merge conflicts.
Yeah that's fine :D |
.vscode/settings.json
Outdated
@@ -0,0 +1,5 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to add this? Like does this make life easier for vscode users in some way? I'm not necessarily adverse to it but it seems like the sort of thing that would normally be purely local to individual users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, of course. I don't know where that came from... obviously it was added by GitHub tooling when I made fixes
Not required for the merging
@davis, could you please review this PR? |
@davis, I think I'm on the right track now (a lot of difficulty to find an enum shared by all the classes and accessible from the CPU and CUDA codes for example) and everything is working on my side now (including the dnn.cpp test). As indicated in the closed PR, to simplify the review, I've merged all the modifications (i.e. for the two new layers) via this single PR. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Failure message from test: Error occurred at line 1518. Testing Finished |
@arrufat , would you please be able to pull this development branch on your side and perhaps verify what I noticed - not during compilation but during test program execution? There is indeed an error, at least on my validation platform, in a Conv1D operation which is obviously unrelated to the introduction of the new DNN layers that I am currently adding for formalizing the attention mechanism. Just to make sure you observe the same behavior. Thanks in advance |
I've run the test on my GPU machine ( |
OK, so maybe it's due to the GPU I'm using (an A4500); could it be possible to review again and merge all now? There are still other elements to work on ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice this is great. Fix up the minor stuff I commented on and I'll merge it right away :D
Done. |
This pull request introduces a new layer,
multm_prev_
, and enhances thegemm()
function to support PLANE_WISE operations. These changes aim to improve the flexibility and performance of matrix multiplications in deep learning models, particularly for attention mechanisms.New layer: multm_prev_
The
multm_prev_
layer performs element-wise matrix multiplication between the current layer's input and the previous layer's output. This new layer is particularly useful for implementing attention mechanisms and other operations that require element-wise interactions between tensors.Key features of multm_prev_:
Enhancement to gemm() function:
The
gemm()
function has been updated to support two modes of operation: CHANNEL_WISE (default) and PLANE_WISE. This modification allows for more efficient and flexible matrix multiplications, especially when dealing with 4D tensors.Key changes to gemm():
g_mode
to specify the operation mode (0 for CHANNEL_WISE, 1 for PLANE_WISE)These changes provide greater flexibility in implementing complex neural network architectures, particularly those involving attention mechanisms or other operations requiring element-wise interactions between tensors.
A new test function,
test_multm_prev()
, has been added to verify the correct functionality of themultm_prev
layer and the enhancedgemm()
function.