-
Notifications
You must be signed in to change notification settings - Fork 455
Incremental PBG training #232
base: main
Are you sure you want to change the base?
Conversation
* Optimizer state not loaded
Hi @howardchanth! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
1 similar comment
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
Thanks for the PR! It has indeed been a very requested feature. I'd like to understand more about the design you chose to implement it. In my mind, PBG did already kind-of support incremental training because it allowed to load a previous checkpoint as a "prior" in order to bootstrap/warmstart the training. This however requires the old checkpoint to cover the same exact entities as the new training job. I think you were aware of that as you mention that you're focusing on the issue of adding new entities, which is indeed a limitation. In my mind though this limitation can be lifted just by changing how the importing works, without touching at all the training loop. To be more precise: all we need is that, when importing the next round of edges, the importer also produces an "initial checkpoint" by filtering/rearranging/filling in the old checkpoint. Then that new initial checkpoint can be used as a prior and everything will work. I would be going for that approach as it reduces changes to the training loop, which is already quite complicated. Also, to give you some advance "warning", we're currently stretched a bit thin so I don't know if we'll be able to fully review this PR shortly. |
Thanks @lw for the comments! Yes actually the design is mostly in line with your description, except we are enlarging the previous checkpoint offline at the very beginning of training phase instead of importing (or partitioning) phase. We added an extra parameter in the config schema called
And yes feel free to take your time for the review. Please let me know if there are any further actions needed. Thanks! |
@tmarkovich would you take a look at this and give your feedback? Since you've been running a similar flow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is a good bit more elegant than what I've done. I just have a hacked together jupyter notebook that does some of this, but this looks preferable. I'd love to see some tests put into this code.
random.shuffle(files) | ||
for pq in files: | ||
with pq.open("rb") as tf: | ||
columns = [self.lhs_col, self.rhs_col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for weights
model_optimizer: Optimizer, | ||
loss_fn: AbstractLossFunction, | ||
relation_weights: List[float], | ||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamlerer Is this whitespace change in line with Meta coding standards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry about the formatting for now, because our internal formatter will reformat it before merge. We can't expect external contributors to get this right without publishing
@rkindi maybe we can add a black style file or something into the repo to make it easier for external contributors? I'm not sure if there's anything like this for the internal Meta style.
in ( | ||
holder.lhs_unpartitioned_types | ||
| holder.rhs_unpartitioned_types | ||
(1 if entity_type in holder.lhs_partitioned_types else 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this formatting too
@@ -495,24 +495,32 @@ def _coordinate_train(self, edges, eval_edge_idxs, epoch_idx) -> Stats: | |||
edges_lhs = edges.lhs.tensor | |||
edges_rhs = edges.rhs.tensor | |||
edges_rel = edges.rel | |||
eval_edges_lhs = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you test that this helps fix the over-fitting to sub-buckets?
init_entity_storage: AbstractEntityStorage, | ||
entity_storage: AbstractEntityStorage, | ||
entity_counts: Dict[str, List[int]], | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some tests for enlarge? It looks like it'll work correctly, but I'd rather be guaranteed that it will
# Enlarged embeddings with the offsets obtained from previous training | ||
# Initialize new embeddings with random numbers | ||
old_embs = embs[old_subset_idxs].clone() | ||
new_embs[subset_idxs, :] = embs[old_subset_idxs].clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this in one like to save a memory allocation
|
||
# Enlarged embeddings with the offsets obtained from previous training | ||
# Initialize new embeddings with random numbers | ||
old_embs = embs[old_subset_idxs].clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cut this clone out unless in debug?
new_embs[subset_idxs, :] = embs[old_subset_idxs].clone() | ||
|
||
# Test case 1: Whether the embeddings are correctly mapped into the new embeddings | ||
assert torch.equal(new_embs[subset_idxs, :], old_embs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert could be quite expensive.
edges_lhs[eval_edge_idxs] = edges_lhs[-num_eval_edges:].clone() | ||
edges_rhs[eval_edge_idxs] = edges_rhs[-num_eval_edges:].clone() | ||
edges_rel[eval_edge_idxs] = edges_rel[-num_eval_edges:].clone() | ||
full_edges_lhs = edges_lhs | ||
full_edges_rhs = edges_rhs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping into a new edge-set could be expensive memory wise. I'd find a way to avoid
if eval_edge_idxs is not None: | ||
bucket_logger.debug("Removing eval edges") | ||
tk.start("remove_eval") | ||
num_eval_edges = len(eval_edge_idxs) | ||
eval_edges_lhs = edges_lhs[eval_edge_idxs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rebase here should pick these changes up
It is a very useful feature! But I'm not sure whether I'm using it correctly. It is my config:
I can run it normally,but it seems to have failed to train Incrementally...What is my problem? |
Types of changes
Motivation and Context / Related issue
How Has This Been Tested (if it applies)
Acknowledgement
Thanks @jiajunshen for the unique insights and suggestions when designing and testing this new feature!
Checklist