Skip to content
This repository was archived by the owner on Mar 14, 2024. It is now read-only.

Incremental PBG training #232

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

howardchanth
Copy link

@howardchanth howardchanth commented Aug 15, 2021

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

  • Previously PBG does not allow for training the embeddings recurrently. When there is a change in entities or relations, PBG was not able to load the pre-trained embeddings of the existing entities. With this new feature added, PBG can enlarge a previously trained checkpoint, initialize the existing entities with their pre-trained embeddings, and initialize the embeddings of new entities with random vectors. The enlarged checkpoint will be saved to a new designated folder. (related issue: Can I incremental update the embedding model? #113)
  • We also modified the parquet reading features to support reading from a folder of parquet files. Particularly applicable to partitioned input in parquet files.

How Has This Been Tested (if it applies)

  • Using the build-in testing file to test the changes, all tests have passed.
  • The recurrent training and read parquet from folder features have been validated and working well in our current recommendation system environment with over 10 million users and 9 relations.

Acknowledgement

Thanks @jiajunshen for the unique insights and suggestions when designing and testing this new feature!

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot
Copy link
Contributor

Hi @howardchanth!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2021
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

1 similar comment
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@lw
Copy link
Contributor

lw commented Aug 16, 2021

Thanks for the PR! It has indeed been a very requested feature. I'd like to understand more about the design you chose to implement it. In my mind, PBG did already kind-of support incremental training because it allowed to load a previous checkpoint as a "prior" in order to bootstrap/warmstart the training. This however requires the old checkpoint to cover the same exact entities as the new training job. I think you were aware of that as you mention that you're focusing on the issue of adding new entities, which is indeed a limitation. In my mind though this limitation can be lifted just by changing how the importing works, without touching at all the training loop. To be more precise: all we need is that, when importing the next round of edges, the importer also produces an "initial checkpoint" by filtering/rearranging/filling in the old checkpoint. Then that new initial checkpoint can be used as a prior and everything will work. I would be going for that approach as it reduces changes to the training loop, which is already quite complicated.

Also, to give you some advance "warning", we're currently stretched a bit thin so I don't know if we'll be able to fully review this PR shortly.

@howardchanth
Copy link
Author

Thanks @lw for the comments! Yes actually the design is mostly in line with your description, except we are enlarging the previous checkpoint offline at the very beginning of training phase instead of importing (or partitioning) phase. We added an extra parameter in the config schema called init_entity_path to specify the initial offsets of the embeddings. The reasons for the design are as follows:

  • There could be multiple versions of models while still referring to the same edge partition, hence users may refer to different previous checkpoints as initial embeddings for the new training (while these checkpoints could have the same initial edge partitions), and they won't bother to repartition the data once again when recurrently training the model. (As the time for partition could be long especially for large knowledge graph)
  • The new offsets of the entities need to be referred to when mapping the embeddings of old entities to new ones (as the entities will be reshuffled for every new partition). Though new offsets can be read during the partition phase, we think that it is best to read them offline after the partition is done.

And yes feel free to take your time for the review. Please let me know if there are any further actions needed. Thanks!

@adamlerer
Copy link
Contributor

@tmarkovich would you take a look at this and give your feedback? Since you've been running a similar flow.

Copy link
Contributor

@tmarkovich tmarkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a good bit more elegant than what I've done. I just have a hacked together jupyter notebook that does some of this, but this looks preferable. I'd love to see some tests put into this code.

random.shuffle(files)
for pq in files:
with pq.open("rb") as tf:
columns = [self.lhs_col, self.rhs_col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support for weights

model_optimizer: Optimizer,
loss_fn: AbstractLossFunction,
relation_weights: List[float],
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamlerer Is this whitespace change in line with Meta coding standards?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about the formatting for now, because our internal formatter will reformat it before merge. We can't expect external contributors to get this right without publishing

@rkindi maybe we can add a black style file or something into the repo to make it easier for external contributors? I'm not sure if there's anything like this for the internal Meta style.

in (
holder.lhs_unpartitioned_types
| holder.rhs_unpartitioned_types
(1 if entity_type in holder.lhs_partitioned_types else 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this formatting too

@@ -495,24 +495,32 @@ def _coordinate_train(self, edges, eval_edge_idxs, epoch_idx) -> Stats:
edges_lhs = edges.lhs.tensor
edges_rhs = edges.rhs.tensor
edges_rel = edges.rel
eval_edges_lhs = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you test that this helps fix the over-fitting to sub-buckets?

init_entity_storage: AbstractEntityStorage,
entity_storage: AbstractEntityStorage,
entity_counts: Dict[str, List[int]],
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some tests for enlarge? It looks like it'll work correctly, but I'd rather be guaranteed that it will

# Enlarged embeddings with the offsets obtained from previous training
# Initialize new embeddings with random numbers
old_embs = embs[old_subset_idxs].clone()
new_embs[subset_idxs, :] = embs[old_subset_idxs].clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this in one like to save a memory allocation


# Enlarged embeddings with the offsets obtained from previous training
# Initialize new embeddings with random numbers
old_embs = embs[old_subset_idxs].clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut this clone out unless in debug?

new_embs[subset_idxs, :] = embs[old_subset_idxs].clone()

# Test case 1: Whether the embeddings are correctly mapped into the new embeddings
assert torch.equal(new_embs[subset_idxs, :], old_embs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert could be quite expensive.

edges_lhs[eval_edge_idxs] = edges_lhs[-num_eval_edges:].clone()
edges_rhs[eval_edge_idxs] = edges_rhs[-num_eval_edges:].clone()
edges_rel[eval_edge_idxs] = edges_rel[-num_eval_edges:].clone()
full_edges_lhs = edges_lhs
full_edges_rhs = edges_rhs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping into a new edge-set could be expensive memory wise. I'd find a way to avoid

if eval_edge_idxs is not None:
bucket_logger.debug("Removing eval edges")
tk.start("remove_eval")
num_eval_edges = len(eval_edge_idxs)
eval_edges_lhs = edges_lhs[eval_edge_idxs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rebase here should pick these changes up

@DXY-lemon
Copy link

It is a very useful feature! But I'm not sure whether I'm using it correctly. It is my config:

        entity_path="data/new_entity_path",
        init_path="model/old_checkpoint_path",
        init_entity_path="data/old_entity_path",
        edge_paths=[
            "data/new_entity_path/train_partitioned",
            "data/new_entity_path/valid_partitioned",
            "data/new_entity_path/test_partitioned",
        ],
        checkpoint_path="model/new_checkpoint_path",

I can run it normally,but it seems to have failed to train Incrementally...What is my problem?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants