Incremental PBG training #232

howardchanth · 2021-08-15T09:24:36Z

Types of changes

Docs change / refactoring / dependency upgrade
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

Previously PBG does not allow for training the embeddings recurrently. When there is a change in entities or relations, PBG was not able to load the pre-trained embeddings of the existing entities. With this new feature added, PBG can enlarge a previously trained checkpoint, initialize the existing entities with their pre-trained embeddings, and initialize the embeddings of new entities with random vectors. The enlarged checkpoint will be saved to a new designated folder. (related issue: Can I incremental update the embedding model? #113)
We also modified the parquet reading features to support reading from a folder of parquet files. Particularly applicable to partitioned input in parquet files.

How Has This Been Tested (if it applies)

Using the build-in testing file to test the changes, all tests have passed.
The recurrent training and read parquet from folder features have been validated and working well in our current recommendation system environment with over 10 million users and 9 relations.

Acknowledgement

Thanks @jiajunshen for the unique insights and suggestions when designing and testing this new feature!

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

* Optimizer state not loaded

facebook-github-bot · 2021-08-15T09:24:40Z

Hi @howardchanth!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot · 2021-08-15T09:43:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot · 2021-08-15T09:43:13Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

lw · 2021-08-16T08:23:49Z

Thanks for the PR! It has indeed been a very requested feature. I'd like to understand more about the design you chose to implement it. In my mind, PBG did already kind-of support incremental training because it allowed to load a previous checkpoint as a "prior" in order to bootstrap/warmstart the training. This however requires the old checkpoint to cover the same exact entities as the new training job. I think you were aware of that as you mention that you're focusing on the issue of adding new entities, which is indeed a limitation. In my mind though this limitation can be lifted just by changing how the importing works, without touching at all the training loop. To be more precise: all we need is that, when importing the next round of edges, the importer also produces an "initial checkpoint" by filtering/rearranging/filling in the old checkpoint. Then that new initial checkpoint can be used as a prior and everything will work. I would be going for that approach as it reduces changes to the training loop, which is already quite complicated.

Also, to give you some advance "warning", we're currently stretched a bit thin so I don't know if we'll be able to fully review this PR shortly.

howardchanth · 2021-08-16T08:54:46Z

Thanks @lw for the comments! Yes actually the design is mostly in line with your description, except we are enlarging the previous checkpoint offline at the very beginning of training phase instead of importing (or partitioning) phase. We added an extra parameter in the config schema called init_entity_path to specify the initial offsets of the embeddings. The reasons for the design are as follows:

There could be multiple versions of models while still referring to the same edge partition, hence users may refer to different previous checkpoints as initial embeddings for the new training (while these checkpoints could have the same initial edge partitions), and they won't bother to repartition the data once again when recurrently training the model. (As the time for partition could be long especially for large knowledge graph)
The new offsets of the entities need to be referred to when mapping the embeddings of old entities to new ones (as the entities will be reshuffled for every new partition). Though new offsets can be read during the partition phase, we think that it is best to read them offline after the partition is done.

And yes feel free to take your time for the review. Please let me know if there are any further actions needed. Thanks!

adamlerer · 2022-02-06T17:24:58Z

@tmarkovich would you take a look at this and give your feedback? Since you've been running a similar flow.

tmarkovich

Overall this is a good bit more elegant than what I've done. I just have a hacked together jupyter notebook that does some of this, but this looks preferable. I'd love to see some tests put into this code.

tmarkovich · 2022-02-07T16:44:45Z

torchbiggraph/converters/importers.py

+            random.shuffle(files)
+            for pq in files:
+                with pq.open("rb") as tf:
+                    columns = [self.lhs_col, self.rhs_col]


Add support for weights

tmarkovich · 2022-02-07T16:46:17Z

torchbiggraph/train_cpu.py

-        model_optimizer: Optimizer,
-        loss_fn: AbstractLossFunction,
-        relation_weights: List[float],
+            self,


@adamlerer Is this whitespace change in line with Meta coding standards?

I wouldn't worry about the formatting for now, because our internal formatter will reformat it before merge. We can't expect external contributors to get this right without publishing

@rkindi maybe we can add a black style file or something into the repo to make it easier for external contributors? I'm not sure if there's anything like this for the internal Meta style.

tmarkovich · 2022-02-07T16:47:51Z

torchbiggraph/train_cpu.py

-                        in (
-                            holder.lhs_unpartitioned_types
-                            | holder.rhs_unpartitioned_types
+                        (1 if entity_type in holder.lhs_partitioned_types else 0)


Check this formatting too

tmarkovich · 2022-02-07T16:55:55Z

torchbiggraph/train_gpu.py

@@ -495,24 +495,32 @@ def _coordinate_train(self, edges, eval_edge_idxs, epoch_idx) -> Stats:
        edges_lhs = edges.lhs.tensor
        edges_rhs = edges.rhs.tensor
        edges_rel = edges.rel
+        eval_edges_lhs = None


How did you test that this helps fix the over-fitting to sub-buckets?

tmarkovich · 2022-02-07T16:58:57Z

torchbiggraph/checkpoint_manager.py

+            init_entity_storage: AbstractEntityStorage,
+            entity_storage: AbstractEntityStorage,
+            entity_counts: Dict[str, List[int]],
+    ) -> None:


Could you add some tests for enlarge? It looks like it'll work correctly, but I'd rather be guaranteed that it will

tmarkovich · 2022-02-07T17:22:31Z

torchbiggraph/checkpoint_manager.py

+                # Enlarged embeddings with the offsets obtained from previous training
+                # Initialize new embeddings with random numbers
+                old_embs = embs[old_subset_idxs].clone()
+                new_embs[subset_idxs, :] = embs[old_subset_idxs].clone()


I would do this in one like to save a memory allocation

tmarkovich · 2022-02-07T17:22:44Z

torchbiggraph/checkpoint_manager.py

+
+                # Enlarged embeddings with the offsets obtained from previous training
+                # Initialize new embeddings with random numbers
+                old_embs = embs[old_subset_idxs].clone()


Cut this clone out unless in debug?

tmarkovich · 2022-02-07T17:23:00Z

torchbiggraph/checkpoint_manager.py

+                new_embs[subset_idxs, :] = embs[old_subset_idxs].clone()
+
+                # Test case 1: Whether the embeddings are correctly mapped into the new embeddings
+                assert torch.equal(new_embs[subset_idxs, :], old_embs)


This assert could be quite expensive.

tmarkovich · 2022-02-07T17:36:30Z

torchbiggraph/train_gpu.py

            edges_lhs[eval_edge_idxs] = edges_lhs[-num_eval_edges:].clone()
            edges_rhs[eval_edge_idxs] = edges_rhs[-num_eval_edges:].clone()
            edges_rel[eval_edge_idxs] = edges_rel[-num_eval_edges:].clone()
+            full_edges_lhs = edges_lhs
+            full_edges_rhs = edges_rhs


Mapping into a new edge-set could be expensive memory wise. I'd find a way to avoid

tmarkovich · 2022-02-07T17:36:44Z

torchbiggraph/train_gpu.py

        if eval_edge_idxs is not None:
            bucket_logger.debug("Removing eval edges")
            tk.start("remove_eval")
            num_eval_edges = len(eval_edge_idxs)
+            eval_edges_lhs = edges_lhs[eval_edge_idxs]


A rebase here should pick these changes up

DXY-lemon · 2022-08-24T08:42:30Z

It is a very useful feature! But I'm not sure whether I'm using it correctly. It is my config:

        entity_path="data/new_entity_path",
        init_path="model/old_checkpoint_path",
        init_entity_path="data/old_entity_path",
        edge_paths=[
            "data/new_entity_path/train_partitioned",
            "data/new_entity_path/valid_partitioned",
            "data/new_entity_path/test_partitioned",
        ],
        checkpoint_path="model/new_checkpoint_path",

I can run it normally，but it seems to have failed to train Incrementally...What is my problem?

howardchanth added 12 commits June 10, 2021 22:32

* Initial ideas of adding incremental training feature

68a1935

* Initial version of incremental training

affbd9e

* Initial workable version of incremental training

010ddfa

* Faster loading of pretrained embeddings in enlargements

7223618

* Workable version of recurrent training

0ea84d9

* Fix bugs in recurrent training

3f0a9d6

* Optimizer state not loaded

* Develop workable v2, still investigating bugs

0ecff68

Delete update_plan.txt

c337299

* Bug fixed; recurrent training implemented

b5969cd

* Bug fixed; recurrent training implemented

b44f6c3

* bug fixes on recurrent training

b9e2cf6

fixed overfitting to sub-buckets in GPU Training

38feaa7

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2021

tmarkovich reviewed Feb 7, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental PBG training #232

Incremental PBG training #232

howardchanth commented Aug 15, 2021 •

edited

Loading

facebook-github-bot commented Aug 15, 2021

facebook-github-bot commented Aug 15, 2021

facebook-github-bot commented Aug 15, 2021

lw commented Aug 16, 2021

howardchanth commented Aug 16, 2021

adamlerer commented Feb 6, 2022

tmarkovich left a comment

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

adamlerer Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

tmarkovich Feb 7, 2022

DXY-lemon commented Aug 24, 2022

Incremental PBG training #232

Are you sure you want to change the base?

Incremental PBG training #232

Conversation

howardchanth commented Aug 15, 2021 • edited Loading

Types of changes

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Acknowledgement

Checklist

facebook-github-bot commented Aug 15, 2021

Action Required

Process

facebook-github-bot commented Aug 15, 2021

facebook-github-bot commented Aug 15, 2021

lw commented Aug 16, 2021

howardchanth commented Aug 16, 2021

adamlerer commented Feb 6, 2022

tmarkovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DXY-lemon commented Aug 24, 2022

howardchanth commented Aug 15, 2021 •

edited

Loading