Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[master < ] Add batched and parallel import #43

Merged
merged 42 commits into from
May 20, 2023

Conversation

gitbuda
Copy link
Member

@gitbuda gitbuda commented Jan 25, 2023

Resolves #42 -> mgconsole ~1000 lines/s (VERY SLOW)

DATASET
D1) Playground Cora Scientific Publications, N: 2708, E:5278, T:7986
D2) Playground Marve Cinematic Universe, N:21732 E: 682943, T: 704675

HARDWARE
H1) Ubuntu 20.04, Ryzen 7, 8 cores, 64GB RAM
H2) MacBook Pro M1, 16GB RAM

MEASUREMENTS

context nodes edges serial (n+e)/s parallel (n+e)/s batch workers
D1 + H1 2708 5278 1198.37 2642.62 1000 32
D2 + H1 21732 682943 5655.13 43820.34 1000 32
D1 + H2 2708 5278 736.51 2252.75 1000 16
D2 + H2 21732 682943 1060.32 7939.91 1000 16

NOTE
a) Parsing depends on the size of each line/node/edge, for smaller nodes/edges, query parsing time on H2) is >10k/s, but for bigger nodes (e.g. with large properties) can be much slower, e.g. ~3k/s

TASKS

  • Add STORAGE MODE to the parser
  • Improve (make it correct) the query parsing/analysis part
  • Update the readme page
  • hacked batching -> measure -> more-less the same as single-thread execution because the same calls are still there
  • hacked parallelization -> measure
    • ensuring all nodes+edges are properly created will be complex -> IDEA: use summary
  • proper batched and parallel execution -> measure
    • implement a limited number of batches + shuffling to minimize serialization errors
  • Add hacked version of query detection type state machine
  • in STORAGE MODE IN_MEMORY_ANALYTICAL mode, redundant edges are created for some reason
  • Measure the parsing and execution performance with the state machine
  • try a bigger dataset (500k+ edges)
  • implement benchmark test based on the lab quick start datasets
  • every character is processed + query parsing overhead is low -> use that budget to learn more about queries
  • if backoff is not there -> "mg_raw_transport_recv: Bad file descriptor" OUTCOME: Created issue Debug and improve mgclient session inside mgconsole #53
  • create proper .cpp implementation files for all runners (interactive, serial_import, ...)
  • somehow make the std::getline part faster -> measure OUTCOME: Abandoned for now
  • move line parsing to be parallel as well (NOTE: not easy because a query can span over many lines) -> measure OUTCOME: Abandoned for now

@gitbuda gitbuda added the enhancement enhancement label Jan 25, 2023
@gitbuda gitbuda self-assigned this Jan 25, 2023
@gitbuda
Copy link
Member Author

gitbuda commented Jan 27, 2023

Screenshot 2023-01-27 at 4 23 01 PM

@gitbuda
Copy link
Member Author

gitbuda commented Jan 28, 2023

On M1, execution is completely removed, just the "single thread batching" overhead -> a lot of room for parallel execution

Screenshot 2023-01-28 at 3 36 04 PM

@gitbuda
Copy link
Member Author

gitbuda commented Jan 28, 2023

Single-threaded batched execution but with the same calls, without index creation and __vertex_id prop removal because it's impossible to create indexes in a multi-query transaction -> very similar results.

Screenshot 2023-01-28 at 9 39 22 PM

@gitbuda
Copy link
Member Author

gitbuda commented Feb 5, 2023

Still just a bit faster 🏃
Screenshot 2023-02-05 at 3 04 49 PM

@gitbuda gitbuda changed the title Add batched and parallel import [master < ] Add batched and parallel import Feb 14, 2023
@gitbuda gitbuda added this to the mgconsole-v1.4 milestone Feb 14, 2023
@gitbuda gitbuda requested a review from antoniofilipovic May 16, 2023 06:29
@gitbuda gitbuda removed the request for review from antoniofilipovic May 20, 2023 13:08
@gitbuda gitbuda marked this pull request as ready for review May 20, 2023 13:08
@gitbuda gitbuda merged commit c1d60b7 into master May 20, 2023
@gitbuda gitbuda deleted the add-batching-parallelization branch May 20, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhancement
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add batched and parallel import
1 participant