Recommand · October 14, 2021 0

How to manually set number of iterated items in parallel queries in Neo4j?

Setup:
Two different servers in remote data-center running Neo4j 4.2.9 Enterprise. First one has about 80 GB RAM, second one is somewhat slower.

Data:
I’m executing a very complicated conditional path-finding based on apoc.path.SpanningTree for ~15 000 IDs divided between 8 macroregions. Heaviest part of query is blacklisted nodes in amount from dozens of thousands up to about 1 million nodes for different macroregions. Tracing path for one ID takes ~50 MB of memory and ~1 second on average (with some rare but heavy exceptions). Each macroregion could have from 10 to 3700 IDs for now, but the number is going to increase.

Problem:
I found a working solution to parallelize tracing of all IDs in a chosen macroregion with use of apoc.cypher.mapParallel function (also tried apoc.cypher.parallel which worked slower, and apoc.cypher.mapParallel2).
When I’m checking how my query is iterating through the list of IDs with call dbms.listQueries I can see in column "parameters" which and how many IDs are currently in work. Example:

{
"retries": 1,
"batchSize": 10,
"parallel": true,
"concurrency": 10,
"_": [
"1-6Q77E9J2",
"1-6Q7NXW2E",
"1-6QIZ9ECU",
"1-6QIZDA26",
"1-6QIZGTC7"
]
}

Thing is, I can’t change the amount of IDs per iteration and don’t quiet understand the logic of Neo4j about it. I’ve tried to increase batchSize and concurrency, tried to switch to apoc.cypher.mapParallel2 and set different values to partitions and timeout – no luck, the Neo4j planner always iterates the same amount of IDs.

Results of testing on two servers for different macroregions are in tables below.

To put it shortly:

  • the longer is the list of iterable IDs – the more IDs are being iterated per batch
  • server with better hardware takes about 2x more IDs per batch
  • but speed increase is not that much significant

Faster server

Macroregion Total exec time,s IDs per iteration Total amount of IDs in macroregion Avg exec time per ID, s/ID
MW 7,532 1 10 0,753
NW 569,955 3 906 0,629
SI 3239,69 10 3717 0,872
FE 1343,438 6 2354 0,571
POV 2459,619 10 3733 0,659
CEN 822,114 2 647 1,271
SO 1877,518 7 2531 0,742
UR 2273,111 5 1608 1,414

Slower server

Macroregion Total exec time,s IDs per iteration Total amount of IDs in macroregion Avg exec time per ID, s/ID
MW 7,713 1 10 0,771
NW 654,403 2 906 0,722
SI 3511,782 5 3717 0,945
FE 1450,656 3 2354 0,616
POV 2784,704 5 3733 0,746
CEN 964,475 1 647 1,491
SO 2435,731 4 2531 0,962
UR 2651,325 3 1608 1,648

Questions:

  1. How to override this and manually choose how many IDs will be iterated in parallel?
  2. Why "batchSize" doesn’t do anything with my query and what it stands for in that case?
  3. If anyone knows the details, what is the logic behind this default behaviour of Neo4j planner?