When you are dealing with a large portion of the data in your dataset, most likely you will consider using an APOC procedure: apoc.periodic.iterate() .
As its name suggests it will help you to iterate/loop on a large data set.
Usually, it's quite straightforward to iterate on the nodes we would like to access.
CALL apoc.periodic.iterate(
"MATCH (p:Person) WHERE (p)-[:ACTED_IN]->() RETURN p",
"SET p:Actor",
{batchSize:10000, parallel:true})
The first statement will return a recordset with all the Person nodes [:ACTED_ID] in any movies. The apoc function iteration then lets you run an operation (here it is set the value of a property)
Now, if we look at another example
call apoc.periodic.iterate(
'MATCH (n:Actor) RETURN n',
'Merge (new:Actor {name:toLower(n.name)})
with new as newActor,n as Actor
CALL apoc.refactor.mergeNodes([newActor,Actor],{properties:"discard",mergeRels:true}) yield node return node ',
{batchSize:1}
)
YIELD batches, total, operations;
The first statement will just return all the nodes with label ":Actor".
The Second statement will iterate on the nodes and try to run apoc.refactor.mergeNodes()
to merge 2 nodes to one node and then delete the other one.
Doing this will result in creating a deadlock (and it will perform badly if at all). The reason behind this is because the first statement will return a list of nodes but in the loop, the second statement is trying to update/delete them. When you write apoc.periodic.iterate()
like this, your application will lead to a deadlock immediately and your application will just hang.
Luckily there is always a better way to write this in Cypher:
call apoc.periodic.iterate(
'MATCH (n:Actor) RETURN id(n) as id',
'MATCH (n) WHERE id(n) = id
Merge (new:Actor {name:toLower(n.name)})
with new as newActor,n as Actor
CALL apoc.refactor.mergeNodes([newActor,Actor],{properties:"discard",mergeRels:true}) yield node return node ', {batchSize:1}) YIELD batches, total, operations;
This procedure will help you do the node merge and won't cause the deadlock.
This time in the part of the first statement, instead of returning nodes list directly, a list of nodes IDs are returned. This way, when the second statement is doing in the loop. It will use nodes ID (aka a reference or pointer) to fetch all the nodes which will not create any deadlock in the execution of the procedure.
Comments
0 comments
Please sign in to leave a comment.