If you try to use load a large scale data set in JSON/CSV format, one of the best approaches is to use two APOC procedures combined together to do the trick:
- apoc.period.iterate() to split the large load JSON/CSV job into many small-batch loading jobs which give you the possibility of high-performance parallel data loading.
- apoc.load.json() to load large-scale data set from JSON formatted files.
But things always get complicated when you are dealing with real-world production data. In many cases, you need to scrutinize? records in the file which means the requirement of conditional execution logic during data loading (nodes or relationships creation). That can’t be adequately expressed in Cypher.
For complex conditional logic, you should use apoc.case() or apoc.do.case() procedures which allow for a variable-length list of condition/query pairs, where the query following the first conditional evaluating to true is executed.
The apoc.case() is for query only, in our case of data loading we must use apoc.do.case() for nodes and relationships creations.
So here is an example to combine three APOC procedures together to help you load large-scale data set from JSON files to the Neo4j AuraDB Instance with conditional data loading capabilities and parallel loading capability.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:////Users/example/Neo4j/Aura/users.json')
YIELD value AS data",
"CALL apoc.do.case(
[
data.uuid is NOT null, 'MERGE (u:User {uuid: data.uuid}) ON CREATE SET u.city=data.city',
data.id is NOT null, 'MERGE (u:User {id: data.id}) ON CREATE SET u.city=data.city'
],
'CREATE (u:User) SET u.Name=data.Name',
{data:data}
) YIELD value RETURN value",
{batchSize:2, parallel:false})
in this example, the users.json file will be loaded and examined on each line. if the "UUID is NOT null", then only the first MERGE statement will be executed on this data line. then if the UUID is null, but the "id is NOT null“, then only the second MERGE will be executed on this data line. If both UUID and id are NULL, then only the CREATE will be executed on this data line.
Now you can use this example as a blueprint to create your own data loading procedure.
Comments
0 comments
Please sign in to leave a comment.