Typically when we try to inject datasets into a Neo4j AuraDB Instance, very often we will have JSON or CSV as our raw data format. So that is quite straight forward we use the "LOAD CSV" Cypher statement or apoc.load.json() procedure to get the data set loaded into the AuraDB Instance.
Unfortunately, If the data set is relatively large or the AuraDB Instance you choose to create is relatively too small (ex: 1GB 1*CPU), it will take a long time for the data set to get injected, which will end up getting an error message: "Session Expired: WebSocket connection failure."
To solve this problem, you can use your load statement together with a very powerful APOC procedure -- apoc.periodic.iterate. It will run the second statement for each item returned by the first statement. Returns number of batches and total processed rows
apoc.periodic.iterate(
'statement returning items',
'statement per item',
{batchSize:1000,iterateList:true,parallel:false,params:{},concurrency:50,retries:0})
YIELD batches, total
Here is an example to load a large dataset with CSV format. In this example, 2 parameters of the apoc.periodic.iterate procedure are being used:
- batchSize: run the specified number of operation statements in a single tx. 1000 is a fairly good number to start with. You should fine-tune it according to your AuraDB Instance configuration.
- retries: If the operation statement fails with an error, sleep 100ms and retry until retries-count is reached. You should use this parameter as well. So when you failed to load the whole dataset with a session expired error message. the procedure can help try to load again. start with "3" and tune it accordingly.
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'https://storage.example.com/basics.csv' AS row
"MERGE (m:Movie {id: row.tconst})
ON CREATE SET
m.title = row.primaryTitle,
m.releaseYear = row.startYear,
m.genres = row.genres",
{batchSize:1000, retries: 3})
You should use "MERGE" to inject data into your AuraDB Instance so that you would have duplicated records after multiple times of insert the same data set.
That being said, you must create the unique constraint before loading the data, otherwise, the "MERGE" statement will run very slowly.
CREATE CONSTRAINT UniqueMovieIdConstraint ON (m:Movie) ASSERT m.id IS UNIQUE;
The same method applies too when you try to inject the data set with JSON format.
Normally you will use apoc.load.json() to load JSON files. In this case, when you try to upload to Neo4j AuraDB Instance, you should use it together with apoc.periodic.iterate() procedure. In this example, both "batchSize" and "retries" parameters are all set to a value according to best practice.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('https://storage.example.com/basics.json')
YIELD value AS data",
"MERGE (m:Movie {id: data.tconst})
ON CREATE SET
m.title = data.primaryTitle,
m.releaseYear = data.startYear,
m.genres = data.genres",
{batchSize:2000, retries:3})
Now, you will solve the problem. But you might think that the data injection is still not fast enough. According to the apoc.periodic.iterate() documentation, there are still other parameters we can use to improve the loading efficiency.
Neo4j Aura is a very flexible DBaaS platform. That means you can expand or shrink the AuraDB Instance as you demand. When the data loading consumed a time too long. You can expand the AuraDB Instance with a larger hardware configuration.
Having more CPUs and a larger memory footprint, you can draft another version of the data loading script by adding the parallel capabilities to the APOC procedure. Now you will get a much faster data loading experience.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('https://storage.example.com/basics.json')
YIELD value AS data",
"MERGE (m:Movie {id: data.tconst})
ON CREATE SET
m.title = data.primaryTitle,
m.releaseYear = data.startYear,
m.genres = data.genres",
{batchSize:2000, retries:30, parallel:true, concurrecy:2})
Comments
0 comments
Please sign in to leave a comment.