Apache Spark supports just about every database and hence is a point of coordination.
Hence the inclusion of Neo4j into the Spark environment using the connector means
you can go from Anywhere -> Apache Spark -> Neo4j or the reverse order.
We will demonstrate this using a Snowflake Database as the source and a Neo4j AuraDB instance as the destination, using an Apache Spark deployment on Databricks and the Neo4j Spark Connector.
Note: In this demonstration, we make use of Spark on Databricks, but other vendors' implementations are also supported, and this could also be achieved using Kafka, an ETL (Apache Hop) or, indeed, programmatically using the appropriate Snowflake and Neo4j drivers.
Neo4j Aura setup (destination/sink):
- Create a Neo4j AuraDB instance in the Aura Console.
This demo will work even for Free Tier instances as long as the node and relationship counts are within the set limits (Up to 200k nodes and 400k relationships).
- Make a note of the DBID or connection URI (NEO4J_URI), username (NEO4J_USERNAME) and password (NEO4J_PASSWORD) for the AuraDB instance.
These details will be presented when you create the instance and will be available for download as a .env file named in the format credentials-<dbid>.env .
Snowflake setup (source):
- Create an account under https://app.snowflake.com/ if not done already. Use their 30-day trial if needed - Snowflake Trial Accounts
- Create your data warehouse, database and tables in Snowflake if required.
We will be using the sample table called 'customer' in the 'tpcds_sf100tcl' schema from the 'snowflake_sample_data' database.
This is a part of the default sample data warehouse - COMPUTE_WH.
The process remains the same for custom data as well.
SELECT C_CUSTOMER_ID, C_SALUTATION, C_FIRST_NAME, C_LAST_NAME, C_BIRTH_COUNTRY from snowflake_sample_data.tpcds_sf100tcl.customer limit 100;
To read or write from this account using databricks, you need the following information:
Databricks Setup (Apache Spark):
- Create an account at https://www.databricks.com/. Use their 14-day Free trial if required.
Note: If you are testing databricks for personal use and training, you can use their Community Edition to ensure you don't incur additional charges from your cloud provider.
- Create a new compute cluster or use an existing one.
Ensure the cluster is running.
- Navigate to the cluster's Libraries tab. Reference: Cluster Libraries
- Install a compatible version of the Spark Connector for SnowFlake. Reference: Installing and Configuring the Spark Connector
- Install new
- Select Maven and enter the appropriate Maven coordinates.
We will be using 'net.snowflake:spark-snowflake_2.13:2.11.1-spark_3.3'
- Install new
- Configure the Snowflake connector, including secret management in databricks. Please refer to the official snowflake documentation here for instructions.
The example code provided expects Snowflake credentials to be stored as sfUser and sfPassword.
- Similarly, setup Neo4j credentials in databricks' secret management. The example code expects Neo4j Aura password in the neo4jPassword key. The Connection URL and username (neo4j) are hardcoded.
- Install a compatible version of the Neo4j connector for Apache Spark
Since we had created a Databricks cluster with Apache Spark 3.3.2 and Scala 2.12, we will be installing neo4j-connector-apache-spark_2.13-5.0.0_for_spark_3.jar
- Navigate to the releases page for the Neo4j Apache Spark connector page on GitHub
- Download the connector's jar file from the Assets section of the appropriate release.
- Navigate back to the Libraries tab of your databricks cluster.
- Install the jar file you just downloaded:
- Install new
- Select 'Upload' as the 'Library Source' and 'JAR' as the 'Library Type' and upload the jar file you downloaded earlier
- This is how the cluster's 'Libraries' tab will look after installing both connectors:
Reading from snowflake and writing to Neo4j Aura:
The notebook (.ipynb) attached below can be imported from a notebook in the databricks cluster and has examples for
- Reading data from snowflake using the snowflake connector and displaying the results
- Reading data from snowflake, transforming the tabular data into nodes and relationships and writing it into Neo4j Aura using the Neo4j connector for Apache spark.
You can import the notebook directly into your databricks workspace using the Import option. No local setup is required.
Prerequisites for running the notebook:
- Ensure Neo4j Aura, Snowflake and databricks, including the connectors, are setup as explained in the previous steps.
- Save the Snowflake username(sfUser) and password(sfPassword) and Neo4j Aura password(neo4jPassword) as secrets in databricks' secret manager.
- Update the Snowflake URL (Cell #1, line #8) and Neo4j URL (Cell #2, line #11) in the notebook with your environment's URLs.
Run the notebook as desired.
The video below, produced by one of our architects, runs through the example notebook on how to import data stored in Snowflake to Neo4j Aura.
The source code used in this video is included here: Import Data from Snowflake to Neo4j