If you want to use AWS Glue to import/ingest data into Auradb, this is the right article to review. In this article, we will be sharing a step-by-step procedure for data ingestion using Spark Connector with AWS Glue.
Prerequisites:
-
- Sample Scala Script
- CSV File for Ingestion
- Neo4j Spark Connector Jar
- AWS Access
This sample scala script uses Glue 3.0 version and picks data from S3 bucket , creates a data frame and insert that into Aura Database. Please note that when copying and pasting this script there may be illegal characters created as a result on the empty lines, you will be able to see and remove these in the script editor within glue.
Sample Scala Script:
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions.{col, lit, when}
import com.amazonaws.services.glue.util.GlueArgParser
import org.apache.spark.sql.{SaveMode, SparkSession}
object GlueApp {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
//Generates a dataframe from a S3 filepath
def generateFromS3Df(fileName: String, selectOnly: Seq[String] = null): DynamicFrame = {
val file_path = Array(s"s3://<s3-bucket-name-here>/$fileName.csv")
val fd = glueContext
.getSourceWithFormat(
connectionType = "s3",
options = JsonOptions(Map("paths" -> file_path)),
format = "csv",
formatOptions = JsonOptions(Map("separator" -> "|","quoteChar" -> -1, "writeHeader" -> false, "withHeader" -> true)))
.getDynamicFrame()
if (selectOnly != null)
fd.selectFields(selectOnly)
else
fd
}
def main(sysArgs: Array[String]): Unit = {
//This generates a glue dynamicframe from $bucket/$output_folder/$week_folder/vehicles.csv
val vehiclesDf = generateFromS3Df("vehicles", Seq("vehicle_syskey","model_syskey","model_year","make_syskey","make_desc","model_desc"))
//Need to convert that dynamic frame to a data frame with toDF() method and drop duplicates
val df = vehiclesDf.toDF().dropDuplicates()
/*Ingest vehicle data frame*/
df.write
.mode("Append")
.format("org.neo4j.spark.DataSource")
.option("url", "neo4j+s://<insert dbid here>.databases.neo4j.io")
.option("authentication.basic.username", "<insert username here>")
.option("authentication.basic.password", "<insert password here>")
.option("query", "CREATE (vehicle:Vehicle{id: event.vehicle_syskey}) SET vehicle.year = event.model_year, vehicle.make = event.make_desc, vehicle.model = event.model_desc")
.save()
}
}
- Copy and paste the above contents to a file and save it as <filename>.scala on your system
- Change $fileName.csv as per your data file. Here we are using `vehicles.csv` for upload so in this case filename will be vehicles.csv
CSV File for Ingestion: Please use the attached file vehicles.csv at the bottom of this article
- Copy the csv file to your S3 bucket.
- Download Neo4j Spark Connector and upload the jar file to your S3 bucket. You can download the spark connector here : Download Neo4j Spark Connector
Steps for Ingestion using AWS Glue:
- Login to AWS
- Search for Glue in the search bar and choose AWS Glue:
- Click on Jobs and you will be redirected to Create Job Page in AWS Glue studio
- Select Spark Script editor as Job Source, Choose Upload and edit an existing script click on Choose File and choose the scala script that you have saved locally .
- Chosen file will display under the Choose file button. Double check the filename to make sure you are using the right script for creating the job and then click on CREATE.
This will create an Untitled job and you will be redirected to below page:
By clicking on the icon next to Untitled job, you can change your job name.
- Now we need to make a few changes to the job script:
Change the file_path with your file path on S3:val file_path = Array(s"s3://<s3-bucket-name-here>/$fileName.csv")
At the bottom of script, edit the values highlighted in red with your dbid and database credentials, and after these changes click on the Save button present at the top right.
/*Ingest vehicle data frame*/
df.write
.mode("Append")
.format("org.neo4j.spark.DataSource")
.option("url", "neo4j+s://<insert dbid here>.databases.neo4j.io")
.option("authentication.basic.username", "<insert username here>")
.option("authentication.basic.password", "<insert password here>")
.option("query", "CREATE (vehicle:Vehicle{id: event.vehicle_syskey}) SET vehicle.year = event.model_year, vehicle.make = event.make_desc, vehicle.model = event.model_desc") - Now go to the second tab on the job page Job Details:
There are several properties you can configure for the job that determines how it runs and what resources the job uses. You must provide a name for the job and choose the IAM role that the job will use.
The job name can be any UTF-8 string with a maximum length of 255 characters.
The IAM role must include all the necessary permissions for the job to access data sources and targets, objects in the Data Catalog, Amazon S3 buckets, and any other resources.
The Advanced properties section includes additional job properties, such as script and library locations, security configuration, job metrics, continuous logging, setting job parameters, and adding tags to the job.
For this sample script to run, please make below changes in the job details:
- Choose a name of your choice, you can skip this step if you have already edited the name as part of step5.
- Choose IAM role, please select the role which has permission to access S3 files. if you are not able to view that role in the IAM role section, you will need to create a new service role for your glue job and that should have sufficient permissions to access files from S3.
STEPS TO CREATE IAM ROLE FOR GLUE JOB:
- You can create a new service role for glue on the IAM -> Roles -> Create role page. Select Glue in the dropdown.
- Click on Next , Choose the neccessary permissions for S3 Access and choose Next.
- After successfully creating the role ,go back to Job details section in glue studio, you will now be able to see that service role in the IAM role dropdown . Select the role and move to next step. - Go to Libraries section and pass the path for Neo4j spark connector jar file:
- Click on SAVE. You can edit other options such as job parameters, the number of retries as per your requirement.
- Click on Run and you can monitor the job execution using the Run details Section of the job.
- After successful completion of the job, you will now be able to view the data in your Auradb. Login to your Auradb to view data.
Common Issues:
- Unable to find a role that has S3 access in the IAM role dropdown under Job Details. If you are facing this error, make sure you create a new service role for AWS Glue with S3 access permissions and after creation, it will show in the dropdown as well)
- Error Downloading from S3 Bucket:
This can happen when the role assigned to your job has insufficient permissions to access the S3 files. Make sure you double-check the role permissions under the IAM -> Roles section and add permissions if required. - Copied the wrong path for jar file or unable to access jar file for Neo4j Spark Connector:
This error can be seen when you have either typed the path incorrectly for the spark connector file or used the wrong format like using `.zip` instead of `.jar`
Comments
0 comments
Please sign in to leave a comment.