- The use case and the goal
- Write a Glue Job in Python
- Prepare 3rd party libraries to be used in Glue
- Prepare custom libraries to be used in Glue
- Configure and run the Glue job
Lets get started ……
The use case and the goal:
- A JSON file uploaded in AWS S3 contains details of employees.
- Read the data in the JSON file in S3 and populate the data in to a PostgreSQL database in RDS using an AWS Glue Job.
- Additionally create a custom python library for logging and use it in the Glue job.
- Employee details JSON format is as below.
- The database table “employee_info” has the following schema and is created using the following SQL command.
- Create a file and include the above mentioned employee details as a JSON and upload the file to a S3 bucket.
- Create a RDS PostgreSQL database and create the employee_info table using the SQL command mentioned in the use case section.
Write a Glue Job in Python
- Prepare the data_processor.py python file as below and upload the file to a S3 bucket.
- Important things to note:
- getResolvedOptions is a function that is provided by awsglue.utils which is a utility function that gives access to the arguments that are passed to the script when the job is executed.
Refer Accessing Parameters in AWS Glue Job for more information.
- PG8000 is the library used to connect to the postgreSQL database.
- Logger is a custom library we will be creating in the one of the next sections.
Prepare custom libraries to be used in Glue
- As we discussed in the use case section let’s build a custom logging utility.
Create a directory in your project as “modules”.
Inside the modules directory create a file called “logger_util.py” and place the below code in it.
- If you need to use your custom logging utility in your glue job, you need to generate a wheel file from your code.
- To prepare a wheel file first you need to create a setup file as below. Name this file as “setup.py”.
- To generate a wheel file run the below command and you will be able to see a directory called ‘dist’ is being created with the wheel file inside it.
python setup.py bdist_wheel
- Upload the wheel file created for the logging utility to AWS S3.
Configure and run the Glue Job
- Go to AWS Glue, click on the “Jobs” section and click on “Add Job”
- Provide the job name, IAM role and select the type as “Python Shell” and Python version as “Python 3”.
In the “This job runs section” select “An existing script that you provide” option.
Now we need to provide the script location for this Glue job.
Go to the S3 bucket location and copy the S3 URI of the data_processor.py file we created for the data processing at first.
Insert the copied S3 URI as “S3 path where the script is stored”
- Next step is to provide the Python library path, which is the place where we need to mention the additional libraries we are going to use in our Glue Job.
- Before providing the python library path go to AWS S3 and copy the S3 URI of the pg8000 library and the logging utility library.
- To provide the Python library path, first go to “Security configuration” section and make sure to add the S3 URI of the pg8000 library and the logging utility library as a comma separated string.
- Provide the DB host, name, user, password, S3 bucket name and the data file name in the Job parameters section and click on the Next button.
- Click on “Save job and edit script” and you will be redirected to the edit script page.
- If you need to make additional changes to the the code you can edit here and save.
- Click on “Run Job” to start running the glue job. The logs will be visible in the Logs tab below for the current execution.
- To view all the logs and error logs for all executions navigate back to the Jobs section in AWS Glue page and select the job. The History tab will show all the executions with the respective logs and error logs.
- After a successful job execution, connect to the PostgreSQL database and you will be able to see the data is being loaded using our employee details JSON.