asked on

Aws Glue Pyspark

I am writing ETL scripts using PySpark in AWS Glue. I have a few issues that I am trying to tackle. My source and target databases are Oracle 12c Standard.

1 How to capture incremental updates in the pyspark dataframe?
2 How to update existing record or insert new records in database in the incremental fashion?
3 Is it possible to perform above tasks using python alone instead of pyspark ?

Sharath S

1 How to capture incremental updates in the pyspark dataframe?
[Sharath] - Do you have any date field in your source table that you can use while pulling source data? In that way, you can pull incremental data from source and place in a data frame.
2 How to update existing record or insert new records in database in the incremental fashion?
[Sharath]: What is your data volume? If you can perform all ETL operations and send data to a stage table, you can merge the data from stage to your target table.
3 Is it possible to perform above tasks using python alone instead of pyspark ?
[Sharath]: Yes, depending on the ETL operations that you are performing and the volume of the data. If your data volume is really big, you can go with spark instead of doing the same in python with pandas and other libraries.

Did you already start writing something in spark that an expert can help you with?

This question needs an answer!

Become an EE member today

7 DAY FREE TRIAL

Members can start a 7-Day Free trial then enjoy unlimited access to the platform.

View membership options

Learn why we charge membership fees

We get it - no one likes a content blocker. Take one extra minute and find out why we block content.