Link to home
Start Free TrialLog in
Avatar of J Z
J ZFlag for Belgium

asked on

Question on possibilities with Airflow running as an ECS Service

I have a question on Apache Airflow. We currently have an apache config running in AWS with one core (master)-node and 5 worker nodes. All running in a docker-container set up and managed by an ECS Service and an EC2 auto-scaling group. Right now all the workers are m5.xlarge.

According to the developers the reason why they all have to be m5.xlarge is because there is one job that has a dataset that would otherwise not fit in the memory of a single instance. But the majority of the jobs are small and don't need a lot of resources. So the 5 instances are basically idle most of the time.

I know little or nothing about Apache Airflow. My questions specifically about this setup (Airflow in docker on ECS) are:

1. Does Airflow by default supports a "fleet" of different instance sizes and can it then based on the job type (or other identification) sends these specific jobs to a certain type of worker?

2. Can the worker nodes with a default Airflow setup be spot instances? In other words when a worked dies would Airflow pick up the job again and re-run it or doesn't it have an idea of the state of a job?

3. Is Airflow aware of how many workers there are and what jobs are running where? Is there any way from within airflow to see what jobs are running and how much resources they are using?

4. I see a lot of Airflow core nodes with a very flat CPU utilisation line which seems strange to me and possibly a process that is in some kind of loop. What is the best way to see from a core node what is is doing exactly?

Thanks
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.