MWAA Questions



 1.What is Apache Airflow?

Apache Airflow is a open source data Orchestration tools which allows to define data pipelines using Python .

2. What is DAG?

 A DAG (Directed Acylic Graph)  is a collection of tasks and relation ship between them. In Airflow, the start and end  and not executed in loop/cycle.

3. What are the parameters needed to Define DAG?

dag_id, start_date and schedule_interval must be provided.

with DAG( dag_id="sample_dag", start_date=datetime(year=2024, month=1, day=1, hour=9, minute=0), schedule="@daily", ) as dag:

dag_id is unique for the environment.

4. What is Airflow Task?

A Task is a single functionality definition/operation in a DAG. Tasks are building blocks for Dags and have relationships between them.

 There are Airflow operators built for the task like PythonOperator, DatabricksRunNowOperator etc.


5. What are Operators in Airflow?

Operators are fundamental Building blocks used in DAG tasks to define its operation. 

There are 3 types of Operators with examples.

1. Action Operators:

    PythonOperator: Executes Python functions.

    BashOperator: Executes bash commands.

    EmailOperator: Sends emails.

    DummyOperator / EmptyOperator: Placeholder task for structuring DAGs.

2. Transfer operators

    S3ToRedshiftOperator: Transfers data from Amazon S3 to Amazon Redshift.

    S3ToSnowflakeOperator: S3 to Snowflake

    GoogleCloudStorageToS3Operator: Transfers from GCS to S3

3. Sensor operators

    S3KeySensor: Waits for a file to land in S3

    ExternalTaskSensor: Waits for a task in another DAG.

    TimeDeltaSensor: Waits for a time period.

Apart from these you can create custom operators or use plugins available from provider package . 

Operator providers can be installed via requirememts.txt or support plugins and custom python packages using plugins.zip.


6. MWAA Architecture?


7.How does MWAA differ from a self-managed Airflow environment?

MWAA is same opensource as Apache Airflow and Are compatible. But Setup, Patching and Upgrades are more manageable on the MWAA. The availability of MWAA is high because of scalable natere of the workers as per demand. Security playing major roles in MWAA as Apache Airflow should maintain ssh keys / authentications for Ec2 instances where as the MWAA ins integrated with IAM including role based authentication and Authorization.

8.Describe a high performance Usecase in Apache Airflow and MWAA Airflow?

The following section describes the type of configurations you can use to enable high performance and parallelism on an environment.

On-premise Apache Airflow

Typically, in an on-premise Apache Airflow platform, you would configure task parallelism, auto scaling, and concurrency settings in your airflow.cfg file:

  • core.parallelism – The maximum number of task instances that can run simultaneously per scheduler.

  • core.dag_concurrency – The maximum concurrency for DAGs (not workers).

  • celery.worker_autoscale – The maximum and minimum number of tasks that can run concurrently on any worker.

For example, if core.parallelism was set to 100 and core.dag_concurrency was set to 7, you would still only be able to run a total of 14 tasks concurrently if you had 2 DAGs. Given, each DAG is set to run only seven tasks concurrently (in core.dag_concurrency), even though overall parallelism is set to 100 (in core.parallelism).

On an Amazon MWAA environment

On an Amazon MWAA environment, you can configure these settings directly on the Amazon MWAA console using Using Apache Airflow configuration options on Amazon MWAAConfiguring the Amazon MWAA environment class, and the Maximum worker count auto scaling mechanism. While core.dag_concurrency is not available in the drop down list as an Apache Airflow configuration option on the Amazon MWAA console, you can add it as a custom Apache Airflow configuration option.

Let's say, when you created your environment, you chose the following settings:

  1. The mw1.small environment class which controls the maximum number of concurrent tasks each worker can run by default and the vCPU of containers.

  2. The default setting of 10 Workers in Maximum worker count.

  3. An Apache Airflow configuration option for celery.worker_autoscale of 5,5 tasks per worker.

This means you can run 50 concurrent tasks in your environment. Any tasks beyond 50 will be queued, and wait for the running tasks to complete.

Run more concurrent tasks. You can modify your environment to run more tasks concurrently using the following configurations:

  1. Increase the maximum number of concurrent tasks each worker can run by default and the vCPU of containers by choosing the mw1.medium (10 concurrent tasks by default) environment class.

  2. Add celery.worker_autoscale as an Apache Airflow configuration option.

  3. Increase the Maximum worker count. In this example, increasing maximum workers from 10 to 20 would double the number of concurrent tasks the environment can run.








Comments