Taming APIs and Data Workflows with Airflow: A Developer’s Tale

Transforming a Bash Script into an Airflow DAG: Automating AEMET Data Downloads

When I first set out to download weather data from the AEMET OpenData API, I thought, “Why not use a simple Bash script?” After all, Bash scripts are great for quick automation tasks, and I had a clear plan: loop through years and months, fetch data, and save it to a CSV. But what I hadn’t accounted for were the many challenges lurking in the world of API calls and large datasets. Let me take you on this journey of learning, improvement, and automation with Apache Airflow.

The Bash Script: A Good Start with Hidden Challenges

The script worked… sort of. It iterated through dates, fetched data, and processed it into a neat CSV file. But the problems became evident very quickly:

API Rate Limits: My eager loops bombarded the API with calls, leading to temporary bans.
Error Handling: What happens if the API returns malformed data? Or worse, no data at all?
Performance: The script was single-threaded and slow. With years of data to fetch, the process became painfully tedious.
Maintainability: Scaling the script for new requirements, like retries or better scheduling, quickly turned into a nightmare.

Here’s a snippet of what the Bash script looked like:

# Fetch data for a specific month and year
fetch_data() {
    local year=$1
    local month=$2
    echo "Fetching data for $year-$month..."
}

Here’s the full Bash script I initially wrote:

#!/bin/bash # Variables BASE_URL="https://opendata.aemet.es/opendata/api/antartida/datos/fechaini" API_KEY="<your_token>" # Default values for optional parameters START_YEAR=2020 END_YEAR=2020 MONTHS=$(seq -w 1 12) # Default: All months OUTPUT_CSV="output.csv" # Default: Output filename # Parse arguments while [[ "$#" -gt 0 ]]; do case $1 in --start-year) START_YEAR="$2"; shift ;; --end-year) END_YEAR="$2"; shift ;; --months) MONTHS=$(echo "$2" | tr ',' ' '); shift ;; # Convert comma-separated months to space-separated --output-file) OUTPUT_CSV="$2"; shift ;; *) echo "Unknown parameter: $1"; exit 1 ;; esac shift done # Function to fetch the last day of a month get_last_day_of_month() { local year=$1 local month=$2 # Use `date` with the `-v` option for macOS to find the last day of the month local last_day=$(date -v1d -v+1m -v-1d -j -f "%Y-%m-%d" "$year-$month-01" +"%d") echo "$last_day" } # Function to fetch data fetch_data() { local year=$1 local month=$2 local last_day=$(get_last_day_of_month "$year" "$month") local start_date="${year}-${month}-01T00:00:00UTC" local end_date="${year}-${month}-${last_day}T23:59:59UTC" # Call the API response=$(curl -s -G \ --data-urlencode "api_key=$API_KEY" \ "${BASE_URL}/${start_date}/fechafin/${end_date}/estacion/89064") # Extract the "datos" URL datos_url=$(echo "$response" | jq -r '.datos') # Fetch the actual data if [ "$datos_url" != "null" ]; then curl -s "$datos_url" else echo "null" fi } # Function to extract JSON keys and prepare the header row prepare_headers() { local sample_data="$1" echo "$sample_data" | jq -r 'keys_unsorted | @csv' | sed 's/"//g' } # Initialize the CSV file initialize_csv() { local headers="$1" echo "$headers" > "$OUTPUT_CSV" } # Append JSON data to the CSV append_to_csv() { local record="$1" echo "$record" | jq -r '[.[]] | @csv' >> "$OUTPUT_CSV" } # Loop through years and months headers_initialized=false for year in $(seq "$START_YEAR" "$END_YEAR"); do for month in $MONTHS; do echo "Fetching data for Year: $year, Month: $month..." # Start timing start_time=$(date +%s) data=$(fetch_data "$year" "$month") if [ "$data" != "null" ]; then # Process each record in the array echo "$data" | jq -c '.[]' | while read -r record; do if [ "$headers_initialized" = false ]; then # Extract headers from the first record headers=$(prepare_headers "$record") initialize_csv "$headers" headers_initialized=true fi # Append the record to the CSV append_to_csv "$record" done else echo "No data for Year: $year, Month: $month" fi # End timing and calculate elapsed time end_time=$(date +%s) elapsed_time=$((end_time - start_time)) echo "Time taken for Year: $year, Month: $month: ${elapsed_time} seconds" # Optional: Add a delay to avoid rate-limiting sleep 0.5 done done echo "Data saved to $OUTPUT_CSV"

It worked, but it wasn’t future-proof. Clearly, I needed something more robust. Enter Apache Airflow.

Why Airflow?

Apache Airflow is a workflow orchestrator. It lets you define workflows as Directed Acyclic Graphs (DAGs), with each step being an independent, reusable task. Here’s why it was a perfect fit for my project:

Retry Mechanisms: Automatic retries for failed tasks. No more manual re-runs!
Task Parallelism: Download multiple months of data simultaneously, significantly speeding up the process.
Visibility: A web interface to monitor tasks, logs, and execution history.
Extensibility: Adding new features or modifying workflows is easy.

Setting Up Airflow with Docker Compose

To get started with Airflow, I used Docker Compose for an easy and portable setup. Here’s how you can do the same:

Download the docker-compose.yaml File:

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.3/docker-compose.yaml'

Create Necessary Directories:

Airflow requires several directories for its operation:
- ./dags: Place your DAG files here.
- ./logs: Contains logs from task execution and the scheduler.
- ./config: Add custom log parsers or airflow_local_settings.py to configure cluster policies.
- ./plugins: Add your custom plugins here.
- ./output: Save the flow’s file outputs here.
Create these directories and set the correct user permissions:
```
mkdir -p ./dags ./logs ./plugins ./config ./output
echo -e "AIRFLOW_UID=$(id -u)" > .env
```

Adding Output Mapping in Airflow’s Docker Compose Configuration

To ensure your Airflow setup properly maps output files to a directory on your host machine, you’ll need to manually add the following line to your docker-compose.yaml file under the volumes section of each Airflow-related service:

${AIRFLOW_PROJ_DIR:-.}/output:/opt/airflow/output

Why Add This?

By default, the provided docker-compose.yaml configuration doesn’t include mapping for an output directory. This mapping allows Airflow to save files generated during workflows (like your processed data or intermediate results) directly to a folder on your host machine for easy access.

Steps to Add the Mapping

Open the docker-compose.yaml file you downloaded or created.
Locate the volumes section under each Airflow service (e.g., airflow-webserver, airflow-scheduler, airflow-worker, etc.).

Add the following line to the list of volumes:

${AIRFLOW_PROJ_DIR:-.}/output:/opt/airflow/output

Ensure the output directory exists on your host machine. You can create it with:
```
mkdir -p ./output
```

Example of Modified Volumes Section

Here’s an example for the airflow-webserver service:

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully
    volumes:
      - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
      - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
      - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
      - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
      - ${AIRFLOW_PROJ_DIR:-.}/output:/opt/airflow/output

Verify the Mapping

After starting Airflow with docker compose up, verify that files saved in /opt/airflow/output inside the container are accessible in the output directory on your host machine.

This step ensures your workflow outputs are preserved and easily accessible outside the container, making debugging and data management much simpler.

Initialize the Database:

Airflow requires database migrations and a first user account. Run the following command to initialize:
```
docker compose up airflow-init
```
Run Airflow:

Start all services using Docker Compose:
```
docker compose up
```
Once everything is running, you can access the Airflow UI at http://localhost:8080/.
- Username: airflow
- Password: airflow
This is the default administrator account created during the setup process. You can use it to log in and start exploring the UI. For detailed setup instructions, refer to the official Airflow documentation.

From Bash to DAG: The Transformation

With Airflow set up, I transformed my Bash script into a DAG. Each step of the original script became an independent task in the DAG:

Fetch the API URL: Use Airflow’s HttpHook to call the API and extract the datos URL.
Sensor for Data Availability: Wait until the data is available for download using a PythonSensor.
Process and Save Data: Fetch the JSON data, process it, and append it to a CSV.

Here’s a simplified version of the DAG:

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.python import PythonSensor
from datetime import datetime, timedelta

with DAG(
    dag_id="fetch_aemet_data",
    schedule_interval=None,
    start_date=datetime(2024, 12, 1),
    catchup=False,
) as dag:

    fetch_task = PythonOperator(
        task_id="fetch_data",
        python_callable=fetch_data_function,
        op_kwargs={"year": 2024, "month": 12},
    )

    wait_task = PythonSensor(
        task_id="wait_for_data",
        python_callable=check_data_availability_function,
    )

    process_task = PythonOperator(
        task_id="process_data",
        python_callable=process_data_function,
    )

    fetch_task >> wait_task >> process_task

The Advantages of Airflow

After the transition, everything changed for the better:

Scalability: Airflow handles task scheduling and parallelism. I could now download data for multiple months simultaneously.
Reliability: Built-in retries and error handling saved me from constant babysitting.
Reusability: Tasks like fetch_data and process_data are modular and easy to extend.
Observability: The Airflow UI provided a clear view of what was happening at every step.

Explaining the Code: Why I Chose These Airflow Components

This DAG (fetch_aemet_data_with_sensor) automates the process of fetching, verifying, and processing weather data from the AEMET API. Here’s a breakdown of the key components and why I used them:

1. Setting Up the DAG

The DAG is defined using the with statement, ensuring proper scoping of the tasks. The default_args dictionary specifies:

Retries: If a task fails, it retries once after a 5-minute delay.
Start Date: Tasks can only run after this date.
Catchup: Disabled to prevent backfilling tasks for past dates when the DAG wasn’t active.

This structure ensures the DAG runs efficiently and handles intermittent errors gracefully.

2. Fetching the API Data

The `fetch_data` Task

This task retrieves the datos URL for the requested date range using Airflow’s HttpHook.

Why use HttpHook?

It simplifies making HTTP requests and integrates seamlessly with Airflow’s connection management.
The http_conn_id allows secure storage of API credentials in Airflow’s connections interface.

Key highlights of the function:

Validates the API response and raises errors for missing or invalid datos URLs.
Logs relevant information for debugging.

3. The Sensor: Waiting for Data Availability

The `check_data_availability` Task

This task uses a PythonSensor to check if the datos URL is accessible. Sensors are ideal for waiting on external conditions, like data availability.

Why use a sensor?

The sensor continuously polls the API until the data is available or a timeout is reached.
This ensures the workflow doesn’t proceed until the required data is ready, avoiding potential errors in downstream tasks.

Configuration:

Timeout: Stops polling after 60 seconds.
Poke Interval: Checks the URL every 5 seconds.
Mode: Uses the default “poke” mode for simplicity.

4. Processing the Data

The `process_data` Task

This task fetches the actual data from the datos URL, processes it, and appends it to a CSV file.

Key Features:

Ensures consistent headers in the CSV by checking the schema of the first record.
Validates the structure of the data, logging warnings for any inconsistencies.
Handles edge cases, like empty records or malformed JSON.

Why use CSV output?

CSVs are lightweight and widely supported, making them an ideal format for storing structured data locally.

5. Task Dependencies

The workflow follows this sequence:

Fetch the URL: Ensures the datos URL is retrieved successfully.
Wait for Data: Ensures the data at the datos URL is available before proceeding.
Process the Data: Fetches and processes the data.

How dependencies are defined:

fetch_task >> sensor_task >> process_task ensures the tasks execute in the correct order, maintaining the logical flow.

6. Why Use Variables?

The Variable.get method retrieves configuration values like the output directory. Using Airflow variables allows:

Centralized management of settings.
Flexibility to update configurations without modifying the code.

7. Hardcoded Years and Months

For simplicity, I hardcoded the years and months. In a production setup, this could be dynamic, allowing users to specify the date range as parameters when triggering the DAG.

Advantages of This Approach

Resilience: Sensors and retries ensure the workflow can recover from temporary API issues.
Scalability: Tasks for each month and year are defined dynamically, making the DAG adaptable to varying data requirements.
Modularity: Each function handles a specific responsibility, making the code easier to maintain and extend.
Traceability: Logging at each step provides visibility into the workflow, simplifying debugging.

Full Code Example

For those interested in the complete implementation, here’s the full code for the Airflow DAG:

from datetime import datetime, timedelta import logging import os import json import csv import calendar from airflow import DAG from airflow.models import Variable from airflow.providers.http.hooks.http import HttpHook from airflow.sensors.python import PythonSensor from airflow.operators.python import PythonOperator with DAG( dag_id="fetch_aemet_data_with_sensor", default_args={ "owner": "airflow", "depends_on_past": False, "retries": 1, "retry_delay": timedelta(minutes=5), }, description="Fetch and process data from AEMET API with sensor", schedule_interval=None, # Run on demand start_date=datetime(2024, 12, 1), catchup=False, ) as dag: # Retrieve the base directory from Airflow Variables # Provide a default if it’s not set. base_output_dir = Variable.get("ROOT_OUTPUT_DIRECTORY", default_var="/opt/airflow/output") OUTPUT_DIRECTORY = os.path.join(base_output_dir, dag.dag_id) OUTPUT_CSV = os.path.join(OUTPUT_DIRECTORY, "aemet_data.csv") def get_last_day_of_month(year: int, month: int) -> int: """Return the last day of a given month and year.""" _, last_day = calendar.monthrange(year, month) return last_day def fetch_data(year: int, month: int, start_date: str, end_date: str, **kwargs) -> str: """ Fetch the 'datos' URL from the AEMET API. """ logging.info(f"Fetching datos URL for Year: {year}, Month: {month}") http_hook = HttpHook(http_conn_id="aemet_api", method="GET") conn = http_hook.get_connection(http_hook.http_conn_id) api_key = conn.extra_dejson.get("api_key") if not api_key: raise ValueError("API key is missing in the connection's extra field.") endpoint = f"/opendata/api/antartida/datos/fechaini/{start_date}/fechafin/{end_date}/estacion/89064" response = http_hook.run(endpoint=endpoint, data={"api_key": api_key}) response.raise_for_status() try: response_data = response.json() except json.JSONDecodeError as e: raise ValueError(f"Failed to decode JSON from API response: {e}") datos_url = response_data.get("datos") if not datos_url: raise ValueError("The 'datos' URL is missing in the API response.") if not isinstance(datos_url, str) or not datos_url.startswith("http"): raise ValueError("The 'datos' URL is not a valid string or does not start with 'http'.") logging.info(f"Fetched datos URL for {year}-{month:02d}: {datos_url}") return datos_url def check_data_availability(year: int, month: int, datos_url: str, **kwargs) -> bool: """ Check if data at 'datos_url' is available. """ logging.info(f"Checking data availability for {year}-{month:02d}: {datos_url}") http_hook = HttpHook(http_conn_id="aemet_api", method="GET") response = http_hook.run(endpoint=datos_url) return response.status_code == 200 def write_csv_headers_if_needed(record: dict, csv_path: str) -> None: """Write CSV headers if the file does not exist.""" if not os.path.exists(csv_path): os.makedirs(os.path.dirname(csv_path), exist_ok=True) with open(csv_path, "w", newline="", encoding="utf-8") as csvfile: writer = csv.writer(csvfile) writer.writerow(record.keys()) def process_data(year: int, month: int, datos_url: str, **kwargs) -> None: """ Fetch and process data from the 'datos_url' JSON endpoint, then append the records to a CSV file. """ logging.info(f"Processing data for {year}-{month:02d} from {datos_url}.") # Ensure the output directory exists at runtime if not os.path.exists(OUTPUT_DIRECTORY): os.makedirs(OUTPUT_DIRECTORY) http_hook = HttpHook(http_conn_id=None, method="GET") response = http_hook.run(endpoint=datos_url) response.raise_for_status() try: records = response.json() except json.JSONDecodeError as e: raise ValueError(f"Invalid JSON response from {datos_url}: {e}") if records and not isinstance(records, list): raise ValueError("Expected a list of records but got something else.") if records and not all(isinstance(r, dict) for r in records): raise ValueError("All records should be dictionaries.") if not records: logging.info(f"No records found for {year}-{month:02d}.") return expected_keys = set(records[0].keys()) for i, record in enumerate(records, start=1): if set(record.keys()) != expected_keys: logging.warning(f"Record #{i} in {year}-{month:02d} has inconsistent schema.") logging.info(f"Processing {len(records)} records for {year}-{month:02d}.") write_csv_headers_if_needed(records[0], OUTPUT_CSV) with open(OUTPUT_CSV, "a", newline="", encoding="utf-8") as csvfile: writer = csv.writer(csvfile) for record in records: writer.writerow(record.values()) logging.info(f"Successfully processed and saved {len(records)} records for {year}-{month:02d} to {OUTPUT_CSV}.") # Hardcoded years start_year = 2018 end_year = 2020 for year in range(start_year, end_year + 1): for month in range(1, 13): start_date = f"{year}-{month:02d}-01T00:00:00UTC" last_day = get_last_day_of_month(year, month) end_date = f"{year}-{month:02d}-{last_day}T23:59:59UTC" fetch_task = PythonOperator( task_id=f"fetch_datos_url_{year}_{month}", python_callable=fetch_data, op_kwargs={ "year": year, "month": month, "start_date": start_date, "end_date": end_date, }, ) sensor_task = PythonSensor( task_id=f"wait_for_data_{year}_{month}", python_callable=check_data_availability, op_kwargs={ "year": year, "month": month, "datos_url": "{{ ti.xcom_pull(task_ids='fetch_datos_url_" + f"{year}_{month}" + "') }}" }, timeout=60, poke_interval=5, mode="poke", ) process_task = PythonOperator( task_id=f"process_data_{year}_{month}", python_callable=process_data, op_kwargs={ "year": year, "month": month, "datos_url": "{{ ti.xcom_pull(task_ids='fetch_datos_url_" + f"{year}_{month}" + "') }}" }, ) fetch_task >> sensor_task >> process_task

This code is ready to be used in your Airflow setup. Feel free to adapt it to your specific needs or let me know if you encounter any challenges!

Final Thoughts

This DAG demonstrates how to build a robust, maintainable workflow using Airflow. By combining sensors, dynamic task generation, and modular functions, it handles real-world challenges like data availability and API errors gracefully. Let me know if you have any questions or suggestions for improvement! 🚀

Lessons Learned

Start Simple: The Bash script was a good starting point. It helped me understand the problem and identify pain points.
Invest in Better Tools: Choosing the right tool for the job (Airflow) saved me countless hours.
Iterative Improvements: The migration to Airflow was gradual. I didn’t throw away the Bash script overnight; I built upon it.

Final Thoughts

The journey from Bash to Airflow wasn’t just about improving a script; it was about rethinking the process entirely. Airflow gave me the power to scale, monitor, and optimize my workflow in ways I hadn’t imagined. If you’re managing complex workflows, consider giving Airflow a try. And remember, every automation journey begins with a simple script.

Got questions or a similar story to share? Drop them in the comments—I’d love to hear from you! 🚀

Transforming a Bash Script into an Airflow DAG: Automating AEMET Data Downloads

The Bash Script: A Good Start with Hidden Challenges

Why Airflow?

Setting Up Airflow with Docker Compose

Adding Output Mapping in Airflow’s Docker Compose Configuration

Why Add This?

Steps to Add the Mapping

Example of Modified Volumes Section

Verify the Mapping

From Bash to DAG: The Transformation

The Advantages of Airflow

Explaining the Code: Why I Chose These Airflow Components

1. Setting Up the DAG

2. Fetching the API Data

The fetch_data Task

3. The Sensor: Waiting for Data Availability

The check_data_availability Task

4. Processing the Data

The process_data Task

5. Task Dependencies

6. Why Use Variables?

7. Hardcoded Years and Months

Advantages of This Approach

Full Code Example

Final Thoughts

Lessons Learned

Final Thoughts

The `fetch_data` Task

The `check_data_availability` Task

The `process_data` Task