Evaluating LLM-Generated Transformations for Data Engineering

Evaluating LLM-Generated Transformations for Data Engineering

Introduction

Data transformation is central to Nexla’s value proposition. Whether users are flattening nested records, normalizing timestamps, or enriching datasets using large language models (LLMs) and embedding APIs, the quality of generated transform code directly affects correctness, reliability, and user trust.

Over time, we observed that transform code produced by the NOVA Agent has often failed in subtle but impactful ways. In order to figure out concrete ways to fix it, we built a structured evaluation framework to systematically understand why models fail and which models perform best on real, data transformation tasks that Nexla users care about.

This blog presents our end-to-end evaluation framework for transform code generation, including:

  • Dataset curation for ETL and LLM/API tasks
  • A sandboxed execution and LLM-based judging pipeline
  • Comparative results across mini and SOTA models
  • Common failure patterns and actionable takeaways

For every evaluation framework, we need: data, instructions, and containers. In our setup, we have split this into two independent but composable layers:

  1. Dataset Generation – deterministic, reproducible task definitions with ground-truth reference code implementations
  2. Evaluation and Execution – prompt design, automated code generation, execution, and LLM-based judging

This separation allows datasets to evolve independently of evaluation logic, while preserving consistent benchmarks over time.

1. Datasets

Most of the existing coding datasets such as SWE-Bench (Verified) don’t exclusively cover data transformation tasks. ELT-Bench comes close but their data transformation tasks are tested for generating accurate SQL. Thus, we set out to build synthetic datasets that test Python transform code. All our datasets are defined declaratively using Python scripts that emit structured JSON test cases. Each test case follows a uniform schema consisting of:

  • A natural-language task description
  • A representative input payload
  • An expected output structure
  • A reference (ground-truth) implementation

Canonical Test Case Schema:

{
    "task": "...",
    "input": {...},
    "output": {...},
    "ground_truth_code": "..."
}

1.2 Datasets: ETL Transforms

Purpose:
Evaluate deterministic, pure-Python data transformations without external dependencies.

Coverage (31 test cases):

  • Nested object flattening
  • Array explosion and pivoting
  • Data type normalization
  • Field derivation and enrichment
  • Deduplication and aggregation
  • Format standardization (dates, enums, phone numbers)

Key Constraint:
No external API calls are permitted. All logic must be implemented using standard Python constructs.

Sample Test Case:

{
  "task": "Flatten all nested objects consistently into top-level columns.",
  "input": {
    "id": "INC001",
    "assigned_to": {
      "id": "u1",
      "name": "Alice"
    },
    "opened_by": {
      "id": "u2",
      "name": "Bob"
    }
  },
  "output": {
    "id": "INC001",
    "assigned_to_id": "u1",
    "assigned_to_name": "Alice",
    "opened_by_id": "u2",
    "opened_by_name": "Bob"
  },
  "ground_truth_code": "def transform(data):\n    result = {}\n    \n    def flatten(obj, prefix=''):\n        for key, value in obj.items():\n            new_key = f\"{prefix}_{key}\" if prefix else key\n            if isinstance(value, dict):\n                flatten(value, new_key)\n            else:\n                result[new_key] = value\n    \n    flatten(data)\n    return result"
}

1.3 Datasets: LLM/API Integration Dataset

Purpose:
Evaluate a model’s ability to correctly integrate with real LLM and embedding APIs.

Coverage (16 test cases):

  • Embedding generation (OpenAI, Cohere, HuggingFace, Voyage AI, Jina AI)
  • Text summarization (GPT-4, Claude)
  • Text classification (sentiment, intent, moderation)
  • Multi-step pipelines combining embeddings with downstream LLM inference

Key Challenge:
Correctness requires actual API usage, not simulated outputs or deterministic stand-ins.

Sample Test Case:

{
  "task": "Generate embeddings for product descriptions using OpenAI's text-embedding-3-small model. The output should include a 'description_embedding' field (1536 dimensions) and 'embedding_model' field.",
  "input": {
    "product_id": "P001",
    "name": "Wireless Bluetooth Headphones",
    "description": "Premium over-ear headphones with active noise cancellation and 30-hour battery life"
  },
  "output": {
    "product_id": "P001",
    "name": "Wireless Bluetooth Headphones",
    "description": "Premium over-ear headphones with active noise cancellation and 30-hour battery life",
    "description_embedding": "[vector of 1536 dimensions]",
    "embedding_model": "text-embedding-3-small"
  },
  "ground_truth_code": "def transform(data):\n    import openai\n    \n    result = data.copy()\n    \n    if 'description' in result and result['description']:\n        response = openai.embeddings.create(\n            input=result['description'],\n            model=\"text-embedding-3-small\"\n        )\n        result['description_embedding'] = response.data[0].embedding\n        result['embedding_model'] = \"text-embedding-3-small\"\n    \n    return result"
}

2. Evaluation and Execution Pipeline

Load Dataset → Code Generation → Code Execution → Judging → Aggregation

2.1 Code Generation for Candidate Models

For code generation, we use model-specific clients corresponding to each provider, including OpenAI, Anthropic, and Google. Based on experiment setup, we evaluate models with two prompt configurations : (1) No Web Search setting: Models operate in a closed-book mode and must rely solely on their internal knowledge to generate transformation code. (2) Web Search Enabled setting: Models are allowed to query external documentation to discover current API syntax and usage patterns. For the web search setting, we let the models use their native web search tools through client APIs. The generated responses are post-processed to extract executable code from markdown blocks, with all explanatory text removed to ensure that only runnable code is evaluated.

2.2 Code Execution for Candidate Models

All generated code is executed within a sandboxed Python environment to ensure isolation and reproducibility. During execution, the system captures standard output, standard error, and structured return values. Within our runs, we understood the failure modes for candidate models and decided to add explicit distinction between different classes of failures, including runtime execution errors, API key or environment-related errors, and logical mismatches between expected and actual outputs. We also added fallback strategies that are applied where appropriate, allowing the evaluation pipeline to continue even when execution fails, while still preserving detailed error signals for downstream analysis.

2.3 Judging

Broadly, we employ the LLM-as-Judge as an evaluation strategy. We use Claude Sonnet 4.5 as the judge model, scoring each generated solution across five dimensions on a 0–10 scale:

  • Correctness
  • Code quality
  • Efficiency
  • Robustness
  • Similarity to the ground-truth approach

During runs, we did manual intervention over judgement traces and realized the judge had its own failure modes. Thus, to ensure fair and correct judgments, we added special handling for:

  • Granting points based on semantic equivalence in text-generation tasks instead of exact match.
  • Handling API key–related execution failures gracefully and granting points if generated code is valid, even though it errors out on importing keys.
  • Allowing test cases to be marked as passed if the JSON constructed by the judge model has parsing issues.

2.4 Results Storage

Results are stored as structured JSON, capturing per-test and per-model metrics, detailed latency breakdowns, and aggregated summary statistics.

Sample Result Trace:

{
  "metadata": {...},
  "results": {
    "test_cases": [
      {
        "test_case_id": 1,
        "task": "...",
        "input": {...},
        "expected_output": {...},
        "ground_truth_code": "...",
        "model_results": [
          {
            "model": "model-name",
            "generated_code": "...",
            "execution_success": true/false,
            "actual_output": {...},
            "error_message": "...",
            "evaluation": {
              "correctness": 0-10,
              "code_quality": 0-10,
              "efficiency": 0-10,
              "robustness": 0-10,
              "similarity_to_ground_truth": 0-10,
              "overall_score": 0-10,
              "passed": true/false,
              "feedback": "...",
              "error_analysis": "..."
            },
            "latency": {
              "code_generation": seconds,
              "execution": seconds,
              "judging": seconds, 
   "total": seconds
            }
          }
        ]
      }
    ],
    "summary": {
      "model-name": {
        "average_score": float,
        "pass_rate": float,
        "passed": int,
        "total": int,
        "avg_code_generation_latency": float,
        "avg_execution_latency": float,
        "avg_judging_latency": float,
        "avg_total_latency": float
      }
    }
  }
}

3. Experiments

Overview

We conducted two independent evaluation runs covering mini models and SOTA models across the two core data transformation workloads: deterministic ETL transforms and LLM/API integration tasks. The selected models in both suites represent the strongest publicly claimed options for code generation at their respective capacity tiers. To reflect realistic production constraints, all models were evaluated without extended or chain-of-thought reasoning, prioritizing low-latency code generation where time-to-output is critical.

Each model was evaluated under three conditions: ETL transformations, LLM/API integration, and LLM/API integration with web search enabled. To approximate real-world usage, we also report a unified task accuracy, which combines ETL performance with LLM/API performance under documentation-aware conditions. The table below summarizes pass rates across all evaluation settings for both Mini and SOTA model classes. In the following sections, we focus in detail on all experiments and discuss recurring failure modes for models across tasks.

Model ETL Transforms LLM/API Integration LLM/API Integration
(with Web Search)
Unified Task Accuracy
(ETL + LLM/API with search)
Mini Models
claude-haiku-4.5 45.2% 6.2% 56.2% 50.7%
gemini-flash-3 93.5% 37.5% 75.0% 84.3%
gpt-4o-mini 83.9% 18.8% 37.5% 60.7%
gpt-5.1-codex-mini 90.3% 31.2% 56.2% 73.3%
SOTA Models
claude-opus-4.5 100.0% 37.5% 43.8% 71.9%
gemini-3-pro-preview 100.0% 18.8% 56.2% 78.1%
gpt-5.2 96.8% 31.2% 81.2% 89.0%

3.1 Zoom In: Mini Models

Candidate Suite:

  • claude-haiku-4.5
  • gemini-flash-3
  • gpt-4o-mini
  • gpt-5.1-codex-mini

Judge model: Claude Sonnet 4.5

Scenario 1: ETL Tasks

Zoom In: Mini Models: Scenario 1: ETL Tasks

  • Gemini-flash-3 performs best overall among mini models, with high correctness and minimal edge-case failures. Most errors are minor and related to incomplete edge-case handling.
  • GPT-4o-mini shows generally correct logic but frequently reconstructs input objects instead of preserving all original fields, leading to silent data loss in some transformations.
  • GPT-5.1-codex-mini tends to over-generalize transformations, applying logic globally rather than selectively. Common issues include hardcoded thresholds and incorrect field scoping.
  • Claude-haiku-4.5 performs worst on ETL tasks, frequently violating constraints by calling Claude Sonnet 3.5 with deprecated model name (claude-3-5-sonnet-20241022) for deterministic operations such as timestamp parsing, numeric checks, or simple conditionals.

Scenario 2: LLM/API Tasks (Without Web Search)

Nexla Blog: Zoom In: Mini Models Scenario 2: LLM/API Tasks

  • All mini models struggle significantly without documentation access.
  • Claude-haiku-4.5 consistently uses invalid or non-existent model names and, in some cases, fabricates outputs instead of calling APIs.
  • Gemini-flash-3 often produces correct-looking outputs but avoids required API calls, instead using deterministic or pattern-based shortcuts.
  • GPT-4o-mini commonly relies on deprecated OpenAI SDK syntax (pre-v1.0), with incorrect response parsing and outdated imports.
  • GPT-5.1-codex-mini frequently generates fake API implementations using hashing or randomness rather than real embedding or inference APIs.

Overall pass rates are extremely low in this setting, indicating that mini models cannot reliably infer modern API usage without external guidance.

Scenario 3: LLM/API Tasks (With Web Search)

Nexla Blog: Zoom In: Mini Models Scenario 3: LLM/API Tasks (With Web Search)

  • Web search improves performance across all mini models by 2–9×, with the largest gains observed for Claude-haiku-4.5.
  • Gemini-flash-3 remains prone to shortcut behaviors even with documentation access, particularly in multi-step API workflows.
  • GPT-4o-mini continues to struggle with parameter correctness and response parsing despite improved API discovery.
  • GPT-5.1-codex-mini shows partial improvement but still exhibits outdated imports and incomplete integrations.

Summary:

Nexla Blog: Zoom In: Mini Models Summary

3.2 Zoom In: State-of-the-Art (SOTA) Models

Candidate Suite:

  • claude-opus-4.5
  • gemini-3-pro
  • gpt-5.2

Judge model: Claude Sonnet 4.5

Scenario 1: ETL Tasks

Nexla Blog: Zoom In: State-of-the-Art Summary

  • Gemini-3-pro and Claude-opus-4.5 achieve perfect pass rates across all ETL tasks, producing clean, constraint-compliant, and highly readable code.
  • GPT-5.2 delivers the fastest code generation but shows occasional correctness issues, primarily due to hardcoded thresholds in outlier detection and aggregation logic.

Overall, SOTA models demonstrate strong reasoning and constraint adherence for deterministic transformation tasks.

Scenario 2: LLM/API Tasks (Without Web Search)

Nexla Blog: Zoom In: State-of-the-Art Scenario 2: LLM/API Tasks

  • Even SOTA models struggle without access to API documentation
  • Gemini-3-pro defaults to fake embedding implementations using hashing or randomness in the majority of failures.
  • GPT-5.2 exhibits similar behavior, substituting deterministic logic for real API calls when documentation is unavailable.
  • Claude-opus-4.5 performs best in this setting but still generates fake API implementations in a significant fraction of failures.

This highlights the similar expected limitation: even frontier models do not reliably internalize up-to-date API contracts.

Scenario 3: LLM/API Tasks (With Web Search)

Nexla Blog: Zoom In: State-of-the-Art Scenario 3: LLM/API Tasks

  • GPT-5.2 achieves the strongest overall performance, with an 81.25% pass rate and only minor residual issues related to execution environments and formatting.
  • Gemini-3-pro shows the largest relative improvement but still suffers from execution-time failures and response-parsing errors in some integrations.
  • Claude-opus-4.5 improves modestly with web search but remains sensitive to API key errors and provider-specific response formats.

With documentation access, SOTA models become viable for complex, real-world API-driven transformation workflows.

Summary:

Nexla Blog: Zoom In: State-of-the-Art Summary

For big models, we also care about speed:

Nexla Blog: Zoom In: State-of-the-Art Big Model Speed

4. Takeaways

4.1 Failure Pattern Taxonomy

Failure  Description Models
API Overuse Uses LLM calls for deterministic logic claude-haiku-4.5
Fake Implementations Hash- or random-based outputs instead of real APIs gpt-5.1-codex-mini, gpt-4o-mini, gemini-flash-3, gemini-3-pro, gpt-5.2
API Version Drift Deprecated SDKs or incorrect parameters gpt-4o-mini, gpt-5.1-codex-mini, gpt-5.2
Shortcut Heuristics Requirement-violating logic with correct outputs gemini-flash-3, gemini-3-pro
Prompt Non-Compliance Explicit constraints ignored claude-haiku-4.5, gpt-5.1-codex-mini

4.2 Key Observations

  1. Gemini-flash-3 punches above its weight
    Gemini-Flash-3 does more than just beat other mini models, but it even outperforms Gemini-3-Pro. This tracks with Google’s reported coding-focused benchmarks such as SWE-bench Verified, where Flash scored higher than Pro (78 % vs. 76.2 %). This goes on to confirm the rumored hypothesis with additional empirical evidence that during the distillation process, Flash not only compressed the Pro capabilities model but refined its code generation signal. Further, in theory, it raises questions about how distillation possibly interacts with residual streams and can remove the less relevant features for different tasks in intermediate states.
  2. Web search is necessary for API calls
    The web search tool remains the single most important lever for successful real-world API integrations across all models evaluated. Without documentation access, nearly every model’s accuracy collapses, regardless of overall capacity. With web search enabled, models can ground their API usage in current syntax and examples, dramatically improving correctness. In production scenarios where timely API correctness is critical, this suggests that documentation-aware prompting should be prioritized over brute model size.
  3. GPT-5.2 emerges as the most balanced SOTA
    Among the models we evaluated, GPT-5.2 stands out as the most balanced model, combining competitive speed, high correctness on both ETL and LLM/API tasks, and strong recovery when documentation is available. Its performance with web search enabled contributes to the highest Unified Task Accuracy overall, indicating that it strikes a pragmatic balance between data transforms, API integration, and latency. This makes GPT-5.2 an especially strong choice for end-to-end transform workflows where both deterministic logic and dynamic API interactions are required.

Public Release


You May Also Like

Scale Icon
GenAI ETL Integration
Admin Profile
AI-Powered Integration for Modern Engineering

Join Our Newsletter

Share

Related Blogs

Nexla Blog: AI-Powered Data Transformation: From Manual Mapping to Self-Learning Pipelines
Nexla Blog: How Automated ETL Tools Contribute to Better Data Quality and Consistency

Ready to Conquer Data Variety?

Unify your data & services today!