Senior AI Coding Benchmark Designer and Engineer

G2i Inc. • Mexico

Remote

Apply

AI Summary

Design and build coding benchmarks and evaluation pipelines for frontier AI models. Requires 4+ years of software engineering experience and expertise in Python, Git, and modern development workflows.

Key Highlights

Design and build coding benchmarks and evaluation pipelines

Evaluate frontier AI models on real-world programming tasks

Analyze model-generated code for correctness and reliability

Key Responsibilities

Design and build coding benchmarks that evaluate frontier models on real-world programming tasks

Build and maintain scalable data pipelines for evaluation workflows

Analyze model-generated code for correctness, reliability, and edge-case failures

Technical Skills Required

Python Git Modern development workflows LLM coding benchmarks Data pipelines Model-generated code analysis

Benefits & Perks

Fully remote work

$80-$100/hour compensation

Weekly payment via PayPal or Stripe

Nice to Have

Senior or Lead-level profile with a history of technical ownership

Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)

Proficiency in additional languages: JavaScript, Go, C++, or others

Job Description

Before Applying

This role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations.

For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role.

What You'll Be Doing

Design and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:

Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality code
Build and maintain scalable data pipelines for evaluation workflows
Analyze model-generated code for correctness, reliability, and edge-case failures
Construct structured evaluation scenarios across large repos and multi-language environments
Provide detailed technical feedback on model performance and failure patterns
Contribute to evaluation frameworks that set the bar for how coding ability is measured

End result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.

AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.

Interested in remote work opportunities in Development & Programming? Discover Development & Programming Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

What You'll Need

4+ years of professional software engineering experience (non-negotiable)
Expert Python — clean, performant, well-tested code
Hands-on experience working in large, complex codebases
Proven experience designing and implementing LLM coding benchmarks and evaluation data pipelines
Strong command of Git and modern development workflows
Track record at a high-growth tech company or top-tier software organization
Strong written English communication

Identity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.

Nice to have

Senior or Lead-level profile with a history of technical ownership
Bachelor's or Master's in CS, ML, or related field (or equivalent professional experience)
Proficiency in additional languages: JavaScript, Go, C++, or others

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

CI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)
Background in security engineering or significant open-source contributions
Familiarity with AI/ML evaluation methodologies or model benchmarking

Logistics

Location: Fully remote — work from anywhere on the accepted locations list
Compensation: $80–$100/hr based on location and seniority
Contract length: 3 months, with potential for extension
Hours: Full-time availability preferred — hours vary by project and are not guaranteed week to week
Engagement: 1099 independent contractor
Payment: Weekly via PayPal or Stripe

⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

Job Overview

Posted Date May 25, 2026

Employment Type Contract

Experience Level Mid-Senior level

Location Mexico

Annual Salary 96,000 - 120,000 USD

Category Programming

Company G2i Inc.

Mentioned Skills

Similar Jobs

Explore other opportunities that match your interests

TikTok Shop & E-Commerce Growth Manager

Programming

•

1d ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Associate

emma of torre.ai

Mexico

Senior Software Developer (Laravel/PHP) - Remote

Programming

•

2d ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Mid-Senior level

Remoto Workforce

Mexico

Principal Java Software Engineer

Programming

•

1w ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

Modus Create

Mexico

Senior AI Coding Benchmark Designer and Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

TikTok Shop & E-Commerce Growth Manager

emma of torre.ai

Senior Software Developer (Laravel/PHP) - Remote

Remoto Workforce

Principal Java Software Engineer

Premium Job

Modus Create

Subscribe our newsletter