{"id":97700,"date":"2026-04-27T17:13:52","date_gmt":"2026-04-27T13:13:52","guid":{"rendered":"https:\/\/herecareers.com\/job\/freelance-agent-evaluation-engineer\/"},"modified":"2026-04-27T17:13:57","modified_gmt":"2026-04-27T13:13:57","slug":"freelance-agent-evaluation-engineer","status":"publish","type":"job_listing","link":"https:\/\/herecareers.com\/ar\/job\/freelance-agent-evaluation-engineer\/","title":{"rendered":"Freelance Agent Evaluation Engineer"},"content":{"rendered":"<p><em>Please submit your CV in English and indicate your level of English proficiency.<\/em><\/p>\n<p><a href=\"https:\/\/himalayas.app\/companies\/mindrift\" rel=\"nofollow noopener\" target=\"_blank\">Mindrift<\/a> connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.<\/p>\n<h3><strong>What this opportunity involves<\/strong><\/h3>\n<p>We&#8217;re building a dataset to evaluate AI coding agents \u2014 how well a model handles real-world developer tasks. You&#8217;ll create challenging tasks and evaluation criteria within realistic simulated environments:<\/p>\n<ul>\n<li>Build virtual companies following a high-level plan &#8211; codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history<\/li>\n<li>Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair<\/li>\n<li>Design tasks set in isolated environments &#8211; emulations of a developer&#8217;s workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase<\/li>\n<li>Write tests that accept all correct solutions and reject incorrect ones &#8211; neither too strict (breaking on valid approaches) nor too lenient (passing bad ones)<\/li>\n<li>Iterate with an AI agent on tests &#8211; verifying they catch real problems, don&#8217;t miss bad solutions, and don&#8217;t break on good ones<\/li>\n<li>Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios<\/li>\n<li>Iterate based on feedback from expert QA reviewers who score your work on quality criteria<\/li>\n<\/ul>\n<h3>What this is NOT<\/h3>\n<ul>\n<li>Not data labeling<\/li>\n<li>Not prompt engineering<\/li>\n<li>Not writing code from scratch &#8211; the agent writes most of the code; you guide and evaluate<\/li>\n<\/ul>\n<p>A significant part of the work is done together with AI &#8211; it&#8217;s very hard to create tasks that challenge frontier models without using frontier models.<\/p>\n<h3><strong>What we look for<\/strong><\/h3>\n<p>This opportunity is a good fit for experienced developers, software engineers, and\/or test automation specialists open to part-time, non-permanent projects. Ideally, contributors will have:<\/p>\n<ul>\n<li>Degree in Computer Science, Software Engineering, or related fields<\/li>\n<li>5+ years in software development, primarily Python (FastAPI, pytest, async\/await, subprocess, file operations)<\/li>\n<li>Background in full-stack development, with experience building React-based interfaces (JavaScript\/TypeScript) and robust back-end systems<\/li>\n<li>Experience writing tests (functional, integration \u2014 not just running them)<\/li>\n<li>Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis)<\/li>\n<li>CI\/CD understanding (GitHub Actions as a user: triggers, labels, reading results)<\/li>\n<li>English proficiency &#8211; B2<\/li>\n<\/ul>\n<p>You don&#8217;t need to be an expert in every item, but you should be comfortable reading and reasoning about code across the stack.<\/p>\n<h3>Why this is hard<\/h3>\n<ol>\n<li>Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution.<\/li>\n<li>Tasks have many valid solutions. Writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.<\/li>\n<\/ol>\n<h3>How it works<\/h3>\n<p>Apply \u2192 Pass qualification(s) \u2192 Join a project \u2192 Complete tasks \u2192 Get paid<\/p>\n<h3>Effort estimate<\/h3>\n<p>Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.<\/p>\n<h3>Compensation<\/h3>\n<p>On this project, contributors can earn up to <strong>$17 per hour equivalent<\/strong>, depending on their level and pace of contribution.<\/p>\n<p>Compensation varies across projects depending on scope, complexity, and required expertise. Please note that other projects on the platform may offer different earning levels based on their requirements.<\/p>\n<p>Originally posted on <a href=\"https:\/\/himalayas.app\" rel=\"nofollow noopener\" target=\"_blank\">Himalayas<\/a><\/p>","protected":false},"author":100,"featured_media":0,"comment_status":"open","ping_status":"open","template":"","job_listing_type":[1174],"job_listing_category":[2525],"job_listing_location":[],"job_listing_tag":[],"class_list":["post-97700","job_listing","type-job_listing","status-publish","hentry","job_listing_type-remote","job_listing_category-ai-evaluation-engineer"],"metas":{"_job_featured_image":"","_job_featured":"","_job_filled":"","_job_urgent":"","_job_category":{"2525":"AI-Evaluation-Engineer"},"_job_type":{"1174":"Remote"},"_job_tag":[],"_job_expiry_date":"","_job_gender":"","_job_apply_type":"external","_job_phone":"","_job_apply_url":"https:\/\/himalayas.app\/companies\/mindrift\/jobs\/freelance-agent-evaluation-engineer-1387176212","_job_apply_email":"","_job_salary_type":"","_job_salary":"","_job_experience":"","_job_career_level":"","_job_qualification":"","_job_video_url":"","_job_photos":"","_job_application_deadline_date":"","_job_address":"","_job_location":[],"_job_map_location":{"address":"","latitude":"","longitude":""},"_job_logo":"https:\/\/herecareers.com\/wp-content\/uploads\/wp-job-board-pro-uploads\/_employer_featured_image\/2025\/11\/ICON-Here-Careers-Logo-150-150x150.png","_job_employer_name":"HR","_job_employer_url":"https:\/\/herecareers.com\/ar\/employer\/hr\/"},"cmb2":{"_job_careerjet_job_fields":{"_job_careerjet_detail_url":"","_job_careerjet_company_name":""}},"_links":{"self":[{"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing\/97700","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing"}],"about":[{"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/types\/job_listing"}],"author":[{"embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/users\/100"}],"replies":[{"embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/comments?post=97700"}],"wp:attachment":[{"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/media?parent=97700"}],"wp:term":[{"taxonomy":"job_listing_type","embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing_type?post=97700"},{"taxonomy":"job_listing_category","embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing_category?post=97700"},{"taxonomy":"job_listing_location","embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing_location?post=97700"},{"taxonomy":"job_listing_tag","embeddable":true,"href":"https:\/\/herecareers.com\/ar\/wp-json\/wp\/v2\/job_listing_tag?post=97700"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}