AI / Systems Engineer — NeoHuman.
Own the desktop capture, trace generation, and evaluation evidence infrastructure behind NeoHuman. We are looking for someone who works at the boundary between operating systems, multimodal capture, LLM-based normalization, and assessment pipelines.
About Us.
Emergences Labs builds infrastructure for evaluating AI-native work. Our product, NeoHuman, is a desktop-first challenge and proctoring system that observes how people use AI tools, software, files, and workflows during real tasks, then turns those sessions into reliable evaluation evidence.
We care about more than final answers. We want to understand the process — what tools someone used, what they typed, what they copied, what they verified, how they changed their work, and how they moved across apps. That trace becomes the foundation for AI competency evaluation, expert review, and high-quality training and evaluation data.
We are a small, research-driven team. Our published work includes the AgentIF-OneDay benchmark. We value demonstrated capability over credentials.
The Role.
We are looking for an AI / Systems Engineer to own the desktop capture, trace generation, and evaluation evidence infrastructure behind NeoHuman.
This is not a typical full-stack role. You will work at the boundary between operating systems, desktop apps, multimodal data capture, LLM-based normalization, and assessment pipelines. The core challenge is building a trustworthy system that can reconstruct a candidate's real work session across macOS and Windows without relying on a browser extension.
What You'll Do.
Desktop Session Capture
Build and improve cross-platform capture for macOS and Windows: app and window activity, accessibility trees, OCR snapshots, clipboard events, keyboard and mouse activity boundaries, file interactions, and replay metadata.
Trace Fusion and Timeline Generation
Turn noisy low-level signals into clean, human-readable session timelines: what the user did, where they went, what they typed, what they viewed, what files they touched, and what they submitted.
Evaluation Evidence Pipeline
Design the data model and ingestion pipeline that converts desktop traces, artifacts, recordings, and challenge metadata into reliable evidence for AI competency evaluation.
Multimodal Capture Reliability
Build robust fallback logic across AX/UIA, OCR, keyboard and mouse triggers, clipboard, video segments, and artifact diffs, with explicit confidence and gap tracking.
LLM Normalization and Reporting
Use LLMs carefully to summarize, normalize, and structure session traces without hallucinating or hiding evidence quality.
Privacy, Consent, and Proctoring Safety
Implement session-scoped capture, transparent user consent, sensitive-field handling, permission diagnostics, retention controls, and enterprise-ready auditability.
Desktop App Engineering
Own Electron packaging, native helpers, permission flows, auto-updates, logging, crash handling, and local performance budgets.
Backend Data Infrastructure
Maintain PostgreSQL and Supabase schemas, migrations, storage flows, async jobs, replay assets, and export formats for evaluation and research partners.
Problems You'll Work On.
How do we reliably know what a user typed in Claude, ChatGPT, Codex, Cursor, VS Code, Figma, Slack, Google, or any other desktop or web app?
How do we combine accessibility trees, OCR, keyboard and mouse triggers, clipboard, and recordings into one clean trace?
How do we avoid noisy engineering logs in the final report while preserving enough raw evidence for audits?
How do we keep capture lightweight enough that it does not slow down a candidate's machine?
How do we make the same product work across macOS and Windows even though the underlying OS APIs are different?
How do we distinguish exact evidence from inferred evidence without making the product feel complex?
Requirements.
Must-Have
Strong production engineering experience with TypeScript and Node.js.
Experience building desktop or system-level software, ideally with Electron plus native helpers.
Strong PostgreSQL experience — schema design, migrations, indexing, and data quality constraints.
Comfortable working with operating system APIs, permissions, background processes, and cross-platform differences.
Hands-on LLM API experience for structured output, summarization, evaluation, and hallucination control.
Strong debugging instincts across frontend, backend, desktop runtime, logs, and database state.
Security and privacy mindset around sensitive user activity, PII, consent, and enterprise audit requirements.
Strong Plus
macOS Accessibility / AX APIs, ScreenCaptureKit, Vision OCR, CGEventTap, or Input Monitoring experience.
Windows UI Automation, Win32 input hooks, OCR, or desktop capture experience.
Experience with proctoring, session replay, RPA, observability, or user activity capture systems.
Experience studying or building systems similar to Screenpipe, OpenAdapt, OpenChronicle, Rewind, or workflow mining tools.
Experience with multimodal data pipelines: OCR, video, screenshots, semantic UI trees, files, and event streams.
Experience with AI evaluation, rubric scoring, pairwise comparison, competency frameworks, or RL data pipelines.
Familiarity with Supabase, Vercel, Electron packaging, background workers, and cloud storage.
Tech Stack.
TypeScript · Node.js · Electron · Native macOS and Windows helpers · Next.js 15 · React 19 · PostgreSQL and Supabase · Storage · OCR · Accessibility APIs · OpenAI · Anthropic Claude · Gemini · Vercel · GitHub Actions.
What Good Looks Like.
You can read a messy trace from a real desktop session and identify which parts are exact, inferred, duplicated, missing, or noisy.
You can design a schema that makes bad evidence hard to store and easy to audit.
You can debug why macOS captured AX text in Claude but missed Codex, then turn that into a better permission or fallback strategy.
You can ship pragmatic improvements without pretending any single sensor is perfect.
You care about both product clarity and low-level correctness.
Why Join.
Hard, unusual problem
We are building a desktop-first evaluation system that sits between proctoring, AI assessment, replay, and multimodal evidence capture.
Small team, high ownership
You will own major parts of the capture and evaluation stack directly.
Research and product
We publish benchmarks, ship production software, and build data infrastructure for AI evaluation.
Technical founders
Leadership codes daily and cares about system design, evidence quality, and product craft.
APPLY
Send Us Your Work.
Email jack@emergences.ai with your resume and a brief note about a system you have built — desktop software, AI infrastructure, data pipelines, evaluation, or low-level debugging. Use your name and the position as the subject line, e.g. "Jane Doe — AI / Systems Engineer." Emergences Labs is an equal opportunity employer.