# Optimization Model for an LLM-First Programming Language

## Core definition

A programming language is **optimized for LLMs** when LLM agents can generate, verify, repair, and maintain programs in that language with high reliability under realistic constraints.

More formally, for a task `T`, model `M`, context budget `C`, tool environment `E`, and repair budget `R`, the language/toolchain pair `L` is better optimized when it increases:

```text
P(correct, safe, maintainable program | T, M, C, E, R, L)
```

while decreasing:

```text
expected(tokens + tool calls + repair iterations + human intervention)
```

This definition includes the language, compiler, standard library, package manager, documentation format, diagnostics, formatter, test runner, and project conventions.

## What optimization is not

LLM optimization is not simply:

- using English-like syntax,
- minimizing character count,
- removing types,
- copying Python with different keywords,
- maximizing benchmark pass rate on toy algorithms,
- assuming one specific model architecture,
- hiding complexity from all users.

The target is not “a language only machines can read.” Humans must still audit generated software. The goal is a formal system that reduces model error modes while staying inspectable.

## Optimization dimensions

### 1. Syntactic determinism

LLMs should have a small generation search space for valid syntax.

Desired properties:

- deterministic grammar,
- minimal contextual keywords,
- canonical formatter,
- limited optional punctuation,
- no semantically meaningful whitespace surprises beyond a clearly specified block structure,
- no multiple equivalent module syntaxes,
- no hidden preprocessor behavior.

Metrics:

- parse success rate on first generation,
- grammar production entropy for benchmark corpora,
- frequency of formatter-changing semantic mistakes,
- number of syntax repair iterations.

Design implication:

The language should prefer regular constructs over ad hoc special cases. If two spellings mean the same thing, one should be removed or made formatter-canonical.

### 2. Semantic locality

Correct generation should depend as much as possible on information near the code being written.

Desired properties:

- imports identify exact symbols or modules,
- public interfaces summarize required types, errors, effects, capabilities, and contracts,
- no hidden global registration,
- no runtime reflection required for ordinary dispatch,
- project graph summaries available to agents,
- dependency APIs exposed through local machine-readable manifests.

Metrics:

- cross-file mismatch rate,
- missing-symbol rate,
- number of files required in prompt context for successful generation,
- success degradation under reduced context budgets.

Design implication:

Module interfaces should be self-describing. The compiler should be able to emit compact “agent cards” for packages, modules, and functions.

### 3. Boundary explicitness

LLMs are most likely to make harmful assumptions at boundaries: I/O, public APIs, persistence, time, randomness, concurrency, environment, and unsafe operations.

Desired properties:

- public functions require explicit input and output types,
- nullable/fallible results are represented in the type system,
- effects are declared or inferred and shown at boundaries,
- environment variables and secrets are typed declarations,
- external service calls use capabilities,
- serialization schemas are linked to static types.

Metrics:

- unhandled error rate,
- unchecked nullable value rate,
- undeclared environment/dependency rate,
- security-sensitive operation detection rate,
- runtime exception rate in hidden tests.

Design implication:

Boundary contracts should be concise enough for generation but strict enough that omission is a compiler error.

### 4. Repairability

Generated code will fail. The language should make failures cheap and reliable to repair.

Desired properties:

- structured compiler diagnostics with stable error codes,
- machine-readable spans, expected/actual types, and candidate fixes,
- diagnostic explanations include relevant local interface summaries,
- test failures include minimized counterexamples where possible,
- linter and security diagnostics are integrated into the same reporting model,
- repair suggestions distinguish safe automatic fixes from design choices.

Metrics:

- median repair iterations,
- probability of fixing an error after one diagnostic,
- diagnostic token length per successful repair,
- percentage of diagnostics with actionable suggested edits,
- regression rate after repair.

Design implication:

Diagnostics should be designed as an API, not as terminal prose only. Human text can be derived from structured diagnostic data.

### 5. Verifiability

The language should make correctness claims executable or checkable.

Desired properties:

- examples are executable tests,
- contracts are part of function/module declarations,
- property tests are first-class,
- exhaustive pattern matching,
- optional bounded model checking for finite-state components,
- optional proof hooks for critical libraries,
- generated tests and implementation can be scored separately.

Metrics:

- hidden-test pass rate,
- property-test failure rate,
- contract coverage,
- percentage of public APIs with executable examples,
- rate of overfitting to visible examples.

Design implication:

Specifications should live in the same artifact graph as code. Comments can explain, but checkable declarations should constrain.

### 6. Context economy

LLMs operate under finite context windows and retrieval limits. The language should minimize necessary prompt context.

Desired properties:

- compact interface summaries,
- canonical examples for every public API,
- stable symbol IDs,
- no dependency on long prose documentation for ordinary usage,
- small standard-library surface with orthogonal operations,
- compiler-produced project maps.

Metrics:

- tokens required to describe a task and relevant APIs,
- performance under context truncation,
- number of retrieved snippets used in successful runs,
- symbol ambiguity rate.

Design implication:

Every public package should be able to emit a concise machine-readable summary optimized for generation and repair.

### 7. Hallucination resistance

LLMs invent plausible names when APIs are large, inconsistent, or version-sensitive.

Desired properties:

- standard library names are regular and semantically compositional,
- package APIs include versioned symbol indexes,
- imports fail with closest-valid alternatives and version notes,
- deprecated APIs include migration metadata,
- package manager rejects undeclared dependencies.

Metrics:

- nonexistent function/type/module references,
- wrong-version API usage,
- undeclared dependency references,
- incorrect import paths,
- repair success after unknown-symbol diagnostics.

Design implication:

The compiler and package manager should jointly prevent “ambient API guessing.”

### 8. Safe defaults

Generated code should be safe by default and noisy when unsafe.

Desired properties:

- immutable data by default,
- no implicit shell execution,
- safe path handling APIs,
- parameterized database APIs,
- no untyped deserialization by default,
- secrets are non-printable by default,
- explicit capabilities for filesystem, network, subprocess, environment, clock, randomness, and unsafe memory.

Metrics:

- vulnerability count by class,
- use of unsafe escape hatches,
- tainted-input path to sink,
- secret logging incidents,
- authorization-check omissions in security tasks.

Design implication:

Security-critical APIs should make the safe path shorter and more canonical than the unsafe path.

### 9. Performance legibility

Models often produce asymptotically or practically inefficient code. The language should make cost visible.

Desired properties:

- standard collection operations have documented complexity in symbol summaries,
- compiler warns on common accidental quadratic patterns where feasible,
- benchmark contracts can specify input sizes and performance envelopes,
- profiler output is compact and comparable,
- value copying and allocation are visible enough to reason about.

Metrics:

- timeout rate,
- memory limit failures,
- asymptotic mismatch classification,
- performance relative to idiomatic baseline,
- avoidable allocation/copy diagnostics.

Design implication:

Performance should not depend on hidden magic. Cost models should be exposed through tooling and documentation.

### 10. Human auditability

LLM-optimized does not mean human-hostile.

Desired properties:

- readable canonical formatting,
- concise names and explicit types at important boundaries,
- generated-code provenance metadata,
- review summaries for effects and unsafe operations,
- easy diffing and semantic change reports,
- explanations linked to compiled artifacts, not unverifiable comments.

Metrics:

- human review time,
- reviewer defect detection rate,
- maintainability rubric score,
- semantic diff clarity,
- generated-comment accuracy.

Design implication:

The language should make machine-generated code easier for humans to trust, not harder.

## Candidate language affordances implied by the model

The following are not final specifications, but they are strongly suggested by the optimization criteria.

### Interface-first modules

Each module should expose a compact interface containing:

- exported symbols,
- type signatures,
- effect declarations,
- error types,
- contracts and examples,
- version or symbol identity.

Agents should be able to generate against interfaces before implementations.

### Typed effects and capabilities

Effects should be tracked at least at public boundaries. A simple effect set might include:

- `fs`
- `net`
- `time`
- `random`
- `env`
- `process`
- `db`
- `log`
- `unsafe`
- `async`

Capabilities should be explicit values or declarations, preventing accidental access to sensitive operations.

### Result and option handling

Fallible and nullable operations should not be represented by implicit exceptions or unchecked nulls. The language should enforce one of:

- handle all cases,
- propagate explicitly,
- convert with an explicit default or error mapping.

### Exhaustive data modeling

Algebraic data types, tagged unions, records, and exhaustive pattern matching should be core features. Generated code should not need to encode domain variants through stringly typed conventions.

### Contracts and examples

Functions should support attached executable examples and contracts:

- preconditions,
- postconditions,
- invariants,
- generated boundary cases,
- property-test generators.

The syntax must remain concise enough that LLMs reliably produce it.

### Structured diagnostics

Diagnostics should have a stable schema such as:

- diagnostic code,
- severity,
- primary span,
- related spans,
- expected value/type/effect,
- actual value/type/effect,
- likely cause,
- ranked suggested fixes,
- machine-applicable edits where safe,
- documentation references tied to installed versions.

### Project graph summaries

The toolchain should produce summaries for:

- modules,
- exports,
- imports,
- dependencies,
- effects,
- capabilities,
- tests,
- contracts,
- generated-code provenance.

These summaries are context packets for LLM agents.

## Anti-features

The following features are risky for an LLM-first language unless heavily constrained:

- implicit global imports,
- monkeypatching,
- uncontrolled macros,
- runtime reflection as ordinary architecture,
- untyped exceptions across module boundaries,
- unchecked null,
- broad dynamic values without contracts,
- hidden dependency injection,
- order-dependent module initialization,
- package install scripts with arbitrary behavior,
- multiple competing build systems,
- formatting choices that alter semantics subtly,
- ambiguous overload resolution,
- silent numeric coercions,
- implicit string-to-code execution.

## Evaluation proxies

Because the true objective is probabilistic and expensive to measure directly, the project should use proxy metrics:

| Optimization dimension | Proxy metrics |
| --- | --- |
| Syntactic determinism | Parse success, syntax repair count, formatter stability |
| Semantic locality | Context-token requirement, cross-file mismatch rate |
| Boundary explicitness | Unhandled fallible/null operations, undeclared effects |
| Repairability | Repair iterations, diagnostic usefulness, regression rate |
| Verifiability | Hidden-test pass, contract coverage, property failure rate |
| Context economy | Success under truncated context, retrieved-token count |
| Hallucination resistance | Unknown symbols, wrong imports, wrong API versions |
| Safe defaults | Vulnerability classes, unsafe escape-hatch usage |
| Performance legibility | Timeouts, complexity mismatches, memory failures |
| Human auditability | Review score, semantic diff quality, maintainability |

## Feature acceptance test

A proposed language feature should answer yes to most of the following:

1. Does it reduce a measured LLM failure mode?
2. Can it be explained in a compact interface summary?
3. Can compiler diagnostics repair common mistakes involving it?
4. Does it avoid adding multiple equivalent spellings?
5. Does it improve or preserve human auditability?
6. Does it interact cleanly with modules, effects, tests, and packages?
7. Can benchmark tasks measure its impact?
8. Are escape hatches explicit and reviewable?

If a feature primarily helps expert human expressiveness but increases generation ambiguity, it should be deferred or redesigned.

## Summary

LLM optimization is an engineering objective over a generation-and-repair system. The language should reduce invalid generation, expose hidden assumptions, make correctness checkable, and turn failures into structured repair opportunities. The benchmark framework must verify these claims against Python, Rust, and TypeScript rather than assuming them.