# Project Charter

## Project name

**AI Creates a Programming Language**

The language itself is not named in this milestone. Naming should occur after the initial syntax, semantics, and evaluation criteria are validated enough to avoid branding a premature design.

## Mission

Design and build a programming language, compiler, standard tooling, examples, and benchmarks optimized for LLM-assisted and LLM-autonomous software creation.

The project’s central claim to test:

> A programming language designed around the strengths and weaknesses of LLM code generation can achieve higher end-to-end correctness, lower repair cost, and better safety than Python, Rust, or TypeScript on comparable tasks.

## Charter outcome for this milestone

This milestone establishes:

- the research basis for language and tooling choices,
- the definition of “optimized for LLMs,”
- the initial success metrics,
- the benchmark and evaluation plan,
- the scope boundaries and project principles.

It intentionally does not choose every future language feature. It defines the decision framework by which features should be proposed, tested, accepted, or rejected.

## Stakeholders

### Primary stakeholders

- **LLM agent developers** who need a reliable target language for generated systems.
- **Human reviewers** responsible for auditing generated code.
- **Compiler/tooling engineers** building language servers, diagnostics, package tooling, and test harnesses.
- **Application developers** who will use generated code in production systems.

### Secondary stakeholders

- Programming language researchers.
- AI safety and security reviewers.
- Benchmark maintainers.
- Educators studying generated software.
- Organizations evaluating whether generated code is trustworthy.

## Problem statement

Mainstream languages were designed primarily for human authors. LLMs can generate them because they appear heavily in training data, not because the languages expose an ideal machine-generation interface. Common failures include:

- plausible but nonexistent APIs,
- mismatched interfaces across files,
- missing error handling,
- hidden framework conventions,
- wrong dependency versions,
- insufficient tests,
- security mistakes,
- poor use of compiler feedback,
- fragile patches that pass visible tests only.

A new language can address these problems by changing the generation target itself: syntax, semantics, standard library, package metadata, diagnostics, and project conventions can all be made more predictable and verifiable.

## Definition of success

The project succeeds if the new language demonstrates, on a transparent benchmark suite, meaningful improvements over Python, Rust, and TypeScript for LLM-generated programs.

Success must be measured across multiple axes, not only benchmark pass rate:

1. **Correctness**: generated programs pass public and hidden tests.
2. **First-pass validity**: generated programs parse, typecheck, build, and run without repair.
3. **Repair efficiency**: compiler/test feedback leads to fewer iterations and lower token cost.
4. **Interface consistency**: multi-file symbols and schemas remain coherent.
5. **Security**: generated programs avoid common vulnerability classes.
6. **Maintainability**: code is understandable, idiomatic for the language, and easy to modify.
7. **Performance**: programs meet task-level performance constraints.
8. **Reproducibility**: builds and tests are deterministic from declared inputs.

## Initial quantitative targets

The following targets are proposed for the first serious benchmarked prototype. They are intentionally ambitious but testable.

Against the best-performing baseline among Python, Rust, and TypeScript for each task family, the new language should achieve:

- at least **20% relative improvement** in compiler/build success on first generation,
- at least **15% relative improvement** in hidden-test pass rate after a fixed repair budget,
- at least **25% reduction** in median repair iterations for tasks requiring repair,
- at least **25% reduction** in hallucinated API or missing-symbol defects,
- no statistically significant increase in critical security defects,
- task performance within an accepted envelope defined per benchmark category.

These targets are not guaranteed outcomes. They are the first evaluation thresholds that determine whether the design direction is working.

## Project principles

### 1. Optimize the whole loop

The unit of optimization is not syntax alone. The project optimizes the loop:

prompt → generated code → compiler/build → diagnostics → repair → tests → review → deployment artifact.

### 2. Prefer explicitness at boundaries

LLMs fail when assumptions are implicit. Public functions, module exports, external I/O, effects, permissions, error types, and environment requirements should be explicit and machine-checkable.

### 3. Make the common path canonical

For every common task, there should be one obvious project shape, one obvious error model, one obvious formatter result, and one obvious testing convention.

### 4. Design diagnostics as an API

Compiler and tool diagnostics are part of the language interface. They must be stable, structured, and useful for both humans and agents.

### 5. Support graded assurance

Not all code requires proof-level verification. The language should support a ladder of assurance:

- examples,
- unit tests,
- property tests,
- contracts,
- bounded checks,
- optional proofs or external verification.

### 6. Keep escape hatches visible

Unsafe operations, unchecked casts, broad dynamic values, shell execution, network access, global mutation, and reflection-like behavior should be syntactically and semantically visible.

### 7. Benchmark honestly

The new language must compete against well-configured Python, Rust, and TypeScript baselines. Evaluation must not rely on strawman prompts, intentionally poor baseline tooling, or tasks tailored only to the new language’s strengths.

### 8. Preserve human accountability

The project optimizes for LLM generation but must remain reviewable by humans. Generated code should not become opaque.

## Scope

### In scope for the overall project

- Language design: syntax, semantics, type system, effects, errors, modules.
- Compiler or interpreter implementation.
- Formatter, linter, test runner, package tooling, and language server.
- Standard library design for common benchmark domains.
- Examples and documentation oriented toward LLM use.
- Benchmark harness comparing against Python, Rust, and TypeScript.
- Reproducible experiment logs and reports.

### In scope for milestone 1

- Research dossier.
- Project charter.
- Definition of LLM optimization.
- Evaluation metrics and protocol.
- Initial benchmark plan and task catalog.

### Out of scope for milestone 1

- Compiler implementation.
- Final syntax.
- Standard library implementation.
- Executable benchmark harness.
- Claims of empirical superiority.

### Non-goals for the overall project

- Replacing every mainstream language use case.
- Designing a natural-language programming system with no formal syntax.
- Maximizing brevity at the expense of correctness.
- Depending on a single proprietary LLM.
- Hiding all complexity from human reviewers.
- Treating benchmark score as the only measure of value.
- Encouraging unreviewed deployment of generated code.

## Design constraints

The future language should aim for:

- deterministic parsing with a small grammar,
- canonical formatting,
- explicit module exports and imports,
- public boundary type annotations,
- explicit effect and capability declarations,
- typed errors and nullability,
- immutable data by default,
- structured concurrency if concurrency is supported,
- integrated tests and contracts,
- machine-readable compiler output,
- reproducible package resolution,
- stable standard-library API summaries.

## Evaluation commitments

The project commits to evaluating at least the following baselines:

- Python with current CPython, type-checking where task-appropriate, formatter/linter where task-appropriate.
- Rust with current stable toolchain, Cargo, formatter, Clippy where task-appropriate.
- TypeScript with current Node.js LTS-compatible runtime, TypeScript compiler strict mode, formatter/linter where task-appropriate.

Exact tool versions should be recorded by the benchmark harness when implemented rather than hard-coded in this charter.

## Governance and decision process

Language and tooling decisions should be evaluated using the following decision record fields:

1. Problem addressed.
2. Alternative designs considered.
3. Expected LLM-generation benefit.
4. Expected human-maintenance impact.
5. Static/dynamic guarantees added.
6. Complexity cost.
7. Benchmark tasks affected.
8. Failure modes.
9. Decision and rationale.

A feature should not be accepted solely because it is elegant, familiar, or popular. It should improve measurable generation, repair, safety, or maintainability outcomes without disproportionate complexity.

## Milestone roadmap after this charter

### Milestone 2: Language design specification

Define the initial syntax, semantics, module system, type model, effects, errors, and core standard library surface.

### Milestone 3: Compiler or interpreter prototype

Implement parser, semantic analyzer, diagnostics, formatter, and a minimal runtime or code generation target.

### Milestone 4: Tooling for LLM workflows

Implement structured diagnostics, project graph summaries, language-server functionality, generated-code metadata, and repair-loop affordances.

### Milestone 5: Examples and standard library expansion

Build representative programs and libraries needed for benchmark coverage.

### Milestone 6: Benchmark harness and baseline implementations

Implement task runners, hidden tests, scoring, and baseline solutions/prompts for Python, Rust, TypeScript, and the new language.

### Milestone 7: Empirical evaluation and design iteration

Run controlled experiments, publish results, identify weaknesses, and revise language/tooling design.

## Risk register

| Risk | Impact | Mitigation |
| --- | --- | --- |
| LLMs perform better in mainstream languages because of training prevalence | The new language may underperform early | Provide excellent examples, canonical docs, and structured tooling; evaluate learning curves separately. |
| Language becomes too complex | Agents and humans both fail more often | Enforce small grammar and feature acceptance criteria tied to benchmarks. |
| Benchmark overfits to the new language | Results are not credible | Use diverse tasks, hidden tests, baseline experts, and public protocols. |
| Tooling burden is underestimated | Compiler exists but ecosystem is unusable | Treat tooling as core scope, not an add-on. |
| Strong guarantees reduce productivity | First-pass code becomes verbose or hard to generate | Use inference, defaults, and graded assurance while preserving explicit boundaries. |
| Security claims are overstated | Generated code may be deployed unsafely | Measure concrete vulnerability classes and require human accountability. |
| Dependency ecosystem is too small | Real-world adoption stalls | Start with domains that need few dependencies, design interop later, and expose package metadata clearly. |
| Model-specific optimization ages poorly | Language helps one generation of models only | Optimize around durable failure modes: ambiguity, hidden state, missing contracts, weak diagnostics. |

## Charter acceptance criteria

This charter is considered complete when it provides:

- a mission and problem statement,
- a definition of project success,
- measurable initial targets,
- scope and non-goals,
- design principles,
- evaluation commitments,
- governance process,
- major risks.

Those criteria are satisfied by this document and the companion evaluation files in this milestone.