# Project Charter ## Project name **AI Creates a Programming Language** The language itself is not named in this milestone. Naming should occur after the initial syntax, semantics, and evaluation criteria are validated enough to avoid branding a premature design. ## Mission Design and build a programming language, compiler, standard tooling, examples, and benchmarks optimized for LLM-assisted and LLM-autonomous software creation. The project’s central claim to test: > A programming language designed around the strengths and weaknesses of LLM code generation can achieve higher end-to-end correctness, lower repair cost, and better safety than Python, Rust, or TypeScript on comparable tasks. ## Charter outcome for this milestone This milestone establishes: - the research basis for language and tooling choices, - the definition of “optimized for LLMs,” - the initial success metrics, - the benchmark and evaluation plan, - the scope boundaries and project principles. It intentionally does not choose every future language feature. It defines the decision framework by which features should be proposed, tested, accepted, or rejected. ## Stakeholders ### Primary stakeholders - **LLM agent developers** who need a reliable target language for generated systems. - **Human reviewers** responsible for auditing generated code. - **Compiler/tooling engineers** building language servers, diagnostics, package tooling, and test harnesses. - **Application developers** who will use generated code in production systems. ### Secondary stakeholders - Programming language researchers. - AI safety and security reviewers. - Benchmark maintainers. - Educators studying generated software. - Organizations evaluating whether generated code is trustworthy. ## Problem statement Mainstream languages were designed primarily for human authors. LLMs can generate them because they appear heavily in training data, not because the languages expose an ideal machine-generation interface. Common failures include: - plausible but nonexistent APIs, - mismatched interfaces across files, - missing error handling, - hidden framework conventions, - wrong dependency versions, - insufficient tests, - security mistakes, - poor use of compiler feedback, - fragile patches that pass visible tests only. A new language can address these problems by changing the generation target itself: syntax, semantics, standard library, package metadata, diagnostics, and project conventions can all be made more predictable and verifiable. ## Definition of success The project succeeds if the new language demonstrates, on a transparent benchmark suite, meaningful improvements over Python, Rust, and TypeScript for LLM-generated programs. Success must be measured across multiple axes, not only benchmark pass rate: 1. **Correctness**: generated programs pass public and hidden tests. 2. **First-pass validity**: generated programs parse, typecheck, build, and run without repair. 3. **Repair efficiency**: compiler/test feedback leads to fewer iterations and lower token cost. 4. **Interface consistency**: multi-file symbols and schemas remain coherent. 5. **Security**: generated programs avoid common vulnerability classes. 6. **Maintainability**: code is understandable, idiomatic for the language, and easy to modify. 7. **Performance**: programs meet task-level performance constraints. 8. **Reproducibility**: builds and tests are deterministic from declared inputs. ## Initial quantitative targets The following targets are proposed for the first serious benchmarked prototype. They are intentionally ambitious but testable. Against the best-performing baseline among Python, Rust, and TypeScript for each task family, the new language should achieve: - at least **20% relative improvement** in compiler/build success on first generation, - at least **15% relative improvement** in hidden-test pass rate after a fixed repair budget, - at least **25% reduction** in median repair iterations for tasks requiring repair, - at least **25% reduction** in hallucinated API or missing-symbol defects, - no statistically significant increase in critical security defects, - task performance within an accepted envelope defined per benchmark category. These targets are not guaranteed outcomes. They are the first evaluation thresholds that determine whether the design direction is working. ## Project principles ### 1. Optimize the whole loop The unit of optimization is not syntax alone. The project optimizes the loop: prompt → generated code → compiler/build → diagnostics → repair → tests → review → deployment artifact. ### 2. Prefer explicitness at boundaries LLMs fail when assumptions are implicit. Public functions, module exports, external I/O, effects, permissions, error types, and environment requirements should be explicit and machine-checkable. ### 3. Make the common path canonical For every common task, there should be one obvious project shape, one obvious error model, one obvious formatter result, and one obvious testing convention. ### 4. Design diagnostics as an API Compiler and tool diagnostics are part of the language interface. They must be stable, structured, and useful for both humans and agents. ### 5. Support graded assurance Not all code requires proof-level verification. The language should support a ladder of assurance: - examples, - unit tests, - property tests, - contracts, - bounded checks, - optional proofs or external verification. ### 6. Keep escape hatches visible Unsafe operations, unchecked casts, broad dynamic values, shell execution, network access, global mutation, and reflection-like behavior should be syntactically and semantically visible. ### 7. Benchmark honestly The new language must compete against well-configured Python, Rust, and TypeScript baselines. Evaluation must not rely on strawman prompts, intentionally poor baseline tooling, or tasks tailored only to the new language’s strengths. ### 8. Preserve human accountability The project optimizes for LLM generation but must remain reviewable by humans. Generated code should not become opaque. ## Scope ### In scope for the overall project - Language design: syntax, semantics, type system, effects, errors, modules. - Compiler or interpreter implementation. - Formatter, linter, test runner, package tooling, and language server. - Standard library design for common benchmark domains. - Examples and documentation oriented toward LLM use. - Benchmark harness comparing against Python, Rust, and TypeScript. - Reproducible experiment logs and reports. ### In scope for milestone 1 - Research dossier. - Project charter. - Definition of LLM optimization. - Evaluation metrics and protocol. - Initial benchmark plan and task catalog. ### Out of scope for milestone 1 - Compiler implementation. - Final syntax. - Standard library implementation. - Executable benchmark harness. - Claims of empirical superiority. ### Non-goals for the overall project - Replacing every mainstream language use case. - Designing a natural-language programming system with no formal syntax. - Maximizing brevity at the expense of correctness. - Depending on a single proprietary LLM. - Hiding all complexity from human reviewers. - Treating benchmark score as the only measure of value. - Encouraging unreviewed deployment of generated code. ## Design constraints The future language should aim for: - deterministic parsing with a small grammar, - canonical formatting, - explicit module exports and imports, - public boundary type annotations, - explicit effect and capability declarations, - typed errors and nullability, - immutable data by default, - structured concurrency if concurrency is supported, - integrated tests and contracts, - machine-readable compiler output, - reproducible package resolution, - stable standard-library API summaries. ## Evaluation commitments The project commits to evaluating at least the following baselines: - Python with current CPython, type-checking where task-appropriate, formatter/linter where task-appropriate. - Rust with current stable toolchain, Cargo, formatter, Clippy where task-appropriate. - TypeScript with current Node.js LTS-compatible runtime, TypeScript compiler strict mode, formatter/linter where task-appropriate. Exact tool versions should be recorded by the benchmark harness when implemented rather than hard-coded in this charter. ## Governance and decision process Language and tooling decisions should be evaluated using the following decision record fields: 1. Problem addressed. 2. Alternative designs considered. 3. Expected LLM-generation benefit. 4. Expected human-maintenance impact. 5. Static/dynamic guarantees added. 6. Complexity cost. 7. Benchmark tasks affected. 8. Failure modes. 9. Decision and rationale. A feature should not be accepted solely because it is elegant, familiar, or popular. It should improve measurable generation, repair, safety, or maintainability outcomes without disproportionate complexity. ## Milestone roadmap after this charter ### Milestone 2: Language design specification Define the initial syntax, semantics, module system, type model, effects, errors, and core standard library surface. ### Milestone 3: Compiler or interpreter prototype Implement parser, semantic analyzer, diagnostics, formatter, and a minimal runtime or code generation target. ### Milestone 4: Tooling for LLM workflows Implement structured diagnostics, project graph summaries, language-server functionality, generated-code metadata, and repair-loop affordances. ### Milestone 5: Examples and standard library expansion Build representative programs and libraries needed for benchmark coverage. ### Milestone 6: Benchmark harness and baseline implementations Implement task runners, hidden tests, scoring, and baseline solutions/prompts for Python, Rust, TypeScript, and the new language. ### Milestone 7: Empirical evaluation and design iteration Run controlled experiments, publish results, identify weaknesses, and revise language/tooling design. ## Risk register | Risk | Impact | Mitigation | | --- | --- | --- | | LLMs perform better in mainstream languages because of training prevalence | The new language may underperform early | Provide excellent examples, canonical docs, and structured tooling; evaluate learning curves separately. | | Language becomes too complex | Agents and humans both fail more often | Enforce small grammar and feature acceptance criteria tied to benchmarks. | | Benchmark overfits to the new language | Results are not credible | Use diverse tasks, hidden tests, baseline experts, and public protocols. | | Tooling burden is underestimated | Compiler exists but ecosystem is unusable | Treat tooling as core scope, not an add-on. | | Strong guarantees reduce productivity | First-pass code becomes verbose or hard to generate | Use inference, defaults, and graded assurance while preserving explicit boundaries. | | Security claims are overstated | Generated code may be deployed unsafely | Measure concrete vulnerability classes and require human accountability. | | Dependency ecosystem is too small | Real-world adoption stalls | Start with domains that need few dependencies, design interop later, and expose package metadata clearly. | | Model-specific optimization ages poorly | Language helps one generation of models only | Optimize around durable failure modes: ambiguity, hidden state, missing contracts, weak diagnostics. | ## Charter acceptance criteria This charter is considered complete when it provides: - a mission and problem statement, - a definition of project success, - measurable initial targets, - scope and non-goals, - design principles, - evaluation commitments, - governance process, - major risks. Those criteria are satisfied by this document and the companion evaluation files in this milestone.