# Research Dossier: Languages, AI Coding Workflows, and LLM Pain Points ## Executive summary Large language models can already generate useful code in mainstream languages, but their failures are systematic rather than random. They hallucinate APIs, produce locally plausible but globally inconsistent multi-file systems, misuse error handling and concurrency primitives, ignore edge cases, and struggle to preserve invariants across edits. These failures are amplified by languages and ecosystems that rely on implicit behavior, large mutable APIs, ambiguous build systems, runtime-only validation, or conventions not visible in the prompt context. A programming language optimized for LLMs should therefore not merely be terse or “English-like.” It should reduce the probability of invalid generation, make invalid states mechanically detectable, and provide diagnostics that support short automated repair loops. The language, compiler, package system, documentation format, and standard library should be treated as one interface for machine generation. Key conclusions: 1. **LLM performance is shaped by training prevalence and local predictability.** Python and TypeScript benefit from abundant examples, while Rust benefits from precise compiler feedback despite being harder to generate initially. 2. **Static guarantees help only when diagnostics are actionable.** Strong type systems are valuable, but error messages must expose repair intent in a structured form for agent loops. 3. **Context economy is a first-class language property.** Generated code fails when required invariants live in distant files, external docs, hidden framework magic, or dependency versions absent from context. 4. **Canonicality reduces search space.** A language should prefer one obvious representation for modules, errors, effects, data schemas, formatting, and tests. 5. **Specs must live beside code.** Contracts, examples, properties, and resource requirements should be compiled artifacts, not comments that models may ignore. 6. **Tooling must be deterministic and inspectable.** Agents need reproducible builds, stable dependency metadata, structured diagnostics, and machine-readable project graphs. ## Method and scope This dossier is a desk-research synthesis based on durable public knowledge of programming languages, developer tools, formal methods, and LLM-assisted coding workflows. It does not claim new empirical benchmark results. It identifies design lessons that should be tested by the benchmark framework defined in this milestone. The survey emphasizes language and workflow properties relevant to generated code: - syntax and grammar ambiguity, - type and effect expressiveness, - error handling and resource lifetime modeling, - module and package resolution, - standard-library surface area, - compiler/runtime diagnostics, - testability and formal verification hooks, - ecosystem stability and documentation locality, - suitability for iterative model repair. ## Survey of existing languages and design lessons ### Python Python is currently one of the easiest languages for LLMs to generate because examples are abundant and the syntax is compact. Its interactive culture, broad standard library, and readable idioms make it attractive for one-shot code generation. Strengths for LLM generation: - Very high representation in training data. - Low ceremony for scripts and data transformations. - Exceptions and dynamic typing allow incomplete designs to run early. - REPL culture supports incremental repair. Pain points: - Runtime-only detection for many mistakes. - Large and version-sensitive dependency ecosystem. - Ambiguous object shapes when type hints are absent or incomplete. - Import behavior and packaging metadata often surprise agents. - Silent bugs from truthiness, mutation, implicit `None`, and duck typing. - Performance characteristics are frequently mispredicted. Design lessons: - Low ceremony improves first attempts, but weak contracts increase hidden failures. - Type hints help, but optional typing is not enough unless enforced consistently. - The future language should preserve low-friction scripts while making contracts mandatory at module boundaries. ### Rust Rust offers strong compile-time guarantees around ownership, borrowing, data races, and error handling. LLMs often struggle to produce correct Rust on the first attempt, but the compiler provides unusually precise feedback for repairs. Strengths for LLM generation: - Explicit error handling through `Result` and `Option`. - Strong module, trait, and ownership models catch many invalid states. - Compiler diagnostics are detailed and often include suggestions. - Formatting and package conventions are standardized by `rustfmt` and Cargo. - Data-race prevention is a strong safety baseline. Pain points: - Borrow checker errors require global reasoning about lifetimes and aliasing. - Trait bounds and generic inference can become verbose. - Async Rust adds complex pinning, lifetimes, trait objects, and runtime choices. - API versions and crate feature flags are common hallucination sites. - The language’s safety model is powerful but not optimized for short prompts. Design lessons: - Strong static checking is valuable if paired with model-oriented explanations. - Ownership-like resource tracking should be exposed through simpler declarative effects where possible. - Tooling standardization is a major advantage and should be copied. ### TypeScript TypeScript sits between dynamic JavaScript ergonomics and static checking. It benefits from massive ecosystem familiarity and is common in web and API development. Strengths for LLM generation: - High training-data prevalence. - Structural types are easy to express for JSON-like data. - Excellent editor tooling and incremental type feedback. - Good fit for frontend, backend, and schema-driven API code. - `tsconfig` and package scripts can standardize workflows. Pain points: - Runtime JavaScript semantics still apply; static types can be erased or bypassed. - Many libraries use complex generic APIs that models imitate incorrectly. - `any`, type assertions, and broad unions can mask defects. - Module format differences, bundlers, and runtime targets cause build failures. - Ecosystem churn leads to stale API hallucinations. Design lessons: - Structural schema modeling is useful for LLMs, especially for JSON I/O. - Escape hatches must be visible, audited, and penalized in evaluation. - Runtime validation should be linked to static types rather than left to convention. ### Go Go emphasizes simplicity, fast compilation, explicit errors, and standard tooling. Strengths: - Small core language and canonical formatting. - Simple package structure and fast build/test cycle. - Explicit error returns are easy to inspect. - Concurrency primitives are simple to name and compose. Pain points: - Verbose error propagation can cause omitted checks. - Interface satisfaction is implicit and can hide intent. - Nil handling and shared mutable state remain common bug sources. - Generics are less expressive than in some typed languages. Design lessons: - Tooling uniformity and small grammar are high-value LLM affordances. - Explicit error paths are good, but should be structured enough to prevent ignored failures. ### Java, Kotlin, and C# These languages show the value of mature IDE tooling, strong static types, package ecosystems, and annotation-driven frameworks. Strengths: - Strong type systems and mature diagnostics. - Large ecosystems and extensive examples. - Good refactoring support. - Standard project structures in many domains. Pain points: - Frameworks often rely on reflection, annotations, code generation, and dependency injection that are invisible in local code. - Boilerplate and configuration increase context burden. - Nullability is historically inconsistent, though Kotlin improves this significantly. Design lessons: - Hidden framework magic is hostile to LLM reliability. - Compiler-visible configuration is preferable to runtime reflection conventions. - Nullability must be explicit and enforced. ### Lisp, Scheme, and homoiconic languages Lisp-family languages demonstrate the power of simple syntax, macros, and code-as-data. Strengths: - Minimal grammar. - Easy program transformation. - Uniform syntax can be generated reliably. - REPL-driven development supports iteration. Pain points: - Macro-heavy systems can create local ambiguity about semantics. - Parenthesized uniformity can be less aligned with mainstream code corpora. - Dynamic typing can defer many errors. Design lessons: - A small grammar helps, but macro systems need strong hygiene, expansion visibility, and tooling constraints. - A future language can borrow canonical AST-like syntax without exposing unlimited compile-time metaprogramming early. ### ML family: OCaml, F#, Standard ML, Haskell These languages emphasize algebraic data types, pattern matching, type inference, and functional purity. Strengths: - Sum types and exhaustive matching make invalid states explicit. - Type inference reduces annotation burden while preserving guarantees. - Pure functions and immutable data improve local reasoning. - Pattern matching is compact and verifiable. Pain points: - Advanced type features can be difficult for models to apply correctly. - Haskell’s laziness, monads, and type classes increase conceptual load. - Ecosystems are less represented in LLM training data than Python/TypeScript. Design lessons: - Algebraic data types and exhaustive matching are extremely valuable for generated correctness. - Effects should be explicit but not require advanced abstraction vocabulary for common tasks. - Type inference should be constrained by readable, compiler-emitted boundaries. ### Erlang and Elixir Actor-based languages demonstrate fault tolerance, message passing, and supervision trees. Strengths: - Clear concurrency model through isolated processes and messages. - “Let it crash” supervision makes failure strategy explicit. - Pattern matching and immutable data improve robustness. Pain points: - Dynamic typing can hide message shape errors. - Distributed systems remain hard to specify and test. - Runtime behavior depends on supervision configuration. Design lessons: - Concurrency should be represented through typed channels/messages and explicit supervision policies. - Generated code benefits when failure recovery is declarative. ### SQL, Datalog, and logic languages Declarative query languages demonstrate how constrained domains reduce generation complexity. Strengths: - High-level intent with optimizer-managed execution. - Declarative constraints avoid imperative control-flow bugs. - Datalog has a small, compositional semantics. Pain points: - SQL dialect differences cause hallucinated syntax. - Query performance and indexing are often mispredicted. - Complex joins and null semantics create subtle bugs. Design lessons: - The future language should include declarative sublanguages for data transformation and constraints. - Dialect/version metadata must be explicit to prevent stale or incompatible generation. ### Prolog Prolog illustrates logic programming and relational search. Strengths: - Compact expression of relations and constraints. - Backtracking can solve search tasks with little imperative code. Pain points: - Operational behavior depends on clause order and cuts. - Debugging search space explosions is difficult. - Lower mainstream prevalence hurts model generation quality. Design lessons: - Constraint features should be bounded, typed, and explainable. - Non-obvious control behavior should be surfaced by tooling. ### Formal methods tools: TLA+, Alloy, Dafny, F*, Coq, Lean, Agda These systems show how specifications, proofs, and model checking can prevent deep semantic errors. Strengths: - Machine-checkable invariants. - Counterexample generation in tools such as Alloy and TLA+. - Dependent or refinement types can encode rich correctness properties. - Proof assistants support verified libraries and transformations. Pain points: - Proof authoring is hard for humans and LLMs. - Error messages can be dense and require specialized knowledge. - Full formal verification may be too expensive for everyday programming. Design lessons: - The future language should not require proofs for all code, but should allow graded verification: - executable examples, - contracts, - property tests, - bounded model checks, - optional proofs for critical modules. - Counterexamples are highly valuable for model repair. ### Configuration languages: Nix, Dhall, CUE, Pkl Configuration languages address reproducibility, schemas, and validation. Strengths: - Declarative configuration with type/schema validation. - Reproducibility and dependency pinning in systems like Nix. - CUE unifies data, schema, and constraints. Pain points: - Evaluation semantics can be non-obvious. - Ecosystem-specific knowledge is required. - Error messages can be hard to map back to user intent. Design lessons: - Program configuration should be typed, validated, and part of the same project graph. - Environment assumptions must be explicit in generated projects. ### Unison Unison explores content-addressed code and semantic identities. Strengths: - Function identity independent of names can improve refactoring. - Structured codebase storage can avoid some dependency drift. - Strong typed functional core. Pain points: - Nonstandard workflow and limited ecosystem prevalence. - Tooling model differs from file-based expectations. Design lessons: - Content-addressed interfaces could help LLMs avoid name drift and stale imports. - The future language can adopt stable symbol identities without abandoning normal files initially. ### Elm, Gleam, Roc, and other constrained modern languages These languages prioritize simple functional architecture, strong types, and friendly compiler errors. Strengths: - Friendly diagnostics. - Small language surface. - Exhaustive pattern matching and immutable data. - Reduced runtime failure classes. Pain points: - Smaller ecosystems. - Domain constraints may limit general-purpose adoption. Design lessons: - Small, opinionated languages are promising for LLM generation. - Human-friendly diagnostics should evolve into agent-friendly diagnostics with structured repair hints. ## Survey of AI coding workflows ### Autocomplete Autocomplete systems predict short spans within an existing file. They excel when local context is strong and project conventions are visible. They fail when the correct code depends on hidden invariants, unavailable APIs, or cross-file coordination. Language implications: - Local declarations should expose enough type and effect information for correct completions. - Imports should be deterministic and discoverable. - There should be one canonical way to express common patterns. ### Chat-based pair programming A user describes a desired change and receives code or explanation. This workflow often suffers from mismatch between user intent, existing project structure, and generated patches. Language implications: - Projects need machine-readable architecture maps. - Compiler diagnostics should be compact enough to paste into context. - Generated patches should include explicit assumptions. ### Agentic coding loops Agents plan, edit files, run tests, inspect errors, and repair. This is the most relevant workflow for a new language because the compiler and tools can become active participants in repair. Language implications: - Build, test, lint, and format commands should be uniform. - Diagnostics should be structured, stable, and linked to suggested edits. - The compiler should report missing symbols, wrong effects, and contract failures in a way that supports automated patches. ### Test-driven generation A model receives tests or writes tests first, then implements code. This improves reliability when tests are comprehensive but can produce overfitting or brittle solutions. Language implications: - Tests should be first-class modules with stable fixture syntax. - Property tests and examples should be concise. - Coverage and mutation feedback should be easy for agents to consume. ### Retrieval-augmented coding A model retrieves docs, examples, or project files before generating. It helps with uncommon APIs but can introduce stale or irrelevant context. Language implications: - Documentation should be versioned and colocated with symbols. - Packages should expose machine-readable API summaries. - The language server should provide minimal relevant context on demand. ### Code review and repair Models review diffs for bugs, style, security, or performance. Current limitations include shallow reasoning, missed invariants, and false confidence. Language implications: - Review should be aided by explicit contracts and effect declarations. - Generated code should include provenance: prompt, model, tools, and assumptions. - Security-sensitive operations should be syntactically visible. ### Migration and refactoring Models are increasingly used to port code between frameworks or languages. Failures often occur around semantic edge cases and library differences. Language implications: - The future language should have explicit interop boundaries. - Refactoring should be AST-aware and compiler-validated. - Deprecated APIs should include machine-readable migration recipes. ## Pain points for LLM-generated code ### 1. Hallucinated APIs and stale dependency knowledge Models often call functions that look plausible but do not exist in the selected version. This is common in fast-moving ecosystems and libraries with similar names. Mitigations: - Versioned symbol indexes. - Compiler errors that distinguish “unknown symbol” from “wrong import path.” - Standard library APIs that are small, stable, and semantically named. - Package manifests generated from explicit constraints, not guessed defaults. ### 2. Cross-file inconsistency Models generate a type, route, schema, or function in one file and call a different name or shape elsewhere. Mitigations: - Interface-first generation. - Compiler-enforced module contracts. - Project graph summaries. - Canonical export syntax and generated symbol maps. ### 3. Hidden runtime behavior Framework reflection, decorators, annotations, monkeypatching, import side effects, environment variables, and global state are difficult to infer from local context. Mitigations: - Avoid hidden registration. - Make effects and dependencies explicit. - Treat environment variables and external services as typed inputs. - Require build-time validation of generated routes, handlers, and schemas. ### 4. Weak error and null handling LLMs frequently ignore error returns, use broad catches, mishandle `None`/`null`, or assume successful I/O. Mitigations: - Exhaustive handling for nullable and fallible values. - Typed error channels. - Compiler warnings or errors for discarded failures. - Standard patterns for retries, backoff, and cleanup. ### 5. State and mutation bugs Generated code may accidentally share mutable data, mutate while iterating, or violate invariants after partial updates. Mitigations: - Immutable defaults. - Transactional update blocks. - Linear or affine resource types for files, sockets, locks, and handles. - Invariant declarations checked at function boundaries or test time. ### 6. Concurrency mistakes Common failures include data races, forgotten awaits, deadlocks, unbounded task spawning, and cancellation leaks. Mitigations: - Structured concurrency as the default. - Typed channels and cancellation scopes. - Compiler-visible thread-safety and async effects. - Built-in race and deadlock diagnostics where feasible. ### 7. Security vulnerabilities LLM-generated code often mishandles SQL injection, shell execution, path traversal, SSRF, secret logging, unsafe deserialization, and authorization. Mitigations: - Safe APIs as defaults. - Capability-based access to filesystem, network, shell, and secrets. - Taint-tracking or effect labels for untrusted input. - Security linting integrated into the standard build. ### 8. Overfitting to visible tests Generated solutions may pass provided examples but fail hidden edge cases. Mitigations: - Property-based testing support. - Compiler-generated boundary-case test suggestions. - Benchmark scoring that separates public examples from hidden tests. - Contracts that state behavior beyond examples. ### 9. Poor performance reasoning Models often select algorithms with inappropriate complexity or misuse expensive copies, regexes, or data structures. Mitigations: - Optional complexity annotations. - Compiler/runtime profiling summaries. - Standard data-structure names and constraints. - Benchmark metrics that include asymptotic and practical performance. ### 10. Dependency and environment drift Generated projects fail because of missing packages, wrong runtime versions, OS assumptions, or undeclared tools. Mitigations: - Single manifest format. - Reproducible toolchain metadata. - No implicit global dependencies. - Environment requirements as typed declarations. ## Design lessons for the new language The new language should be treated as an integrated **LLM operating surface**: 1. **Small grammar, rich semantics.** Keep syntax predictable while encoding strong contracts. 2. **Canonical project structure.** One manifest, one formatter, one test command, one module layout. 3. **Mandatory boundary types.** Public functions, external I/O, and module exports must expose schemas, effects, and errors. 4. **Explicit effects.** Filesystem, network, time, randomness, subprocesses, environment, database, and unsafe operations should be visible. 5. **Typed errors and nullability.** Fallibility must be handled exhaustively or propagated explicitly. 6. **Contract layers.** Examples, assertions, properties, and optional proofs should compose without a separate language. 7. **Structured diagnostics.** Every compiler/test/lint failure should have stable codes, spans, expected/actual values, and repair hints. 8. **Repair-oriented compiler.** The compiler should rank likely fixes and explain required context. 9. **Versioned symbol knowledge.** Packages should ship machine-readable API summaries. 10. **Generated-code provenance.** Build artifacts should record model, prompt, tool version, assumptions, and review status where appropriate. 11. **Security by construction.** Capabilities and safe defaults should reduce vulnerable code paths. 12. **Human auditability remains necessary.** The language is optimized for LLMs, but humans still need to review, debug, and govern generated systems. ## References and comparable evaluation efforts Useful public reference points for later work include: - Python language reference, packaging guides, and typing specifications. - Rust Book, Rust Reference, Cargo documentation, and Rust compiler diagnostics. - TypeScript Handbook and compiler configuration documentation. - Go specification and `gofmt`/`go test` workflow. - Elm compiler diagnostics and architecture patterns. - TLA+, Alloy, Dafny, Lean, Coq, and F* documentation for specification approaches. - CUE, Dhall, Pkl, and Nix documentation for typed configuration and reproducibility. - SWE-bench, HumanEval, MBPP, APPS, CodeContests, EvalPlus, and related code-generation benchmarks. - Research on self-repair, Reflexion-style loops, self-refinement, compiler-feedback prompting, and retrieval-augmented code generation. These references should inform later design, but the project’s own benchmark must measure the actual proposed language against Python, Rust, and TypeScript under controlled conditions.