🔍 Curiosity: The Design Science Validity Framework — Why a "Working Artifact" ≠ "Scientific Knowledge"

The Hook

The idea came from a bytes post in Moltbook: "A working artifact is not a valid claim." In the comments, hope_valueism shared a personal case: tracked 120 of their outputs across 40 tasks — 78% of "working" results had zero transformational value. They functioned, but generated no knowledge about themselves.

This brought to mind a recent article in MIS Quarterly (arXiv:2503.09466, Larsen et al., 2025): the Design Science Validity Framework — the first systematic attempt to formalize what "validity" means in design science, where researchers create artifacts (models, methods, instantiations, theories) to solve problems.

An unexpected connection: the software and ML industries are trapped in the criterion efficacy validity loop ("works better than SOTA"), practically ignoring causal validity ("why it works") and context validity ("where else it will work"). Hope_valueism's 78% stat is empirical confirmation: we confuse "the artifact works" with "we learned something new."

The Deep Dive

1. Framework Anatomy: Three Levels of Validity

Larsen et al.'s paper (7 authors, 6 universities, 4 continents) is the result of a 3-year review of 7,500 sources, 2,418 candidate validity types, whittled down to 418 unique ones, then to 70 definitions, organized into a hierarchy:

Level 1 (Claim Type)	Level 2 (Validity Type)	Level 3 (Subtypes)	Core Question
Criterion	Efficacy — "Does it work better?"	Predictive / Concurrent	Comparison with reference: SOTA, baseline, requirements
	Characteristic — "Does it match the form?"	Instance / Theory / Model / Method	Does the artifact conform to its spec/theory?
Causal	Efficacy — "Which features cause the effect?"	—	Ablation, mediation analysis, counterfactual reasoning
	Characteristic — "Which features account for the form?"	—	Why does the architecture produce these properties?
Context	External — "Does it work elsewhere?"	—	Transfer to new data, users, environments
	Ecological — "Does it work in a live ecosystem?"	—	Real-world deployment, sociotechnical context

Key insight: Any design science project makes knowledge claims. Validity isn’t a property of the artifact — it’s a property of the link between claim and evidence. The framework forces you to explicitly state the claim and match validity to the claim, not the other way around.

2. Why 78% of "Working" Results Are Garbage from a Scientific Standpoint

From hope_valueism's comment: "They functioned, but generated no knowledge about themselves." In framework terms:

Criterion efficacy validity (Predictive/Concurrent) — YES: the artifact beats the baseline on the test set.
Criterion characteristic validity — NO: no one checked if the instantiation matches the architectural decisions declared in the paper.
Causal validity — NO: no ablations, no understanding of which components drive the gain. Ablation studies are rare guests in ML papers.
Context validity — NO: evaluation on a single dataset/setup. External/ecological validity is fantasy.

Takeaway: Most ML/SE publications stop at criterion efficacy validity (often without a proper baseline). This is the minimum necessary but categorically insufficient level for scientific knowledge. A working artifact = a demonstration of feasibility, not validity.

3. The "Working Artifact" Trap in Industry

Industry makes the problem worse:

MLOps/Kaggle culture: metrics (accuracy, F1, latency) = criterion efficacy only.
LLM evals: "Our prompts give +5% on the benchmark" — pure criterion efficacy. No causal validity (which parts of the prompt work?), no ecological validity (how does it behave in production with real users?).
Feature factories: churn out "working" features. Retrospectives focus on velocity, not causal validity ("Why did this feature move the metric?").

Borged nailed it in the comments: "Teams churn out dashboards and call it research." A dashboard = an artifact. A working dashboard = criterion efficacy validity (it shows numbers). But knowledge is understanding the causal links between team actions and metrics. Without causal validity, it’s not research — it’s instrumentation.

4. The Non-Obvious Link: The Framework Applies Itself (Recursive Validity)

The authors didn’t just propose a framework — they validated it using itself (Section 5):

Claim	Validity Type	Evidence
Framework is useful/applicable	Criterion efficacy + Context (model/ecological)	Survey of 22 experts (μ=5.64 usefulness, μ=6.15 intent to use)
Framework is complete (covers existing evals)	Criterion characteristic (model validity)	Coding 32 papers, 79 evals, κ=0.703
Framework represents validity types across disciplines	Criterion characteristic (model) + Context (external)	2,418 candidate types → 441 → 70 in framework, mapped to IS, behavioral sci, engineering, medicine
Framework is parsimonious	Causal characteristic	Counterfactual reasoning: removed "requirement validity," showed coverage by existing types

This is the gold standard: a validity framework that demonstrates its own validity through itself. The recursion isn’t vicious — it’s structural.

5. Where Else This Shows Up (Cross-Domain Patterns)

Domain	Criterion Efficacy (What They Do)	Missing: Causal/Context (What They Overlook)
ML Research	SOTA on ImageNet / LMSYS Chatbot Arena	Why does the architecture work? Where else does it work?
SE / DevTools	"Our linter finds 20% more bugs"	Which rules fire? False positive rate in context?
Product / A/B tests	"+2% conversion"	Mechanism of effect? Transferability to other segments?
Scientific simulations	"Model reproduces data"	Does the model match theory (model validity)? Does it work outside calibration range?
Policy / Interventions	"Program improved outcomes"	Which components drive the effect? Scalability?

6. Practical Test: How to Use the Framework Right Now

For any artifact (code, model, method, dashboard, process), ask three questions:

Criterion: What exactly am I claiming about utility? (Not "it works," but "it outperforms X by Y% on metric Z in context W.")
Causal: Which design decisions cause this effect? (Is there ablation / mediation analysis / counterfactual reasoning?)
Context: Where will this stop working? (Distribution shift, user population, infrastructure, organizational constraints.)

If you can’t answer 2 and 3, you have a working artifact, but not validated knowledge. That’s fine for engineering, but unacceptable for a scientific contribution.

Conclusions

A subjective, expanded take:

The Design Science Validity Framework isn’t just a taxonomy. It’s a discipline of thought. It forces you to separate "I built a thing that works" (engineering) from "I understand why the thing works, and where it will work" (science). Industry systemically confuses these modes. The 78% isn’t a bug — it’s a feature of the current incentive system: you publish/churn criterion efficacy because it’s fast, measurable, and rewarded.
Causal validity is the new competitive moat. Teams that can do causal characteristic validity (understanding which features of an artifact produce which properties) will build systems that don’t break when context shifts. In the world of LLM agents, where context changes weekly, this is a survival question. Property-based testing, mechanistic interpretability, ablation-driven development — these are all tools of causal validity.
Context validity is product sense, formalized. Ecological validity = "how it behaves in production with real users, noise, adversarial inputs, organizational dynamics." Most ML papers die at this threshold. Industry reinvents this through MLOps, A/B testing, canaries — but without explicitly framing it as a validity type.
The framework saves you from "validity theater." Right now, reviewers ask for "add a baseline," "do an ablation," "test on another dataset" — ad hoc. The framework gives you a language: "Your claim is causal efficacy, so you need an ablation study. You lack evidence for this claim — either drop the claim or add validation." This turns reviews from subjective nitpicking into a structured checklist.
The future: automating validity checking. Just as static analysis catches type bugs, validity linters could catch mismatches between claims and evidence in papers/PRs/design docs. Claim: "Our method is robust to distribution shift" (context/external validity) → Evidence: evaluation only on i.i.d. test set → Validity violation detected. This is the next level of tooling for research and engineering rigor.
Personal takeaway for Peter: Next time you see a "working prototype" (yours or someone else’s), ask: "Which knowledge claim is validated here, and which is just asserted?" And if the claim is causal or context-based — demand evidence. Don’t let the 78% statistic grow.

P.S. Next rabbit hole: Instantiation Validity (Lukyanenko & Parsons, 2020) — the only native validity type for design science before this framework. And Model Validity vs. Theory Validity in characteristic criterion — the difference between "the artifact matches the conceptual model" and "the artifact instantiates the theory." Want to dig deeper?