---
title: "Naturalizing typological kinds: Comparanda, mechanisms, and measurement"
author: "Brett Reynolds"
year: "2025"
status: "Preprint"
canonical_url: "https://ling.auf.net/lingbuzz/009461"
website_url: "https://brettreynolds.ca/papers/naturalizing-typological-kinds/"
markdown_url: "https://brettreynolds.ca/papers/naturalizing-typological-kinds/paper.md"
version: "author-manuscript mirror"
version_date: "2026-06-12"
keywords: ["typology", "comparative concepts", "homeostatic property clusters", "measurement", "naturalized kinds"]
---
# Naturalizing typological kinds: Comparanda, mechanisms, and measurement

**Author-manuscript mirror.** This Markdown file is provided for accessibility, search, and machine readability. The canonical public record is linked in the metadata above.

## Abstract
The comparative-concept programme treats comparative concepts as explicit definitions and diagnostics rather than universal entities. But instrumentalism leaves four empirical questions unanswered: which comparanda show stable profiles through diachronic change and across genealogically independent families? Which mechanisms explain their stability? Do patterns predict held-out families under declared decision rules? When should a concept be demoted? Some comparative concepts earn naturalized status when they pick out homeostatic property cluster kinds: profiles of properties, stabilized by mechanisms, and projectible for specified purposes. These profiles are stable not because they’re universal, but because independent mechanisms converge to maintain them. This turns comparative concepts into testable hypotheses about their comparanda, with falsifiable predictions about regeneration, erosion, projection decay, and out-of-sample generalization. A worked comparison of participial labels shows the logic: English bundles perfect selection, passive selection, non-finite modification, and adjacency to deverbal adjectives under one label, while German, French, Greek, and Japanese split the bundle along independently motivated selectional and distributional axes. Construction labels receive the same treatment: cross-linguistic comparison decomposes them into Level-I comparanda, syntactic profiles, and language-internal realizations. The framework requires three prerequisites: explicit mappings separating comparanda from realizations, a syntax–semantics firewall testing semantic targets independently of forms, and measurement models with study-specific decision rules calibrated before evaluation. This preserves typology’s empirical ambition while avoiding both universalist overreach and instrumental agnosticism.


# Which comparative concepts identify natural kinds?

Haspelmath’s ((2010)) comparative-concept programme clarified typology’s conflation problem. By treating comparative concepts as explicit definitions and diagnostics for comparison rather than universal entities projected from language-specific structures, it addresses the confusion of measuring English articles and calling it universal definiteness, or tracking nominative case and calling it universal subjecthood. Comparative-concept discipline separates comparanda, the cross-linguistic phenomena placed under comparison (comparative functions, comparative categories, semantic targets, and discourse roles), from realizations, the language-internal resources used to express them (categories, constructions, morphemes, and prosodic patterns).

Haspelmath’s solution is instrumentalist, and it’s genuine methodological progress. Comparative concepts are definitions and diagnostics for measurement; by themselves, they don’t commit analysts to cross-linguistic kinds. This paper keeps that constraint and asks a further question. Once language-internal categories have been separated from cross-linguistic comparanda, which comparative concepts support inference because recurrent mechanisms stabilize the profiles they pick out across unrelated languages?

Naturalization names that stronger, defeasible status. The instrumental model leaves four questions that it doesn’t try to answer. If some comparative concepts pick out profiles that recur across unrelated languages through convergent mechanisms while others reflect inherited descriptive templates, criteria are needed to distinguish them. If some patterns show stability through diachronic change while others dissolve, mechanistic explanations are needed. If typology aims for empirical science rather than descriptive cataloguing, falsifiable predictions and demotion criteria are needed.

**Obstacle 1 is the stability question.** Comparative concepts allow comparison, but which comparanda show profiles that persist through diachronic change and recur across genealogically independent families? Consider <span class="smallcaps">definiteness</span><sub>Cross</sub> as an illustrative candidate. Articleless languages often develop new definiteness marking via demonstratives, possessive frames, and classifier complexes after previous systems erode. <span class="smallcaps">nominality</span><sub>Cross</sub> offers a second candidate, with argument-head privileges and possessive/quantificational interfaces recurring across unrelated families (Indo-European, Sino-Tibetan, Niger-Congo). By contrast, <span class="smallcaps">adjective</span><sub>Cross</sub> looks like a possible failure case: property-concept modification fractures into stative verbs, relative clauses, or nominal patterns with no persistent dedicated lexical category. Instrumentalism treats all three concepts as equally conventional. Naturalization asks: do definiteness and nominality persist because regenerative mechanisms counteract erosion, while adjective lacks such mechanisms? Without criteria for diachronic and phylogenetic persistence, the distinction can’t be made.

**Obstacle 2 is the explanation gap.** Why do some patterns recur while others dissolve? Instrumentalism describes the variation but doesn’t explain it. <span class="smallcaps">subject</span><sub>Cross</sub>, treated as a syntactic-function candidate, appears across nominative-accusative, ergative-absolutive, and Philippine-type alignment systems via extraction privileges, reflexive binding, and control, suggesting that convergent discourse and processing pressures may maintain the profile despite varied morphosyntactic realizations. <span class="smallcaps">topic</span><sub>Cross</sub> offers a parallel role candidate: it recurs through sentence-initial position (Germanic), *wa*-marking (Japanese), unexpressed arguments in languages that license argument omission, and intonational prominence (Mandarin), suggesting discourse-continuity mechanisms across different realizations. But without specifying which cognitive, diachronic, or discourse mechanisms cause the stability, description lacks explanation. Naturalization provides mechanistic accounts: learnability bootstrapping, grammaticalization pathways, discourse asymmetries.

**Obstacle 3 is the testability problem.** Which patterns generalize to held-out families versus overfitting to the training sample? Instrumentalism provides no falsifiable predictions, no preregistered decision rules, no out-of-sample validation. Differential object marking (DOM) correlates with specificity, animacy, definiteness, and topicality in Indo-European and Turkic, but does the pattern predict unseen families? Without declared projectibility criteria (for example, a held-out predictive target specified for a given study) and convergent validity checks (do psycholinguistic signatures align with typological predictions?), stable generalizations can’t be distinguished from sample-specific accidents. Naturalization makes the question empirically defeasible: derive predictions from mechanisms, test on held-out data with preregistered decision rules, report sensitivity to alternative calibrations, and accept the risk of falsification. These results provide the demotion triggers addressed in Obstacle 4.

**Obstacle 4 is the demotion problem.** When should a comparative concept be retired? Instrumentalism offers no exit criteria, no failure modes, no protocol for recognizing that a concept has outlived its usefulness. If diagnostics for a comparative <span class="smallcaps">adjective</span> concept systematically fail in well-described non-European languages, should the conclusion be that adjectives are universal but diagnostics are bad, that adjectives are family-specific but the label stays, or that the concept should be demoted to too thin status with documented failure mode? Without naturalization criteria (clustering, mechanisms, predictions), concepts remain in use long after the evidence suggests they should be abandoned. The framework provides explicit demotion triggers: diagnostics fail across families ($`\to`$ too thin), stability traces to shared templates ($`\to`$ genealogical artifact), mechanisms prove inadequate ($`\to`$ indeterminate).

These obstacles persist because instrumentalism treats all comparative concepts as mere analyst conveniences without asking which ones pick out profiles that behave like scientific kinds. The question isn’t whether comparative concepts are useful; their usefulness is granted here. The question asks whether some concepts identify comparanda whose stability is maintained by convergent mechanisms, making them candidates for naturalization with testable predictions about regeneration, erosion, and trade-offs. The framework developed below provides criteria for answering this question, making naturalization empirically defeasible rather than stipulative.

# Tools for naturalization

Answering the naturalization question requires disciplined separation of comparanda from realizations. Two requirements conflict: cross-linguistic comparison needs comparanda that travel across grammars, while grammatical analysis needs kinds that are grammar-specific. From this tension, typology accumulates measurement error and fragile generalizations. The components below (explicit mapping functions, syntax-semantics firewall, and measurement models) are prerequisites for testing which comparative concepts identify comparanda that deserve naturalization.

I adopt a terminological discipline following Huddleston and Pullum (2002): function (unqualified) denotes syntactic functions only; target denotes semantic targets; role denotes discourse/pragmatic roles; category denotes syntactic categories, including lexical categories and phrasal or clausal projections. A profile is a structured bundle of diagnostics, sometimes all within one level and sometimes distributed across levels. Subscripts distinguish comparative cross-linguistic concepts (<span class="smallcaps">term</span><sub>Cross</sub>) from language-specific realizations (<span class="smallcaps">term</span><sub>Eng</sub>, <span class="smallcaps">term</span><sub>Jpn</sub>) and general language-internal references (<span class="smallcaps">term</span><sub>$`L`$</sub>) where no particular language is in focus.

Haspelmath (2010) supplies the comparative-concept side of the distinction: comparative concepts are explicit definitions and diagnostic procedures that analysts formulate for a cross-linguistic study. They let analysts ask the same question of each language in a sample without claiming that the comparison term names a category in any speaker’s grammar. For example, <span class="smallcaps">definiteness</span><sub>Cross</sub> specifies which discourse behaviours count as evidence for definite-reference contexts; it doesn’t assert a universal article category.

I use comparandum for the other side of the relation, the phenomenon or profile placed under comparison. A comparative concept specifies a comparandum: it gives the definition, diagnostics, coding decisions, and decision rules used to estimate its strength in a language. The concept operationalizes the comparandum; it isn’t identical to it. In the matrix below, row labels such as <span class="smallcaps">definiteness</span><sub>Cross</sub> name comparanda, while the codebook supplies the comparative concepts that diagnose those rows.

The framework builds on but extends Haspelmath’s comparative-concept programme and related distinctions in Croft (2001). My cross-linguistic comparanda include comparative categories, syntactic functions, semantic targets, discourse roles, and decomposed profiles built from these. My comparative concepts are the definitions and diagnostics used to identify them. My language-internal categories, constructions, and realizations are the structures that particular grammars make available.

The extension is naturalization: treating a comparative concept as a defeasible hypothesis that its comparandum is a stable scientific kind maintained by convergent mechanisms, not merely an instrumental tool. This generates falsifiable predictions about regeneration, erosion, projection decay, and trade-offs that purely instrumental frameworks can’t derive, and provides explicit demotion criteria that instrumentalism lacks.

Terminologically, I restrict claims of “absence” or “presence” to language-internal inventories: it is misleading to say “Language $`L`$ lacks <span class="smallcaps">adjective</span><sub>Cross</sub>” (treating a comparative concept as if it could be absent), but accurate to say “In language $`L`$, no dedicated lexical category is specialized for the property-concept modifier function; that function is realized via relative-clause constructions and stative-verb patterns.”

The foundation is clear descriptions of language-internal lexical categories, phrasal or clausal categories, and functions (<span class="smallcaps">noun</span><sub>Eng</sub>, <span class="smallcaps">relative clause</span><sub>Jpn</sub>, <span class="smallcaps">subject</span><sub>Heb</sub>) with their full morphosyntactic property clusters. Cross-linguistic comparison then requires apparatus that permits asking whether <span class="smallcaps">noun</span><sub>Eng</sub> and <span class="smallcaps">noun</span><sub>Jpn</sub> realize similar patterns, without presupposing that English’s cluster defines the universal or that cross-linguistic prototypes have to mirror any particular realization. Section <a href="#sec:naturalization-question" data-reference-type="ref" data-reference="sec:naturalization-question">1</a> details how systematic conflation errors block this goal.

The framework operates at three distinct levels. **Level I** (cross-linguistic pressures) comprises semantic targets like definiteness and discourse roles like topic, comparanda diagnosed through behavioural tests independent of any particular language’s morphosyntax. **Level II** (cross-linguistic syntax) comprises syntactic functions like subject, lexical categories like verb, and phrasal or clausal categories like VP and clause, comparanda diagnosed through portable morphosyntactic tests (extraction, agreement, distribution). **Level III** (language-internal syntax) comprises the actual categories, constructions, and functions in particular grammars, including English’s determinatives, Japanese’s *wa*-marking, and Hebrew’s construct state. Levels I and II contain comparanda, not grammar-internal entities: Level-I comparanda are operationalized by definitions and behavioural or discourse diagnostics; Level-II functions and categories are operationalized by definitions and portable morphosyntactic diagnostics. Level III consists of language-specific realizations. Table <a href="#tab:levels" data-reference-type="ref" data-reference="tab:levels">1</a> summarizes the architecture.

Constructions require special care because many familiar construction labels bundle form and meaning. A language-internal <span class="smallcaps">caused-motion construction</span><sub>$`L`$</sub>, for example, is a Level-III construction. A comparative <span class="smallcaps">caused-motion</span><sub>Cross</sub> profile has to be decomposed into a Level-I caused-motion target and a Level-II argument-structure profile, then linked to whichever Level-III constructions realize that profile in particular languages. Similarly, <span class="smallcaps">interrogative clause</span><sub>Cross</sub> is a Level-II clausal category only when it is kept distinct from the Level-I questionhood comparandum; English subject-auxiliary inversion, Japanese *ka*, or Mandarin *ma* are Level-III realizations. The row label may name the profile, but the codebook has to expose its components.

<div id="tab:levels">

| Level | Contents | Diagnostic focus |
|:---|:---|:---|
| Level-I: cross-linguistic pressures | Semantic targets (<span class="smallcaps">specificity</span><sub>Cross</sub>, <span class="smallcaps">definiteness</span><sub>Cross</sub>, <span class="smallcaps">individuation</span><sub>Cross</sub>) and discourse roles (<span class="smallcaps">topic</span><sub>Cross</sub>, <span class="smallcaps">focus</span><sub>Cross</sub>, <span class="smallcaps">addressee</span><sub>Cross</sub>) | Behavioural tests on interpretation, common ground management, and discourse continuity |
| Level-II: cross-linguistic syntax | Functions (<span class="smallcaps">subject</span><sub>Cross</sub>, <span class="smallcaps">determiner</span><sub>Cross</sub>, <span class="smallcaps">modifier</span><sub>Cross</sub>), lexical categories (<span class="smallcaps">N</span><sub>Cross</sub>, <span class="smallcaps">V</span><sub>Cross</sub>), and phrasal or clausal categories (<span class="smallcaps">VP</span><sub>Cross</sub>, <span class="smallcaps">interrogative clause</span><sub>Cross</sub>) | Portable morphosyntactic diagnostics (alignment, extraction, distribution, selection, slot privileges) |
| Level-III: language-internal syntax | Language-internal functions (<span class="smallcaps">subject</span><sub>Eng</sub>, <span class="smallcaps">determiner</span><sub>Eng</sub>), lexical categories (<span class="smallcaps">V</span><sub>Eng</sub>), phrasal or clausal categories (<span class="smallcaps">relative clause</span><sub>Jpn</sub>), and constructions (<span class="smallcaps">caused-motion construction</span><sub>Eng</sub>, <span class="smallcaps">construct state</span><sub>Hbr</sub>) | Distributional, morphological, prosodic, and construction-specific evidence in $`L`$; mapped upward with graded weights |
|  |  |  |

Three-level ontology for comparanda, comparative concepts, and realizations

</div>

Multiple levels typically converge in a single expression. In English \[*the cat*\] *awoke*, the bracketed NP instantiates <span class="smallcaps">topic</span><sub>Cross</sub> (Level I, diagnosed by discourse continuity: *the cat…it*) and <span class="smallcaps">subject</span><sub>Cross</sub> (Level II, diagnosed by extraction, agreement behaviour, and the subject–topic coupling) via the language-internal function <span class="smallcaps">subject</span><sub>Eng</sub> (Level III, instantiated by a preverbal NP in a canonical declarative clause). Topichood can be part of the evidence for English subjecthood, as one property in a broader subject profile, while remaining a distinct Level-I comparandum. Japanese shows the same coupling with different exponents: *sono neko-ga okita* ‘that cat-<span class="smallcaps">nom</span> awoke’ uses a *ga*-marked NP as clear Level-III evidence for <span class="smallcaps">subject</span><sub>Cross</sub>; *sono neko-wa okita* ‘that cat-<span class="smallcaps">top</span> awoke’ uses *wa* rather than *ga*, but the same NP can still realize <span class="smallcaps">subject</span><sub>Cross</sub> while instantiating <span class="smallcaps">topic</span><sub>Cross</sub>. The particles are alternatives; the comparanda needn’t be. Subsequent *okita* ‘awoke’ can continue the topic with the argument unexpressed. Same comparanda, different realizations.

The separation of levels is methodological. It gives the analyst independent evidence channels; it doesn’t claim that pressures, functions, and language-internal categories are causally isolated. A Level-I comparandum can help stabilize a Level-II function or profile, and a Level-II or Level-III system can make a Level-I comparandum easier to learn, detect, or transmit. I call these relations cross-level couplings. They are bridge hypotheses in the matrix, supported or rejected by diagnostics and predictions, rather than identities between levels.

The kind claims differ by level. Building on Croft (2001; Huddleston and Pullum 2002), I analyse Level-III lexical categories as homeostatic property cluster (HPC) kinds in the grammar-internal sense developed in Reynolds (2026). A language-internal lexical category such as <span class="smallcaps">noun</span><sub>Eng</sub> is a cluster of distributional, inflectional, semantic, constructional, and social properties maintained by community-bound mechanisms: acquisition, entrenchment, analogy, instruction, and norm stabilization. It supports defeasible inferences inside the grammar without supplying a universal essence (Boyd 1991, 1999; Khalidi 2013). Level-I comparanda may also be treated as HPC kinds when cognitive–interactional pressures stabilize their profiles. Level-II comparanda, by contrast, are accessed through analyst-proposed comparative concepts. Some associated comparative concepts may earn naturalized status (Section <a href="#subsec:naturalization-criteria" data-reference-type="ref" data-reference="subsec:naturalization-criteria">6.2</a>), in which case the comparanda they identify are treated (defeasibly) as kinds; others remain instrumental comparanda without ontological commitment.

Three prerequisites operationalize this framework: an executable three-level comparanda–realization schema (Section <a href="#sec:matrix" data-reference-type="ref" data-reference="sec:matrix">4</a>) separating what is compared (categories<sub>Cross</sub>, functions<sub>Cross</sub>, targets<sub>Cross</sub>, roles<sub>Cross</sub>, and decomposed profiles) from what exists in particular languages (language-specific realizations); an explicit syntax–semantics firewall protocol (Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a>) testing Level-I comparanda independently of their morphological or syntactic expression; and a measurement-first approach (Section <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a> and Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>) promoting some comparative concepts to naturalized status when the comparanda they identify show consistent cross-linguistic stability with identifiable homeostatic mechanisms.[^2]

The paper proceeds as follows. Section <a href="#sec:objections" data-reference-type="ref" data-reference="sec:objections">3</a> surfaces five immediate objections before detailing the apparatus. Section <a href="#sec:matrix" data-reference-type="ref" data-reference="sec:matrix">4</a> operationalizes the three-level mapping with formal apparatus (mapping functions, weight assignment, reliability scores). Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a> extends the framework to Level-I comparanda through a firewall protocol with operational diagnostics. Sections <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a> and <a href="#subsec:mechanisms" data-reference-type="ref" data-reference="subsec:mechanisms">6.3</a> introduce naturalized comparative concepts and the homeostatic mechanisms that stabilize the profiles they identify, with a brief convergence analogy (Section <a href="#sec:eyes" data-reference-type="ref" data-reference="sec:eyes">7</a>) illustrating why similar pressures can produce independently similar profiles. Section <a href="#sec:codebook" data-reference-type="ref" data-reference="sec:codebook">8</a> gives codebook entries and implementation constraints. Section <a href="#sec:predictions" data-reference-type="ref" data-reference="sec:predictions">9</a> derives falsifiable predictions with study-specific decision rules. Section <a href="#sec:illustrations" data-reference-type="ref" data-reference="sec:illustrations">10</a> works through a compact cross-linguistic demonstration from participial labels before sketching two further domains. Section <a href="#sec:threats" data-reference-type="ref" data-reference="sec:threats">11</a> addresses objections about measurement feasibility, falsifiability, and circularity. Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a> gives the latent-variable measurement model.

The framework issues a challenge: specify your comparanda, test them independently, and declare your decision rules, or admit you are measuring labels, not categories.

# Anticipated objections

Readers who want the machinery before the objections can read Sections <a href="#sec:matrix" data-reference-type="ref" data-reference="sec:matrix">4</a>–<a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a> first, then return here before Section <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a>. The objections are easier to evaluate once the mapping apparatus and firewall protocol are visible.

Skeptics will raise five immediate concerns. I address them now before presenting the apparatus, so readers can evaluate the framework without lingering doubts.

**Isn’t this just universals with extra steps?** No. Naturalization differs from universalism in four ways: it’s gradient (strong/weak/contested, not binary), empirically defeasible (demotion criteria exist), mechanistically explained (convergent evolution, not stipulated essences), and allows family-specific variation (weak in some genealogical clusters, strong in others). Naturalized concepts are maintained by independent mechanisms that happen to converge, not by innate primitives. Section <a href="#subsec:essence" data-reference-type="ref" data-reference="subsec:essence">11.1</a> provides the detailed philosophical contrast.

**How can this be operationalized without drowning in complexity?** The apparatus is complex, but so is the phenomenon. The framework provides: (1) executable codebook with decision rules (Section <a href="#sec:codebook" data-reference-type="ref" data-reference="sec:codebook">8</a>), (2) worked examples showing how diagnostics apply (Section <a href="#sec:illustrations" data-reference-type="ref" data-reference="sec:illustrations">10</a>), (3) agreement summaries and coder/item effects rather than a single reliability gate, (4) two-layer measurement model that separates diagnostic evidence from realization patterns (Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>). Complexity is managed through explicit protocols, not eliminated through oversimplification.

**What about languages like Salish with flexible lexical categories?** The framework handles this gracefully. Boundary crispness is itself a measurable property: languages with flexible lexical categories will show distributed weights across multiple lexical categories rather than sharp 0/1 contrasts in the $`M_L`$ matrix. For example, Salish lexical categories<sub>Sal</sub> might realize both <span class="smallcaps">noun</span><sub>Cross</sub>-like and <span class="smallcaps">verb</span><sub>Cross</sub>-like patterns with intermediate weights ($`w \approx 0.6`$, $`w \approx 0.5`$). This isn’t a defect; it’s accurate description. Semantic targets and discourse roles can still be diagnosed independently even when morphosyntactic category boundaries blur. See Section <a href="#sec:threats" data-reference-type="ref" data-reference="sec:threats">11</a> for fuller treatment.

**Doesn’t the firewall protocol create circularity?** No. The firewall blocks diagnostic circularity while leaving causal feedback available for modelling. At a time-slice, the two-layer model (Appendix <a href="#subsec:measurement-derivation" data-reference-type="ref" data-reference="subsec:measurement-derivation">13.4</a>) ensures diagnostics (Layer 1) never condition on forms (Layer 2), with independence enforced structurally via conditional independence: $`(F_{L,c,\phi} \perp d_{L,i} \mid \eta_{L,c})`$, where $`F_{L,c,\phi}`$ is a form observation, $`d_{L,i}`$ is a diagnostic observation, and $`\eta_{L,c}`$ is latent comparandum strength. Across acquisition and diachrony, Layer 2 can then model whether morphosyntactic forms stabilize, amplify, or transmit the diagnosed Level-I comparandum. The sequential order prevents circularity: first estimate the comparandum from independent behavioural evidence (scope, anaphora, presupposition tests), then estimate which forms realize it and whether those mappings improve prediction.

**Is this empirically tractable?** Yes, but demanding. Tractability requires: multi-year collaborative effort, systematic inter-rater training, phylogenetically stratified sampling (genealogically independent languages to test convergence claims), and potentially LLM-assisted coding with rigorous validation protocols. The theory identifies the empirical object; implementation requires coordinated effort comparable to WALS, Grambank, or APiCS. Pilot studies on 10–15 well-described languages can establish feasibility before scaling. Section <a href="#sec:threats" data-reference-type="ref" data-reference="sec:threats">11</a> acknowledges the labour-intensive nature while showing how modern tools can mitigate scaling challenges.

# Three-level framework and mapping

The three-level ontology becomes operational through explicit apparatus: mapping functions specify which language-specific forms realize which cross-linguistic comparanda, weight functions quantify the strength of these mappings, reliability scores track coding confidence. Four explicit guardrails (Rules A–D) prevent the category–function conflation, meaning–form collapse, and construction-label shortcuts diagnosed above.

For each language $`L`$, the analyst specifies which forms (morphosyntactic, prosodic, constructional) realize which cross-linguistic comparanda, how strongly they do so, and how much confidence those judgements deserve, thereby mapping comparanda to the concrete forms that languages deploy.

The procedure starts with a working hypothesis: an inventory $`\mathcal{C}`$ of Level-I and Level-II comparanda (functions, categories, semantic targets, discourse roles, and decomposed profiles). This inventory is provisional: it grows as more languages are studied. Naturalization claims (Section <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a>) are model-relative: they hold with respect to a fixed version of $`\mathcal{C}`$ and may require re-evaluation if $`\mathcal{C}`$ is updated. For each language $`L`$, I then define a mapping
``` math
f_L : \mathcal{C} \rightarrow \mathcal{P}(\mathrm{Forms}_L),
```
together with a weight function $`w_L : \mathcal{C} \times \mathrm{Forms}_L \to [0,1]`$ and an optional reliability score $`r_L(c, \phi) \in [0,1]`$, where $`c`$ ranges over comparanda and $`\phi`$ over forms. Here $`\mathcal{P}`$ denotes the power set, and $`\mathrm{Forms}_L`$ ranges over heterogeneous realizations: lexical categories (e.g., <span class="smallcaps">determinative</span><sub>Eng</sub>), phrasal categories (e.g., <span class="smallcaps">NP</span><sub>Eng</sub>), morphological systems (e.g., number marking<sub>Eng</sub>), and prosodic or constructional patterns (e.g., left-peripheral NP with topic intonation<sub>Eng</sub>). Constructional patterns can occupy columns, but a construction label can occupy a row only as a decomposed profile with its component diagnostics visible. The weight function $`w_L : \mathcal{C} \times \mathrm{Forms}_L \to [0,1]`$ has support restricted to $`f_L`$: $`w_L(c,\phi) > 0 \Rightarrow \phi \in f_L(c)`$, and $`f_L(c) = \{\phi : w_L(c,\phi) > 0\}`$. For any $`L`$ and $`c`$, $`f_L(c)`$ is finite.

Intuitively, $`f_L(c)`$ lists the language-internal forms that realize comparandum $`c`$, $`w_L(c, \phi)`$ measures how strongly form $`\phi`$ realizes $`c`$ in language $`L`$, and $`r_L(c, \phi)`$ records how trustworthy the judgement is (inter-annotator agreement, corpus counts, experimental evidence).

**Decision procedures for weight assignment.**<span id="subsec:weight-procedures" label="subsec:weight-procedures"></span> The weight function $`w_L`$ requires explicit decision procedures to make analyst judgements inspectable and comparable. I distinguish two complementary notions:

Observational weight
A frequency ratio from corpus counts after diagnostic contexts have been coded under the study codebook: $`w_L^{\text{obs}}(c,\phi) = \frac{\text{\# contexts where } \phi \text{ realizes } c}{\text{\# diagnostic contexts for } c}`$. The ratio is reproducible only relative to those coding decisions: analysts still have to decide which contexts count as diagnostic for $`c`$, and those decisions require provenance, coder-disagreement records, and sensitivity checks. This measure is less discretionary than a global judgement, not free of judgement. Large parallel corpora with controlled contexts are rare for most language families.

Analyst weight
A provisional rating based on diagnostic strength: $`w_L^{\text{prov}}(c,\phi)`$ is assigned via a four-point ordinal scale that reflects sensitivity and precision. These serve as initial values for the measurement model, which refines them using cross-linguistic diagnostic patterns. The anchors below are coding conveniences, not universal cutoffs (Cowart 1997; Schütze 2016):

- Provisional 1.0: Canonical exponent: $`\phi`$ appears in $`\geq 90\%`$ of diagnostic contexts where $`c`$ is independently diagnosed, with high precision ($`\geq 80\%`$ of $`\phi`$ occurrences signal $`c`$)

- Provisional 0.7: Strong secondary: $`\phi`$ appears in $`60\%`$–$`90\%`$ of diagnostic contexts, moderate precision ($`50\%`$–$`80\%`$)

- Provisional 0.4: Weak correlate: $`\phi`$ appears in $`30\%`$–$`60\%`$ of diagnostic contexts, low precision ($`20\%`$–$`50\%`$)

- 0.0: Absent or irrelevant: $`\phi`$ appears in $`<30\%`$ of diagnostic contexts or precision $`<20\%`$

Provisional weights initialize the measurement model’s priors on realization parameters $`\kappa_{L,c,\phi}`$ and $`\lambda_{c,\phi}`$ (the baseline form tendency and the comparandum-loading parameter; Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>). Final weights $`w_L(c,\phi)`$ are posterior-derived conditional probabilities with uncertainty quantification. Coder agreement is reported descriptively and enters the model through coder, item, and diagnostic effects; low agreement triggers diagnostic review or sensitivity analysis rather than automatic exclusion.

The weight function has a precise interpretation that ensures identifiability. Specifically, $`w_L(c,\phi)`$ measures the probability that form $`\phi`$ is realized when comparandum $`c`$ is at maximum standard strength ($`\eta_{L,c}=+1`$, where $`\eta`$ is the latent comparandum-strength scale). An auxiliary quantity $`q_L(c,\phi)`$ captures the false-positive rate: the baseline probability when the comparandum is absent ($`\eta_{L,c}=0`$). This two-parameter characterization preserves the HPC intuition that multiple forms can independently realize the same comparandum with high weights simultaneously; there’s no sum-to-1 constraint forcing competition. Identifiability is ensured by anchoring the latent scale (fixing $`\text{Var}(\eta)=1`$ and $`\mathbb{E}[\eta]=0`$), enforcing monotonicity ($`\lambda_{c,\phi} \geq 0`$), and separating diagnostic evidence from realization evidence (Level III). The two-layer measurement model (Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>) estimates $`w_L`$ as a derived quantity from structural parameters $`\kappa`$ and $`\lambda`$, avoiding circularity by ensuring diagnostics never condition on Level III forms.

In practice, implementations should use **triangulation**: start with analyst weights for pilot coding, validate against corpus-based observational weights where available, then refine via measurement model estimation. When analyst provisional weights and model posteriors conflict, the codebook should state how reliability and partial pooling are balanced; Appendix <a href="#subsec:weight-combination" data-reference-type="ref" data-reference="subsec:weight-combination">13.2</a> gives one explicit combination rule. The reliability score $`r_L(c,\phi)`$ captures residual uncertainty with transparent anchors: experimental or corpus-validated evidence receives the highest reliability; high coder agreement with low item uncertainty receives a lower but still strong score; contested evidence receives an intermediate score; sparse evidence or single-coder judgements receive low reliability. Agreement statistics such as Cohen’s $`\kappa`$ can be reported as descriptive summaries, but they don’t validate a mapping by themselves. Disagreement is part of the data: it may reveal coder bias, diagnostic ambiguity, or a comparandum whose profile is genuinely thin or unstable. This explicit procedure ensures $`w_L`$ is auditable and reproducible. In practice, $`r_L`$ is mandatory for contested mappings and for any mappings used in naturalization claims (Sections <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a>–<a href="#sec:predictions" data-reference-type="ref" data-reference="sec:predictions">9</a>); it may be omitted for uncontroversial, expository cells (treated as missing data, not as $`r=1`$).

These mappings are represented in a sparse matrix $`M_L`$ whose rows index comparanda $`c \in \mathcal{C}`$, columns index forms $`\phi \in \mathrm{Forms}_L`$, and cells record $`w_L(c,\phi)`$. $`M_L`$ is a denotational representation of $`f_L`$ and $`w_L`$, not a distinct theoretical object. In the matrices, function rows and category rows are kept distinct; profile rows specify their component diagnostics. No row ever mixes a function with a category label or uses a language-internal construction label as an unanalyzed comparandum.

Preventing the conflation errors diagnosed in Section <a href="#sec:naturalization-question" data-reference-type="ref" data-reference="sec:naturalization-question">1</a> requires four guardrails:

1.  **No category/function collapse.** Level-II comparanda never appear in the Level III columnar inventory. The mapping across levels proceeds via $`f_L`$; <span class="smallcaps">subject</span><sub>Cross</sub> and <span class="smallcaps">subject</span><sub>Eng</sub> don’t appear in the same column. Concretely, this means a Level II label such as “subject (comparative function)” isn’t listed as a Level III form.

2.  **No meaning-as-form identity.** Level I targets and roles correlate with and motivate morphosyntax; they don’t constitute it. Diagnostics for <span class="smallcaps">definiteness</span><sub>Cross</sub>, <span class="smallcaps">specificity</span><sub>Cross</sub>, or <span class="smallcaps">topic</span><sub>Cross</sub> are behavioural (anaphora, scope, continuity), not definitional via articles, differential object marking, or topic particles.

3.  **Cross-level couplings are bridge hypotheses.** A Level-I comparandum may stabilize a Level-II function or profile, and a Level-II or Level-III system may make that Level-I comparandum more learnable, detectable, or transmissible. Such relations are entered as mappings, weights, provenance claims, or predictions. They don’t define either side.

4.  **No construction-label shortcuts.** Language-internal constructions occupy Level-III columns. Cross-linguistic construction labels are admissible only as decomposed profiles: a Level-I comparandum, a Level-II syntactic profile or category, and the Level-III realizations that carry the bundle in particular languages.

Rows corresponding to comparanda and columns to Level III realizations record the mapping in a matrix $`M_L`$. Table <a href="#tab:matrix" data-reference-type="ref" data-reference="tab:matrix">2</a> shows a fragment for English and Japanese. Rows enable cross-linguistic comparison: each language satisfies the same comparandum with different grammatical resources (e.g., <span class="smallcaps">definiteness</span><sub>Cross</sub> is realized via definite determinatives, proper names, and referential pronoun uses in English, versus demonstratives and topic-marked NPs in Japanese). Columns support language-internal analysis: a single lexical category participates in multiple comparanda (e.g., English determinatives contribute to both <span class="smallcaps">determiner</span><sub>Cross</sub> and <span class="smallcaps">definiteness</span><sub>Cross</sub>).

<div id="tab:matrix">

| Comparandum $`c`$ | Form $`\phi`$ in English | $`w_{\mathrm{Eng}}`$ ($`q`$) | Form $`\phi`$ in Japanese | $`w_{\mathrm{Jpn}}`$ ($`q`$) |
|:---|:---|:---|:---|:---|
| <span class="smallcaps">definiteness</span><sub>Cross</sub> | Referential pronoun use<sub>Eng</sub> | $`\sim`$<!-- -->0.85 (0.1) | Demonstrative<sub>Jpn</sub> | $`\sim`$<!-- -->0.9 (0.1) |
|  | Determinative<sub>Eng</sub> (*the*) | $`\sim`$<!-- -->0.7 (0.2) | Bare NP<sub>Jpn</sub> (context) | $`\sim`$<!-- -->0.4 (0.4) |
| <span class="smallcaps">topic</span><sub>Cross</sub> | Left-peripheral NP<sub>Eng</sub> + intonation | $`\sim`$<!-- -->0.7 (0.2) | XP<sub>Jpn</sub> + *wa* | $`\sim`$<!-- -->0.9 (0.1) |
| <span class="smallcaps">modifier</span><sub>Cross</sub> | Adjective phrase<sub>Eng</sub> | $`\sim`$<!-- -->0.8 (0.2) | Relative clause<sub>Jpn</sub> | $`\sim`$<!-- -->0.9 (0.1) |
|  | Relative clause<sub>Eng</sub> | $`\sim`$<!-- -->0.7 (0.2) |  |  |

Fragment of $`M_L`$ for English ($`L = \mathrm{Eng}`$) and Japanese ($`L = \mathrm{Jpn}`$), illustrating the framework with approximate values. Weights are conditional probabilities $`w_L(c,\phi)=\Pr(F_{L,c,\phi}=1 \mid \eta_{L,c}=1)`$ from the two-layer model (Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>). False-positive rates $`q_L(c,\phi)=\Pr(F_{L,c,\phi}=1 \mid \eta_{L,c}=0)`$ shown in parentheses. Multiple realizations per comparandum reflect HPC clustering; rows need not sum to 1.

</div>

Populating $`M_L`$ follows the protocol from Section <a href="#subsec:weight-procedures" data-reference-type="ref" data-reference="subsec:weight-procedures">[subsec:weight-procedures]</a>: define the comparandum, list the diagnostics, score each candidate realization based on sensitivity and precision, record provenance, and model coder uncertainty. Weights reflect empirical coverage: high weights mark forms that reliably appear when the comparandum is diagnosed and reliably signal it when they appear; lower weights mark weaker or context-dependent correlations. Level-I rows never contain language-specific meanings but only point to the forms that realize those meanings; language-internal forms occupy only the columns.

Constructional profiles add one step. Before coding a row such as <span class="smallcaps">caused-motion</span><sub>Cross</sub>, the codebook has to state the Level-I caused-motion target, the Level-II argument-structure diagnostics, and the candidate Level-III realizations. The same applies to interrogatives: <span class="smallcaps">questionhood</span><sub>Cross</sub> and <span class="smallcaps">interrogative clause</span><sub>Cross</sub> can be coupled, but English inversion, Japanese *ka*, and Mandarin *ma* are mappings to be measured, not definitions of either comparandum.

Filled matrices for a language sample enable quantifying typological structure: trade-offs (negative correlations among weights), regeneration pathways (diachronic shifts where one realization’s weight rises as another falls), and naturalization prospects (whether a comparative lexical category like <span class="smallcaps">noun</span><sub>Cross</sub> shows consistently high, multi-cue weights across independent families). Sections <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a>–<a href="#sec:predictions" data-reference-type="ref" data-reference="sec:predictions">9</a> develop these analyses. The same matrix supports language-specific work (scanning down a column reveals how <span class="smallcaps">determinative</span><sub>Eng</sub> participates in multiple comparanda) and cross-linguistic comparison (scanning across a row reveals which constructions satisfy <span class="smallcaps">topic</span><sub>Cross</sub>).

# Syntax–semantics firewall

Semantic targets and discourse roles (<span class="smallcaps">definiteness</span><sub>Cross</sub>, <span class="smallcaps">specificity</span><sub>Cross</sub>, <span class="smallcaps">genericity</span><sub>Cross</sub>, <span class="smallcaps">agentivity</span><sub>Cross</sub>, <span class="smallcaps">topic</span><sub>Cross</sub>, <span class="smallcaps">focus</span><sub>Cross</sub>) raise a parallel problem. Testing these Level-I comparanda independently of any particular morphological or syntactic form prevents collapsing semantic targets into their morphosyntactic realizations or treating morphological presence as definitional of semantic targets.

The firewall protocol verifies Level-I comparanda through behavioural diagnostics independently of any morphosyntactic exponent before recording mappings in $`M_L`$. This workflow parallels established semantic fieldwork methods that separate meaning probes from morphosyntactic packaging (Matthewson 2004) and mirrors experimental acceptability-judgement practice in psycholinguistics, where interpretations are elicited before forms are analysed (Cowart 1997; Schütze 2016). Semantic targets and discourse roles are never treated as functions or categories but diagnosed independently and only then linked, via the mapping, to whichever functions, categories, or constructions express them in $`L`$. The link may later be analysed as a stabilizing coupling, but it can’t supply the diagnostic evidence that licensed the Level-I comparandum in the first place.

Constructional labels apply the same discipline to composite bundles. For <span class="smallcaps">caused-motion</span><sub>Cross</sub>, the caused-motion target is diagnosed from event interpretation before the analyst asks which argument-structure profile or construction realizes it. For <span class="smallcaps">interrogative clause</span><sub>Cross</sub>, questionhood is diagnosed separately from the clausal category and from language-internal markers. The firewall therefore blocks two shortcuts at once: meaning can’t define form, and a familiar construction label can’t define the whole profile.

Operationalizing Level-I comparanda through comparative concepts with diagnostics that don’t presuppose any single exponent (Rule B) provides the solution: test the comparandum directly in specific discursive contexts, then record which forms express it in each language. For each Level I comparandum $`c`$ (semantic target or discourse role) and each language $`L`$, follow the five-step protocol illustrated in Figure <a href="#fig:workflow" data-reference-type="ref" data-reference="fig:workflow">1</a>, which enforces independence of diagnostics from morphosyntactic form.

<figure id="fig:workflow" data-latex-placement="t">

<figcaption>Anti-circularity workflow: diagnose the Level-I comparandum (steps 1–2) before examining forms (steps 3–5). The firewall between steps 2 and 3 enforces diagnostic independence.</figcaption>
</figure>

## Why the firewall is non-negotiable

Rule B isn’t a convenience. It’s a necessity. Conflate meanings with forms, and comparison becomes impossible. You measure articles, not definiteness. Case morphology, not agentivity. The exponent, not the semantic target.

This conflation has three problematic consequences:

First, it makes comparative concepts unfalsifiable. If <span class="smallcaps">definiteness</span><sub>Cross</sub> is defined as “having articles,” then any language without articles automatically “lacks definiteness,” regardless of discourse behaviour. The claim becomes a tautology, not an empirical hypothesis.

Second, it masks target persistence. Semantic targets like identifiability, scopal stability, and agentivity persist across languages even as their morphological exponents turn over. Articles grammaticalize from demonstratives, erode, and are rebuilt from new sources. Case systems collapse and are replaced by word order. Adjective lexical categories diffuse and are reconstituted from relative clauses. Identifying the target with a specific form misses this diachronic pattern.

Third, it generates spurious variation. Languages appear radically different when forms are compared (“Language X has no determinatives”), but similar at Level I when semantic targets are compared (“Language X realizes definiteness via demonstratives and classifiers”). The variation is real but mischaracterized. It’s variation in realization patterns, not in the underlying Level-I pressures.

The firewall enforces diagnostic independence by design. Stage 1 diagnoses the semantic target from behavioural diagnostics while withholding candidate forms. Stage 2 then models the coupling between that target and observable forms, including cases where forms help stabilize or transmit it over time. Figure <a href="#fig:dag" data-reference-type="ref" data-reference="fig:dag">2</a> visualizes this anti-circularity workflow.

## Coding protocol (excerpt)

For each Level I comparandum $`c`$ (semantic target or discourse role) and each language $`L`$, follow this five-step protocol that enforces independence of diagnostics from morphosyntactic form:

1.  **Establish diagnostic traction.** Verify that the behavioural tests for $`c`$ yield consistent judgements in $`L`$. Select test contexts (minimal pairs, elicited examples, or corpus instances) where the diagnostics should discriminate. If diagnostics fail to yield stable judgements, either refine the tests or recognize that the comparative concept may not pick out a relevant comparandum for $`L`$.

2.  **Apply diagnostics to contexts.** Test specific NPs, clauses, or discourse contexts using the operational criteria from the diagnostic battery below. Score strength of evidence: 0 = diagnostic fails; 1 = weak evidence; 2 = moderate; 3 = strong. Record which diagnostics apply and which don’t. This step is conducted without consulting morphosyntactic form; judgements are based on interpretation, discourse behaviour, and entailment patterns.

3.  **Identify candidate realizations.** Only after completing step 2, survey which morphosyntactic, prosodic, or constructional forms correlate with contexts where $`c`$ was independently diagnosed. Look for forms $`\phi`$ that appear systematically in contexts scoring high on the diagnostics.

4.  **Score mappings.** For each form $`\phi`$ that correlates with $`c`$, assign $`w_L(c, \phi)`$ based on:

    - **Sensitivity (recall):** Does $`\phi`$ appear consistently when $`c`$ is diagnosed? High sensitivity means $`\phi`$ is a reliable marker.

    - **Precision (positive predictive value):** When $`\phi`$ appears, does it reliably signal $`c`$? Forms may serve multiple comparanda simultaneously. Referential personal-pronoun uses in English, for example, often contribute to <span class="smallcaps">definiteness</span><sub>Cross</sub> while also carrying person, number, gender, and case contrasts; dummy *there*, weather *it*, generic *one*, and interrogative *who* have to be coded as false positives or excluded by the codebook. Precision is evaluated separately for each comparandum. In an illustrative codebook, very high precision can initialize $`w_L^{\text{prov}} = 1.0`$; the measurement model (Appendix <a href="#sec:measurement" data-reference-type="ref" data-reference="sec:measurement">13</a>) refines this via posterior estimation, reporting $`w_L`$ as $`\Pr(\phi \mid \eta_{L,c}=1)`$ with credible intervals.

    - **Contextual strength:** How robustly does $`\phi`$ correlate with $`c`$ across contexts?

    - **Record false positives.** For each form $`\phi`$, document contexts where $`\phi`$ appears but $`c`$ is diagnosed as absent. This estimates $`q_L(c,\phi)`$, the false-positive rate, which distinguishes high-precision exponents ($`w_L \gg q_L`$) from ambiguous forms ($`w_L \approx q_L`$).

    See Section <a href="#subsec:weight-procedures" data-reference-type="ref" data-reference="subsec:weight-procedures">[subsec:weight-procedures]</a> for the formal weight-assignment procedure.

5.  **Record provenance and reliability.** Note whether the mapping is uncontroversial (leave $`r_L`$ unspecified), contested ($`r_L`$ mandatory, with coder disagreement retained), or experimentally validated ($`r_L`$ based on corpus evidence or acceptability judgements). Document which diagnostics were used, how scoring decisions were made, and whether disagreement clusters by coder, diagnostic, or language family.

**Step 2 precedes step 3**: Level-I comparanda aren’t inferred from morphological exponents. This prevents circular reasoning and guards against cluster-reduction and false-universalization errors.

## Diagnostic battery (excerpt)

<span id="ex:definiteness-cc" label="ex:definiteness-cc"></span> **<span class="smallcaps">Definiteness</span><sub>Cross</sub>** (identifiability/uniqueness/familiarity): Test anaphoric uptake, uniqueness inferences (bridging), anti-novelty contexts. Don’t infer from articles.

<span id="ex:specificity-cc" label="ex:specificity-cc"></span> **<span class="smallcaps">Specificity</span><sub>Cross</sub>** (scopal stability): Test wide scope over negation/modals, choice-function readings. Avoid equating with DOM.

<span id="ex:genericity-cc" label="ex:genericity-cc"></span> **<span class="smallcaps">Genericity</span><sub>Cross</sub>**: Distinguish kind-reference from characterizing predication. Don’t conflate with bare-noun syntax.

<span id="ex:agentivity-cc" label="ex:agentivity-cc"></span> **<span class="smallcaps">Agentivity</span><sub>Cross</sub>**: Use proto-agent diagnostics (*deliberately*, clefts, purpose clauses). Don’t equate with subjecthood.

<span id="ex:caused-motion-cc" label="ex:caused-motion-cc"></span> **<span class="smallcaps">Caused-motion</span><sub>Cross</sub>**: Diagnose caused change of location or path from event entailments and paraphrase tests. Don’t infer it from an English-style verb–NP–PP pattern.

These diagnostics serve as checklists, not proxies for grammatical exponents. The full codebook (excerpt in Section <a href="#sec:codebook" data-reference-type="ref" data-reference="sec:codebook">8</a>) provides complete specifications.

<figure id="fig:dag" data-latex-placement="ht">

<figcaption>Two-layer hierarchical model. <strong>Layer 1</strong>: diagnostic items <span class="math inline"><em>d</em><sub><em>L</em>, <em>i</em></sub></span> modelled via latent strength <span class="math inline"><em>η</em><sub><em>L</em>, <em>c</em></sub></span> with coefficient <span class="math inline"><em>β</em><sub><em>i</em></sub></span>, intercept <span class="math inline"><em>α</em><sub><em>i</em></sub></span>, and random effects <span class="math inline">(<em>u</em><sub>fam</sub>, <em>v</em><sub>coder</sub>)</span> and measurement error <span class="math inline"><em>ε</em><sub><em>i</em></sub></span>. <strong>Layer 2</strong>: observed forms <span class="math inline"><em>F</em><sub><em>L</em>, <em>c</em>, <em>ϕ</em>, <em>t</em></sub></span> modelled via form parameters <span class="math inline">(<em>κ</em><sub><em>L</em>, <em>c</em>, <em>ϕ</em></sub>, <em>λ</em><sub><em>c</em>, <em>ϕ</em></sub>)</span> and measurement error <span class="math inline"><em>ε</em><sub><em>F</em></sub></span>. Derived quantities <span class="math inline"><em>w</em><sub><em>L</em></sub>, <em>q</em><sub><em>L</em></sub></span> are deterministic functions of Layer 2 parameters. Hyperpriors: <span class="math inline"><em>σ</em><sub>fam</sub>, <em>σ</em><sub>coder</sub> ∼ Exp(1)</span>. Identifiability: <span class="math inline">𝔼[<em>η</em>] = 0</span>, <span class="math inline">Var(<em>η</em>) = 1</span>, <span class="math inline"><em>λ</em><sub><em>c</em>, <em>ϕ</em></sub> ≥ 0</span>. Anti-circularity: <span class="math inline"><em>F</em> ⟂ <em>d</em> ∣ <em>η</em></span>.</figcaption>
</figure>

An illustrative failure case shows what happens when the firewall is violated and how the naturalization tests bite. The slogan “feminine personal names end in /a/” collapses comparandum and exponent: it bundles the Level-I comparandum (feminine personal name) with a specific phonological realization (final /a/), leaning on Latin orthography to make the pattern look universal.

Applying the failure modes (Section <a href="#sec:naturalized" data-reference-type="ref" data-reference="sec:naturalized">6</a>):

1.  **Too fat**: the slogan lumps together unrelated exponents. Feminine names that end in /a/ because of Romance declension suffixes, Turkish vowel harmony, or Arabic templatic vocoids get treated as one “kind,” even though the diagnostics and mechanisms differ.

2.  **Negative projectibility**: the correlation evaporates outside Indo-European and neighbouring families. Masculine /a/ names (Arabic, Bantu) and feminine names without /a/ (Hebrew, Mandarin, English monosyllables) are plentiful; held-out families invalidate the prediction.

3.  **Areal/mechanistic failure**: the clustering flows from genealogical inheritance and areal diffusion (Mediterranean Sprachbund, IE suffixal pathways) rather than recurrent cognitive or discourse pressures that would regenerate the pattern independently.

The three-level apparatus handles this cleanly. Define <span class="smallcaps">feminine personal name</span><sub>Cross</sub> as the comparandum and record how each language realizes it: final low vowel ($`w_{\mathrm{IE}} \approx 0.8`$), feminine declension suffix ($`w_{\mathrm{Lat}} = 1.0`$), onomastic formatives ($`w_{\mathrm{Sem}} \approx 0.6`$), prosodic frames, or no dedicated marking ($`w_{\mathrm{Eng}} = 0`$ for most names).

The diagnostic question (“Does the name trigger feminine agreement, carry feminine social indexing?”) stays separate from the realization question (“Which forms signal it?”), satisfying Rule B and the firewall protocol. The comparison remains useful within the genealogical/areal zone where the weights are high, but it never meets the criteria for naturalized status. Why? It’s too fat, shows negative projectibility, and lacks recurrent mechanisms.

# Naturalized comparative concepts

Section <a href="#sec:naturalization-question" data-reference-type="ref" data-reference="sec:naturalization-question">1</a> identified four questions that instrumentalism leaves unanswered: which comparanda show convergent stability (Obstacle 1), why do some patterns recur while others dissolve (Obstacle 2), how can real patterns be distinguished from artifacts (Obstacle 3), and when should a concept be demoted (Obstacle 4). This section answers those questions by specifying when comparative concepts earn naturalized status by identifying comparanda with kind-like stability.

Not all comparative concepts are created equal. Some are plausible candidates for naturalized status, foremost <span class="smallcaps">noun</span><sub>Cross</sub>. Noun-like profiles recur across genealogically and areally independent languages: they serve as argument heads, anchor possessive and quantificational constructions, take characteristic predication patterns, and participate in modification relationships. The question is whether this clustering is maintained well enough to support projection.

Some comparative concepts earn naturalized status when the comparanda they pick out behave like stable scientific kinds because independent mechanisms converge to maintain them. Think of the profiles as cross-linguistic habits that keep recurring, not as universal essences. Formally, a naturalized comparative concept identifies a weak homeostatic property cluster (HPC) kind operating at the cross-linguistic level: a profile of correlated properties, stabilized by mechanisms, and projectible for a stated analytic purpose. The mechanisms matter because they explain why projection is reliable; they aren’t the payoff by themselves. For naturalized linguistic concepts, these mechanisms include recurrent cognitive pressures (reference tracking, packaging for quantification), discourse pressures (argument realization), and convergent grammaticalization pathways. Section <a href="#subsec:mechanisms" data-reference-type="ref" data-reference="subsec:mechanisms">6.3</a> catalogues the full palette.

Not all comparative concepts achieve this status. Some proposed comparanda fail by undergeneration, others by overgeneration. I use too thin for the first pattern: the diagnostics don’t cluster strongly enough to support inference. I use too fat for the second: the comparandum bundles distinct profiles whose mechanisms and predictions diverge. When the profiles fail to stabilize, they can be tagged with explicit failure modes: too thin, too fat, negative (projectibility metrics underperform study-specific decision rules), or indeterminate (insufficient data from enough independent languages to judge).

## Failure modes in practice

A study first defines the languages with adequate evidence for the comparandum under investigation. Languages outside that evidential set are indeterminate; they don’t count as negative cases. Within the adequately evidenced set, a comparandum is too thin if too few languages show the relevant diagnostic cluster. It is too fat if the proposed comparandum bundles several weakly related profiles rather than a unified one. It is negative if it fails to project to held-out families under the study’s declared decision rule. The formal versions of these decision rules are given in Appendix <a href="#subsec:failure-formal" data-reference-type="ref" data-reference="subsec:failure-formal">13.1</a>; the main point is that demotion is explicit, evidentially conditioned, and revisable.

## Criteria for naturalization

Not all comparative concepts qualify for naturalized status. The bar should be high, since the aim is to avoid reinstating the spurious universals this paper is trying to eliminate.

A comparative concept earns promotion when three conditions jointly obtain:

1.  **Cross-linguistic clustering:** Diagnostics cluster reliably across $`\geq 3`$ genealogically independent families (e.g., Indo-European + Niger-Congo + Sino-Tibetan). It isn’t enough for <span class="smallcaps">noun</span><sub>Cross</sub>-like patterns to appear in Indo-European alone; evidence from independent evolutionary histories is needed.

2.  **Identifiable mechanisms:** Cognitive, diachronic, or discourse pressures are specified with concrete evidence: grammaticalization pathways, acquisition trajectories, and documented discourse asymmetries. Appeals to “communicative need” or “cognitive salience” don’t suffice without evidence.

3.  **Predictive purchase:** Mechanisms generate falsifiable predictions about erosion, regeneration, or trade-offs that can be tested on held-out data. Without falsifiable consequences, the explanation isn’t doing real work.

Failure to qualify triggers explicit tagging:

- **Too thin:** diagnostics systematically fail in well-described non-European languages (covariation may be artifact of shared descriptive templates)

- **Genealogical artifact:** stability traces to transmitted descriptive frameworks, not convergent evolution

- **Indeterminate:** no plausible mechanisms can be articulated despite searching

- **Too fat:** comparandum overgenerates, lumping distinct phenomena

- **Negative:** projectibility metrics underperform the decision rules declared for the study

Reassessment becomes necessary when new evidence challenges previously naturalized status: broader sampling reveals the pattern was genealogically/areally restricted; proposed mechanisms prove inadequate; or predictions fail in held-out data. In such cases, the classification is revised: downgrade from naturalized to instrumental comparative concept, restrict naturalization claims to specific families/areas, or tag with failure modes.

Naturalization is gradient, not binary. <span class="smallcaps">noun</span> likely sits near the strong pole: high property covariation across independent families, multiple reinforcing mechanisms (argument realization, possession, quantification interfaces), clear predictions about paradigm stability and regeneration. <span class="smallcaps">adjective</span> occupies a weaker position: the lexical category is absent or diffuse in many well-described languages, fewer convergent grammaticalization pathways, less stable clustering. <span class="smallcaps">subject</span><sub>Cross</sub> (function) remains contested: definitional circularity (is it a syntactic or semantic notion?), radical variation in alignment systems, unclear whether mechanisms actually converge. Systems in which verbal marking tracks a preverbal topic position rather than an agentive argument make the problem especially clear: Remijsen notes that Dinka has number marking on the verb conditioned by the occupant of that position, and both Dinka and Shilluk have voice marking conditioned by the semantic role of that occupant, even when it isn’t the agentive subject (p.c.). Such cases force the analysis to ask whether <span class="smallcaps">subject</span><sub>Cross</sub> has syntactic projectibility beyond semantic-role and information-structural diagnostics. The framework provides the test; which concepts pass is an empirical matter.

## Homeostatic mechanisms: plural levels

What actually stabilizes linguistic categories? The answer is plural: there’s no single mechanism doing all the work (cf. Miller 2021).

Language-internal categories (like NP<sub>Eng</sub> or construct state<sub>Heb</sub>) draw on community-specific mechanisms. Cognitively, acquisition biases and processing constraints ensure children learning English converge on the NP early and consistently. Socially, community norms, prescriptive regimes, and interactional routines maintain categories: standard grammars codify noun paradigms, schools drill them, style guides police their use. Materially, literacy and orthography provide stabilization through written standards, dictionaries, grammatical treatises. Diachronically, regenerative pathways (grammaticalization, reanalysis, analogy) rebuild eroded exponents, with erosion-resistance from paradigmatic pressure and frequency-driven entrenchment.

Naturalized comparative concepts identify profiles that exhibit homeostasis through recurrent but independent mechanisms operating across unrelated languages. Cognitive and discourse pressures (reference tracking, individuation, quantificational packaging) are driven by stable communicative demands. Discourse-adaptive asymmetries (arguments versus predicating expressions, topics versus comments) flow from information structure. Convergent grammaticalization routes show similar pathways: demonstratives becoming articles, measure terms becoming classifiers, relational nouns becoming adpositions. Areal diffusion can amplify these patterns within contact zones.

The HPC claim is modest: each kind is stabilized by some specifiable mix of these forces, to be discovered through empirical investigation. The causal profile differs by level (community-bound for language-internal categories, globally recurrent for comparanda identified by naturalized comparative concepts), but both satisfy the homeostatic template.

Cross-level coupling is one subtype of this mix. Count-related individuation pressures can favour number morphology or classifier systems; once entrenched, those systems make individuation easier to signal and learn. The explanatory claim is diachronic and projective: coupling should improve predictions about regeneration, erosion, or compensatory structure.

# Convergent mechanisms

The biological analogy is a guide to the logic, not evidence for the linguistic claim. Camera eyes evolved independently in vertebrates and cephalopods, but both lineages converge on a lens, retina, and iris because stable design pressures make some profiles recurrent. The analogy matters only in this limited sense: independently developed systems can share a projectible profile when recurrent mechanisms keep rebuilding similar solutions.

The biological case depends on character-identity mechanisms: developmental processes in gene-regulatory networks that maintain structural sameness across evolutionary time and convergent evolution (Shubin et al. 2009; Wagner 2014). Even though vertebrate and cephalopod eyes evolved separately, they recruit conserved genetic systems (like Pax6 genes) that bias development toward similar solutions. The mechanisms don’t specify essences; they describe the processes that keep a biological structure recognizable despite variation.

The linguistic claim has to be tested in linguistic terms. I propose three recurrent mechanism types that can maintain category behaviour through diachronic change and cross-linguistic convergence:

Invariance
Stable mapping from discourse and processing demands (identifiability, individuation, possession, quantification) to privileged morphosyntactic slots. Nouns reliably serve as argument heads, anchor possessive constructions, and interface with quantification; this mapping persists even as specific morphology changes.

Cohesion
Paradigmatic and selectional pressures that penalize drift. Once a language establishes determinative or classifier systems, case or construct morphologies, and modification profiles, these co-adapted traits resist being pulled apart. Changing one element (say, losing case) creates pressure to compensate elsewhere (developing stricter word order).

Regeneration
Recurrent diachronic sources that rebuild eroded exponents. When articles weaken, languages reliably find new sources: demonstratives grammaticalize into definite markers (deixis $`\to`$ articles), measure terms become classifiers (measure terms $`\to`$ classifiers), verbs nominalize to create new argument-taking lexical categories (nominalizers $`\to`$ nouns). The pathways recur because discourse and processing pressures remain stable.

These mechanisms explain why <span class="smallcaps">noun</span> may behave as a stable HPC across genealogically independent languages without positing universals or essences. Noun-like lexical categories reflect convergent solutions to stable discourse demands (picking out entities, tracking reference, anchoring modification). The stability is real if the profile projects, but mechanisms maintain it; innate lexical categories or semantic primitives don’t mandate it.

These mechanisms can be formalized probabilistically, but the main typological claim is simpler. Invariance predicts that recurrent discourse and processing pressures make some realizations more likely than they would be by chance. Cohesion predicts that forms participating in one part of a profile are more likely to participate in related parts. Regeneration predicts that when an exponent erodes, languages with the relevant pressures tend to recruit new sources rather than leaving the profile permanently unexpressed. Appendix <a href="#subsec:mechanism-formal" data-reference-type="ref" data-reference="subsec:mechanism-formal">13.3</a> gives one way to write these claims as explicit probability statements.

Prior predictive validation provides a quality check for these probabilistic versions. One can simulate language histories under plausible ranges for the pressure effect, cohesion increment, and regeneration probability before fitting the model. If the simulated trajectories produce unrealistic timescales (e.g., grammaticalization in 50 years or 5000 years) or implausible pattern frequencies, the assumptions require revision. This workflow makes the theory’s causal commitments testable before the empirical analysis proper.

# A reproducible codebook (excerpt)

Measurement discipline requires operational codebooks: explicit diagnostics, decision rules, reliability checks, and provenance tracking for every comparandum (category, function, target, role, or decomposed profile) and every language-internal realization under evaluation. Below are five illustrative entries showing the required structure. A full implementation would provide this level of detail for every row in the comparanda inventory and every realization under evaluation.

**Comparative lexical category: <span class="smallcaps">noun</span><sub>Cross</sub>.** Diagnostics: argument-head privileges; possessive/quantificational interfaces; nominal morphology; predication and modification profiles. Record which language-internal lexical categories or morphemes realize it and how strongly (e.g., <span class="smallcaps">noun</span><sub>Eng</sub>, <span class="smallcaps">classifier</span><sub>Tha</sub>) and track the evidence supporting each weight. **Syntactic function: <span class="smallcaps">determiner</span><sub>Cross</sub>.** Diagnostics: dedicated articles (definite/indefinite); demonstratives in selectional use (not just deictic); classifier/measure structures licensing numeral combination; distribution of bare NPs in argument positions (<span class="smallcaps">subject</span><sub>Cross</sub>, <span class="smallcaps">object</span><sub>Cross</sub>). Code gradient strength 0–3 (absent $`\to`$ fully grammaticalized). Record specific exponents (morphemes, constructions) and genealogical provenance (inherited, newly developed, contact-induced).

**Semantic target: <span class="smallcaps">definiteness</span><sub>Cross</sub>.** Diagnostics: anaphoric uptake (can the referent be picked up by pronouns or repeated definites in subsequent discourse?); uniqueness and bridging (does the expression trigger “only one” inferences or support part-whole bridging?); anti-novelty contexts (is the expression blocked in presentational or existential constructions where the referent has to be new?). Code strength of evidence on a gradient scale. Don’t use presence of articles as a diagnostic criterion, since that would bake in the conflation this framework is trying to avoid.

**Discourse role: <span class="smallcaps">topic</span><sub>Cross</sub>.** Diagnostics: sentence-initial position preference; aboutness tests (“As for X, …”); persistence across discourse spans; compatibility with focus operators. Code strength 0–3. Record whether topic is marked morphologically (particles, case), positionally (preverbal/post-verbal), or both. Note interaction with <span class="smallcaps">subject</span><sub>Cross</sub> function and information structure.

**Decomposed profile: <span class="smallcaps">caused-motion</span><sub>Cross</sub>.** Components: Level-I caused-motion target (caused change of location or path); Level-II argument-structure profile (causer, theme, path or goal licensing, participant linking); Level-III realizations (<span class="smallcaps">caused-motion construction</span><sub>Eng</sub>, serial-verb pattern, applicative frame, lexical causative plus path phrase). Code the target and syntactic profile separately before scoring realizations. Don’t treat the English construction label as a diagnostic.

A valid implementation requires guarding against researcher degrees of freedom: the many analytic choices that can be made after seeing the data, including which diagnostics to retain, how to calibrate weight anchors, which languages to exclude as insufficiently described, how to collapse marginal realizations, and which performance metric to report. These choices are legitimate when declared in advance and checked for sensitivity, but they become a source of false projectibility when they’re tuned to make a preferred comparandum profile cluster more cleanly.

Diagnostics should be preregistered before examining cross-linguistic data; post-hoc selection capitalizes on chance. Weight-assignment criteria (when does a realization count as 0.5 vs. 0.7?) have to be established via training data, coder calibration, and model criticism; anchors chosen to maximize apparent clustering are circular.

Inter-rater agreement should be reported, but no single statistic should function as a validity gate. Disagreements are resolved through pre-specified decision rules when the issue is procedural, and otherwise retained as uncertainty in the measurement model. Cross-validation requires reserving a subset of languages (e.g., 20%) for held-out testing; failure to replicate in unseen data indicates overfitting.

Finally, coding needs to stratify by genealogy, area, and literacy availability (Section <a href="#sec:predictions" data-reference-type="ref" data-reference="sec:predictions">9</a>); ignoring these risks attributing mechanism-driven patterns to spurious correlates.

These precautions sketch an implementation blueprint. The present paper gives the framework and a compact worked demonstration; larger validation requires statistical methods for empirical work. The apparatus is designed for iterative refinement: initial codings reveal where diagnostics fail, prompting revision of the comparative concept or recognition that it lacks naturalized status. The specific metrics and models mentioned (agreement summaries, factor analysis, item-response theory (IRT)) are illustrative examples that operationalize the constraints, not mandatory choices; alternative methods meeting the same standards are equally admissible.

**Implementation realities for empirical collaborators.**<span id="subsec:implementation" label="subsec:implementation"></span> Since the paper is theoretical, I note these as collaboration prerequisites:

Populating $`M_L`$ for even 50 languages with required reliability is a multi-year, multi-researcher project. Using LLMs as research assistants is realistic but requires explicit prompt engineering, validation pipelines, and error propagation protocols.

Definiteness diagnostics (anaphoric uptake, bridging) require controlled discourse contexts that don’t exist for most languages. Adaptation to elicited narratives or corpus-mining heuristics is necessary, with validation against expert judgements.

Without large parallel corpora, observational weights (Section <a href="#subsec:weight-procedures" data-reference-type="ref" data-reference="subsec:weight-procedures">[subsec:weight-procedures]</a>) are impossible. Analyst weights based on diagnostic strength are necessary, reintroducing subjectivity. The reliability score $`r_L`$ should capture this uncertainty formally (e.g., as precision parameter in hierarchical model).

Predictions require phylogenetically controlled samples. Most typological databases (WALS, Grambank) don’t stratify this way; custom sampling is needed with explicit family/area controls.

Predictions about mechanism competition (Section <a href="#subsec:mechanism-predictions" data-reference-type="ref" data-reference="subsec:mechanism-predictions">9.1</a>) require stratifying by literacy availability. This demands historical sociolinguistic data often unavailable for minority languages.

The workload is tractable but requires coordinated empirical effort. The theory identifies the empirical object; implementation requires fieldwork expertise, computational infrastructure, and sustained collaboration.

# Predictions and risk

The homeostatic mechanisms catalogue (Section <a href="#subsec:mechanisms" data-reference-type="ref" data-reference="subsec:mechanisms">6.3</a>) generates testable cross-linguistic predictions about regeneration pathways, erosion rates, and realization trade-offs. The three-level mapping (Section <a href="#sec:matrix" data-reference-type="ref" data-reference="sec:matrix">4</a>) and firewall protocol (Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a>) provide the measurement apparatus for testing these predictions. Each prediction has to be paired with a study-specific decision rule before held-out evaluation, but the statistical implementation is secondary. The typological claim is about what should recur, erode, regenerate, or trade off if the proposed mechanism is real.

A useful summary is the effective number of realizations: the inverse concentration of a comparandum’s weight across forms in a language. High values indicate stable clustering with multiple strong exponents; low values suggest fragile realization profiles. Naturalization predicts repeated redundancy across phylogenetically independent families for comparanda identified by legitimate naturalization candidates. Appendix <a href="#subsec:failure-formal" data-reference-type="ref" data-reference="subsec:failure-formal">13.1</a> gives the corresponding formula.

## Mechanism-specific predictions

The mechanism-driven account shouldn’t merely restate familiar correlations. It has to predict the causal processes that maintain the profile. Three examples illustrate the stronger form of prediction:

Regeneration pathway prediction
If a language loses articles, the specific grammaticalization source of the replacement is predictable from the language’s existing typological profile. Languages with well-developed classifier systems should favour demonstrative-article pathways, since classifiers already package quantification, while languages with strong possessive morphology should favour genitive-article pathways. The empirical test is whether documented replacement sources are predicted better from the typological profile than from genealogy alone.

Mechanism competition prediction
In oral traditions, cognitive pressures (reference tracking, memory constraints) should dominate the noun HPC; in literate traditions, material pressures (orthographic conventions, prescriptive grammars) should dominate. This predicts different erosion/stability profiles: oral-tradition languages should show faster regeneration of eroded exponents, because cognitive pressure remains constant, but also faster erosion, because no orthographic system stabilizes the forms. Literate languages should show slower movement in both directions.

Co-adaptation trade-off
When a language loses case morphology, it should compensate with stricter word order and/or increased use of adpositions. The timing should be correlated: languages that lose case without developing stricter order should show higher rates of lexical-category reanalysis, including erosion of noun–adjective distinctions.

These predictions are mechanistic: they test not just correlations but the causal processes (regeneration pathways, competition between cognitive vs. material pressures, co-adaptive compensation) that maintain homeostasis. Numerical performance targets belong in the study codebook, where they can be calibrated with power analysis, posterior predictive checks, and sensitivity analysis before held-out evaluation.

## Original baseline predictions

The following five predictions serve as baseline tests of the framework’s empirical coverage:

1.  **Trade-off (syntactic functions + semantic targets)**: strength of classifier systems negatively correlates with obligatory number morphology on common nouns. Classifiers realize both <span class="smallcaps">determiner</span><sub>Cross</sub> and <span class="smallcaps">individuation</span><sub>Cross</sub>; number morphology realizes <span class="smallcaps">individuation</span><sub>Cross</sub> independently.

2.  **N–A separation (lexical categories + syntactic functions)**: strong possessive/construct morphology predicts sharper <span class="smallcaps">noun</span><sub>Cross</sub>–<span class="smallcaps">adjective</span><sub>Cross</sub> predication splits.

3.  **Regeneration (syntactic functions)**: weakening of article systems for <span class="smallcaps">determiner</span><sub>Cross</sub> is followed by increased use of nominalisers and possessive frames for the same function.

4.  **Semantics decoupling (semantic targets)**: languages without articles still show strong <span class="smallcaps">definiteness</span><sub>Cross</sub> effects; correlation between article presence and <span class="smallcaps">definiteness</span><sub>Cross</sub> weakens once the firewall is enforced.

5.  **Specificity non-identity (semantic targets)**: presence of differential object marking improves detectability of <span class="smallcaps">specificity</span><sub>Cross</sub> but is neither necessary nor sufficient for it; wide-scope diagnostics sometimes diverge from differential object marking.

Construction-label decomposition supplies a related test. For profiles such as <span class="smallcaps">caused-motion</span><sub>Cross</sub>, a model that separates the Level-I target from the Level-II argument-structure profile should predict held-out realization patterns better than a model that treats <span class="smallcaps">caused-motion construction</span><sub>$`L`$</sub> as a single unanalyzed label. Failure of the decomposed model is informative: the profile may be too fat, or the proposed coupling between event semantics and argument structure may be weaker than assumed.

These tests represent pragmatic, illustrative checks on the framework’s predictions. A fully integrated empirical study would translate them into preregistered diagnostics, held-out evaluation, and uncertainty estimates. The point in the main text is the typological direction of the predictions; the exact statistical targets belong to the implementation stage.

Literate traditions introduce additional stabilizers (orthographic standards, grammars, dictionaries, and metalinguistic policing) that can amplify or arrest these trajectories. Oral settings rely more heavily on cognitive and interactional mechanisms, so regeneration and erosion may proceed differently across the predictions above. Any measurement agenda therefore has to stratify by the availability of literacy-driven mechanisms when estimating homeostatic strength.

# Worked illustrations

The framework is easiest to see where a familiar descriptive label bundles several independently recognizable diagnostics. English past participle is a useful guide case. In a form such as *written*, the label supports several projections: selection by perfect *have*, selection by passive *be* or *get*, headship in non-finite clauses modifying nouns, and adjacency to deverbal adjectives such as *the written report*. From the language-internal label alone, an analyst can infer a sizeable distributional profile.

The Haspelmathian question isn’t whether other languages “have past participles”. That would project an English descriptive label onto other grammars. The comparative question is whether the diagnostics bundled in English are bundled by one language-internal form, split across several forms, or organized by different constructions. Table <a href="#tab:participle-decay" data-reference-type="ref" data-reference="tab:participle-decay">3</a> sketches the coding logic.

<div id="tab:participle-decay">

| Language | Realization pattern | Projectibility consequence |
|:---|:---|:---|
| English | One language-internal label bundles perfect selection, passive selection, non-finite modification, and deverbal-adjective adjacency | High projection: knowing the form is a past participle licenses several further expectations |
| German | The verbal participial cluster is preserved for perfect and passive uses, but prenominal extended modifiers carry adjectival agreement | Projection holds for the verbal cluster but weakens at the modifier/adjective boundary |
| French | The verbal participial cluster is preserved, with visible agreement in contexts where the form participates in adjectival or modifier-like uses | Projection holds, but agreement makes part of the verbal/adjectival boundary overt |
| Modern Greek | Perfect uses an invariant form after *ekho*; passive is finite mediopassive; *-menos* forms behave as agreeing deverbal adjectives | Projection decays sharply: the invariant form predicts perfect use, while adjectival projections belong to a different form |
| Japanese | Past/perfective, passive, resultative/progressive, and prenominal modification are distributed across distinct forms and constructions | No single language-internal form carries the English bundle; projection from the English label fails |

Projection decay in a participial-label comparison. The table is a compact demonstration of the coding workflow; a database implementation would attach source citations, diagnostics, and reliability values to each cell.

</div>

This miniature case uses the full matrix procedure. The rows are comparanda: perfect-related selection, passive-related selection, non-finite modification, and deverbal-adjective adjacency. The columns are language-internal realizations. The naturalization question is whether the bundle supports projection beyond the components. English supports broad projection; German and French support narrower projection; Greek splits the bundle; Japanese distributes the diagnostics across unrelated constructions. The prediction isn’t that every language will preserve the English cluster. The prediction is that when the cluster fractures, it should fracture along independently motivated selectional and distributional axes rather than randomly.

This example also shows why projectibility, not mere mechanism naming, is the payoff. If the English bundle were stabilized, then removing one stabilizer should reduce projection: if the modifier use is reanalysed as adjectival, the past-participle label should stop predicting modifier behaviour. If passive is routed through a finite mediopassive system, the label should stop predicting passive uses. The evidence that matters is the decay of inference from the label, not the existence of a label in a reference grammar.

Construction labels behave the same way. The question isn’t whether a language “has <span class="smallcaps">caused-motion construction</span><sub>$`L`$</sub>” in the English sense. The comparative question is whether a caused-motion target and an argument-structure profile cluster strongly enough to support projection across unrelated languages. A language may realize the target through an English-style construction, serial verbs, applicatives, a lexical causative plus a path phrase, or several weaker devices.

Interrogatives make the same point from the other direction. <span class="smallcaps">Questionhood</span><sub>Cross</sub> belongs to the Level-I side of the comparison, while <span class="smallcaps">interrogative clause</span><sub>Cross</sub> is a Level-II clausal category only if it has portable syntactic diagnostics. Particles, inversion, intonation, and word order are Level-III realizations. Treating any one of them as the definition of the construction would collapse the profile before it could be tested.

The same logic applies to the recurring claim that some languages lack adjectives. The present framework separates the <span class="smallcaps">modifier</span><sub>Cross</sub> function from language-internal adjective lexical categories. A language might realize <span class="smallcaps">modifier</span><sub>Cross</sub> through relative-clause constructions (1.0), property nouns (0.7), stative verbs in attributive frames (0.5), and no dedicated adjective lexical category (0). The function is realized; there’s no dedicated $`L`$-internal lexical category specialized for that function.

The framework then generates a testable prediction: in languages where there’s no dedicated $`L`$-internal lexical category specialized for the property-concept modifier function, that function should be realized via compensatory elaboration: more complex relative-clause syntax, richer nominal modification paradigms, or increased use of stative predicates in attributive position (Dixon 2004; Croft 2001).[^3]

The firewall’s payoff is clearest in the definiteness domain. Traditional surveys equate <span class="smallcaps">definiteness</span><sub>Cross</sub> with articles: languages *have* <span class="smallcaps">definiteness</span><sub>Cross</sub> if they grammaticalize *the*/*a*-type morphemes. But when <span class="smallcaps">definiteness</span><sub>Cross</sub> is tested independently using discourse diagnostics (anaphoric uptake, uniqueness, anti-novelty), the semantic target is still clearly present in languages lacking article systems (Lyons 1999; Matthewson 2004).

Salish languages, for instance, show strong <span class="smallcaps">definiteness</span><sub>Cross</sub> effects in referential behaviour despite having no definite articles. Semantic targets survive realization turnover: forms come and go (articles grammaticalize, erode, get replaced), but the discourse pressures driving identifiability marking remain stable.

The Salish example shows diagnostic independence with cross-level coupling. Article presence neither defines nor exhausts the target, while article systems can still stabilize and transmit identifiability routines where they exist. The firewall lets those claims be tested separately.

# Threats and replies

Section <a href="#sec:objections" data-reference-type="ref" data-reference="sec:objections">3</a> addressed five immediate objections. This section turns to deeper philosophical concerns that require fuller treatment.

How should genealogical inheritance (homology) be distinguished from independent convergent evolution (analogy) and contact-induced diffusion? All three produce cross-linguistic similarity, but the mechanisms differ. The framework handles this by tracking genealogical identity separately from comparandum similarity. When coding a language, the analyst records provenance: is this exponent inherited from a documented proto-form, newly developed within the attested history of the language, or borrowed from a contact neighbour? Areal patterns get flagged explicitly. Both genealogical and comparandum patterns are reported; the distinction matters for testing which homeostatic mechanisms are active (inherited paradigms versus recurrent grammaticalization versus diffusion).

Coder disagreement is part of measurement, not merely a nuisance to be eliminated. The codebook (Section <a href="#sec:codebook" data-reference-type="ref" data-reference="sec:codebook">8</a>) specifies decision rules, provenance fields, and reliability checks for every diagnostic. When coders disagree, adjudication proceeds at the comparandum level (category, function, target, role, or decomposed profile) using operational tests, not by negotiating familiar labels; unresolved disagreement remains in the uncertainty estimates. For example, if Coder A and Coder B disagree on whether an element is a <span class="smallcaps">determinative</span><sub>$`L`$</sub>, the resolution protocol asks: does it realize <span class="smallcaps">determiner</span><sub>Cross</sub>? Does it encode <span class="smallcaps">definiteness</span><sub>Cross</sub>? Does it participate in selectional restrictions? Does it combine with numerals? The diagnostics settle procedural disputes, while persistent disagreement signals a weaker or less stable profile.

For less-studied languages, absence of coded evidence isn’t negative evidence: a grammar sketch, a small text collection, or one fieldworker’s notes may fail to answer a diagnostic question even when the language has the relevant profile. Demotion claims apply only under declared evidential conditions. Sparse, genre-bound, or tradition-bound documentation defaults to indeterminate, not too thin, unless the study can show that the relevant diagnostics were actually tested and failed. The framework should preserve expert field knowledge by recording provenance, register, speaker variation, translation practice, and unresolved uncertainty rather than flattening them into zeros.

The same rule applies to constructional labels. If coders disagree about whether a language has <span class="smallcaps">caused-motion construction</span><sub>$`L`$</sub>, they first adjudicate the Level-I caused-motion target and the Level-II argument-structure profile, then score whichever Level-III forms realize the combination. The construction label doesn’t settle the analysis.

What would force reassessment and potential withdrawal of naturalized status? Failure on any of the three criteria from Section <a href="#subsec:naturalization-criteria" data-reference-type="ref" data-reference="subsec:naturalization-criteria">6.2</a>: diagnostics systematically fail in genealogically independent languages; proposed homeostatic mechanisms dissolve under scrutiny; or predictions fail in held-out data. Such disconfirmation reveals that the concept was misclassified initially (appeared naturalized within a limited sample but lacks genuine cross-linguistic stability) or requires scope restriction (naturalized within certain families/areas but not globally).

## The essence worry

Some will read naturalized as sneaking back in innate lexical categories or semantic essences. This misinterprets the claim. Naturalization is a hypothesis about process convergence, not a claim about innate primitives.

Consider the convergence analogy from Section <a href="#sec:eyes" data-reference-type="ref" data-reference="sec:eyes">7</a>. Similar profiles can arise independently when stable demands make some solutions recurrent. Likewise, <span class="smallcaps">noun</span><sub>Cross</sub>-like patterns reflect convergent solutions to stable discourse demands (tracking reference, packaging for quantification), maintained by independent mechanisms in each lineage.

The key difference from universalist accounts:

- **Universals claim**: all languages have nouns because UG provides a Noun feature.

- **Naturalization claims**: many languages develop noun-like lexical categories because recurrent pressures (cognitive, discourse, diachronic) independently stabilize similar property clusters. The similarity is real but mechanistically explained, not essentially mandated.

Naturalization is empirically defeasible: if evidence shows that “noun-like” patterns in 30% of languages trace to a single proto-language and the rest show no convergent mechanisms, <span class="smallcaps">noun</span><sub>Cross</sub> is downgraded from naturalized to instrumental status. Universals don’t admit this kind of revision.

The framework is explicitly anti-essentialist: lexical categories persist because mechanisms maintain them, not because they have defining features. When mechanisms diverge or fail, lexical categories dissolve or reorganize. This amounts to homeostasis, not essence.

## The independence worry

**Objection:** Standard measurement models require sum-to-1 constraints for identifiability, forcing competition between forms. Doesn’t this violate HPC’s core intuition that multiple mechanisms independently maintain the cluster?

**Reply:** The two-layer model achieves identifiability without simplex constraints. Variance anchoring ($`\text{Var}(\eta)=1`$) and monotonicity ($`\lambda \geq 0`$) fix the scale, allowing several forms to receive high weights at once. For example, English proper names, definite determinatives, and referential personal-pronoun uses can all provide strong evidence for <span class="smallcaps">definiteness</span><sub>Cross</sub> without forced trade-offs. They shouldn’t be coded as perfect exponents: dummy *there*, weather *it*, generic *one*, and interrogative *who* are precisely the kinds of false positives that $`q_L(c,\phi)`$ is meant to capture. The false-positive rate distinguishes high-precision exponents from ambiguous forms, providing richer characterization than normalized weights. This preserves HPC’s claim that clustered, partially redundant mechanisms can support rigorous statistical inference.

# Conclusion

Comparative-concept discipline makes the conflation problem visible but leaves a deeper question unanswered: which concepts, if any, identify comparanda that are more than descriptive conveniences? This paper proposed defeasible criteria for promoting comparative concepts to naturalized status when their comparanda behave as homeostatic property cluster kinds: profiles of properties stabilized by mechanisms and projectible for a purpose. The participial comparison showed how the claim bites empirically. English supports broad projection from the past-participle label; German and French preserve narrower clusters; Greek and Japanese split the bundle in different ways. Constructional labels extend the lesson: when a label bundles meaning, argument structure, and form, projectibility has to be tested on the decomposed profile. This turns comparative concepts from descriptive tools into testable hypotheses about which patterns reflect convergent pressures versus inherited templates.

Three prerequisites enabled this naturalization project: comparanda discipline (explicit mappings $`M_L(c,\phi)`$ separating cross-linguistic comparanda from language-internal realizations), diagnostic independence (testing meanings via behavioural evidence before examining morphosyntactic forms), and measurement accountability (latent variable models with declared decision rules, uncertainty estimates, and held-out evaluation, making naturalization empirically defeasible). Diagnostic independence separates evidence channels without denying cross-level coupling: forms can be tested as realizations and stabilizers once the comparandum has diagnostic traction. The same discipline prevents construction labels from smuggling whole analyses into a row of the matrix. The question isn’t whether this framework is convenient; it isn’t. The question is whether typologists are willing to measure what they claim to compare.

**Three pilot studies.** The framework is empirically tractable. Three studies can test its core predictions within 18 months:

1.  **Regeneration pathways (12 languages, 6 families):** Test whether languages with well-developed classifier systems favour demonstrative→article pathways while languages with strong possessive morphology favour genitive→article pathways. Evaluation: held-out families should show better prediction from typological profile than from a genealogy-aware baseline.

2.  **Mechanism competition (oral vs. literate traditions, 30 languages):** Test whether oral traditions show faster regeneration but faster erosion compared to literate traditions. Evaluation: regeneration and erosion effects should move in the predicted directions after genealogy and area are controlled.

3.  **Adjective naturalization test (50 languages, 10 families):** Code <span class="smallcaps">modifier</span><sub>Cross</sub> realizations and test projectibility. Evaluation: held-out families should support projection from the proposed profile, or the concept should be demoted with a documented failure mode.

These pilots require coordination: fieldworkers for diagnostic protocols, corpus linguists for historical pathways, statisticians for hierarchical models. The theory identifies the empirical object. Implementation requires sustained collaboration.

Typology stands at a choice point. Continue conflating levels, and universals will keep dissolving. Separate comparanda from realizations, enforce diagnostic independence, and demand measurement accountability, and the field can become a measurement science rather than a label-based enterprise.

# From labels to measurement

Traditional typology trades in binary tallies: language X *has* adjectives or it doesn’t, nouns *mark* number or they don’t, language Y *has* <span class="smallcaps">caused-motion construction</span><sub>$`L`$</sub> or it doesn’t. This forces clean boundaries where the data shows gradients and partial patterns. Variation isn’t residue to be tidied away after the analysis; it’s part of what the analysis is trying to model, because languages change, speakers diverge, and realizations often remain probabilistic even in well-described systems. The three-level mapping (Section <a href="#sec:matrix" data-reference-type="ref" data-reference="sec:matrix">4</a>) and the firewall protocol (Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a>) provide the foundation for something better: explicit measurement models with declared performance rules and uncertainty estimates.

The first move is to replace binary lexical-category and construction labels with latent variables that represent the degree to which a language instantiates a comparandum as specified by a comparative concept. For <span class="smallcaps">nominality</span><sub>Cross</sub>, define a vector of observable diagnostics:
``` math
\begin{aligned}
  \mathbf{n}_L = \big\langle
     &\text{ArgHead},\ \text{Poss/Quant Interface},\ \text{NomMorph},\\
     &\text{Determiner/Classifier System},\ \text{Predication Profile},\ \text{Modification Profile}
  \big\rangle.
  \end{aligned}
```

## Formal decision rules and summaries

Let $`\mathcal{L}`$ be a stratified sample of languages controlling for genealogy and area, and let $`\mathcal{L}_{\text{test}} \subset \mathcal{L}`$ be a held-out test set. Let $`R_L(c) \in [0,1]`$ summarize evidence quality for comparandum $`c`$ in language $`L`$, combining diagnostic coverage and mapping reliability. Let $`r_{\min}`$ be the study’s minimum evidence-quality score, and let $`\mathcal{L}_{\text{eval}}(c) = \{L \in \mathcal{L} : R_L(c) \geq r_{\min}\}`$ be the languages with adequate evidence for comparandum $`c`$. Languages outside $`\mathcal{L}_{\text{eval}}(c)`$ are evidentially indeterminate for that comparandum; they don’t count as negative cases.

A comparandum $`c`$ is too thin if too few adequately evidenced languages show any realization above the study’s minimal weight benchmark:
``` math
\frac{|\{L \in \mathcal{L}_{\text{eval}}(c) : \exists \phi \in \mathrm{Forms}_L \text{ s.t. } w_L(c,\phi) > \epsilon\}|}{|\mathcal{L}_{\text{eval}}(c)|} < k
```
where $`\epsilon = 0.3`$ and $`k = 0.2`$ are illustrative coding benchmarks.

A comparandum $`c`$ is too fat if it overgenerates, lumping distinct phenomena. One implementation treats this as a profile with no dominant exponent and too many weak correlates:
``` math
\max_{\phi \in \mathrm{Forms}_L} w_L(c,\phi) < \tau_1 \quad \text{and} \quad \sum_{\phi \in \mathrm{Forms}_L} w_L(c,\phi) > \tau_2
```
where $`\tau_1 = 0.6`$ and $`\tau_2 = 2.0`$ are illustrative benchmarks.

A comparandum $`c`$ fails projectibility if predictive performance on held-out families falls below the declared decision rule:
``` math
\text{ROC-AUC}_{\mathcal{L}_{\text{test}}}(c) < \tau \quad \text{or} \quad \text{macro-F1}_{\mathcal{L}_{\text{test}}}(c) < \tau
```
where $`\tau`$ is calibrated for the study. A comparandum $`c`$ is indeterminate if the data is insufficient:
``` math
|\mathcal{L}_{\text{eval}}(c)| < n_{\text{min}}
```
where $`n_{\text{min}} = 10`$ languages is an illustrative minimum for moderately reliable coverage.

The effective number of realizations for comparandum $`c`$ in language $`L`$ is:
``` math
N_{\text{eff},L}(c) = \frac{\left(\sum_{\phi} w_L(c,\phi)\right)^2}{\sum_{\phi} w_L(c,\phi)^2}
```
Higher values indicate a less concentrated realization profile. These numerical values are illustrative benchmarks, not universal constants. A study has to declare its decision rules before evaluating held-out data, justify the calibration with model comparison, power analysis, or posterior predictive checks, and report sensitivity to nearby alternatives.

## A weight-combination rule

When analyst provisional weights and model posteriors conflict, one auditable compromise is:
``` math
w_L^{\text{final}}(c,\phi) = \frac{r_L(c,\phi) \cdot w_L^{\text{prov}}(c,\phi) + n_{\text{pooled}} \cdot \mathbb{E}[w_L(c,\phi) \mid \text{model}]}{r_L(c,\phi) + n_{\text{pooled}}}
```
where $`n_{\text{pooled}}`$ is the effective sample size from phylogenetically related languages in the partial-pooling hierarchy. This formula isn’t mandatory; it records one way of making the analyst-vs-model trade-off explicit.

## Formalizing mechanism claims

The mechanism catalogue can be translated into probability statements. For invariance, stable discourse and processing pressures increase the probability that a form realizes comparandum $`c`$:
``` math
\Pr\big(\phi \in f_L(c) \mid \text{pressure } P\big) > \Pr\big(\phi \in f_L(c)\big) + \delta
```
where $`P`$ ranges over pressures such as reference tracking, individuation, possession, and quantification, and $`\delta`$ is the study-specific minimum pressure effect.

For cohesion, co-adaptation of morphosyntactic properties penalizes drift:
``` math
\Pr\big(w_L(c,\phi) > \epsilon \mid w_L(c',\phi) > \epsilon\big) > \Pr\big(w_L(c,\phi) > \epsilon\big) + \gamma
```
Here $`c'`$ is a related comparandum, $`\epsilon`$ is the study’s weak-realization benchmark, and $`\gamma`$ is the minimum cohesion increment. This captures the claim that if a form realizes one comparandum strongly, it is more likely to realize related comparanda.

For regeneration, recurrent diachronic sources rebuild eroded exponents:
``` math
\Pr\big(\exists \phi' \in f_L(c) \text{ at } t+1 \mid w_L(c,\phi) \to 0 \text{ at } t\big) > \omega
```
Here $`\phi'`$ is a replacement form at the later time point, and $`\omega`$ is the minimum regeneration probability. These parameters are study-specific. They should be calibrated from training data, historical corpora, or expert-coded diachronic samples rather than treated as universal constants.

## Deriving the measurement model from ontological commitments

The measurement structure follows directly from the three-level ontology rather than being asserted post-hoc. The comparandum-indexed matrix for language $`L`$ can be estimated via a **two-layer hierarchical model** that separates diagnostic evidence from realization evidence (Level III), preserving anti-circularity while ensuring identifiability.

Layer 1 focuses on diagnostic evidence. Behavioral diagnostics $`d_{L,i}`$ measure the latent strength $`\eta_{L,c}`$ of comparandum $`c`$ in language $`L`$ without reference to morphosyntactic forms:

``` math
\begin{align}
d_{L,i} &\sim \text{Bernoulli}(\pi_{L,i}) \\
\text{logit}(\pi_{L,i}) &= \alpha_i + \beta_i \eta_{L,c} + u_{\text{fam}(L)} + v_{\text{coder}(i)}
\end{align}
```

where:

- $`\eta_{L,c} \sim \text{Normal}(0, 1)`$ is the latent comparandum strength (mean fixed at 0 and variance at 1 to anchor location and scale)

- $`d_{L,i}`$ is diagnostic $`i`$ in language $`L`$ (e.g., ArgHead, PossInterface from Section <a href="#subsec:diagnostic-battery" data-reference-type="ref" data-reference="subsec:diagnostic-battery">5.3</a>)

- $`\pi_{L,i}`$ is the probability that diagnostic $`i`$ succeeds in language $`L`$

- $`\alpha_i \sim \text{Normal}(0, 2)`$ is diagnostic-specific difficulty

- $`\beta_i \sim \text{Normal}(0, 1)`$ is diagnostic discrimination (how informative the test is)

- $`u_{\text{fam}} \sim \text{Normal}(0, \sigma_{\text{fam}})`$ captures phylogenetic clustering via partial pooling

- $`v_{\text{coder}} \sim \text{Normal}(0, \sigma_{\text{coder}})`$ captures systematic coder biases

- $`\sigma_{\text{fam}}, \sigma_{\text{coder}} \sim \text{Exponential}(1)`$ are hierarchical standard deviations

Diagnostics never condition on forms $`\phi`$, ensuring $`(F_{L,c,\phi} \perp d_{L,i} \mid \eta_{L,c})`$ as required by anti-circularity (Rule B in Figure <a href="#fig:dag" data-reference-type="ref" data-reference="fig:dag">2</a>).

Layer 2 captures realization evidence. Forms $`F_{L,c,\phi,t}`$ (indexed by tokens $`t`$) are observed conditional on latent comparandum strength, with $`\rho_{L,c,\phi,t}`$ denoting the token-level probability that form $`\phi`$ realizes comparandum $`c`$ in language $`L`$:

``` math
\begin{align}
F_{L,c,\phi,t} &\sim \text{Bernoulli}(\rho_{L,c,\phi,t}) \\
\text{logit}(\rho_{L,c,\phi,t}) &= \kappa_{L,c,\phi} + \lambda_{c,\phi} \eta_{L,c}
\end{align}
```

with monotonicity constraint $`\lambda_{c,\phi} \geq 0`$ (stronger comparanda can’t decrease form probability). Priors:

``` math
\begin{align}
\kappa_{L,c,\phi} &\sim \text{Normal}(\mu_{\kappa}, 1.5) \\
\lambda_{c,\phi} &\sim \text{HalfNormal}(0, 1)
\end{align}
```

where $`\mu_{\kappa}`$ is initialized from analyst provisional weights $`w_L^{\text{prov}}`$ via the logit transform.

The weight function and false-positive rate are deterministic functions of structural parameters, evaluated at reference points on the latent scale:

``` math
\begin{align}
w_L(c,\phi) &= \sigma(\kappa_{L,c,\phi} + \lambda_{c,\phi}) \quad \text{(probability when $\eta_{L,c}=+1$)} \\
q_L(c,\phi) &= \sigma(\kappa_{L,c,\phi}) \quad \text{(probability when $\eta_{L,c}=0$)}
\end{align}
```

where $`\sigma`$ is the logistic function. Operationally, $`\eta_{L,c}=0`$ corresponds to **diagnostic absence**: all Layer 1 diagnostics score below their minimal diagnostic criteria, indicating the comparandum isn’t active in that context. This makes $`q_L`$ empirically estimable from tokens where the comparandum is diagnosed as absent yet the form appears. Because $`\kappa`$ and $`\lambda`$ have priors, $`w_L`$ and $`q_L`$ inherit posterior distributions; the matrix $`M_L`$ stores $`\mathbb{E}[w_L \mid \text{data}]`$ with 80% credible intervals.

Simple comparanda use the scalar $`\eta_{L,c}`$ above. Decomposed profiles use a component vector, for example $`\boldsymbol{\eta}_{L,c}=\langle \eta^{\mathrm{I}}_{L,c}, \eta^{\mathrm{II}}_{L,c}\rangle`$, where the Level-I component is estimated from behavioural diagnostics and the Level-II component from portable morphosyntactic diagnostics. In Layer 2, $`\lambda_{c,\phi}\eta_{L,c}`$ is replaced by $`\boldsymbol{\lambda}_{c,\phi}^{\top}\boldsymbol{\eta}_{L,c}`$. This lets the model ask whether a Level-III construction realizes both components, rather than letting the construction label define the profile.

Four mechanisms ensure unique parameter estimation:

1.  **Location anchoring:** Fixing $`\mathbb{E}[\eta]=0`$ eliminates translation invariance between $`\eta`$ and $`\kappa`$ (shifting both by a constant would otherwise leave the likelihood unchanged)

2.  **Scale anchoring:** Fixing $`\text{Var}(\eta)=1`$ breaks the $`(a \cdot \lambda, \eta/a)`$ scaling symmetry

3.  **Monotonicity:** The constraint $`\lambda \geq 0`$ prevents sign ambiguity

4.  **Independent measurements:** Layers 1 and 2 provide conditionally independent evidence sources, yielding two equations for two latent quantities ($`\eta`$ and $`\kappa`$,$`\lambda`$)

Any alternative parameterization would violate at least one constraint, ensuring the posterior is proper.

The estimation workflow proceeds in four steps:

1.  Fit Layer 1 to obtain posterior draws of $`\eta_{L,c}`$ from diagnostic data alone

2.  Condition on $`\eta_{L,c}`$ samples while fitting Layer 2, yielding joint posterior samples of $`(\kappa, \lambda)`$

3.  Transform samples to $`(w_L, q_L)`$ and compute posterior summaries

4.  Validate via posterior predictive checks: does the model generate diagnostic and form patterns consistent with observed data?

The model is multilevel because languages share evolutionary history. The phylogenetic random effect $`u_{\text{fam}}`$ implements partial pooling by genealogy, preventing spurious correlations from areal clustering. The latent comparandum strength $`\eta_{L,c}`$ is anchored at $`\mathbb{E}[\eta]=0`$ with $`\text{Var}(\eta)=1`$ to ensure both location and scale identifiability.

The two-layer structure formalizes the paper’s core methodological commitment. Level-I comparanda are diagnosed first from behavioural evidence alone, and syntactic comparanda from portable morphosyntactic diagnostics, then Level-III realizations ($`\kappa`$, $`\lambda`$) are estimated conditional on that diagnosis (Layer 2). This workflow embodies Rule B: withhold candidate forms when diagnosing semantic targets.

The same logic extends to other comparanda specified by comparative concepts: <span class="smallcaps">adjectivality</span><sub>Cross</sub> (modification, predication, gradability, comparison), <span class="smallcaps">verbiness</span><sub>Cross</sub> (predicator-head privileges, tense-aspect-mood morphology, argument structure), and decomposed profiles such as <span class="smallcaps">caused-motion</span><sub>Cross</sub>. Each gets its own diagnostic vector and measurement model. Semantic targets are tested independently (Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a>) and linked to lexical categories, constructions, and other realizations only via the observed mappings in the matrix $`M_L`$ (Rule B).

Naturalization candidates are evaluated against declared projectibility metrics (ROC-AUC for classifiers, macro-F1 for multi-class prediction) with study-specific decision rules that have to be assessed in held-out test data before promotion. Failure under those rules triggers explicit failure modes: too thin, too fat, negative, or indeterminate. The framework builds measurement discipline into the theoretical machinery.

<div id="refs" class="references csl-bib-body hanging-indent">

<div id="ref-boyd1991" class="csl-entry">

Boyd, Richard. 1991. “Realism, Anti-Foundationalism and the Enthusiasm for Natural Kinds.” *Philosophical Studies* 61 (1–2): 127–48. <https://doi.org/10.1007/BF00385837>.

</div>

<div id="ref-boyd1999" class="csl-entry">

Boyd, Richard. 1999. “Homeostasis, Species, and Higher Taxa.” In *Species: New Interdisciplinary Essays*, edited by Robert A. Wilson. MIT Press. <https://doi.org/10.7551/mitpress/6396.003.0012>.

</div>

<div id="ref-Cowart1997" class="csl-entry">

Cowart, Wayne. 1997. *Experimental Syntax: Applying Objective Methods to Sentence Judgments*. Sage Publications.

</div>

<div id="ref-croft2001" class="csl-entry">

Croft, William. 2001. *Radical Construction Grammar: Syntactic Theory in Typological Perspective*. Oxford University Press. <https://doi.org/10.1093/acprof:oso/9780198299554.001.0001>.

</div>

<div id="ref-dixon2004" class="csl-entry">

Dixon, R. M. W. 2004. *Adjective Classes: A Cross-Linguistic Typology*. Oxford University Press.

</div>

<div id="ref-haspelmath2010" class="csl-entry">

Haspelmath, Martin. 2010. “Comparative Concepts and Descriptive Categories in Crosslinguistic Studies.” *Language* 86 (3): 663–87. <https://doi.org/10.1353/lan.2010.0021>.

</div>

<div id="ref-huddleston2002" class="csl-entry">

Huddleston, Rodney, and Geoffrey K. Pullum. 2002. *The Cambridge Grammar of the English Language*. Cambridge University Press. <https://doi.org/10.1017/9781316423530>.

</div>

<div id="ref-khalidi2013" class="csl-entry">

Khalidi, Muhammad Ali. 2013. *Natural Categories and Human Kinds: Classification in the Natural and Social Sciences*. Cambridge University Press. <https://doi.org/10.1017/CBO9780511998553>.

</div>

<div id="ref-lyons1999" class="csl-entry">

Lyons, Christopher. 1999. *Definiteness*. Cambridge University Press. <https://doi.org/10.1017/cbo9780511605789>.

</div>

<div id="ref-Matthewson2004" class="csl-entry">

Matthewson, Lisa. 2004. “On the Methodology of Semantic Fieldwork.” *International Journal of American Linguistics* 70 (4): 369–415. <https://doi.org/10.1086/429207>.

</div>

<div id="ref-miller2021" class="csl-entry">

Miller, J. T. M. 2021. “Words, Species, and Kinds.” *Metaphysics* 4 (1): 18–31. <https://doi.org/10.5334/met.70>.

</div>

<div id="ref-reynolds2025hpcbook" class="csl-entry">

Reynolds, Brett. 2026. “Words That Won’t Hold Still: How Linguistic Categories Work.” <https://github.com/BrettRey/hpc-book>.

</div>

<div id="ref-schutze2016" class="csl-entry">

Schütze, Carson T. 2016. *The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology*. Language Science Press. <https://doi.org/10.17169/langsci.b89.100>.

</div>

<div id="ref-ShubinTabinCarroll2009" class="csl-entry">

Shubin, Neil, Cliff Tabin, and Sean Carroll. 2009. “Deep Homology and the Origins of Evolutionary Novelty.” *Nature* 457 (7231): 818–23. <https://doi.org/10.1038/nature07891>.

</div>

<div id="ref-Wagner2014" class="csl-entry">

Wagner, Günter P. 2014. *Homology, Genes, and Evolutionary Innovation*. Princeton University Press. <https://doi.org/10.1515/9781400851461>.

</div>

</div>

[^1]: I used ChatGPT 5, Claude Sonnet 4.5, Gemini Pro 2.5, and Kimi 2 extensively in drafting and revising the paper. I reviewed, edited, and approved all the material and take full responsibility for the final text and conclusions. Formal definitions, decision rules, and citations were checked against the cited sources. I’m grateful to Martin Haspelmath and Bert Remijsen for comments on earlier drafts. <brett.reynolds@humber.ca>

[^2]: A Lean formalization of the typing discipline and firewall constraints is available in the supplementary repository, <https://github.com/BrettRey/naturalizing-typological-kinds>. It is an audit of the ontology, not part of the empirical argument.

[^3]: An earlier draft of this section used the phrasing “languages lacking ADJECTIVE<sub>$`L`$</sub> lexical categories”, the exact conflation diagnosed in Section <a href="#sec:naturalization-question" data-reference-type="ref" data-reference="sec:naturalization-question">1</a>. The slip was identified by Martin Haspelmath (p.c.), demonstrating how deeply entrenched these habits are even when explicitly resisted. This shows why formal protocols (Section <a href="#sec:firewall" data-reference-type="ref" data-reference="sec:firewall">5</a>) are required, not merely terminological vigilance.