---
title: "Expert Grammaticality Judges as Evaluators, Not Participants: Grammaticality Judgments, Rater Roles, and Research Ethics Review"
author: "Brett Reynolds"
year: "2026"
status: "Under review at Canadian Journal of Linguistics / Revue canadienne de linguistique"
canonical_url: "https://philarchive.org/rec/REYEGJ"
website_url: "https://brettreynolds.ca/papers/expert-grammaticality-judges/"
markdown_url: "https://brettreynolds.ca/papers/expert-grammaticality-judges/paper.md"
version: "author-manuscript mirror"
version_date: "2026-06-20"
keywords: ["grammaticality judgments", "acceptability judgments", "research ethics review", "rater reliability", "TCPS 2", "morphology", "syntax", "phonology"]
---
# Expert Grammaticality Judges as Evaluators, Not Participants: Grammaticality Judgments, Rater Roles, and Research Ethics Review

**Author-manuscript mirror.** This Markdown file is provided for accessibility, search, and machine readability. The canonical public record is linked in the metadata above.

## Abstract
A grammaticality judgment is always a human act, but it isn’t always participant data. Whether the person who supplies one is a participant or an evaluator should turn on what the judgment is evidence of: a person (reactions, processing, or dialect) or an object (an item’s status under a stated standard). The role isn’t only an ethical classification; it also fixes the generalization a judgment licenses, over speakers or items. In the object case, the judge resembles an essay scorer or prescriptive annotator more than a survey respondent, and systematic, reliability-checked procedures mark measurement rather than a study of the judges. Canada’s Tri-Council Policy Statement (TCPS 2) reaches the distinction by focus of inquiry; the United States Common Rule states it through its “about whom” definition. The proposal routes ethics review rather than loosening it: it neither licenses private intuition nor erases the cases where judges are participants.


**Keywords:** grammaticality judgments; acceptability judgments; research ethics review; rater reliability; TCPS 2

# The problem

Marking a sentence acceptable is one act with more than one evidential life: a response sampled from a population, a datum about processing or dialect, a consultant’s contribution of community knowledge, a colleague’s check on a constructed contrast, or an expert’s assessment of an item under a stated standard. Ask whether the grammaticality judgments in a linguistics paper needed consent and ethics review, and the answer turns on which life is in play: on whether the study is about that person, not just whether a person produced the judgment.

In Canadian institutions the question is sharp from the start. Canada’s Tri-Council Policy Statement (TCPS 2) defines participants through the data and responses relevant to a research question, so on a literal reading a requested grammaticality judgment looks captured before its evidential role has even been described. The argument below is that the role, not the bare request, should settle the classification.

The pull toward the first description has a respectable source. A strand of experimental syntax has argued that informal intuitions, often the author’s own, are a weak evidential base, and that acceptability should be measured on samples of naive speakers under controlled conditions (Gibson and Fedorenko 2010, 2013; Gibson et al. 2011). On that view the design samples the judges, takes their responses as data, and counts as participant research. Informal intuitions can mislead, and the corrective has improved the field.

The corrective answers a measurement question, not a role question. That informal intuitions can be unreliable is a reason to calibrate, sample, or formalize the measurement, and the reliability of informal judgments has itself been tested, with results more favourable than the sharpest critiques predicted (Sprouse et al. 2013; Schütze 2016). It isn’t a reason to treat everyone who supplies a grammaticality judgment as the object of study.

The answer isn’t that expert judgment escapes scrutiny, but that the scrutiny should follow the role. A judge is a participant when the question concerns that person’s reaction, processing, background, identity, or distribution of responses. A judge is an evaluator when the question concerns the linguistic status of an item and expertise is used to assess it against disciplinary standards. Misclassifying costs both ways: treating evaluation as sampling overstates what the judgment shows, and treating sampling as evaluation strips persons of protections they’re owed.

The stakes aren’t only ethical. The role also fixes which generalization a judgment licenses, over a kind of item or over a population of speakers, so misreading it corrupts the inference, not just the oversight.

Expert judges of grammaticality shouldn’t be treated as participants merely because their judgments are human acts. The first question is what role the judgment plays in the evidential design.

# Participants, informants, and evaluators

The central distinction is between producing data about oneself and applying expertise to a target object. A participant contributes evidence about a population of people, a psychological process, a distribution of acceptability, a demographic contrast, or an intervention effect. An evaluator contributes an assessment of an object: an essay, a translation, a diagnosis, a code, a performance, or a linguistic example.

Between the two sits informant, less a third role than a recurring hard case. A fieldwork consultant who supplies primary data about their own language is, by the test in the previous paragraph, contributing evidence about a speaker and a community, which places them on the participant or community-contributor side of the contrast, not the essay-scorer side. The term turns slippery because the same person may later be asked to apply a stated criterion to a specified item, which is evaluation. Informant marks where the two roles meet, not a stable category between them.

The category needs special care in documentary, Indigenous, and community-based work. Where a judgment draws its authority from community membership, lived practice, or collective knowledge rather than from a task-specific criterion applied to a defined object, the safer classification is participant or community contributor, not evaluator. TCPS 2 gives this its own chapter, on research involving the First Nations, Inuit, and Métis peoples of Canada, with requirements of engagement and community guidance that the evaluator framing must not bypass (Canadian Institutes of Health Research et al. 2022, chap. 9). The evaluator category shouldn’t become a way to relabel community knowledge as expert service.

Calling someone an evaluator is a functional move, not an honorific one. It doesn’t make the judgment infallible, unregulated, or methodologically private; it changes the evidential question. The relevant issues become qualification, calibration, independence, adjudication, documentation, and conflict of interest, not sampling, participant burden, demographic representativeness, or the protection of private personal data.

What counts as expertise here isn’t a professional title alone but a role in a design: a judge is an evaluator when selected to apply a stated standard to a linguistic object, so the judgment’s evidential value turns on qualification, independence, and task specification, not credentials. The cases form a spectrum, from authorial self-checking and informal consultation to independent expert evaluation and the systematic collection of judgments from a sample. A linguist reporting their own reaction as one of a population isn’t an evaluator in this sense; one applying a disciplinary criterion to a specified item may be.

Research ethics didn’t invent the contrast. Data annotation in language technology draws the same line, between descriptive labelling, which sets out to capture and model the annotator’s own beliefs, and prescriptive labelling, which asks the annotator to apply a fixed scheme to an object (Röttger et al. 2022). The descriptive annotator sits nearer the participant pole; the prescriptive annotator, nearer the evaluator pole.

For grammaticality work, the distinction matters because many expert judgments don’t estimate speaker-preference distributions. The task is to decide whether a proposed sentence, construction, contrast, or diagnostic behaves as the theory implies. That use of expertise is evaluative.

# The essay-scorer analogy

Educational and psychometric measurement draws a line that grammaticality research often blurs: the scorers aren’t the scored. Essay scorers, oral-proficiency raters, and constructed-response markers can introduce error, bias, drift, and disagreement, and assessment researchers take that seriously. None of it converts the scorers into participants alongside the students. The same holds inside the research process: a colleague or reviewer who judges whether an author’s example is acceptable, or whether a claimed contrast holds, is assessing the materials, not thereby enrolling as a subject.

What makes the analogy bite is that this scoring isn’t casual. Constructed-response programs train raters against rubrics, double-score responses, monitor severity and drift, adjudicate disagreements, and model rater effects statistically (McCaffrey et al. 2022; American Educational Research Association et al. 2014). The scoring is about as systematic, multi-rater, and quantitatively scrutinized as human judgment gets. The raters are still part of the measurement apparatus, not its object.

That ordering matters, because the usual worry about grammaticality judgments runs the other way. The intuition is that once judgment collection becomes systematic, with many judges, fixed materials, and reliability statistics, it must be participant research. The essay-scorer case shows the inference backwards. Systematicity, multiple raters, and reliability analysis are signatures of a measurement procedure, not marks of a study about the judges.

The sharpest form of the worry presses on reliability itself. If a study reports how far several judges agree, isn’t it then reporting facts about those people? It’s reporting a property of the measurement, not a finding about the judges. Inter-rater agreement is what licenses the step from the judges’ verdicts to the status of the object, much as a rater facet functions in psychometric models (American Educational Research Association et al. 2014). The statistic characterizes the instrument; the object stays what the study is about.

The experimental-syntax worry of <a href="#sec:problem" data-reference-type="ref+label" data-reference="sec:problem">1</a> has a stronger form: role and measurement can’t be separated, because an expert judgment is an instrument only as a calibrated sample of experts, and to calibrate it you have to study them. But calibrating an instrument validates a procedure; it doesn’t enrol the instrument as a subject. A thermometer checked against a reference doesn’t become the object of the reading, and a scoring panel trained and modelled for severity isn’t the object of the study when its agreement underwrites a verdict about the item. Experimental syntax already works this way: questionable judgments, Phillips (2010) notes, are caught by colleagues, audiences, and reviewers before they enter the literature, and that scrutiny is adjudication, an evaluator-side control, not a sample drawn to learn about the judges. Riemer (2009) draws a related evidence/prediction line.

The line can move, though, on a principle rather than a stipulation. An aggregate agreement coefficient is used instrumentally, as a premise that licenses the step from the raters’ verdicts to the item’s status. A model of rater severity by training, dialect, gender, or theoretical orientation makes that variation the conclusion, so the raters become what the study is about. The same numbers can serve either role; what differs is the direction of inference. The clearest case is a design that contrasts expert with naive agreement: the divergence between the two groups is itself the finding (Dąbrowska 2010), which puts the judges on the participant side (<a href="#sec:boundary-cases" data-reference-type="ref+label" data-reference="sec:boundary-cases">5</a>).

The analogy buys neither infallibility nor exemption from scrutiny. It relocates the scrutiny. If raters can be unreliable, the response is to train, blind, rotate, calibrate, adjudicate, report: the standards that govern scoring, not those that govern participant sampling (American Educational Research Association et al. 2014). An expert grammaticality judge deciding whether an item behaves as a theory implies is doing the rater’s job, and should be held to the rater’s standards.

The analogy isn’t exact, and the gap cuts in a useful direction. Essay scorers normally judge performances produced independently of them, whereas grammaticality evaluators often judge examples constructed for the argument they support. That raises the need for disclosure, independence, and, where feasible, blinding to the prediction. It changes the controls the role requires, not the role itself.

The essay analogy has a limit worth pressing: a sentence’s grammaticality isn’t fixed independently of the judge the way an essay’s merits are, but constituted by the community’s competence and read through the evaluator’s own. Expert diamond grading is the closer model. A grader assigns a stone’s colour by comparison with master stones under fixed lighting, so colour here isn’t response-independent either, keyed to a community-fixed standard and read through trained perception. Yet the grade is about the stone, inter-grader agreement characterizes the procedure rather than the grader, and no one treats the grade as a datum about the grader. A grammaticality verdict has the same shape: a constituted object, a statable criterion, a competent judge as instrument.

# What expert grammaticality judgments do

Expert grammaticality judgments can serve several roles. They can screen examples, test diagnostic contrasts, adjudicate corpus candidates, check paraphrases, evaluate minimal pairs, or assess whether a constructed item successfully isolates the intended variable. These aren’t the same role as estimating how a population would rate the sentence on a Likert scale.

The methodological burden is correspondingly different. An expert-judgment design should say what qualifies the judge, what the judge was asked to evaluate, what information was available, whether judgments were independent, how disagreements were handled, and whether the judgment bears on form, meaning, register, dialect, processing difficulty, or theoretical classification.

This frame carries a precondition: a standard the evaluator can state and be checked against. Where the criterion is articulable, the judge applies it to the item and the verdict is open to challenge. Where the supposed standard is only the theorist’s own intuition in other dress, the evaluator role collapses into the private authority this paper declines to defend, since expert and naive judgments can diverge (Dąbrowska 2010). The label is earned by a statable criterion, not assumed.

A study in this journal shows the precondition at work. Studying resumptive pronouns in English relative clauses, Loss and Wicklund (2020) begin from a stated generalization in the descriptive literature: a restrictive relative clause requires a gap, not a resumptive pronoun (Huddleston and Pullum 2002, 1091). Against that criterion, the resumptive *She got a couch at Sears that it was on sale* is marked unacceptable beside the gapped *She got a couch at Sears that was on sale*, even though the resumptive form is attested in speech. That mark evaluates the sentence under a statable rule; it isn’t a datum about whoever produced it.

Where the rule runs out, the role changes. For appositive relatives the literature offered only an informal intuition that resumption is better tolerated, with no settled criterion. So Loss and Wicklund (2020) did what the participant role calls for: they collected acceptability ratings from a sample of speakers and compared the distributions across clause types. The same phenomenon supports an evaluator’s verdict where a criterion can be stated and a participant study where the open question is how speakers respond.

The expertise must also match the object. Competence in one variety, register, or theoretical diagnostic doesn’t license evaluation across all of them; a syntactician at home with a standard-variety contrast may be a poor judge of a youth register, a minoritized variety, or second-language classroom usage. Where the judgment concerns a socially located pattern, the relevant competence is competence for that pattern, not disciplinary standing in general.

Level discipline matters here. A grammaticality claim, an acceptability-distribution claim, a processing claim, and a social-distribution claim are different claims. Treating every expert judgment as participant data blurs those levels instead of protecting them.

The role also fixes which projectible predicate a judgment confirms, and over what reference class (Goodman 1955). Read as evaluation, marking a sentence confirms a predicate over linguistic items, that structures of this type are grammatical; read as participation, it confirms a predicate over speakers, that people of this variety accept it. A single asterisk feeds a different induction depending on the role. Treating the expert assessment as a sampled response invites bad inference, such as a confidence interval drawn around one expert as though it sampled a population. The role isn’t only an ethical classification; it fixes which generalizations are licit.

One might object that grammaticality, grounded in speakers’ competence, makes any judgment evidence about persons after all. A sentence’s grammaticality does supervene on the community’s competence, so the fact is grounded in speakers. But grounding fixes a truthmaker, not a topic. The objection slides twice: from what makes the property obtain, a population idealized, to what bears it, the item; and from the community whose practice constitutes the language to the individual whose verdict reports its status. The evaluator’s competence is the route to the item’s status, not what the verdict is about.

# When judges are participants

The evaluator role has boundaries. If the study asks how linguists respond, how expertise changes judgments, how quickly people process a contrast, how dialect background affects acceptability, or how training changes ratings, then the judges belong to the target population. In such cases, the design studies their responses as responses, so they function as participants or participant-like contributors.

The clearest limiting case sits at the other end. A single-authored syntax article rests on the author’s own judgments: the author builds the examples and marks them, and the marks carry the argument. If supplying a judgment relevant to the research question made one a participant, the author would be a participant in their own study, owing self-consent, a right to withdraw, and review of the author as a subject. No plausible ethics regime treats this as ordinary participant research, and on that reading the field’s pre-experimental history counts as unconsented self-experimentation. Producing the judgment can’t be the trigger.

Parity shows what the trigger is instead. Let the author judge fifty sentences and a colleague judge the same fifty; the judgments do identical evidential work, and only the producer differs. The line falls on the focus of inquiry, not the human act. A literalist might exempt the author for free, since no one else is studied, while counting the colleague a minimal participant; but that relocates the question rather than answering it, because production is constant across the two cases and the one thing that varies, who produced the verdict, has nothing to do with what the study is about. This doesn’t make a lone intuition good evidence. That it can be unreliable is the measurement worry of <a href="#sec:problem" data-reference-type="ref+label" data-reference="sec:problem">1</a>, met by calibration and replication, not by recasting the author as a research subject.

A single project can hold both roles. Experts may screen or validate materials as evaluators, while naive speakers then rate those materials as participants. If the study goes on to analyze the experts’ own distribution of responses, their expertise effects, or their disagreement with naive speakers as a finding, the experts have become participants for that part of the design.

Because the role follows the design, it can look manipulable: relabel a judgment study as “evaluation” and the protections fall away. What blocks the move is that the role is fixed by the target of the research question, not by the label a researcher prefers. Take the hard case: thirty trained linguists rate two hundred sentences and report the reliability. If the ratings certify the status of the items, the design is evaluation. If the distribution of ratings is itself the finding, the linguists are the object of study, and they’re participants. The diagnostic is what the judgment is evidence of, settled before any label is applied.

The distinction cuts both ways, against overreach and against special pleading. Expertise doesn’t exempt a study from human-participant review when the human is the object of inquiry. But neither does the mere presence of human expertise convert every scholarly evaluation into participant research.

# What the regulations actually say

Of the two regulatory frameworks at issue, the Canadian one poses the harder case. TCPS 2 defines participants as individuals whose data or “responses to interventions, stimuli or questions by the researcher, are relevant to answering the research question(s)” (Canadian Institutes of Health Research et al. 2022). A rater’s judgment answers a question and is plainly relevant to the research question, so on its face the definition takes in the evaluator. The distinction can’t rest on the bare fact that a judgment was requested.

It needn’t. TCPS 2 allows “interaction with individuals who are not themselves the focus of the research, in order to obtain information” (Canadian Institutes of Health Research et al. 2022). The test it sets is the focus of inquiry: whether the person is the object of study, not whether a response was produced.

Its own example is institutional: “authorized personnel” releasing role-based information about an organization’s policies, procedures, or statistics “are not considered participants” (Canadian Institutes of Health Research et al. 2022). The Panel sharpens the test, exempting information that staff “normally provide as part of their work duties” because “the information is the focus of the research, not the views of the staff member” (Panel on Research Ethics 2022).

A grammaticality judgment isn’t institutional fact of that kind; it’s a cognitive assessment, not a registrar’s enrolment figures. But the test doesn’t turn on the kind of information, only on the focus of inquiry. An expert asked whether an item meets a stated criterion is interacted with so that the item, not the expert, can be studied: the focus is the item, and the assessor isn’t the object.

The same interpretation marks the boundary. Asking someone “to provide personal opinions outside the scope of their job roles” brings review back (Panel on Research Ethics 2022), so the evaluator role holds only while the judgment is read as a within-competence assessment of the object. Once a judgment is taken as the person’s own reaction, idiolect, or attitude, the focus has moved to the person, and the judge becomes a participant (<a href="#sec:boundary-cases" data-reference-type="ref+label" data-reference="sec:boundary-cases">5</a>).

This locates the point about method correctly. That “choice of methodology and/or intent or ability to publish findings” don’t determine whether an activity is research (Canadian Institutes of Health Research et al. 2022) blocks the inference from systematicity to participant status, but it doesn’t by itself separate participant from evaluator. The focus-of-inquiry test does that work.

United States policy states the same line more explicitly. Under the Common Rule, a human subject is a living individual “about whom” an investigator obtains data through intervention or interaction, or whose identifiable private information is used (U.S. Department of Health and Human Services 2018). The *from*/*about* contrast is built into the definition: a rater’s score can be obtained *from* a person without being *about* that person. Where the American text names the distinction, the Canadian framework reaches it through focus of inquiry. The judge evaluates the item; the item doesn’t evaluate the judge.

# Reporting and ethics

The practical proposal is a reporting distinction that follows the role. Participant studies should report recruitment, eligibility, consent, relevant demographics, compensation, exclusion, and the statistical treatment of responses. Evaluation studies should report a different set (American Educational Research Association et al. 2014):

- the evaluator’s qualifications for the standard applied;

- the evaluation task and the criterion or diagnostic used;

- the materials and context available to the evaluator;

- whether the evaluator was blind to the hypothesis;

- independence from the author and the project;

- how disagreements were adjudicated;

- any conflict of interest or relationship to the work;

- whether the judgment concerns form, meaning, register, dialect, processing difficulty, or theoretical classification.

These aren’t bureaucratic decoration. They’re what makes the evaluator classification credible: without a statable criterion, a matched domain of expertise, and some account of independence or adjudication, the work isn’t expert evaluation in the relevant sense, and the role-based case for treating it as evaluation rather than participant research lapses.

This routing doesn’t weaken ethics review; it aims it. A study that targets persons belongs in participant review with the usual protections; one that uses expert competence to assess linguistic materials belongs under the standards that govern measurement (American Educational Research Association et al. 2014). The routing for common designs is summarized in <a href="#tab:routing" data-reference-type="ref+label" data-reference="tab:routing">1</a>.

<div id="tab:routing">

| **What the design studies** | **Role** | **What to report** |
|:---|:---|:---|
| Whether speakers accept an item | Participant | recruitment, consent, demographics, sampling, compensation, exclusion, statistical treatment of responses |
| Whether an item meets a stated criterion | Evaluator | qualification, criterion, materials, independence, blinding, adjudication, conflict of interest |
| Whether experts and naive speakers diverge | Participant (for that comparison) | group definition, sampling, consent, the statistical contrast |
| Whether materials pass expert screening before a participant study | Evaluator (for that step) | qualification, validation task, handling of disagreement |

Routing judgment evidence by what the design studies. The role follows the focus of inquiry, so the same materials change rows as the research question changes, and a mixed design occupies more than one.

</div>

One protection can’t fall through the gap. Classifying a judge as an evaluator settles their status as a source of human-subjects data; it doesn’t settle their standing as a contributor of labour. Even where a graduate student pressed to supply judgments is no one’s research subject, they’re still owed fair acknowledgement, a reasonable burden, and protection against coercion. Those attach to the person as a worker, a matter of labour ethics rather than human-subjects review, a line sharpened in the parallel debate over whether machine-learning crowdworkers are subjects or workers (Kaushik et al. 2023). The evaluator framing relocates one set of obligations and leaves the other untouched.

Several questions travel together but shouldn’t answer one another: whether the judgment is evidence about the judge (participant status), whether it’s reliable (measurement), whether the person was fairly treated, credited, and free from coercion (labour), and whether their identity or judgment will be disclosed (confidentiality and attribution). A negative answer to the first doesn’t settle the rest.

The two roles put different pressures on identity, but naming isn’t the diagnostic. Participants are normally anonymized because their identities are data-bearing and disclosure can harm them. Evaluators may need attribution, internal traceability, or conflict-of-interest disclosure, because their qualifications and relationships bear on measurement quality, but they too may be anonymized, for independence, blinding, or protection from retaliation. What a study owes is an account of how it handles attribution, anonymity, and conflict of interest, not a rule that reads the role off whether the person is named.

Misclassifying an evaluator as a participant can become a small instance of what Haggerty (2004) calls ethics creep: the outward expansion of research-ethics review over activities it was never built to cover. A role-test checks the creep well beyond grammaticality, in any field where qualified assessment of materials is read as a sample of the assessors. It checks the opposite error too, refusing the relabeling that would dodge review where persons are the object.

The institutional claim stays narrow. Local boards decide the scope of their own review, and no definition settles policy. The conceptual point, argued in <a href="#sec:regulatory-asymmetry" data-reference-type="ref+label" data-reference="sec:regulatory-asymmetry">6</a>, is only that the word “judgment” shouldn’t trigger participant review on its own. What should trigger review is whether the design studies persons as persons. The bridge is normative: evidential role is what participant status should track, not something that by itself settles how a board must classify a study.

# Conclusion

Expert grammaticality judgments should be classified by evidential role. When the research target is a population of judgments, an effect on judges, or the social distribution of acceptability, judges are participants. When the target is the linguistic status of materials and expert judgment is used to evaluate those materials, judges are evaluators.

Classifying by evidential role makes grammaticality research more transparent, not less: participant protections where persons are the object of inquiry; qualification, calibration, independence, disclosure, and fair treatment of labour where judgment is part of the measurement procedure. The right analogy is often not the survey respondent but the essay scorer, a fallible, reportable, correctable evaluator whose work affects measurement quality. Where the judgment is the object, protect the judge; where the judgment is the instrument, calibrate the procedure.

# Use of AI tools

The large language models Claude (Anthropic; Opus 4.8) and GPT-5.5 (OpenAI), used as released and accessed through their providers’ web and programming interfaces in June 2026, served as drafting and editing aids. I used them to draft and revise prose, to develop and stress-test the argument and its counterarguments, and to propose candidate sources. Every source, quotation, and citation was verified against the primary document before inclusion; no bibliographic content was taken from model output unverified. I conceived the thesis, performed all verification, and am responsible for all claims, arguments, errors, and interpretive choices. I declare no competing interest arising from the tools’ use.

<div id="refs" class="references csl-bib-body hanging-indent">

<div id="ref-aera2014standards" class="csl-entry">

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. 2014. *Standards for Educational and Psychological Testing*. American Educational Research Association.

</div>

<div id="ref-tcps2_2022" class="csl-entry">

Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, and Social Sciences and Humanities Research Council of Canada. 2022. *Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans*. Policy statement. Government of Canada. <https://ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2022.html>.

</div>

<div id="ref-dabrowska2010" class="csl-entry">

Dąbrowska, Ewa. 2010. “Naive v. Expert Intuitions: An Empirical Study of Acceptability Judgments.” *The Linguistic Review* 27 (1): 1–23. <https://doi.org/10.1515/tlir.2010.001>.

</div>

<div id="ref-gibsonfedorenko2010" class="csl-entry">

Gibson, Edward, and Evelina Fedorenko. 2010. “Weak Quantitative Standards in Linguistics Research.” *Trends in Cognitive Sciences* 14 (6): 233–34. <https://doi.org/10.1016/j.tics.2010.03.005>.

</div>

<div id="ref-gibsonfedorenko2013" class="csl-entry">

Gibson, Edward, and Evelina Fedorenko. 2013. “The Need for Quantitative Methods in Syntax and Semantics Research.” *Language and Cognitive Processes* 28 (1-2): 88–124. <https://doi.org/10.1080/01690965.2010.515080>.

</div>

<div id="ref-gibson2011mturk" class="csl-entry">

Gibson, Edward, Steven T. Piantadosi, and Kristina Fedorenko. 2011. “Using Mechanical Turk to Obtain and Analyze English Acceptability Judgments.” *Language and Linguistics Compass* 5 (8): 509–24. <https://doi.org/10.1111/j.1749-818X.2011.00295.x>.

</div>

<div id="ref-Goodman1955" class="csl-entry">

Goodman, Nelson. 1955. *Fact, Fiction, and Forecast*. Harvard University Press.

</div>

<div id="ref-haggerty2004" class="csl-entry">

Haggerty, Kevin D. 2004. “Ethics Creep: Governing Social Science Research in the Name of Ethics.” *Qualitative Sociology* 27 (4): 391–414. <https://doi.org/10.1023/B:QUAS.0000049239.15922.a3>.

</div>

<div id="ref-huddleston2002" class="csl-entry">

Huddleston, Rodney, and Geoffrey K. Pullum. 2002. *The Cambridge Grammar of the English Language*. Cambridge University Press. <https://doi.org/10.1017/9781316423530>.

</div>

<div id="ref-kaushik2023crowdworkers" class="csl-entry">

Kaushik, Divyansh, Zachary C. Lipton, and Alex John London. 2023. “Resolving the Human-Subjects Status of Machine Learning’s Crowdworkers.” *Queue* 21 (6): 101–27. <https://doi.org/10.1145/3639452>.

</div>

<div id="ref-lossWicklund2020" class="csl-entry">

Loss, Sara S., and Mark Wicklund. 2020. “Is English Resumption Different in Appositive Relative Clauses?” *Canadian Journal of Linguistics/Revue Canadienne de Linguistique* 65 (1): 25–51. <https://doi.org/10.1017/cnj.2019.19>.

</div>

<div id="ref-mccaffrey2022scoring" class="csl-entry">

McCaffrey, Daniel F., Jodi M. Casabianca, Kathryn L. Ricker-Pedley, René R. Lawless, and Cathy Wendler. 2022. “Best Practices for Constructed-Response Scoring.” *ETS Research Report Series* 2022 (1): 1–58. <https://doi.org/10.1002/ets2.12358>.

</div>

<div id="ref-pre_scope" class="csl-entry">

Panel on Research Ethics. 2022. “TCPS 2 Interpretations: Scope of REB Review.” <https://ethics.gc.ca/eng/policy-politique_interpretations_scope-portee.html>.

</div>

<div id="ref-phillips2010" class="csl-entry">

Phillips, Colin. 2010. “Should We Impeach Armchair Linguists?” In *Japanese/Korean Linguistics*, edited by Shoichi Iwasaki, Hajime Hoji, Patricia M. Clancy, and Sung-Ock Sohn, vol. 17. CSLI Publications.

</div>

<div id="ref-riemer2009" class="csl-entry">

Riemer, Nick. 2009. “Grammaticality as Evidence and as Prediction in a Galilean Linguistics.” *Language Sciences* 31 (5): 612–33. <https://doi.org/10.1016/j.langsci.2008.04.001>.

</div>

<div id="ref-rottger2022annotation" class="csl-entry">

Röttger, Paul, Bertie Vidgen, Dirk Hovy, and Janet B. Pierrehumbert. 2022. “Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks.” In *Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, <span class="nocase">edited by Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz</span>. Association for Computational Linguistics. <https://doi.org/10.18653/v1/2022.naacl-main.13>.

</div>

<div id="ref-schutze2016" class="csl-entry">

Schütze, Carson T. 2016. *The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology*. Language Science Press. <https://doi.org/10.17169/langsci.b89.100>.

</div>

<div id="ref-sprouse2013" class="csl-entry">

Sprouse, Jon, Carson T. Schütze, and Diogo Almeida. 2013. “A Comparison of Informal and Formal Acceptability Judgments Using a Random Sample from *Linguistic Inquiry* 2001–2010.” *Lingua* 134: 219–48. <https://doi.org/10.1016/j.lingua.2013.07.002>.

</div>

<div id="ref-commonrule2018" class="csl-entry">

U.S. Department of Health and Human Services. 2018. “Federal Policy for the Protection of Human Subjects (Common Rule), 45 CFR Part 46.” <https://www.ecfr.gov/current/title-45/part-46>.

</div>

</div>

[^1]: Contact: <brett.reynolds@humber.ca>
