Turing test

Introduced in 1950 by the English mathematician and logician Alan Turing, the Turing test is a behavioral assessment of whether a machine’s text-based conversational performance is indistinguishable from that of a human interlocutor, a proposal Turing framed as an "imitation game." According to Turing’s paper in Mind, the interrogator exchanges typed messages with unseen partners—a human and a machine—and must decide which is which. Turing suggested that, within about fifty years, computers might perform well enough that an average judge would identify them correctly less than 70% of the time after five minutes of questioning. I.—COMPUTING MACHINERY AND INTELLIGENCE (Mind/Oxford Academic); The Turing Test (Stanford Encyclopedia of Philosophy); Turing test | Britannica

Definition and formulation

–The test operationalizes whether to "speak of machines thinking" by replacing the ambiguous question “Can machines think?” with an empirical game of verbal discrimination. The interrogator’s task is constrained to textual dialogue; success is measured statistically by the judge’s rate of misclassification under specified conditions. I.—COMPUTING MACHINERY AND INTELLIGENCE (Mind/Oxford Academic); The Turing Test (Stanford Encyclopedia of Philosophy)
–Turing’s 1950 prediction referenced a five‑minute exchange and a 70% correct-identification threshold, but subsequent scholarship emphasizes that such parameters are part of his forecast, not inherent to the test itself, and may be varied in implementations. The Turing Test (Stanford Encyclopedia of Philosophy); Turing test | Britannica

Context and early systems

–In the 1960s, Joseph Weizenbaum’s ELIZA program demonstrated how simple pattern‑matching rules could produce conversationally plausible replies, notably in a psychotherapist script, prompting debate about the superficiality of language mimicry. ELIZA—a computer program for the study of natural language communication between man and machine (Communications of the ACM, 1966).
–Public competitions adopted Turing‑style formats, most prominently the Loebner Prize (1991–2019), which annually awarded the “most humanlike” chatbot under controlled text‑chat sessions; the event’s rules and judging standards drew criticism from AI researchers for rewarding deception and narrow tricks rather than general intelligence. Loebner Prize (Wikipedia overview of competition history); AISB report on the 2019 event.

Interpretation and scope

–The test is behaviorist and performance‑based: it evaluates indistinguishability in dialogic behavior, not the internal mechanisms or representational states of the system. Turing’s article also catalogs and responds to objections, making the test a philosophical device as well as a methodological proposal. I.—COMPUTING MACHINERY AND INTELLIGENCE (Mind/Oxford Academic); The Turing Test (Stanford Encyclopedia of Philosophy)
–Encyclopaedia Britannica summarizes the contemporary reading: a judge interacts remotely, usually via text, for a fixed time, and success is measured by the probability of misidentification across trials. Turing test | Britannica

Philosophical critiques

–John Searle’s Chinese room argument (1980) claims that symbol manipulation alone—of the sort a program performs—does not constitute understanding, even if the system passes a Turing test; the scenario imagines a person following rules to produce Chinese outputs without knowing Chinese. Minds, brains, and programs (Behavioral and Brain Sciences, 1980); The Chinese Room Argument (Stanford Encyclopedia of Philosophy)
–Responses include the “systems reply,” which locates understanding in the whole system rather than the rule‑following individual, and other replies that target the assumptions of the argument. The Chinese Room Argument (Stanford Encyclopedia of Philosophy)

Variants and alternatives

–The “Total Turing Test” extends the original’s linguistic focus to include perceptual and motor capacities—proposing that a convincing agent should perform across the broader spectrum of human skills. Other bodies, other minds (Minds and Machines, 1991); The Truly Total Turing Test (Minds and Machines, 1998).
–The Winograd Schema Challenge was introduced as an alternative emphasizing commonsense reasoning via carefully crafted pronoun‑resolution problems; early competitions in 2016 saw no system reach human‑level accuracy, though by 2019–2024 transformer‑based models surpassed 90% on benchmark variants, prompting debate about dataset artifacts and what the scores reveal. Planning, Executing, and Evaluating the Winograd Schema Challenge (AI Magazine, 2016); The First Winograd Schema Challenge at IJCAI‑16 (AI Magazine, 2017); The Defeat of the Winograd Schema Challenge (AAAI 2024)

Modern language models and empirical results

–Large language models are optimized to produce human‑like text and can sometimes fool judges in short interactions, but controlled tests to date show mixed results relative to human baselines, and scholars caution that “passing” depends on test design and statistical criteria. A 2023 public Turing‑test study reported GPT‑4 being judged human in roughly half of games, below human foils. Does GPT‑4 pass the Turing test? (arXiv, 2023). Communications of the ACM summarizes that recent LLMs have not “convincingly” passed robust formulations and discusses the interpretive limits of the test for general intelligence. Beyond Turing: Testing LLMs for Intelligence (Communications of the ACM, 2024). A 2024 Science commentary reviews recent claims of passing under looser, two‑player protocols and contrasts them with Turing’s original three‑party setup. The Turing Test and our shifting conceptions of intelligence (Science, 2024)

Reverse Turing tests and applications

–CAPTCHAs are sometimes described as reverse Turing tests: automated challenges designed so that humans succeed and bots fail, used widely for web security. Their modern formulation and applications were articulated in work by von Ahn, Blum, and Langford. Telling Humans and Computers Apart Automatically (Communications of the ACM, 2004). Subsequent research documents both evolving designs and growing vulnerabilities as machine perception improves. A CAPTCHA design based on visual reasoning (ICASSP, 2018)

Terminology and influence

–Turing’s own terminology—the Imitation Game—highlights the behavioral criterion; the test has been a touchstone for research agendas, public competitions, and critiques within Artificial intelligence and philosophy of mind. Contemporary discourse often situates the test relative to the capacities and limitations of modern systems, including CAPTCHA defenses, historical chatbots such as ELIZA, and alternative reasoning benchmarks like the Winograd Schema Challenge. I.—COMPUTING MACHINERY AND INTELLIGENCE (Mind/Oxford Academic); The Turing Test (Stanford Encyclopedia of Philosophy); Turing test | Britannica