Mobile applications have become must for all scale businesses...

Welcome to a new generation of Drupal. It’s a digital experience...

Humans outperform AI in a reasoning test

What can’t they do? Solve simple visual logic puzzles. 

Forming and abstracting concepts is at the heart of human intelligence. These abilities enable humans to understand and create internal models of the world, to use these models to make sense of new information, often via analogy, and to decide how to behave in novel situations. 

In AI, research on concept formation and abstraction often utilizes idealized domains that capture some of the essential aspects of abstraction and analogy in the real world. In such domains, one can be explicit about the assumed prior knowledge without requiring the open-ended knowledge involved in real-world language and imagery. 

Research by Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell at the Santa Fe Institute examined how to evaluate the degree to which an AI system has learned or understood a concept in a generalizable way. The report was published on testing humans as well as three machine solvers: the top two programs from a 2021 ARC competition and OpenAI’s GPT-4.  

The results show that humans outperform the machine solvers on this benchmark, showing abilities to abstract and generalize concepts that are not yet captured by AI systems. 

Understanding the research

Chollet proposed the Abstraction and Reasoning Corpus (ARC) to evaluate abstract concept understanding and reasoning abilities in humans and AI systems. ARC consists of a set of analogy problems. Each problem consists of a set of demonstrations—initial and transformed grids—and one or more test input grids.  

In Chollet’s terminology, the demonstrations coupled with the test inputs form a task to be solved. To solve a task, an agent must infer the abstract rule governing the demonstrations and apply that rule to each test input to produce a correct output grid. 

The researchers created ConceptARC by choosing 16 concepts. Each concept is central to one or more tasks in Chollet’s published ARC “training” and “evaluation” sets, though those sets were not organized around specific concepts. 

Human participants were recruited using the Amazon Mechanical Turk and Prolific platforms and tested on tasks in our corpus using a visual interface. Participants were presented with a visual interface for solving ARC tasks, adapted from the original ARC viewer and programmed for online data collection using the Psiturk framework. 

Each participant was presented with a random selection of tasks. Each task had three test inputs, but these were randomly split among participants, with each participant seeing only one test input of a given task. The participants were given three attempts to solve each test input. If participants solved the test input correctly, they were asked to verbally describe their solution before moving on to the next task. Participants were excluded from further analysis if they failed to solve two or more minimal tasks or provided empty or nonsensical explanations for their solutions.

The ARC-Kaggle programs were allowed to make three predictions for each test input, and if one of the predictions was correct, the test input was considered to be solved. To test this language-only version of GPT-4 on the tasks in ConceptARC, the API provided by OpenAI was used. 

The result

The human participants achieved substantially higher accuracy than the machine solvers on tasks in our ConceptARC benchmark. The human participants exhibited over 90% average accuracy on 11 of the 16 concepts and over 80% accuracy on each of the remaining five concepts. The average accuracy across test inputs in a concept group measures how well solvers can generalize over different tasks representing a given concept.  

GPT-4, whose performance in this domain was impressive, given that it was not designed or trained for such tasks had an accuracy below 30% on 15 of 16 concepts. GPT-4’s weak performance here contrasts with its much better performance on other idealized domains for analogy-making.  

ARC-Kaggle programs did not reach human-level accuracy; it is interesting to note that these programs have significantly higher accuracy on the tasks in ConceptARC than they did on the original tasks in the ARC-Kaggle competition, where their respective accuracies were 21% and 19%. 

To summarize, the results show that humans exhibit strong conceptual generalization abilities in the ARC domain compared to much weaker abilities in current AI programs, both those designed for this domain and more general-purpose large language models. The scientists believe that their benchmark and future extensions will spur improvements in the development of AI systems for conceptual abstraction and in the effective evaluation of such systems. 

 

FIZIKA MIND

MSME Regg UAN UP15D0006259
UDYAM-UP-15-0011360
GeM Seller Id: 4885200001235542
 

Payment Information

A/c Name: FIZIKA MIND
A/c No: 53320200000435
IFSC : BARB0KUDESI

Contact us

Fizikamind Infotech Pvt. Ltd.

Kipps Enclave Shastri Nagar Bareilly, IN

Cont. no. +91-9411029100
[email protected]

Get in touch with us

FIZIKA MIND

Best website developer in bareilly

Education - This is a contributing Drupal Theme
Design by WeebPal.