The so-called neurosymbolic models, which combine algorithms with symbolic reasoning techniques, seem to be much more suitable for predicting, explaining and considering counterfactual possibilities than neural networks. But researchers at DeepMind claim that neural networks can outperform neurosymbolic models under the right test conditions. In a prepress work, the co-authors describe an architecture for spatio-temporal reasoning about videos in which all components are learned and all intermediate representations are distributed (rather than symbolically) on the layers of the neural network. The team says it outperforms neurosymbolic models on all questions in a popular dataset, with the biggest advantage over counterfactual questions.
DeepMind research could have implications for the development of machines that can argue their experiences. Contrary to previous studies, models based solely on distributed representations may actually perform well on visual tasks that measure high-level cognitive function, according to the researchers – at least insofar as they outperform existing neurosymbolic models.
The neural network architecture proposed in the paper draws attention to effectively integrate information. (Attention is the mechanism by which the algorithm focuses on a single element or several elements at a time.) It is self-monitored, which means that the model must deduce masked objects in videos using basic dynamics to extract more information. And architecture ensures that the visual elements in videos correspond to physical objects, a step taken by co-authors is essential for higher-level reasoning.
The researchers compared their neural network with CoLlision events for video presentation and reasoning (CLEVRER), a data set based on information from psychology. CLEVRER contains over 20,000 5-second videos of colliding objects (three shapes of two materials and eight colors) generated by a physics engine and over 300,000 questions and answers, all focusing on four elements of logical reasoning: descriptive (e.g., “what color”), Explanatory (“for what is responsible”), predictive (“what will happen next”), and counterfactual (“what if”).
According to DeepMind co-authors, their neural network matched the performance of the best neurosymbolic models without preliminary or labeled data and 40% less training data, causing the idea that neural networks are hungrier than neurosymbolic models. Moreover, it obtained 59.8% of the most difficult counterfactual questions – better than chance and all other models – and generalized to other tasks, including those in CATER, a set of video data for tracking objects in which the goal is to predict the location of a target object in the final frame.
“Our results … add to a body of evidence that deep networks can reproduce many properties of human knowledge and reasoning, while benefiting from the flexibility and expressiveness of distributed representations,” the co-authors wrote. “Neural models have also had some success in mathematics, an area that, intuitively, seems to require the execution of formal rules and the manipulation of symbols. Somewhat surprisingly, large-scale neural language patterns … can acquire a penchant for arithmetic reasoning and analogy without being explicitly trained for such tasks, suggesting that current neural network limitations are ameliorated when scaled to more much data and larger and more efficient architectures are used. “