Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

About

Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data.

This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing {\sc {\sc MMTabQA}}, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.

Dataset Details

we create the MMTabQA dataset with 69,740 questions over 25,026 tables by augmenting existing tables from four data sources:

WikiSQL Dataset to benchmark model capabilities in parsing entities accurately and answering basic SQL-based questions.
WikiTableQuestions dataset to include questions which require more complex reasoning.
FeTaQA dataset to include long-form answer-based questions which involve multiple row/column reasoning.
HybridQA dataset which includes extra contextual passages beyond the tables, requiring hybrid complex reasoning.

The dataset contains questions of the following types:

Explicit Questions: which mention an entity that is replaced by an image in the table.
Answer-Mention Questions: whose answer contains an entity that is replaced by an image in the table, but the question does not.
Implicit Questions: where an image-replaced entity is involved in intermediate reasoning but not mentioned in the answer or the question.
Visual Questions: Questions that involve visual aspects of entities. These are created by synthetically augmenting the explicit questions in the dataset with visual attributes of the entity they refer to.

No. of Questions	Avg. Img per Table	No. of Tables	Overall
WikiSQL	21,472	13.68	9,784
WikiTable-Questions	10,052	17.67	1,259
FeTaQA	7,476	10.43	5,898
HybridQA	30,470	14.64	8,085
Overall	69,740	14.10	25,026

Table: Primary Dataset Statistics

Data Source	Explicit Ques	Implicit Ques	Visual Ques	Answer-Mention Ques
Data Source	Question Type Statistics
FeTaQA	2,499	612	1,185	3,180
WikiTable-Questions	3,523	2,879	877	2,773
WikiSQL	12,956	315	1,827	6,374
HybridQA	5,819	17,647	1,874	5,130

Question Type Statistics: Distribution of the different types of questions in the dataset.

Dataset Creation

The figure below shows the steps followed for creating the dataset:

Dataset Examples

Experimental Results

Dataset	WikiTableQuestions				WikiSQL				FetaQA
Model	EQ	AQ	IQ	VQ	EQ	AQ	IQ	VQ	EQ	AQ	IQ	VQ
Partial Input Baseline
Gemini-1.5 Flash	40.99	27.38	48.95	31.4	39.14	28.71	62.22	28	0.51	0.44	0.44	0.47
GPT-4o	57.45	38.02	70.83	42.40	52.57	43.86	72.38	39.00	0.51	0.46	0.42	0.44
Llama-3 70B	41.13	26.48	43.75	31.8	41.117	30.75	61.27	30.6	0.52	0.46	0.45	0.48
Mixtral 8x7B	26.56	9.90	30.26	20.2	23.42	17.71	28.88	19.2	0.44	0.39	0.38	0.39
Oracle-Entity Replaced Baseline
Gemini-1.5 Flash	74.89	78.19	54.86	-	82.28	81.86	77.46	-	0.56	0.50	0.41	-
GPT-4o	87.80	84.86	84.55	-	85.57	82.71	79.05	39.00	0.53	0.48	0.43	-
Llama-3 70B	75.74	75.31	58.85	-	78.28	78.57	68.25	-	0.49	0.46	0.41	-
Mixtral 8x7B	54.89	53.87	40.69	-	59.28	69.28	33.96	-	0.44	0.41	0.33	-
Image-Captioning Baseline
Gemini-1.5 Flash	52.34	42.16	51.39	42.2	50.42	40.85	67.30	46.6	0.57	0.46	0.42	0.43
Table-as-an-Image Baseline
Gemini-1.5 Flash	44.22	25.65	41.01	37.8	47.08	35.75	52.38	35.25	0.62	0.43	0.42	0.47
GPT-4o	66.12	50.60	62.12	46.0	72.28	64.82	58.49	32.6	0.62	0.46	0.43	0.50
Llama-3 70B	57.76	41.90	45.54	42.1	55.24	48.54	59.84	45.8	0.56	0.45	0.41	0.44
Mixtral 8x7B	40.34	20.38	41.16	30.9	38.74	29.29	39.12	30.9	0.55	0.43	0.41	0.42

Results on sampled subset of MMTabQA. Substring match is reported for Wiki-related data sources and ROUGE-L is reported for FetaQA data source. EQ - Explicit Questions, AQ - Answer-Mention Questions, IQ - Implicit Questions, VQ - Visual Questions. Best performing models are highlighted in red.

People

The MMTabQA dataset is prepared by collaboration of across multiple institutions IIIT Hyderabad, IIT Guwahati, UCLA, UPenn and UNC, Chapel Hill by the following people:

Citation

Please cite our paper as below if you use the InfoSync dataset.

@misc{mathur2024knowledgeawarereasoningmultimodalsemistructured,
	title={Knowledge-Aware Reasoning over Multimodal Semi-structured Tables}, 
	author={Suyash Vardhan Mathur and Jainit Sushil Bafna and Kunal Kartik and Harshita Khandelwal and Manish Shrivastava and Vivek Gupta and Mohit Bansal and Dan Roth},
	year={2024},
	eprint={2408.13860},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2408.13860}, 
}

Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-20-1-0080. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. This work was partially funded by ONR Contract N00014-19-1-2620. Lastly, we extend our appreciation to the reviewing team for their insightful comments.