FUNSD is a dataset for form understanding in natural language. It consists of 199 real scanned forms, in which 149 for training and 50 for testing. Previous works evaluate their performance on the SER task and RE task. The SER task aims to classify the text blocks into 4 categories: header, question, answer, and other. The RE task aims to identify the linking relationships between the known entities (SER ground-truth labels).
XFUND dataset is a multilingual extension of FUNSD that covers seven languages (Chinese, Japanese, French, Italian, German, Spanish, and Portuguese). Each subset contains 199 real scanned forms, in which 149 for training and 50 for testing.
Model Version: Sep. 25
Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put "N/A" instead)."header": [xxx, ...], "key": [xxx, ...], "value": [xxx, ...]
You are a document understanding AI, who reads the contents in the given document image and tells the information that the user needs. Respond with the original content in the document image, do not reformat. No extra explanation is needed. Extract all the key-value pairs from the document image.
Method | FUNSD | XFUND-zh | ||||||
Precision ↑ | Recall ↑ | F1 ↑ | 1-NED ↑ | Precision ↑ | Recall ↑ | F1 ↑ | 1-NED ↑ | |
GPT-4V | 41.85% | 29.36% | 34.51% | 0.2697 | 25.87% | 15.15% | 19.11% | 0.1544 |
Supervised-SOTA | - | - | - | 0.5500 | - | - | - | - |
We observe that there exists some erroneous annotations in the FUNSD and XFUND dataset, which is inconsistent with human understanding. Hence the results shown below may not be accurate.
Dataset | Precision↑ (%) | Recall↑ (%) | F1↑ (%) | 1-NED↑ |
---|---|---|---|---|
FUNSD | 20.69 | 10.25 | 13.71 | 0.1979 |
XFUND-zh | 0.07 | 0.02 | 0.03 | 0.0420 |
Illustration of error cases of the SER task. The text content enclosed within the red box is incorrectly identified as header entities.
Illustration of Entity Prediction on Full Document Images in the FUNSD Dataset. Due to GPT-4V's limited capability in recognizing Chinese characters, we have excluded examples from the XFUND-zh dataset in this context. Zoom in for best review.
Cases in the Pair Extraction task. GPT4V generates keys for the contents enclosed within the red box, which does not exist in the document image.