SIGNAL

Dataset for Semantic and Inferred Grammar Neurological Analysis of Language

Repository structure

This repository follows the next structure:

├── stimuli_generation                  # Linguistic stimuli preparation
|   ├──stimuli_check                    # Code for estimation and selection of stimuli parameters
|   ├──break_grammar                    # Code for generation of grammatically incongruent sentences  
|   └──break_semantics                  # Code for generation of semantically incongruent sentences                
├── EEG_processing                      # Source code for EEG data analysis
|   ├── z-scores_estimation             # Code for pairwise conditions comparison in EEG data
|   └── draw_plots                      # Code for visualisation
├── LLM_processing                      # Source code for LLM data analysis
|   └── LLM_probing                     # Code for pairwise condition comparison of LLM data
├── STIMULI.xlsx                        # Dataset with linguistic stimuli and their main parameters
├── README.md                           # README file
└── requirements.txt                    # A file with requirements

Dataset

In this paper, we present SIGNAL, a dataset for Semantic and Inferred Grammar Neurological Analysis of Language. Our dataset contains 600 Russian language sentences along with the 64-channel EEG recordings from humans reading these sentences in a carefully designed experimental paradigm.

The dataset include well-controlled stimuli balanced on key lexical-semantic properties and controlled syntactic structure including sentence groups distinguished by three syntactic structures and four congruency conditions (semantical, grammatical, and semantical-grammatical).

The possible syntactic structures were:

Subject + VERB + OBJECT
- Avtory poluchili podarki
- /Authors received presents/
SUBJECT + VERB + ADJECTIVE + OBJECT
- Dramaturg pridumal sovremenniy syujet
- /Writer invented modern storyline/
SUBJECT + VERB + OBJECT+ GENITIVE
- Programma pokajet mestopolozhenie predmeta
- /The programm will show location of the item/

The congruency conditions were semantical, grammatical, or semantical-grammatical (in)congruency of the Object argument within each sentence:

Congruent sentence:
- Storony podpisali soglashenie.
- /The parties signed an agreement (accusative)/
Semantically incongruent sentence:
- Storony podpisali detstvo.
- /The parties signed childhood (accusative)/
Grammatically incongruent sentence:
- Storony podpisali soglashenii.
- /The parties signed an agreement (locative)/
Semantically and gramatically incongruent sentence:
- Storony podpisali detstve.
- /The parties signed childhood (locative)/

Anomalous stimuli were generated using language model, and validity of them was checked via an online validation study with 133 respondents to prove that (in)congruence type is correctly identified by Russian native speakers. The reliability and interpretability of dataset was proven by EEG estimation results and LLMs probing.

Stimuli generation

The code allows to

control congruent sentences for balance of lexical-semantic parameters
generate a semantically/grammatically incongruent counterpart of the congruent sentence

To generate semantically incongruent sentences run the following script:

python break_semantics.py --input congruent_sentences.csv --output sem_inconguent.csv

To generate grammatically incongruent sentences run the following script:

python break_grammar.py --input congruent_sentences.csv --output gram_inconguent.csv

EEG analysis

EEG data include recordings of 21 participants revealing a statistical difference between stimuli congruence conditions on a neuro-physiological level.

The code allows to:

z-scores_estimation.py
- compute averageg event-related potential data within each condition
- compute z-scores to estimate pairwise differences between congruency conditions
- compute statistically significance of the results via permutation tests
- obtain significant spatial-temporal clusters contrasting ERP between four congruency conditions
draw_plots.py
- visualise z-score estimation
- make topographical plots of significantly different clusters

The results demonstrated the presence of significant topically organized neurolinguistically plausible differences in the EEG data between incongruity conditions.

The preprocessed and epoched EEG data is available at https://huggingface.co/datasets/zhuravlevahana/SIGNAL/tree/main.

LLM probing

LLM probing data include experiments for the probing validation study (including supplementary tokenization effect study) and the algorithm of layer-wise condition contrasting based on ruBERT LLM activations. LLM probing allows for model inference and subsequent diagnostic classification study on datasets compatible with the one used in the study.

SIGNAL_SPREADSHEET should be replaced with the link to the spreadsheet containing the data of interest.

We applied Representational Similarity Analysis (RDM) to evaluate activation difference between 12 types of stimuli (three groups of sentences different by syntax structure each divided into four congruency conditions) detected by LLMs. As a result, we obtained layer-wise Representational Dissimilarity Matrices (RDMs) contrasting each pair of condition presented. The results show that the discrimination accuracy grows with a layer number, and the lates layers are significantly more sensible to sentence structure than to the congruency type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIGNAL

Repository structure

Dataset

Stimuli generation

EEG analysis

LLM probing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
EEG_processing		EEG_processing
LLM_processing		LLM_processing
stimuli_generation		stimuli_generation
README.md		README.md
STIMULI.xlsx		STIMULI.xlsx
requirements.txt		requirements.txt

AIRI-Institute/SIGNAL

Folders and files

Latest commit

History

Repository files navigation

SIGNAL

Repository structure

Dataset

Stimuli generation

EEG analysis

LLM probing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages