Skip to content

Data Schema

Moein Owhadi-Kareshk edited this page Dec 6, 2018 · 7 revisions

Data Schema

This page describes the data schema of our merge conflict dataset. As the bellow figure illustrates, the data of repositories, merge scenarios, the commits in the scenario, merge replay with respects to different merging techniques, conflicting files and regions, code complexity, and code style violations are storied in our dataset.

Data Schema

Data Relations

Here we describe each of the tables separately.


Table Repository stores the information of repositories in the dataset. This information is retrieved with GitHub API.

Field Data Type Description
id INT The repository id on GitHub
update_date DATETIME The time that the information of the repository retrieved from GitHub and inserted in the dataset
name CHAR(100) The repository name
description CHAR(400) The repository description
url ARCHAR(120) The repository URL in <USER_NAME>/<REPOSITORY_NAME> format
language VARCHAR(20) The main programming language of the repository
watch_num INT The number of repository watches
star_num INT The number of repository stars
fork_num INT The number of repository forks
issue_num INT The number of repository issues
size BIGINT The size of the repository

Merge Scenario

The information of merge scenarios is stored in Merge_Scenario table.

Field Data Type Description
merge_commit_hash CHAR(40) The SHA-1 of the merge commit
ancestor_hash CHAR(40) The SHA-1 of the ancestor
parent1_hash CHAR(40) The SHA-1 of parent 1
parent2_hash CHAR(40) The SHA-1 of parent 2
parallel_changed_num INT The number of changes that edited in both parents
merge_commit_can_compile INT Whether the merge commit can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
merge_commit_can_pass_test INT Whether the merge commit can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
ancestor_can_compile INT Whether the ancestor can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
ancestor_can_pass_test INT Whether the ancestor can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
parent1_can_compile INT Whether parent 1 can compile; 1 if can compile, 0 if cannot, and -1 if do not try to compile
parent1_can_pass_test INT Whether parent 1 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
parent2_can_compile INT Whether parent 2 can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
parent2_can_pass_test INT Whether parent 2 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
merge_commit_date DATETIME The date of the merge commit
ancestor_date DATETIME The date of the ancestor
parent1_date DATETIME The date of merge parent 1
parent2_date DATETIME The date of merge parent 2
pull_request INT 1 id the merge is a pull request, 0 otherwise


The information of all commits in both parents from the ancestor to the merge commit is stored in Merge_Related_Commit.

Field Data Type Description
commit_hash CHAR(40) The SHA-1 of the commit
date DATETIME The date of the commit
message VARCHAR(40) The commit message
branch VARCHAR(45) The branch name of the commit
merge_commit_parent INT The parent that the commit is in
file_added_num INT The number of added files
file_removed_num INT The number of removed files
file_renamed_num INT The number of renamed files
file_copied_num INT The number of copied files
file_modified_num INT The number of modified files
line_added_num INT The number of added lines
line_removed_num INT The number of removed lines


We replay each merge scenarios to extract their characteristics. Since our dataset has the capability of storing the replying results with different merging techniques, we store the merge reply information in a separate table, called Merge_Replay.

Field Data Type Description
merge_technique VARCHAR(15) The merging techniques, the default is used for git merge with the default configurations
is_conflict INT 1 if there is at least one conflict in the merge replay, 0 otherwise
can_compile INT Whether the replay can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
can_pass_test INT Whether the replay can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to tun the tests
execution_time FLOAT The execution time of the merge replay in seconds
result_is_equal_to_replay INT 1 id the merge replay is equal to the merge commit, 0 otherwise


For each merge replay, the data of files that have conflicts is stored in Conflicting_File table.

Field Data Type Description
file_path_name VARCHAR(1000) The relative path of the conflicting file
conflict_type INT 1 if content conflict, 2 if rename conflict, and 3 if remove/edit conflict


The data of conflicting regions is stored in Conflicting_Region table.

Field Data Type Description
parent1_start_line INT The start line of the conflict region in parent 1
parent1_length INT The length of the conflicting region in parent 1
parent2_start_line INT The start line of the conflict region in parent 2
parent2_length INT The length of the conflicting region in parent 2


The number of code violations is stored in Code-Style_Violation table.

Field Data Type Description
merge_commit_style_violation_num INT the number of code violations in the merge commit
ancestor_style_violation_num INT the number of code violations in the ancestor
parent1_style_violation_num INT the number of code violations in parent 1
parent2_style_violation_num INT the number of code violations in parent 2


The differences between the code complexity of two parents are stores in Code_Complexity table.

Field Data Type Description
merge_commit_hash CHAR(40) The SHA-1 of the merge commit
TODO_diff INT The differences of measure1 in two parents