-
Notifications
You must be signed in to change notification settings - Fork 1
Data Schema
This page describes the data schema of our merge conflict dataset. As the bellow figure illustrates, the data of repositories, merge scenarios, the commits in the scenario, merge replay with respects to different merging techniques, conflicting files and regions, code complexity, and code style violations are storied in our dataset.
Here we describe each of the tables separately.
Table Repository
stores the information of repositories in the dataset. This information is retrieved with GitHub API.
Field | Data Type | Description |
---|---|---|
id |
INT |
The repository id on GitHub |
update_date |
DATETIME |
The time that the information of the repository retrieved from GitHub and inserted in the dataset |
name |
CHAR(100) |
The repository name |
description |
CHAR(400) |
The repository description |
url |
ARCHAR(120) |
The repository URL in <USER_NAME>/<REPOSITORY_NAME> format |
language |
VARCHAR(20) |
The main programming language of the repository |
watch_num |
INT |
The number of repository watches |
star_num |
INT |
The number of repository stars |
fork_num |
INT |
The number of repository forks |
issue_num |
INT |
The number of repository issues |
size |
BIGINT |
The size of the repository |
The information of merge scenarios is stored in Merge_Scenario
table.
Field | Data Type | Description |
---|---|---|
merge_commit_hash |
CHAR(40) |
The SHA-1 of the merge commit |
ancestor_hash |
CHAR(40) |
The SHA-1 of the ancestor |
parent1_hash |
CHAR(40) |
The SHA-1 of parent 1 |
parent2_hash |
CHAR(40) |
The SHA-1 of parent 2 |
parallel_changed_num |
INT |
The number of changes that edited in both parents |
merge_commit_can_compile |
INT |
Whether the merge commit can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile |
merge_commit_can_pass_test |
INT |
Whether the merge commit can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests |
ancestor_can_compile |
INT |
Whether the ancestor can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile |
ancestor_can_pass_test |
INT |
Whether the ancestor can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests |
parent1_can_compile |
INT |
Whether parent 1 can compile; 1 if can compile, 0 if cannot, and -1 if do not try to compile |
parent1_can_pass_test |
INT |
Whether parent 1 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests |
parent2_can_compile |
INT |
Whether parent 2 can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile |
parent2_can_pass_test |
INT |
Whether parent 2 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests |
merge_commit_date |
DATETIME |
The date of the merge commit |
ancestor_date |
DATETIME |
The date of the ancestor |
parent1_date |
DATETIME |
The date of merge parent 1 |
parent2_date |
DATETIME |
The date of merge parent 2 |
pull_request |
INT |
1 id the merge is a pull request, 0 otherwise |
The information of all commits in both parents from the ancestor to the merge commit is stored in Merge_Related_Commit
.
Field | Data Type | Description |
---|---|---|
commit_hash |
CHAR(40) |
The SHA-1 of the commit |
date |
DATETIME |
The date of the commit |
message |
VARCHAR(40) |
The commit message |
branch |
VARCHAR(45) |
The branch name of the commit |
merge_commit_parent |
INT |
The parent that the commit is in |
file_added_num |
INT |
The number of added files |
file_removed_num |
INT |
The number of removed files |
file_renamed_num |
INT |
The number of renamed files |
file_copied_num |
INT |
The number of copied files |
file_modified_num |
INT |
The number of modified files |
line_added_num |
INT |
The number of added lines |
line_removed_num |
INT |
The number of removed lines |
We replay each merge scenarios to extract their characteristics. Since our dataset has the capability of storing the replying results with different merging techniques, we store the merge reply information in a separate table, called Merge_Replay
.
Field | Data Type | Description |
---|---|---|
merge_technique |
VARCHAR(15) |
The merging techniques, the default is used for git merge with the default configurations |
is_conflict |
INT |
1 if there is at least one conflict in the merge replay, 0 otherwise |
can_compile |
INT |
Whether the replay can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile |
can_pass_test |
INT |
Whether the replay can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to tun the tests |
execution_time |
FLOAT |
The execution time of the merge replay in seconds |
result_is_equal_to_replay |
INT |
1 id the merge replay is equal to the merge commit, 0 otherwise |
For each merge replay, the data of files that have conflicts is stored in Conflicting_File
table.
Field | Data Type | Description |
---|---|---|
file_path_name |
VARCHAR(1000) |
The relative path of the conflicting file |
conflict_type |
INT |
1 if content conflict, 2 if rename conflict, and 3 if remove/edit conflict |
The data of conflicting regions is stored in Conflicting_Region
table.
Field | Data Type | Description |
---|---|---|
parent1_start_line |
INT |
The start line of the conflict region in parent 1 |
parent1_length |
INT |
The length of the conflicting region in parent 1 |
parent2_start_line |
INT |
The start line of the conflict region in parent 2 |
parent2_length |
INT |
The length of the conflicting region in parent 2 |
The number of code violations is stored in Code-Style_Violation
table.
Field | Data Type | Description |
---|---|---|
merge_commit_style_violation_num |
INT |
the number of code violations in the merge commit |
ancestor_style_violation_num |
INT |
the number of code violations in the ancestor |
parent1_style_violation_num |
INT |
the number of code violations in parent 1 |
parent2_style_violation_num |
INT |
the number of code violations in parent 2 |
The differences between the code complexity of two parents are stores in Code_Complexity
table.
Field | Data Type | Description |
---|---|---|
merge_commit_hash |
CHAR(40) |
The SHA-1 of the merge commit |
TODO_diff |
INT |
The differences of measure1 in two parents |