Data Schema

This page describes the data schema of our merge conflict dataset. As the bellow figure illustrates, the data of repositories, merge scenarios, the commits in the scenario, merge replay with respects to different merging techniques, conflicting files and regions, code complexity, and code style violations are storied in our dataset.

Data Schema

Data Relations

Here we describe each of the tables separately.

Repository

Table Repository stores the information of repositories in the dataset. This information is retrieved with GitHub API.

Field	Data Type	Description
`id`	`INT`	The repository id on GitHub
`update_date`	`DATETIME`	The time that the information of the repository retrieved from GitHub and inserted in the dataset
`name`	`CHAR(100)`	The repository name
`description`	`CHAR(400)`	The repository description
`url`	`ARCHAR(120)`	The repository URL in <USER_NAME>/<REPOSITORY_NAME> format
`language`	`VARCHAR(20)`	The main programming language of the repository
`watch_num`	`INT`	The number of repository watches
`star_num`	`INT`	The number of repository stars
`fork_num`	`INT`	The number of repository forks
`issue_num`	`INT`	The number of repository issues
`size`	`BIGINT`	The size of the repository

Merge Scenario

The information of merge scenarios is stored in Merge_Scenario table.

Field	Data Type	Description
`merge_commit_hash`	`CHAR(40)`	The SHA-1 of the merge commit
`ancestor_hash`	`CHAR(40)`	The SHA-1 of the ancestor
`parent1_hash`	`CHAR(40)`	The SHA-1 of parent 1
`parent2_hash`	`CHAR(40)`	The SHA-1 of parent 2
`parallel_changed_num`	`INT`	The number of changes that edited in both parents
`merge_commit_can_compile`	`INT`	Whether the merge commit can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
`merge_commit_can_pass_test`	`INT`	Whether the merge commit can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
`ancestor_can_compile`	`INT`	Whether the ancestor can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
`ancestor_can_pass_test`	`INT`	Whether the ancestor can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
`parent1_can_compile`	`INT`	Whether parent 1 can compile; 1 if can compile, 0 if cannot, and -1 if do not try to compile
`parent1_can_pass_test`	`INT`	Whether parent 1 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
`parent2_can_compile`	`INT`	Whether parent 2 can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
`parent2_can_pass_test`	`INT`	Whether parent 2 can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to run the tests
`merge_commit_date`	`DATETIME`	The date of the merge commit
`ancestor_date`	`DATETIME`	The date of the ancestor
`parent1_date`	`DATETIME`	The date of merge parent 1
`parent2_date`	`DATETIME`	The date of merge parent 2
`pull_request`	`INT`	1 id the merge is a pull request, 0 otherwise

Merge_Related_Commits

The information of all commits in both parents from the ancestor to the merge commit is stored in Merge_Related_Commit.

Field	Data Type	Description
`commit_hash`	`CHAR(40)`	The SHA-1 of the commit
`date`	`DATETIME`	The date of the commit
`message`	`VARCHAR(40)`	The commit message
`branch`	`VARCHAR(45)`	The branch name of the commit
`merge_commit_parent`	`INT`	The parent that the commit is in
`file_added_num`	`INT`	The number of added files
`file_removed_num`	`INT`	The number of removed files
`file_renamed_num`	`INT`	The number of renamed files
`file_copied_num`	`INT`	The number of copied files
`file_modified_num`	`INT`	The number of modified files
`line_added_num`	`INT`	The number of added lines
`line_removed_num`	`INT`	The number of removed lines

Merge_Replay

We replay each merge scenarios to extract their characteristics. Since our dataset has the capability of storing the replying results with different merging techniques, we store the merge reply information in a separate table, called Merge_Replay.

Field	Data Type	Description
`merge_technique`	`VARCHAR(15)`	The merging techniques, the default is used for `git merge` with the default configurations
`is_conflict`	`INT`	1 if there is at least one conflict in the merge replay, 0 otherwise
`can_compile`	`INT`	Whether the replay can compile;1 if can compile, 0 if cannot, and -1 if do not try to compile
`can_pass_test`	`INT`	Whether the replay can pass the tests;1 if can pass the tests, 0 if cannot, and -1 if do not try to tun the tests
`execution_time`	`FLOAT`	The execution time of the merge replay in seconds
`result_is_equal_to_replay`	`INT`	1 id the merge replay is equal to the merge commit, 0 otherwise

Conflicting_File

For each merge replay, the data of files that have conflicts is stored in Conflicting_File table.

Field	Data Type	Description
`file_path_name`	`VARCHAR(1000)`	The relative path of the conflicting file
`conflict_type`	`INT`	1 if content conflict, 2 if rename conflict, and 3 if remove/edit conflict

Conflicting_Region

The data of conflicting regions is stored in Conflicting_Region table.

Field	Data Type	Description
`parent1_start_line`	`INT`	The start line of the conflict region in parent 1
`parent1_length`	`INT`	The length of the conflicting region in parent 1
`parent2_start_line`	`INT`	The start line of the conflict region in parent 2
`parent2_length`	`INT`	The length of the conflicting region in parent 2

Code-Style_Violation

The number of code violations is stored in Code-Style_Violation table.

Field	Data Type	Description
`merge_commit_style_violation_num`	`INT`	the number of code violations in the merge commit
`ancestor_style_violation_num`	`INT`	the number of code violations in the ancestor
`parent1_style_violation_num`	`INT`	the number of code violations in parent 1
`parent2_style_violation_num`	`INT`	the number of code violations in parent 2

Code_Complexity

The differences between the code complexity of two parents are stores in Code_Complexity table.

Field	Data Type	Description
`merge_commit_hash`	`CHAR(40)`	The SHA-1 of the merge commit
`TODO_diff`	`INT`	The differences of measure1 in two parents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Schema

Data Schema

Data Relations

Repository

Merge Scenario

Merge_Related_Commits

Merge_Replay

Conflicting_File

Conflicting_Region

Code-Style_Violation

Code_Complexity

Clone this wiki locally