ML Conversational Analytic Tool

The ML Conversational Analytic Tool is a proof of concept (POC) machine learning framework to automatically assess pull request comments and reviews for constructive and inclusive communication.

This repo contains experimental code for discussion and collaboration and is not ready for production use.

Motivation

Constructive and inclusive communication ensures a productive and healthy working environment in open source communities. In open source, communication happens in many forms, including pull requests that are text-based conversations crucial to open source collaboration. The ML Conversational Analytic Tool identifies constructive and inclusive pull requests to foster a healthier open source community.

Build and Run

Environment Setup

Prerequisites
Installation

Prerequisites

Python 3.6+

Installation

A virtualenv or similar tools to create isolated Python environment is recommended for this project.

Install virtualenv
```
pip install virtualenv
```
Set up ML Conversational Analytic Tool in a virtualenv
```
python3 -m venv virtualenv-ml-conversational
```

Activate the virtualenv

source ./virtualenv-ml-conversational/bin/activate

Update pip
```
pip install --upgrade pip
```
Install required python libraries by running the command below
```
pip install -r requirements.txt
```

The libraries used within the project are available in the requirements.txt.

Build Dataset

Extract Raw Data from GitHub
Annotate

Extract Raw Data from GitHub

runDataExtraction.py extracts raw data from GitHub based on parameters passed in by the user. To successfully run the script, a GitHub access token is required and must be set as an environment variable.

Note: There is a rate limit associated with GitHub API. Please read more about GitHub API Rate Limits for details before extracting data from a GitHub repo.

GITACCESS=<YOUR_TOKEN>

Run the script by passing in organization and repo

python runDataExtraction.py <organization> <repo>

organization is the name of the repository owner
repo is the name of the repository; use 'all' to extract all repositories owned by organization
(optional) -reactions is an optional flag to extract comment and review reactions.

Annotate

featureVector.py prepares your data for annotation use. Run the script by passing in path to rawdatafile and words.

python featureVector.py <rawdatafile> <words> -unannotated

rawdatafile is location of raw data csv
words is location of file with lookup words (not needed for annotation purposes)
(optional) -unannotated is an optional flag to generate data for annotation

To annotate the raw data extracted we recommend using Data Annotator For Machine Learning. The quality of the data and the model very much depends on annotation best practices.

Train models

After both raw and annotated datasets are available, models can be trained to predict Constructiveness and Inclusiveness.

There are two models available for training

BaseCNN
BaseLSTM

To train, run the script with required parameters path to annotated_filename, dataset_filename, model, and outcome.

python run.py <annotated_filename> <dataset_filename> <model> <outcome>

annotated_filename is the location of the annotated dataset file
dataset_filename is the location of the raw data
model is the type of model and can be 'LSTM' or 'CNN'
outcome can be 'Constructive', 'Inclusive' or 'Both'
(optional) -roleRelevant indicates that the encoding generated should be a stacked matrix representing user roles in conversation. If it is not set then a single matrix representing each comment/review without the role is generated.
(optional) -pad indicates that the number of comment/review should be padded to be a constant value. This argument is required to be set for CNN and not set for LSTM.

Both BaseCNN and BaseLSTM also have prediction explanation mechanisms that can be accessed through the .explain(obs) method in both classes.

If you have ideas on how to improve the framework to assess text conversation for constructive and inclusive communication, we welcome your contributions!

Documentation

Auto-generated API documentation can be found in docs/ml-conversational-analytic-tool directory.

Run the following command to update the API documentation

PYTHONPATH=./ml-conversational-analytic-tool pdoc --html --output-dir docs ml-conversational-analytic-tool

Contributing

The ml-conversational-analytic-tool project team welcomes contributions from the community. If you wish to contribute code and you have not signed our contributor license agreement, our bot will update the issue when you open a Pull Request. For any questions about the CLA process, please refer to our FAQ. For more detailed information, refer to CONTRIBUTING.md.

Please remember to read our Code of Conduct and keep in mind during your collaboration.

License

Apache License v2.0: see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/ml-conversational-analytic-tool		docs/ml-conversational-analytic-tool
ml-conversational-analytic-tool		ml-conversational-analytic-tool
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Conversational Analytic Tool

Motivation

Overview

Build and Run

Environment Setup

Prerequisites

Installation

Build Dataset

Extract Raw Data from GitHub

Annotate

Train models

Documentation

Contributing

License

About

Releases

Packages

Languages

License

mkbhanda/ml-conversational-analytic-tool

Folders and files

Latest commit

History

Repository files navigation

ML Conversational Analytic Tool

Motivation

Overview

Build and Run

Environment Setup

Prerequisites

Installation

Build Dataset

Extract Raw Data from GitHub

Annotate

Train models

Documentation

Contributing

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages