JEMMA is an Extensible Java dataset for Many ML4Code Applications. It is primarily a dataset of Java code entities at multiple granularities, their properties, and representations. To help users interact and work with the data seamlessly, we have added Workbench capabilities to it as well.
This repository hosts the Workbench part of JEMMA, while the raw data is hosted on Zenodo which can be downloaded at any moment while using the Workbench. The following sections provide more details.
First steps: Install jemma locally
1. $ git clone https://github.com/giganticode/jemma.git
2. $ cd jemma/
3. $ pip install -r requirements.txt
4. $ pip install -e .
Next steps: Downloading all the datasets
Sign-up to Zenodo.org and generate an API num_token [IMPORTANT!]
5. $ cd jemma/download/
6. $ nano config.ini (& replace the dummy `access_token` with your API key)
7. $ python3 download.py
8. $ python3 sanity_checks.py
Link to metadata | columns |
---|---|
projects | project_id |
project_path | |
project_name | |
packages | project_id |
package_id | |
package_path | |
package_name | |
classes | project_id |
package_id | |
class_id | |
class_path | |
class_name | |
methods | project_id |
package_id | |
class_id | |
method_id | |
method_name | |
start_line | |
end_line |
Representation Code | Representation Name | Link to dataset |
---|---|---|
TEXT | raw_source_code | https://doi.org/10.5281/zenodo.5813705 |
TKNA | code_tokens (spaced) | https://doi.org/10.5281/zenodo.5813717 |
TKNB | code_tokens (comma) | https://doi.org/10.5281/zenodo.5813730 |
C2VC | code2vec* | https://doi.org/10.5281/zenodo.5813993 |
C2SQ | code2seq* | https://doi.org/10.5281/zenodo.5814059 |
FTGR | feature_graph* | https://doi.org/10.5281/zenodo.5813933 |
Link to callgraphs data | columns |
---|---|
Callgraphs | caller_project_id |
caller_class_id | |
caller_method_id | |
call_direction | |
callee_project_id | |
callee_class_id | |
callee_method_id |
-
get_project_id
Returns the project_id of the project (queried by project name).
Parameters:
- project_name: (str) - name of the project
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
- Returns None if multiple projects were found with the same name
-
get_project_id_by_path
Returns the project id of the project (queried with project path).
Parameters:
- project_path: (str) - path of the project defined in jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_path was found
- Returns None if multiple projects were found with the same path
-
get_project_id_class_id
Returns the project id of the project (queried with class id)
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
-
get_project_id_by_method_id
Returns the project id of the project (queried with method id)
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str uuid of the corresponding project (project_id)
- Returns None if no such project_id was found
-
get_project_name
Returns the project name of the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project name
- Returns None if no such project_id is defined in jemma
-
get_project_path
Returns the project path of the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project path
- Returns None if no such project_id is defined in jemma
-
get_project_size_by_classes
Returns the size of a project, by the number of classes.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project size, by the number of classes
- Returns None if no such project_id is defined in jemma
-
get_project_size_by_methods
Returns the size of a project, by the number of methods.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a str of the corresponding project size, by the number of methods
- Returns None if no such project_id is defined in jemma
-
get_project_class_ids
Returns all class ids defined within the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all class ids in the project
- Returns an empty List if no classes are found
-
get_project_method_ids
Returns all method ids defined within the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all method ids in the project
- Returns an empty List if no methods are found
-
get_project_class_names
Returns all class names defined within the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all class names in the project
- Returns an empty List if no classes are found
-
get_project_method_names
Returns all method names defined within the project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all method names in the project
- Returns an empty List if no methods are found
-
get_project_metadata
Returns all metadata related to a particular project.
Parameters:
- project_id: (str) - any project_id defined within jemma
Returns:
- Returns a dictionary of project metadata values
- Returns None if no such project_id is defined in jemma
-
get_class_id
Returns the class id of a class in project (queried by class name).
Parameters:
- project_id: (str) - project_id of a project
- class_name: (str) - class name of a class within the project
Returns:
- Returns a str uuid of the corresponding class (class_id)
- Returns None if no such project_id or class_name was found
- Returns None if multiple classes were found with the same name (use: get_class_id_by_path)
-
get_class_id_by_path
Returns the class id of a class (queried with class path).
Parameters:
- class_path: (str) - path of the class defined in jemma
Returns:
- Returns a str uuid of the corresponding class (class_id)
- Returns None if no such class_path was found
- Returns None if multiple classes were found with the same path
-
get_class_id_by_method_id
Returns the class id of a class (queried with method id)
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str uuid of the corresponding class (class_id)
- Returns None if no such class_id was found
-
get_class_name
Returns the class name of a particular class.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a str of the corresponding class name
- Returns None if no such class_id is defined in jemma
-
get_class_path
Returns the class path of a particular class.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a str of the corresponding class path
- Returns None if no such class_id is defined in jemma
-
get_class_size_by_methods
Returns the size of a class, by the number of methods.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a str of the corresponding class size, by the number of methods
- Returns None if no such class_id is defined in jemma
-
get_class_method_ids
Returns all method ids defined within a particular class.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all method ids in the class
- Returns an empty List if no methods are found
-
get_class_method_names
Returns all method names within a particular class.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a (List[str]) corresponding to all method names in the class
- Returns an empty List if no methods are found
-
get_class_metadata
Returns all metadata related to a particular class.
Parameters:
- class_id: (str) - any class_id defined within jemma
Returns:
- Returns a dictionary of class metadata values
- Returns None if no such class_id is defined in jemma
-
get_method_id
Returns the method id of a method in a class (queried by method name).
Parameters:
- class_id: (str) - any class_id defined within jemma
- method_name: (str) - method name of a method within the class
Returns:
- Returns a str uuid of the corresponding method (method_id)
- Returns None if no such class_id or method_name was found
- Returns None if multiple methods were found with the same name (use: get_method_id_stln_enln)
-
get_method_id_by_stln_enln
Returns the method id of a method in a class (queried by method name, start line, and end line).
Parameters:
- class_id: (str) - any class_id defined within jemma
- method_name: (str) - method name of a method within the class
- stln: (str) - start line of the method within the class
- enln: (str) - end line of the method within the class
Returns:
- Returns a str uuid of the corresponding method (method_id)
- Returns None if no such class_id or method_name was found
-
get_method_path
Returns the class path of the parent class of a method.
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str of the corresponding class path
- Returns None if no such method_id is defined in jemma
-
get_start_line
Returns the start line of a particular method
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str of the corresponding start line of the method
- Returns None if no such method_id is defined in jemma
-
get_end_line
Returns the end line of a particular method
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a str of the corresponding end line of the method
- Returns None if no such method_id is defined in jemma
-
get_method_metadata
Returns all metadata related to a particular method.
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a dictionary of method metadata values
- Returns None if no such method_id is defined in jemma
-
get_properties
Get property values for a list of methods.
Parameters:
- property : (str) - property code
- methods : (list[str]) - list of unique methods ids
Returns:
- pandas Dataframe object (with method_id, property) of the passed list of methods
-
get_balanced_properties
Get balanced property values for a list of methods.
Parameters:
- property : (str) - property code
- methods : (list[str]) - list of unique methods ids [OPTIONAL]
Returns:
- pandas Dataframe object (with method_id, property) of the passed list of methods
-
get_representations
Get representation values of a list of methods.
Parameters:
- representation : (str) - representation code
- methods : (list[str]) - list of unique methods ids
Returns:
- pandas Dataframe object (with method_id, representation) of the passed list of methods
-
get_callees
Get a list of method ids for direct callees of a particular method.
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a (List[str]) of method ids for direct callees
- Returns an empty List if no such method_id exists
-
get_callers
Get a list of method ids for direct callers of a particular method.
Parameters:
- method_id: (str) - any method_id defined within jemma
Returns:
- Returns a (List[str]) of method ids for direct callers
- Returns an empty List if no such method_id exists
-
get_caller_context
Get all caller method ids from n-hop neighborhood for a particular method.
Parameters:
- method_id: (str) - method_id for which callers are to be determined
- n_neighborhood: (int) - size of n-hop neighborhood callers that are to be considered
- df: (pandas Dataframe) - pandas Dataframe containing the caller-callee data for the project
Returns:
- Returns a (List[str]) of caller method ids
- Returns an empty List if no callers could be found for method_id
- Returns an empty List if n_neighborhood is 0
-
get_callee_context
Get all callee method ids from n-hop neighborhood for a particular method.
Parameters:
- method_id: (str) - method_id for which callees are to be determined
- n_neighborhood: (int) - size of n-hop neighborhood callees that are to be considered
- df: (pandas Dataframe) - pandas Dataframe containing the caller-callee data for the project
Returns:
- Returns a (List[str]) of callee method ids
- Returns an empty List if no callees could be found for method_id
- Returns an empty List if n_neighborhood is 0
-
gen_TKNA_from_method_text
Process the method text of a method and returns the TKNA representation.
Parameters:
- method_id: (str) - method_id for which TKNA representation is to be generated
- method_text: (str) - corresponding method_text for the method_id
Returns:
- Returns the TKNA representation of a method
-
gen_TKNB_from_method_text
Process the method text of a method and returns the TKNB representation.
Parameters:
- method_id: (str) - method_id for which TKNB representation is to be generated
- method_text: (str) - corresponding method_text for the method_id
Returns:
- Returns the TKNB representation of a method
-
gen_C2VC_from_method_text
Process the method text of a method and returns the C2VC representation.
Parameters:
- method_id: (str) - method_id for which C2VC representation is to be generated
- method_text: (str) - corresponding method_text for the method_id
Returns:
- Returns the C2VC representation of a method
-
gen_C2SQ_from_method_text
Process the method text of a method and returns the C2SQ representation.
Parameters:
- method_id: (str) - method_id for which C2SQ representation is to be generated
- method_text: (str) - corresponding method_text for the method_id
Returns:
- Returns the C2SQ representation of a method
-
gen_FTGR_from_method_text
Process the method text of a method and returns the FTGR representation.
Parameters:
- method_id: (str) - method_id for which FTGR representation is to be generated
- method_text: (str) - corresponding method_text for the method_id
Returns:
- Returns the FTGR representation of a method
-
gen_representation
Process the method text of a method and returns the selected representation.
Parameters:
- representation: (str) - representation (code) which is to be generated
- method_id: (str) - method_id for which the representation is to be generated
- method_text: (str) - corresponding method_text for the method
Returns:
- Returns the selected representation for the method
-
get_properties
Get property values for a list of methods.
Parameters:
- property : (str) - property code
- methods : (list[str]) - list of unique methods ids
Returns:
- pandas Dataframe object (with method_id, property) of the passed list of methods
-
run_models
Trains/finetunes a set of models for a given task and representation, from the specified data
Parameters:
- property: (str) - property (code) which is to be used
- representation: (str) - representation (code) which is to be used
- train_methods: (List[str]) - list of methods (method_ids) to be considered as training samples
- test_methods: (List[str]) - list of methods (method_ids) to be considered as test samples
- models: (List[str]) - List of models (huggingface paths or codes) to train and evaluate
Returns:
- None: Prints the evaluation scores for each model
[COMING SOON]
In order to contribute new data to the JEMMA Datasets, users must fork this repository and clone it locally. Once JEMMA is cloned locally, users can run the processing scripts on local projects, which will generate a set of csv files: metadata, representations, properties, call-graphs---which is the new data.
The freshly generated csvs are to be included in the next commit. It is advised that users review the data before committing. Users can then push the changes to their fork of the JEMMA repository, and submit a new pull request for the data files which were generated.
Once a pull request (data contribution) is submitted, the generated data will be validated for errors and inconsistencies, and then integrated into our original dataset if approved. The new dataset will be subsequently updated on zenodo, which lets us host multiple versions.
Here's the step-by-step procedure for submitting a pull request to JEMMA:
- Fork the JEMMA repository
- Clone the JEMMA repository to your local workspace
- Create a new branch
- Make your changes (run the JEMMA processing scripts)
- Commit the changes (commit new files generated)
- Push the changes to your JEMMA fork
- Create a pull request on Github
This is the alpha release of JEMMA. We have tested it with several use cases. However, there might still be bugs in the implementation that we hope to iron out in the next few months.
If you encounter any of these bugs, please open a respective GitHub Issue!