Skip to content

Latest commit

 

History

History
23 lines (23 loc) · 2.08 KB

README.md

File metadata and controls

23 lines (23 loc) · 2.08 KB

Modeling Developer Expertise with Zipf's law

  • ZipfModeling is an expertise model to identify expert developers based on how they write code.
  • "Tokenizer", you can tokenize the content of python code into the AST nodes of source code and then collect the syntax patterns
  • "ZipfModeling_syndt", shows the probability distribution of alpha in Zipf after fitting on synthatic data.
  • "Model_train_realdt" and "Zipf_logLikelihood", collect syntax patterns from real projects fetched from GitHub and fit the distribution of the syntax patterns for each developer with Zipf's law.
  • "Zipf-Validity" explores the threat to the validity of fitting the data with the Zipf distribution.
  • "sourcecodeAnalyzer" shows with few examples how to convert code into AST and then collect its nodes as syntax patterns.
  • We generate a labeled synthetic dataset by resampling real data. This dataset contains 1200 developers in two categories of "Expert" and "Novice". With this data you can explore the validity of expertise models based on the content of programming code.
  • "Dataset" contains raw real dataset which is extracted from GitHub repositories. The dataset includes following features:
    . 'commit_ID','Author', 'Authored_Date','email','msg','Commiter','committer_date', 'project_path','Commit_before', 'Commit_after','diff','Added_LOC','Removed_LOC', 'Num_LOC'

Citation

Assessing developer expertise from the statistical distribution of programming syntax patterns

@article{,
  title={Assessing developer expertise from the statistical distribution of programming syntax patterns},
  author={Moradi Dakhel, Arghavan and C. Desmarais, Michel and Khomh, Foutse},
  Conferance={Evaluation and Assessment in Software Engineering},
  pages={90--99},
  year={2021}
}