The lack of an automated solution for converting codebases into documentation poses challenges in terms of time, accuracy, and code comprehension. Documentation is often ignored by developers, especially in fast-building teams. However, this leads to severe technical debt. Since technical documentation is hard and existing tools are limited or expensive, there is a need for comprehensive automatic documentation generation.
Our prototype offers a seamless solution to transform a full codebase into comprehensive developer documentation in just one step. By uploading a zip file containing the codebase, you can let the magic happen. The resulting documentation includes function explanations, API specs, table schemas, and dependencies, all in Markdown format.
To power our documentation generation, we leverage the capabilities of GPT-3.5. This advanced language model enables us to produce accurate and contextually relevant documentation for the given codebase.
-
Codebase Traversal: The process begins by traversing the codebase in a tree-wise fashion to access its contents.
-
Code Embeddings with CodeBERT: To extract meaningful information from the code, we employ Microsoft's CodeBERT for code embeddings. However, we encountered an issue with large code files that CodeBERT cannot handle effectively.
-
Handling Large Code Files: To overcome the limitations of CodeBERT for large code files, we devised our own algorithm to create tokenizers in a window-like manner. By specifying a window size and an overlap "region," we maintain essential context and generate embeddings by averaging the embeddings produced for each window.
-
Maintaining Context with Agglomerating Clustering: To ensure context preservation across the codebase, we use Agglomerative Clustering. This technique groups "similar" code files with shared semantic meanings and features, enhancing the quality of the generated documentation. We choose this type of clustering to exploit the hierarchical relations in the clusters formed.
-
Efficient Documentation Generation: After clustering, we concatenate the code files belonging to the same cluster. The resulting concatenated code is then sent to GPT-3.5 using efficient prompt engineering techniques. The generated documentation provides comprehensive insights into the codebase.
We harness the power of the LLM to perform code refactoring, with our complex prompt to change the given code block to a neater, efficient and structurally sound code output. We focus on the cleanliness in the prompting along with considering various code analytics to get the best output.
We also provide a solution to add testing for a specific code block. This forms an integral component in the developer experience, and eliminates the need to devote much time to think about the testing. We again leverage well thought of prompts to give optimal and exhaustive tests.
List of technologies used to build the prototype:
- Frontend: Next.js
- Backend: FastAPI
-
Check out the Server Setup guide here.
-
Client Frontend Setup:
-
Install Node.js dependencies:
npm install
-
Run the development server for the frontend:
npm run dev
-
- Documentation of
ComicifyAI
:
- Input repo :
https://github.com/ayush4345/Comicify.ai
- Output Docs :
- Documentation of
Cluboard
:
- Input repo :
https://github.com/mittal-parth/Cluboard/
- Output Docs :
- Read our Contribution Guidelines to get started here