-
-
Notifications
You must be signed in to change notification settings - Fork 23.5k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING CHANGE] New Top Language Detection Method #481
Comments
Not sure but the current methods seems right to me. Imagine having 2 HTML, JS repositories with more 51% HTML in both. According to the new proposed method, I wouldn't know JavaScript. So in my opinion, the current method makes more sense. |
@saurabhdaware We would not just take the primary language of the individual repo, we would also calculate top 10 langs of individual repos too. That is we are already doing :-
So it would look like -: HTML ---------- some% |
I am not sure if I understood. Wouldn't calculating top 10 languages of each repository same as calculating how much code the user has in bytes? that's how GitHub calculates the percentage as well no? |
No we would just "count" them and in current method we get the "language.size" reduce and sum it up and then sort it. |
Oh ok so in this example
I would have |
Repo 1 - JS x1 & HTML x1 HTML - 50% We would just count them. |
Oh cool. Seems good to me then. |
I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of. |
If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes? |
Unfortunately we cannot do that, it would make the logic complex & we would have two different statistics. it would hamper the consistency. |
I will firstly publish experimental query param to enable this and then if people likes it i would make it default. |
I personally wouldnt use this new one, I think the current one is better.
You could exclude the language, but in my PR (#480) I am making it so you can just exclude a repo. Which is probs better. |
@Bas950 yeah I can understand what you are saying, but the main reason why soo many people uses github-readme-stats is because of it's simplicity and ease of use. Of course we can add "exclude_repo" options and make it better but the thing is that not many people have the time/patience to go through all of their repositories and check which one has some vendor code and exclude them one by one, not to mention this is impractical for users who have lot of repos. So this is why i'm considering this new approach which would mitigate these issues. |
@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to? |
Consider i have five repositories.
Now repo approcah should give my top language as HTML! right?? |
I will be add a feature soon in PR tho that will allow you to opt-in to forks or opt-out of using forks, depends of what @anuraghazra wants to use. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Will this make it support other languages, or should I make a seperate PR for this? Currently I can't see GDScript on my stats. |
This comment has been minimized.
This comment has been minimized.
Please check issue #450 where I've provied GraphQL (I don't remember if it works and if test it) query that show most stared repos, alternative maybe to get repos with most commits (but there are no order by number of commits yet). Using default 100 repos is stupid because the order can be random and those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo. |
I think the suggestion (in that ticket) to use
I just ran it through the graphql explorer, it works 👍
For such repos, I would suggest using the
Not stupid. Easy. The API won't let you get more in a single request, so multiple requests would need to be made. |
By stupid I mean default 100 by sorting like this. Even 10 repos is better if the sorting is done right, most faved repos or repos with most commits maybe most recent commits. Anything but default order which looks like random with fixed seed. |
Thank you for clarifying. 🙌
Yes! I completely agree with this! I was thinking about this some more and I think the main issue here is conflicting use-cases... The GH API gives language stats for Repositories, not for Users. It might be easier to add a separate card and/or split the use-case to support both sides? @anuraghazra If @jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well. It would be a shame to let all the hard work and thoughts that went into this issue come to nothing... |
@Potherca feel free to experiment with different ways to make it more accurate I can surely take a look at them and give some feedbacks on it. |
Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code |
see this api
https://codetabs.com/count-loc/count-loc-online.html
It could be helpful in taking into consideration total lines of codes of
specific repository and its language,
…On Sun, Jan 31, 2021 at 3:20 PM Anurag Hazra ***@***.***> wrote:
Another possible way to count language stats is by using github's search
api https://docs.github.com/en/rest/reference/search#search-code
[image: image]
<https://user-images.githubusercontent.com/35374649/106381019-0606dc80-63dc-11eb-9749-d4f3d2e90df6.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#481 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL7VWCMF2OV46WNDEEBQC3LS4UVH5ANCNFSM4RTPUQVQ>
.
|
it will not work for the case where somone fork repo that is not git fork, but copy that have single commit. I have one repo like this that is written in C and Lua and it will give me those languages that I've never written even single line of it. |
Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response? |
No that sucks #481 (comment) |
Is it that bad though? I guess the problem here is that we are unsure what we want these numbers to represent. I wasn't expecting these numbers to represent the distribution of total number of lines I wrote using each language. That would also be a bad estimate as in C I write quite many more lines for a simple sum compared to in Python. Doesn't necessarily mean I do more C than Python. Again, it really depends what one want these estimates to measure. At least for me doing multiple projects, it would be cool to have an estimate stating repository-wise what is the most common language you have used. Initially, this is actually what I thought these numbers represented. But there are scenarios where such a measure might be suboptimal, as aforementioned, if one assume otherwise. I do not think there is an optimum here that suits all users. Perhaps it could be an option to support both designs, or even multiple? That would at least solve my issue, and thus make me happy :] But perhaps having more than one design that estimate these measures might introduce even more noise into how to interpret these values... Idk anymore |
I agree with @theLMGN . I have a fork on my repo, which I haven't contributed to yet, but which apparently contain a shit ton of C#, which I have never used. This results in my "Most used languages" to be roughly 80% C#, which is sort of funny considering I have contributed to roughly 40 open repos, of which are Python/C++. I'm also wondering if C# and C++ are switched, or if C++ code is misinterpreted as C# in github-readme-stats. According to github-readme-stats I do not do any C++, but I have contributed to C++ projects (forks). |
@theLMGN couldn't you just use the exclude_repo option to exclude that one repository? Since you only did a minor PR, including this repo in the calculating is probably not necessary? |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
As you all might know there are various bugs/issues regarding the top languages calculation.
The problem
The main issue i see is that people often get confused by how the calculations are done.
Currently the top languages are calculated based on how much code in bytes you have in a particular language and then we choose the top languages.
This method is the main reason people are confused about the calculations, because normally users perceive how much they code in languages by how many repositories they have with that particular language.
Quirks with the current calculation method
The Solution
The most straight forward solution I see is that instead of calculating how much code they have, we can calculate how many repositories they have with the languages.
Related issues
#432 #403 #270 #136 #358
The text was updated successfully, but these errors were encountered: