Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING CHANGE] New Top Language Detection Method #481

Closed
anuraghazra opened this issue Sep 20, 2020 · 36 comments
Closed

[BREAKING CHANGE] New Top Language Detection Method #481

anuraghazra opened this issue Sep 20, 2020 · 36 comments
Labels
help wanted Extra attention is needed. lang-card Issues related to the language card. stats-card Feature, Enhancement, Fixes related to stats the stats card.

Comments

@anuraghazra
Copy link
Owner

As you all might know there are various bugs/issues regarding the top languages calculation.

The problem

The main issue i see is that people often get confused by how the calculations are done.

Currently the top languages are calculated based on how much code in bytes you have in a particular language and then we choose the top languages.
This method is the main reason people are confused about the calculations, because normally users perceive how much they code in languages by how many repositories they have with that particular language.

Quirks with the current calculation method

The Solution

The most straight forward solution I see is that instead of calculating how much code they have, we can calculate how many repositories they have with the languages.

Related issues

#432 #403 #270 #136 #358

@anuraghazra anuraghazra added help wanted Extra attention is needed. stats-card Feature, Enhancement, Fixes related to stats the stats card. lang-card Issues related to the language card. labels Sep 20, 2020
@anuraghazra anuraghazra pinned this issue Sep 20, 2020
@saurabhdaware
Copy link

Not sure but the current methods seems right to me. Imagine having 2 HTML, JS repositories with more 51% HTML in both.

According to the new proposed method, I wouldn't know JavaScript.

So in my opinion, the current method makes more sense.

@anuraghazra
Copy link
Owner Author

anuraghazra commented Sep 20, 2020

@saurabhdaware We would not just take the primary language of the individual repo, we would also calculate top 10 langs of individual repos too.

That is we are already doing :-

languages(first: 10, orderBy: {field: SIZE, direction: DESC}) {

So it would look like -:

HTML ---------- some%
Javascript ------ some%

@saurabhdaware
Copy link

I am not sure if I understood. Wouldn't calculating top 10 languages of each repository same as calculating how much code the user has in bytes? that's how GitHub calculates the percentage as well no?

@anuraghazra
Copy link
Owner Author

anuraghazra commented Sep 20, 2020

No we would just "count" them and in current method we get the "language.size" reduce and sum it up and then sort it.

@saurabhdaware
Copy link

Oh ok so in this example

Imagine having 2 HTML, JS repositories with more 51% HTML in both.

I would have
51% HTML
49% JavaScript
right?

@anuraghazra
Copy link
Owner Author

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

@saurabhdaware
Copy link

Oh cool. Seems good to me then.

@DenverCoder1
Copy link
Contributor

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

@DenverCoder1
Copy link
Contributor

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

@anuraghazra
Copy link
Owner Author

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

Unfortunately we cannot do that, it would make the logic complex & we would have two different statistics. it would hamper the consistency.

@anuraghazra
Copy link
Owner Author

I will firstly publish experimental query param to enable this and then if people likes it i would make it default.

@Bas950
Copy link
Contributor

Bas950 commented Sep 23, 2020

I personally wouldnt use this new one, I think the current one is better.

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

You could exclude the language, but in my PR (#480) I am making it so you can just exclude a repo. Which is probs better.

@anuraghazra
Copy link
Owner Author

@Bas950 yeah I can understand what you are saying, but the main reason why soo many people uses github-readme-stats is because of it's simplicity and ease of use. Of course we can add "exclude_repo" options and make it better but the thing is that not many people have the time/patience to go through all of their repositories and check which one has some vendor code and exclude them one by one, not to mention this is impractical for users who have lot of repos.

So this is why i'm considering this new approach which would mitigate these issues.

@crazy-max
Copy link

@mchelen-gov
Copy link

Quirks with the current calculation method

@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to?

@ghost
Copy link

ghost commented Sep 25, 2020

Consider i have five repositories.

  1. C++ 50k line Machine Learning Library i wrote.
  2. html page with hello
  3. html page with hello
  4. html page with hello
  5. html page with hello

Now repo approcah should give my top language as HTML! right??

@anuraghazra
Copy link
Owner Author

Consider i have five repositories.

  1. C++ 50k line Machine Learning Library i wrote.
  2. html page with hello
  3. html page with hello
  4. html page with hello
  5. html page with hello

Now repo approcah should give my top language as HTML! right??

@Bas950
Copy link
Contributor

Bas950 commented Sep 25, 2020

Quirks with the current calculation method

@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to?

I will be add a feature soon in PR tho that will allow you to opt-in to forks or opt-out of using forks, depends of what @anuraghazra wants to use.

@casperiv0

This comment has been minimized.

@Vicellken

This comment has been minimized.

@benstigsen
Copy link

Will this make it support other languages, or should I make a seperate PR for this? Currently I can't see GDScript on my stats.

@ghost

This comment has been minimized.

@stale stale bot added the stale Issue is marked as stale. label Dec 6, 2020
@anuraghazra anuraghazra removed the stale Issue is marked as stale. label Dec 8, 2020
@jcubic
Copy link

jcubic commented Jan 6, 2021

Please check issue #450 where I've provied GraphQL (I don't remember if it works and if test it) query that show most stared repos, alternative maybe to get repos with most commits (but there are no order by number of commits yet).

Using default 100 repos is stupid because the order can be random and those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

@Potherca
Copy link

Potherca commented Jan 30, 2021

Please check issue #450 where I've provied GraphQL query

I think the suggestion (in that ticket) to use orderBy might be a good one. The next issue will, of course, be "Sort by what?", which will undoubtedly lead to "I want to sort by X, not Y, can you make it configurable?". But I think the basic premise is a good addition to resolving this rather sticky puzzle.

(I don't remember if it works and if test it)

I just ran it through the graphql explorer, it works 👍

those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

For such repos, I would suggest using the exclude_repo setting (or just creating a separate org and moving such repos there).
I don't think we should really expect such a level of intelligence from a project such as this. I think the KISS principle would apply here.

Using default 100 repos is stupid

Not stupid. Easy. The API won't let you get more in a single request, so multiple requests would need to be made.
Making more calls means more code, more work, potentially more issues, etc.

@jcubic
Copy link

jcubic commented Jan 30, 2021

By stupid I mean default 100 by sorting like this. Even 10 repos is better if the sorting is done right, most faved repos or repos with most commits maybe most recent commits. Anything but default order which looks like random with fixed seed.

@Potherca
Copy link

Potherca commented Jan 31, 2021

Thank you for clarifying. 🙌

Even 10 repos is better if the sorting is done right

Yes! I completely agree with this!

I was thinking about this some more and I think the main issue here is conflicting use-cases... The GH API gives language stats for Repositories, not for Users. It might be easier to add a separate card and/or split the use-case to support both sides?

@anuraghazra If @jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

It would be a shame to let all the hard work and thoughts that went into this issue come to nothing...

@anuraghazra
Copy link
Owner Author

anuraghazra If jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

@Potherca feel free to experiment with different ways to make it more accurate I can surely take a look at them and give some feedbacks on it.

@anuraghazra
Copy link
Owner Author

Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code

image

Repository owner deleted a comment from stale bot Jan 31, 2021
@ghost
Copy link

ghost commented Jan 31, 2021 via email

@jcubic
Copy link

jcubic commented Jan 31, 2021

it will not work for the case where somone fork repo that is not git fork, but copy that have single commit. I have one repo like this that is written in C and Lua and it will give me those languages that I've never written even single line of it.

@mushahidq
Copy link
Contributor

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

@anuraghazra
Copy link
Owner Author

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

@andreped
Copy link

andreped commented Mar 30, 2021

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

Is it that bad though? I guess the problem here is that we are unsure what we want these numbers to represent. I wasn't expecting these numbers to represent the distribution of total number of lines I wrote using each language. That would also be a bad estimate as in C I write quite many more lines for a simple sum compared to in Python. Doesn't necessarily mean I do more C than Python. Again, it really depends what one want these estimates to measure.

At least for me doing multiple projects, it would be cool to have an estimate stating repository-wise what is the most common language you have used. Initially, this is actually what I thought these numbers represented. But there are scenarios where such a measure might be suboptimal, as aforementioned, if one assume otherwise.

I do not think there is an optimum here that suits all users. Perhaps it could be an option to support both designs, or even multiple? That would at least solve my issue, and thus make me happy :]

But perhaps having more than one design that estimate these measures might introduce even more noise into how to interpret these values... Idk anymore

@foxt
Copy link

foxt commented Apr 13, 2021

It'd also be nice to be able to exclude forks.

My card
my card
Currently shows 86% Python because I'm making a minor PR to a Python repo.

@andreped
Copy link

andreped commented Apr 13, 2021

I agree with @theLMGN .

I have a fork on my repo, which I haven't contributed to yet, but which apparently contain a shit ton of C#, which I have never used. This results in my "Most used languages" to be roughly 80% C#, which is sort of funny considering I have contributed to roughly 40 open repos, of which are Python/C++.

I'm also wondering if C# and C++ are switched, or if C++ code is misinterpreted as C# in github-readme-stats. According to github-readme-stats I do not do any C++, but I have contributed to C++ projects (forks).

@andreped
Copy link

@theLMGN couldn't you just use the exclude_repo option to exclude that one repository? Since you only did a minor PR, including this repo in the calculating is probably not necessary?

Repository owner locked and limited conversation to collaborators Apr 27, 2021
@anuraghazra anuraghazra unpinned this issue Apr 27, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
help wanted Extra attention is needed. lang-card Issues related to the language card. stats-card Feature, Enhancement, Fixes related to stats the stats card.
Projects
None yet
Development

No branches or pull requests