Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Language detection in feed crawler #16

Open
schliflo opened this issue May 11, 2020 · 5 comments
Open

Language detection in feed crawler #16

schliflo opened this issue May 11, 2020 · 5 comments
Assignees
Labels
enhancement New feature or request task Needs to be done

Comments

@schliflo
Copy link
Member

We currently serve all feed entries to users regardless of the article language. This leads to situations where users get served "mixed" content:
Screenshot 2020-05-11 at 15 33 23

This could be solved by using some kind of language detection. Ideally the API would provide a language filter argument or language specific endpoints.

@johanneshiry
Copy link
Member

From an API perspective providing a language filter isn't that big thing, but I think it is harder to determine the language by the headline when you cannot be sure, that the whole feed offers only one language (which would be very easy to just add a language field in the database).

I'll check if the feeds contain mixed languages and if yes it would make sense to discuss further if we want to spent some time checking for automated language detection features or if we are going to only use single language feeds in the future.

@schliflo
Copy link
Member Author

Maybe this lib is an easy solution for now: https://pypi.org/project/langdetect/

@johanneshiry
Copy link
Member

just took a short look but seems promising to me. Dunno if it's worth investigating if we plan to fully overhaul the current backend implementation though ...?!

@schliflo schliflo added enhancement New feature or request task Needs to be done labels Mar 12, 2021
@schliflo
Copy link
Member Author

@johanneshiry maybe it's feasable to port the language detection logic used in https://github.com/coverified/platform_crawler - we basically only need to filter out all non german entries

@johanneshiry
Copy link
Member

any reason why we don't do a full switch to https://github.com/coverified/platform_crawler? Maybe this would make more sense? However, I could also provide small fix here. What's your preferred solution?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request task Needs to be done
Projects
None yet
Development

No branches or pull requests

2 participants