Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SingleProxy returns True but failed to query #498

Open
1 task
TingxunShi opened this issue Mar 26, 2023 · 6 comments
Open
1 task

SingleProxy returns True but failed to query #498

TingxunShi opened this issue Mar 26, 2023 · 6 comments
Labels

Comments

@TingxunShi
Copy link

Describe the bug
scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below

from scholarly import scholarly, ProxyGenerator


pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # True here
scholarly.use_proxy(pg)
search_query = scholarly.search_pubs('A paper title')
pub = next(search_query)
print(pub.bib['cites'])

error reported as:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "myenv\lib\site-packages\urllib3\connection.py", line 309, in connect
    conn = self._new_conn()
  File "myenv\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "myenv\lib\site-packages\urllib3\util\retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\fp\fp.py", line 32, in get_proxy_list
    page = requests.get(self.__website(repeat))
  File "myenv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "myenv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "myenv\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scratch.py", line 10, in <module>
    scholarly.use_proxy(pg)
  File "myenv\lib\site-packages\scholarly\_scholarly.py", line 78, in use_proxy
    self.__nav.use_proxy(proxy_generator, secondary_proxy_generator)
  File "myenv\lib\site-packages\scholarly\_navigator.py", line 68, in use_proxy
    proxy_works = self.pm2.FreeProxies()
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 550, in FreeProxies
    proxy = self._proxy_gen(None)  # prime the generator
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 509, in _fp_coroutine
    all_proxies = freeproxy.get_proxy_list(repeat=False)  # free-proxy >= 1.1.0
  File "myenv\lib\site-packages\fp\fp.py", line 35, in get_proxy_list
    raise FreeProxyException(
fp.errors.FreeProxyException: Request to https://www.sslproxies.org failed

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Proxy service: Single Proxy, a socks5 proxy started locally
  • python version: 3.8
  • OS: Windows 10
  • Version 1.7.11

Do you plan on contributing?
Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

  • Yes, I will create a Pull Request with the bugfix.

Additional context
Add any other context about the problem here.

@TingxunShi TingxunShi added the bug label Mar 26, 2023
@arunkannawadi
Copy link
Collaborator

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

@TingxunShi
Copy link
Author

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.. However it seems the proxy works. Modified code snippet is shown as below

import requests
from scholarly import scholarly, ProxyGenerator


proxies = {
    "http": "socks5://localhost:1208",
    "https": "socks5://localhost:1208"
}

url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success)              # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])

@arunkannawadi
Copy link
Collaborator

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

@TingxunShi
Copy link
Author

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.

@TingxunShi
Copy link
Author

TingxunShi commented Mar 29, 2023

I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.

Traceback (most recent call last):
  File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
    session = pm._handle_captcha2(pagerequest)
  File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
    cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'

arunkannawadi added a commit that referenced this issue Jun 17, 2023
to reflect the changes that happened when we moved to httpx.
Fixes an issue reported in #498.
@arunkannawadi
Copy link
Collaborator

The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.

arunkannawadi added a commit that referenced this issue Jun 17, 2023
arunkannawadi added a commit that referenced this issue Jun 18, 2023
to reflect the changes that happened when we moved to httpx.
Fixes an issue reported in #498.
arunkannawadi added a commit that referenced this issue Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants