SingleProxy returns True but failed to query #498

TingxunShi · 2023-03-26T18:16:25Z

Describe the bug
scholarly couldn't work even if I set up proxy and SingleProxy returns True. Code snippet is as below

from scholarly import scholarly, ProxyGenerator


pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success) # True here
scholarly.use_proxy(pg)
search_query = scholarly.search_pubs('A paper title')
pub = next(search_query)
print(pub.bib['cites'])

error reported as:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "myenv\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "myenv\lib\site-packages\urllib3\connection.py", line 309, in connect
    conn = self._new_conn()
  File "myenv\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "myenv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "myenv\lib\site-packages\urllib3\util\retry.py", line 446, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "myenv\lib\site-packages\fp\fp.py", line 32, in get_proxy_list
    page = requests.get(self.__website(repeat))
  File "myenv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "myenv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "myenv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "myenv\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.sslproxies.org', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000021FDB6B5310>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scratch.py", line 10, in <module>
    scholarly.use_proxy(pg)
  File "myenv\lib\site-packages\scholarly\_scholarly.py", line 78, in use_proxy
    self.__nav.use_proxy(proxy_generator, secondary_proxy_generator)
  File "myenv\lib\site-packages\scholarly\_navigator.py", line 68, in use_proxy
    proxy_works = self.pm2.FreeProxies()
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 550, in FreeProxies
    proxy = self._proxy_gen(None)  # prime the generator
  File "myenv\lib\site-packages\scholarly\_proxy_generator.py", line 509, in _fp_coroutine
    all_proxies = freeproxy.get_proxy_list(repeat=False)  # free-proxy >= 1.1.0
  File "myenv\lib\site-packages\fp\fp.py", line 35, in get_proxy_list
    raise FreeProxyException(
fp.errors.FreeProxyException: Request to https://www.sslproxies.org failed

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Proxy service: Single Proxy, a socks5 proxy started locally
python version: 3.8
OS: Windows 10
Version 1.7.11

Do you plan on contributing?
Your response below will clarify whether the maintainers can expect you to fix the bug you reported.

Yes, I will create a Pull Request with the bugfix.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

arunkannawadi · 2023-03-27T17:36:14Z

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

TingxunShi · 2023-03-28T14:09:11Z

Can you try with scholarly.use_proxy(pg, pg) and see if that runs successfully?

It reports that scholarly._proxy_generator.MaxTriesExceededException: Cannot Fetch from Google Scholar.. However it seems the proxy works. Modified code snippet is shown as below

import requests
from scholarly import scholarly, ProxyGenerator


proxies = {
    "http": "socks5://localhost:1208",
    "https": "socks5://localhost:1208"
}

url = 'https://api.ipify.org'
response = requests.get(url, proxies=proxies)
print(response.text) # code 200, returns a US IP address

pg = ProxyGenerator()
success = pg.SingleProxy(http='socks5://localhost:1208', https='socks5://localhost:1208')
print(success)              # Print True here
scholarly.use_proxy(pg, pg)
search_query = scholarly.search_pubs('Paper title here')
pub = next(search_query)
print(pub.bib['cites'])

arunkannawadi · 2023-03-28T14:31:29Z

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

TingxunShi · 2023-03-28T15:22:27Z

Proxy working, with success = True means they are able to receive responses. However, Google Scholar might still identify that it is an automated request and block the request. It means you'll need a more robust proxy.

I have considered the case you suggested so I visited Google scholar via web browser from the same proxy and it worked. However I will also follow your suggestion to find a more robust proxy to check.

TingxunShi · 2023-03-29T15:29:09Z

I have figured out the reason: I am behind a socks proxy but in _proxy_generator.py if proxy doesn't start with "http", it will add the prefix, so the configuration became "http": "http://socks5://localhost:1208". I removed the corresponding logic and now the response code is 200. However, another bug involving captcha resolving triggered.

Traceback (most recent call last):
  File "lib\site-packages\scholarly\_navigator.py", line 132, in _get_page
    session = pm._handle_captcha2(pagerequest)
  File "lib\site-packages\scholarly\_proxy_generator.py", line 404, in _handle_captcha2
    cur_host = urlparse(self._get_webdriver().current_url).hostname
AttributeError: 'NoneType' object has no attribute 'current_url'

to reflect the changes that happened when we moved to httpx. Fixes an issue reported in #498.

arunkannawadi · 2023-06-17T23:24:26Z

The error above regd. catcha failure is definitely a legitimate bug that I'm fixing right now. Thank you for reporting this.

Raised in #498.

to reflect the changes that happened when we moved to httpx. Fixes an issue reported in #498.

Raised in #498.

TingxunShi added the bug label Mar 26, 2023

arunkannawadi added a commit that referenced this issue Jun 17, 2023

Update the proxy keys in _get_webdriver routines

5561061

to reflect the changes that happened when we moved to httpx. Fixes an issue reported in #498.

arunkannawadi added a commit that referenced this issue Jun 17, 2023

Stop prepending proxy with http if it is socks

6cdbaf0

Raised in #498.

arunkannawadi added a commit that referenced this issue Jun 18, 2023

Update the proxy keys in _get_webdriver routines

4df5aad

to reflect the changes that happened when we moved to httpx. Fixes an issue reported in #498.

arunkannawadi added a commit that referenced this issue Jun 18, 2023

Stop prepending proxy with http if it is socks

cd260d6

Raised in #498.

ma-ji mentioned this issue Jul 11, 2023

resolve conflict between proxy format: HTTPX and Requests #507

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SingleProxy returns True but failed to query #498

SingleProxy returns True but failed to query #498

TingxunShi commented Mar 26, 2023

arunkannawadi commented Mar 27, 2023

TingxunShi commented Mar 28, 2023

arunkannawadi commented Mar 28, 2023

TingxunShi commented Mar 28, 2023

TingxunShi commented Mar 29, 2023 •

edited

Loading

arunkannawadi commented Jun 17, 2023

SingleProxy returns True but failed to query #498

SingleProxy returns True but failed to query #498

Comments

TingxunShi commented Mar 26, 2023

arunkannawadi commented Mar 27, 2023

TingxunShi commented Mar 28, 2023

arunkannawadi commented Mar 28, 2023

TingxunShi commented Mar 28, 2023

TingxunShi commented Mar 29, 2023 • edited Loading

arunkannawadi commented Jun 17, 2023

TingxunShi commented Mar 29, 2023 •

edited

Loading