In this guide, you will learn how to use proxies with Python requests, particularly for web scraping, to bypass website restrictions by changing your IP and location:
- Using a Proxy with a Python Request
- Installing Packages
- Components of Proxy IP Address
- Setting Proxies Directly in Requests
- Setting Proxies via Environment Variables
- Rotating Proxies Using a Custom Method and an Array of Proxies
- Using the Bright Data Proxy Service with Python
- Conclusion
Use pip install
to install the following Python packages to send requests to the web page and collect the links:
requests
: sends HTTP requests to the website where you want to scrape the data.beautifulsoup4
: parses HTML and XML documents to extract all the links.
The three primary components of a proxy server are:
- Protocol is typically either HTTP or HTTPS.
- Address can be an IP address or a DNS hostname.
- Port number is anywhere between 0 and 65535, e.g.
2000
.
Thus, a proxy IP address would look like this: https://192.167.0.1:2000
or
https://proxyprovider.com:2000
.
This guide covers three ways to set proxies in requests. The first approach assumes doing that directly in the requests module.
Do as follows:
- Import the Requests and Beautiful Soup packages in your Python script.
- Create a directory called
proxies
that contains proxy server information. - In the
proxies
directory, define both the HTTP and HTTPS connections to the proxy URL. - Define the Python variable to set the URL of the web page you want to scrape the data from. Use
https://brightdata.com
. - Send a GET request to the web page using the
request.get()
method with two arguments: the URL of the website and proxies. The response will be stored in theresponse
variable. - Pass
response.content
andhtml.parser
as arguments to theBeautifulSoup()
method to collect links. - Use the
find_all()
method witha
as an argument to find all the links on the web page. - Extract the
href
attribute of each link using theget()
method.
Here is the complete source code:
# import packages.
import requests
from bs4 import BeautifulSoup
# Define proxies to use.
proxies = {
'http': 'http://proxyprovider.com:2000',
'https': 'https://proxyprovider.com:2000',
}
# Define a link to the web page.
url = "https://example.com/"
# Send a GET request to the website.
response = requests.get(url, proxies=proxies)
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
Here is the output from running the script above:
To use the same proxy for all requests, it's best to set environment variables in the terminal window:
export HTTP_PROXY='http://proxyprovider.com:2000'
export HTTPS_PROXY='https://proxyprovider.com:2000'
You can remove the proxies definition from the script now:
# import packages.
import requests
from bs4 import BeautifulSoup
# Define a link to the web page.
url = "https://example.com/"
# Send a GET request to the website.
response = requests.get(url)
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
Rotating proxies helps work around the restrictions that websites put when they receive a large number of requests from the same IP address.
Do as follows:
- Import the following Python libraries: Requests, Beautiful Soup, and Random.
- Create a list of proxies to use during the rotation process. Use the
http://proxyserver.com:port
format:
# List of proxies
proxies = [
"http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",
"http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",
"http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",
"http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",
"http://proxyprovider3.com:2090"
]
- Create a custom method called
get_proxy()
. It will randomly select a proxy from the list of proxies using therandom.choice()
method and return the selected proxy in dictionary format (both HTTP and HTTPS keys). You’ll use this method whenever you send a new request:
# Custom method to rotate proxies
def get_proxy():
# Choose a random proxy from the list
proxy = random.choice(proxies)
# Return a dictionary with the proxy for both http and https protocols
return {'http': proxy, 'https': proxy}
-
Create a loop that sends a certain number of GET requests using the rotated proxies. In each request, the
get()
method uses a randomly chosen proxy specified by theget_proxy()
method. -
Collect the links from the HTML content of the web page using the Beautiful Soup package, as explained previously.
-
Catch and print any exceptions that occur during the request process.
Here is the complete source code for this example:
# import packages
import requests
from bs4 import BeautifulSoup
import random
# List of proxies
proxies = [
"http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",
"http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",
"http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",
"http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",
"http://proxyprovider3.com:2090"
]
# Custom method to rotate proxies
def get_proxy():
# Choose a random proxy from the list
proxy = random.choice(proxies)
# Return a dictionary with the proxy for both http and https protocols
return {'http': proxy, 'https': proxy}
# Send requests using rotated proxies
for i in range(10):
# Set the URL to scrape
url = 'https://brightdata.com/'
try:
# Send a GET request with a randomly chosen proxy
response = requests.get(url, proxies=get_proxy())
# Use BeautifulSoup to parse the HTML content of the website.
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website.
links = soup.find_all("a")
# Print all the links.
for link in links:
print(link.get("href"))
except requests.exceptions.RequestException as e:
# Handle any exceptions that may occur during the request
print(e)
Bright Data has a large network of more than 72 million residential proxy IPs and more than 770,000 datacenter proxies.
You can integrate Bright Data’s datacenter proxies into your Python requests. Once you have an account with Bright Data, follow these steps to create your first proxy:
- Click View proxy product on the welcome page to view the different types of proxy offered by Bright Data:
- Select Datacenter Proxies to create a new proxy, and on the subsequent page, add your details, and save it:
- Once your proxy is created, the dashboard will show you parameters such as the host, the port, the username, and the password to use in your scripts:
- Copy-paste these parameters to your script and use the following format of the proxy URL:
username-(session-id)-password@host:port
.
Note:
Thesession-id
is a random number created by using a Python package calledrandom
.
Here is the code that uses a proxy from Bright Data in a Python request:
import requests
from bs4 import BeautifulSoup
import random
# Define parameters provided by Brightdata
host = 'brd.superproxy.io'
port = 33335
username = 'username'
password = 'password'
session_id = random.random()
# format your proxy
proxy_url = ('http://{}-session-{}:{}@{}:{}'.format(username, session_id,
password, host, port))
# define your proxies in dictionary
proxies = {'http': proxy_url, 'https': proxy_url}
# Send a GET request to the website
url = "https://example.com/"
response = requests.get(url, proxies=proxies)
# Use BeautifulSoup to parse the HTML content of the website
soup = BeautifulSoup(response.content, "html.parser")
# Find all the links on the website
links = soup.find_all("a")
# Print all the links
for link in links:
print(link.get("href"))
Running this code will make a successful request using Bright Data’s proxy service.
With Bright Data’s web platform, you can get reliable proxies for your project that cover any country or city in the world. Try Bright Data's proxy services for free now!