-
Notifications
You must be signed in to change notification settings - Fork 0
/
webScraper.html
50 lines (46 loc) · 5.53 KB
/
webScraper.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Python: package data_scraper</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head><body bgcolor="#f0f0f8">
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="heading">
<tr bgcolor="#7799ee">
<td valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"> <br><big><big><strong>data_scraper</strong></big></big></font></td
><td align=right valign=bottom
><font color="#ffffff" face="helvetica, arial"><a href=".">index</a><br><a href="file:c%3A%5Cusers%5Chp%5Cdownloads%5Cdata_scraper%5C__init__.py">c:\users\hp\downloads\data_scraper\__init__.py</a></font></td></tr></table>
<p><tt># coding: utf-8</tt></p>
<p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
<tr bgcolor="#aa55cc">
<td colspan=3 valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"><big><strong>Package Contents</strong></big></font></td></tr>
<tr><td bgcolor="#aa55cc"><tt> </tt></td><td> </td>
<td width="100%"><table width="100%" summary="list"><tr><td width="25%" valign=top><a href="data_scraper.Data_Scrapping_with_Functions.html">Data_Scrapping_with_Functions</a><br>
</td><td width="25%" valign=top></td><td width="25%" valign=top></td><td width="25%" valign=top></td></tr></table></td></tr></table><p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
<tr bgcolor="#eeaa77">
<td colspan=3 valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"><big><strong>Functions</strong></big></font></td></tr>
<tr><td bgcolor="#eeaa77"><tt> </tt></td><td> </td>
<td width="100%"><dl><dt><a name="-get_processed_urls"><strong>get_processed_urls</strong></a>(urls)</dt><dd><tt>From the given urls list segregate and seperate the mail links and the profile links<br>
A list of urls is passed and we seperate the links to profiles from the mail links and append in processed_urls</tt></dd></dl>
<dl><dt><a name="-main"><strong>main</strong></a>()</dt><dd><tt>Consists of list of links for which data to be scrapped<br>
The following for loop traverses this list and passes each of the links to function process_a _department<br>
process_a_department calls the further necessary functions to process the link and create the output text file containing all the relevant information</tt></dd></dl>
<dl><dt><a name="-make_text_file"><strong>make_text_file</strong></a>(vals, processed_urls)</dt><dd><tt>To create text files for the data extracted from Profile section and realtive links of the processed urls <br>
The list of urls for the profile section are analyzed and data is sccrapped off from these and stored in txt files<br>
Once text files are created for each unique name id relative links are checked<br>
The list of vals is traversed where these relative links are checked and written into the respective name id text file with UTF-8 encoding</tt></dd></dl>
<dl><dt><a name="-my_function"><strong>my_function</strong></a>(link)</dt><dd><tt>Function to get list of urls for data to scrap and creating their Response Objects <br>
Link is passed to this function where we use requests.get() to create its Response Object<br>
Content of the response object parsed via Beautiful Soup LXML parser<br>
List of URLS is appended using the unique attribute of the web pages i.e. HREF</tt></dd></dl>
<dl><dt><a name="-process_a_department"><strong>process_a_department</strong></a>(link)</dt><dd><tt>To sequentially call the functions required to process the link and scrap data</tt></dd></dl>
</td></tr></table><p>
<table width="100%" cellspacing=0 cellpadding=2 border=0 summary="section">
<tr bgcolor="#55aa55">
<td colspan=3 valign=bottom> <br>
<font color="#ffffff" face="helvetica, arial"><big><strong>Data</strong></big></font></td></tr>
<tr><td bgcolor="#55aa55"><tt> </tt></td><td> </td>
<td width="100%"><strong>vals</strong> = ['CurriculumVitae', 'SponsoredProjects', 'Courses', 'Publications', 'ProfessionalWork', 'ResearchInterest', 'Seminars', 'Memberships', 'ResearchPublications', 'VisitsAbroad', 'ProfessionalMemberships', 'Awards', 'ResearchProjects', 'InstitutionalContribution', 'AwardsandRecognitions', 'ResearchScholar', 'InstitutionalResponsibilities', 'InviteTalks', 'ProfessionalActivities', 'Students', ...]</td></tr></table>
</body></html>