Skip to content

Commit

Permalink
add README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Lewis Chen authored and recherchetts committed Apr 30, 2019
1 parent 4c600b3 commit 4971977
Show file tree
Hide file tree
Showing 8 changed files with 74 additions and 80 deletions.
2 changes: 1 addition & 1 deletion .idea/SRT.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 1 addition & 4 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

56 changes: 55 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,55 @@
# SRT
# 爬虫 V1.0

## 框架

Scrapy

## 网站

1. 国家发改委
1. 国家能源局

* 均为所有版块

## 网页Rendering

* Splash(运行在Docker中)

网页中含有动态渲染内容(JavaScript等),需要完成渲染后再提取网页内容,也就是说在浏览器中观察到的动态渲染的部分实际上并不会在直接获取的网页源码中呈现,因此需要一个统一的渲染器(也即浏览器)处理动态渲染内容。

## 功能

### 主要功能

解析:

1. 网页内容(标题,正文,时间,作者,etc.)
1. 附件内容(docx,doc,xlsx,xls,pdf\)

并保存至数据库:

1. MongoDB(NoSQL数据库)
1. ElasticSearch(搜索引擎后端数据库)

### 细节

1. 增量爬取(已经爬取的不重复爬取,利用DeltaFetch库,使用Berkeley DB)
1. 使用百度AI平台对文件扫描件进行图像识别
1. 读取附件防阻塞,读取大型超过设定时间

# ElasticSearch搜索引擎

一个开源的分布式实时全文搜索引擎。

接受Scrapy写入数据时进行中文分词并根据文章标题、正文、附件内容生成搜索建议。

# ReactiveSearch

一个开源的ElasticSearch搜索引擎前端。

官方网站:[https://opensource.appbase.io/reactivesearch/](https://opensource.appbase.io/reactivesearch/)
[](https://github.com/appbaseio/reactivesearch)[https://github.com/appbaseio/reactivesearch](https://github.com/appbaseio/reactivesearch)

多种可自定义模块,包括搜索框,过滤器(日期、内容)等等,详情在官网和GitHub的介绍中


Binary file removed 发改委/NDRC/antiword
Binary file not shown.
72 changes: 0 additions & 72 deletions 发改委/NDRC/kantiword

This file was deleted.

13 changes: 13 additions & 0 deletions 发改委/NDRC/pdftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import pdfminer
import subprocess
from 发改委.NDRC.pdf2txt import extract_text


def readPDF(file_paths):
for file_path in file_paths:
subprocess.Popen(['python', r'C:\Users\haoli\PycharmProjects\SRT\发改委\NDRC\pdf2txt.py', file_path])


if __name__ == '__main__':
file_list = [r"C:\Users\haoli\PycharmProjects\SRT\发改委\NDRC\test_files\test.pdf"]
readPDF(file_list)
3 changes: 2 additions & 1 deletion 发改委/NDRC/spiders/general.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ def parse(self, response): # 一级网页解析(目录页)
yield SplashRequest(url=linklist[i], callback=self.parse_content,
meta={'date': datelist[i], 'title': titlelist[i],
'class0': determine_class0(response.url),
'class1': determine_class1(response.url)},
'class1': determine_class1(response.url),
"deltafetch_key": request_fingerprint(response.request)}, # 增量传递指纹
splash_headers={"User-Agent": USER_AGENT, "Referer": response.url})
next = response.urljoin(response.xpath('//li//a[text()="下一页"]/@href').extract_first())
"""
Expand Down
3 changes: 2 additions & 1 deletion 发改委/NDRC/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,8 @@ def readPDF(file_path):

@func_set_timeout(5) # 设定超时限制5s http://www.cnblogs.com/hester/p/7641258.html
def pdf2text(file_path):
cmd = 'python3 ' + '/Users/chenhaolin/PycharmProjects/SRT/发改委/NDRC/pdf2txt.py ' \
# 路径修改
cmd = 'python ' + '/Users/chenhaolin/PycharmProjects/SRT/发改委/NDRC/pdf2txt.py ' \
+ file_path
output_text = os.popen(cmd)
return output_text.read()
Expand Down

0 comments on commit 4971977

Please sign in to comment.