Tistory Crawler

주의! 모든 법적인 책임은 크롤링을 하는 본인에게 있습니다. 위 코드는 학습용으로 개발 되었습니다.

Introduction

Tistory Crawler은 대량의 티스토리 포스팅을 수집하는 라이브러리입니다. 간단하게 한국어 블로그 데이터셋을 구성하도록 도와줍니다. Chrome Driver와 Selenium을 통해 간단하게 사용할 수 있습니다.

Setup

Chrome Driver 설치하기
- Chrome Driver중 사용자 OS와 맞는 드라이버를 다운받아 최상단 폴더에 넣습니다.
- main.py의 driver PATH를 driver가 저장된 절대경로로 바꿔줍니다.
(Optional) Database 연결하기
- 위 프로젝트는 Postgresql에 데이터를 저장하는 형태로 개발되었습니다. Secrets.py파일을 만들어서 SECRET_HOST, SECRET_DBNAME, SECRET_USER, SECRET_PASSWORD, SECRET_PORT 변수를 채워줍니다.

Run

python main.py

Crawl Blog Path

티스토리 메인 스토리탭 에서 5가지 탭의 블로그 HOST명을 크롤링해서 데이터베이스에 저장합니다. 2023-03-21 기준 한 탭에 6100개의 추천 글이 있었습니다.

티스토리 추천은 무한스크롤과 유사하게 구현되어 있기 때문에 한줄씩 내리면서 url을 파싱하는 방식으로 구현되어 있습니다. 따라서 윈도우 사이즈를 변경하면 안됩니다. tistory_recommendation 함수를 참고하세요.

Crawl Blog Body

수집한 Blog포스팅의 PATH가 숫자로만 이루어진 포스팅인 경우만, 1번 포스팅부터 수집된 PATH까지 HTML Body를 반복문을 통해서 크롤링 합니다. **현재는 PATH가 문자를 포함한 블로그는 수집하지 않고 있습니다.

EX) sweetdev.tistory.com/about-python 과 같주소체계를 사용하는 블로그는 수집하지 않고, sweetdev.tistory.com/14와 같이 숫자 주소체계를 사용하는 블로그의 글은 수집 합니다.

Dependencies

beautifulsoup4
urllib3
selenium
psycopg2

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
base		base
.gitignore		.gitignore
AdressCollecting.py		AdressCollecting.py
Database.py		Database.py
DownloadHTML.py		DownloadHTML.py
ParseHTML.py		ParseHTML.py
README.md		README.md
body.html		body.html
chromedriver		chromedriver
links.html		links.html
main.py		main.py
requirements.txt		requirements.txt
test_parsehtml.ipynb		test_parsehtml.ipynb
testbaseball.py		testbaseball.py
utils.py		utils.py
왜안돼.html		왜안돼.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tistory Crawler

Introduction

Setup

Run

Crawl Blog Path

Crawl Blog Body

Dependencies

About

Releases

Packages

Contributors 2

Languages

jonyejin/Tistory-Crawler

Folders and files

Latest commit

History

Repository files navigation

Tistory Crawler

Introduction

Setup

Run

Crawl Blog Path

Crawl Blog Body

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages