[狗爬], [aims to be] A high performance distributed and lightweight spider written in GO .
- pipeline config moved to dedicated section, can be reference by id
- elasticsearch config moved to dedicated section, can be reference by id
- remove indexer module, nested to elastic module
- add parse_pdf joint to parse PDF files
- support major elasticsearch version, auto handle API differences
- auto clone and update framework/vendor repo
- move vendor out of this repo
- fix relative link was not proper resolved with https
- fix redirected link not handled exception
- extract common codebase to another repo: https://github.com/infinitbyte/framework
- sqlite retired, elasticsearch is the first citizen
- add a new cmd
static_fs
to support load static files from folder - auto generate elasticsearch mapping and template, no need to manual create mapping first
- add new
backup
command to support migration
- optimize sql, speed up task list
- enable cross domain requests
- fix mysql as database option
- update update_check_time, fix init next_fetch time
- refactor domain to host, api and mapping has changed
- refactor module, update yml settings: module->name
- dynamic create pipelines
- init plugin architecture
- support extract tags by css path
- add chrome fetch joint, via chrome debug protocol
- add auto-completion to search ui
- search ui support mobile
- support access control by github oauth
- remove goleveldb due to memory leak
- update logo
- remove hard coded version
- update task UI, support filter by status and host
- clean offset_canvas menu
- move repo to infinitbyte/gopa, for better collaboration, namespace changed as well
- separate API and UI, listen on different port
- add mysql as database option
- add elasticsearch as database option
- add elasticsearch as blob(snapshot) datastore
- task fetch and update with stepped delay
- add hash joint to crawler pipeline
- dispatch tasks and auto update tasks
- add proxy to fetch joint
- filter url before push to checker
- add rules config to url filter
- support elasticsearch as database store
- add task_deduplication in the check phrase
- add content hash check to detect duplication
- refactor webhunter, support basic auth
- add pipeline joint to detect the language of webpage
- add search ui
- multi instance support on local machine
- streamline clustering on local machine
- modules and pipelines dynamic config ready
- pipeline and context refactored to support dynamic parameters
- save snapshot to KV store and update task management
- optimize shutdown logic, reduce half of goroutines
- add a wiki about how to build gopa on windows
- remove timeout in queue by default
- improve statsd performance with buffered client
- refine log level, enable pprof to config listen address
- update task ui, limit length of name
- detect dead process, re-place lock file
- persist auto-incremented id sequence to disk
- simplified joint register
- add high performance tolowercase and touppercase func
- add queue stats api
- remove simhash due to poor performance and memory leak
- fix wrong relative url by using unicode index
- fix statsd no data was send out
- fix poor string merge performance
- fix http goroutine leak
- raft clustering
- dynamic change logging setting from the console, can be filter log by level, message, file and function name
- dynamic create pipeline
- add tls to security api and websocket
- add proxy to crawler pipeline
- use template engine, UI refactoring
- add a logo
- fix incorrect stats number, incorrect task filter
- fix incorrect redirect handler, url ignored
- add stats api to expose the task info, http://localhost:8001/stats
- add websocket and simple ui to interact with Gopa, http://localhost:8001/ui/
- add task api to accept seed
- dynamic change the seelog config via api, [GET/POST] http://localhost:8001/setting/seelog/
- follow 301/302 redirect, and continue fetch
- add boltdb status page, http://localhost:8001/ui/boltdb
- add pipeline framework to create crawler
- add command to dynamic change logging level and add seed url
- export metrics to statsD
- support daemon mode in linux and darwin
- add task management api
- add update_ui setup to Makefile in order to build static ui
- add git commit log and build_date to gopa binary
- console ui support websocket reconnect
- remove bloom, use leveldb to store urls
- crawling speed control
- cookie supported
- brief logging format
- shutdown nil exception
- wrong relative link in parse phrase
- ruled fetch
- fetch/parse offset can be persisted and reloadable
- http console
- refactor storage interface,data path are now configable
- disable pprof by default
- use local storage instead of kafka,kafka will be removed later
- check local file's exists first before fetch the remote page
- resolve memory leak caused by sbloom filter
- download by url template
- list page download
- adding golang pprof, http://localhost:6060/debug/pprof/
- go tool pprof http://localhost:6060/debug/pprof/heap
- go tool pprof http://localhost:6060/debug/pprof/profile
- go tool pprof http://localhost:6060/debug/pprof/block
- integrate with kafka to make task controllable and recoverable
- parameters configable
- goroutine can be controlled now
- bloom-filter persistence
- building script works
- just up and run.