- Feature: create table to be able to join journals from downloads with journals from E data
- Feature: add bash script to parse nlm data, and transform usefull data to csv
- Feature: keep "ar" suffixes in article ids, to make match with E. data easier
- Feature: quote all values when exporing minimal fields
- Feature: add region in geolocalization data
- Feature: add flag to parsing script to avoid exporting some columns to csv
- Feature: add city in geolocalization data
- Feature: filter html downloads depending on journal referential data
- Feature: user can now specify source and output directories
- Feature: parsing can be stopped, and relaunched
- Feature: package parsing script in docker container
- Feature: added device type in download table. "p" -> pc, "m" -> mobile and "t" -> tablet
- Feature: journals have only 1 (standard) classification now
- Feature: indexes are all anonymous now, to make renaming easier
- Bug fix: parsing can now be stopped using ctrl-c
- Feature: Sampling script works by sampling all years at once
- Bug fix in parasble log line counts, that affected robot detection also
- Update robot detection to use number of requests instead of number of download as threshold
- More robust ip parsing for geolocation
- Added R script to draw samples from log file set
- Database optimization: relations have been normalized to improve DB creation time, DB disk space, and query speed
- Referer support: similar referers are grouped into categories
- Html files are not considered as downloads, except if url ends with ".html" or ".html?vue=integral"
- Good robots are detected using user agent.
- Bad robot are detected using per day, IP based activity stats.
- Domains are associated to domains.