Goodscrapes - Goodreads.com HTML-API
Updated: 2022-01-21
Since: 2014-11-05
focuses on analysing, not updating info on GR
less limited, e.g., reading shelves and reviews of other members: Goodscrapes can scrape thousands of fulltext reviews.
official is slow too; API users are even second-class citizen
theoretically this library is more likely to break, but Goodreads progresses very very slowly: nothing actually broke between 2019-2014 (I started this); actually their API seems to change more often than their web pages; they can and do disable API functions without being noticed by the majority, but they cannot easily disable important webpages that we use too; There are unit-tests to detect markup changes on the scraped Goodreads.com website.
this library grew with every new usecase and program; it retries operations on errors on Goodreads.com, which are not seldom (over capacity, exceptions etc); it saw a lot of flawed data such as wrong review dates ("Jan 01, 1010"), which broke Time::Piece.
Goodreads "isn't eating its own dog food" https://www.goodreads.com/topic/show/18536888-is-the-public-api-maintained-at-all#comment_number_1
slow: version with concurrent AnyEvent::HTTP requests was marginally faster, so I sticked with simpler code; doesn't actually matter due to Amazon's and Goodreads' request throttling. You can only speed things up significantly with a pool of work-sharing computers and unique IP addresses...
just text pattern matching, no ECMAScript execution and DOM parsing with a headless renderer (so far sufficient and faster). Regex is not meant for HTML parsing and a HTML parser would had been easier from time to time, I would use one today. However, regular expressions proved good enough for goodreads.com, given that user generated content is very restricted and cannot easily confuse the regex patterns. The Regex code is small too. We just look at the server response as text with some features which mark the start and end of a value of interest.
for real-world usage examples see Andre's Goodreads Toolbox. There are unit tests in the "t" directory, too. Tests are good (up-to-date) tutorials and might help comprehending the yet terse API documentation.
_
prefix means private function or constant (use in module only)ra
prefix means array reference,rh
prefix means hash referenceon
prefix orfn
suffix means function variableconstants are uppercase, functions lowercase
Goodscrapes code in your program is usually recognizable by the 'g' or 'GOOD' prefix in the function or constant name
common internal abbreviations: pfn = progress function, bfn = book handler function, pag = page number, nam = name, au = author, bk = book, uid = user id, bid = book id, aid = author id, rat = rating, tit = title, q = query string, slf = shelf name, shv = shelves names, t0 = start time of an operation, ret = return code, tmp = temporary helper variable, gp = group, gid = group id, us = user
never cast 'id' to int or use %d format string, despite digits only, compare as strings
don't expect all attributes set (
undef
), this depends on the available info on the scraped page
id =>
string
title =>
string
isbn =>
string
isbn13 =>
string
num_pages =>
int
num_reviews =>
int
num_ratings =>
int
103 for exampleavg_rating =>
float
4.36 for example, 0 if no ratingstars =>
int
rounded avg_rating, e.g., 4format =>
string
(binding)user_rating =>
int
number of stars 1,2,3,4 or 5 (program user)user_read_count =>
int
(program user)user_num_owned =>
int
(program user)user_date_read =>
Time::Piece
(program user)user_date_added =>
Time::Piece
(program user)ra_user_shelves =>
string[]
referenceurl =>
string
img_url =>
string
review_id =>
string
year =>
int
(original publishing date)year_edit =>
int
(edition publishing date)rh_author =>
%user
reference
id =>
string
name =>
string
"Firstname Lastname"name_lf =>
string
"Lastname, Firstname"residence =>
string
(might require login)age =>
int
(might require login)num_books =>
int
books shelfed, not books written (even if is_author == 1)is_friend =>
bool
is_author =>
bool
is_female =>
bool
is_private =>
bool
is_staff =>
bool
true if user is a Goodreads.com employeeis_mainstream =>
bool
currently, guessed from number of ratings for any book, is_author == 1url =>
string
URL to the user's profile pageworks_url =>
string
URL to the author's distinct works (is_author == 1)img_url =>
string
user_min_rating =>
int
requires is_author == 1user_max_rating =>
int
requires is_author == 1user_avg_rating =>
float
3.3 for example (user of the program), requires is_author == 1, value depends on the shelves involved_seen =>
int
incremented if user already exists in a load-target structure
id =>
string
rh_user =>
%user
referencebook_id =>
string
rating =>
int
with 0 meaning no rating, "added" or "marked it as abandoned" or something similarrating_str =>
string
represention of rating, e.g., 3/5 as "[*** ]" or "[TTT ]" if there's additional text, or "[ttt ]" if not longer than 160 charstext =>
string
date =>
Time::Piece
url =>
string
full text review
id =>
string
name =>
string
url =>
string
img_url =>
string
num_members => int
text =>
string
rh_to_user =>
%user
reference, addressed userrh_review =>
%review
reference, addressed review, undefined if not comment on a review (but group, another user's status, book list, ...)rh_book =>
%book
reference, undefined if rh_review is undefined and vice versa
returns a sanitized, valid Goodreads user id or kills the current process with an error message
returns the given shelf name if valid
returns a shelf which includes all books if no name given
kills the current process with an error message if name is malformed
returns true if the given user or author is blacklisted and would slow down any analysis
generates and returns a CLI progress indicator function $f, with $f->( 20 ) adding 20 to the last values and printing the sum like "40 unit_str". Given a second (max value) argument $f->( 10, 100 ), it will print a percentage without any unit: "10%". Given a modern terminal, the text remains at the same position if the progress function is called multiple times.
some Goodreads.com pages are only accessible by authenticated members
some Goodreads.com pages are optimized for authenticated members (e.g. get 200 books vs 30 books per request)
usermail => string
userpass => string
r_userid => string ref
set user ID if variable is empty/undef [optional]
change one or multiple library-scope parameters
ignore_errors => bool
disables retries for [ERROR] and [CRIT] with the process just keep going with the next stepmaxretries => int
sets number of retries when there is an error, critical issues are retried indefinitely (if ignore_errors is false)retrydelay_secs => int
cache_days => int
sets the number of days that a resource can be loaded from the local storage. Scraping Goodreads.com is a very slow process; scraped documents can be cached if you don't need them "fresh" during development time or long running sessions (cheap recovery on crash, power blackout or pauses), or when experimenting with parameters
%book
greadbook( $book_id )
%user
greaduser( $user_id, $prefer_author = 0 )
there can be a different user and author with the same ID (2456: Joana vs Chuck Palahniuk); if there's no user but an author, Goodreads would redirect to the author page with the same ID and this function would return the author
if ambiguous you can set the $prefer_author flag
reads all group memberships of the given user into rh_into
from_user_id => string
rh_into => hash reference (id => %group,...)
on_group => sub( %group )
[optional]on_progress => sub
seegmeter()
[optional]
reads a list of books (and/or authors) present in the given shelves of the given user
from_user_id => string
ra_from_shelves => string-array reference
with shelf namesrh_into => hash reference (id => %book,...)
[optional]rh_authors_into => hash reference (id => %user,...)
[optional]; this parameter is for convenience and also replaces the formergreadauthors()
function. It's not required to access author data as author data is available from the book data too: $book->{rh_author}->{...}on_book => sub( %book )
[optional]on_progress => sub
seegmeter()
[optional]doesn't add users to
rh_authors_into
whengisbaduser()
is truesets the
user_XXX
andis_mainstream
fields in each author item
reads the names of all shelves of the given user
from_user_id => string
ra_into => array reference
ra_exclude => array reference
won't add given names to the result [optional]Precondition: glogin()
Postcondition: result includes 'read', 'to-read', 'currently-reading', but doesn't include '#ALL#'
sets the
user_XXX
andis_mainstream
fields in each author item
DEPRECATED: use
greadshelf()
withrh_authors_into
parametergets a list of authors whose books are present in the given shelves of the given
from_user_id => string
ra_from_shelves => string-array reference
with shelf namesrh_into => hash reference (id => %user,...)
[optional]on_progress => sub
seegmeter()
[optional]If you need authors and books data, then use
greadshelf
which also populates the author property of every bookskips authors where
gisbaduser()
is truesets the
user_XXX
andis_mainstream
fields in each author item
reads the Goodreads.com list of books written by the given author
author_id => string
limit => int
number of books to read intorh_into
rh_into => hash reference (id => %book,...)
on_book => sub( %book )
[optional]on_progress => sub
seegmeter()
[optional]
loads ratings (no text), reviews (text), "to-read", "added" etc; you can filter later or via on_filter parameter
rh_for_book => hash reference %book
, seegreadbook()
rh_into => hash reference (id => %review,...)
since => Time::Piece
[optional]on_filter => sub( %review )
, return 0 to drop [optional]on_progress => sub
seegmeter()
[optional]dict_path => string
path to a dictionary file (1 word per line) [optional]text_minlen => int
overwriteson_filter
argument [optional, default 0 ]0 = no text filtering n = specified minimum length (see also GOOD_USEFUL_REVIEW_LEN constant)
rigor => int
[optional, default 2]level 0 = search newest reviews only (max 300 ratings) level 1 = search with a combination of filters (max 5400 ratings) level 2 = like 1 plus dict-search if more than 3000 ratings with stall-time of 2 minutes level n = like 1 plus dict-search with stall-time of n minutes
queries Goodreads.com for the friends and followees list of the given user
rh_into => hash reference (id => %user,...)
from_user_id => string
on_user => sub( %user )
return false to exclude user from $rh_into [optional]on_progress => sub
seegmeter()
[optional]discard_threshold
=> number> don't add anything to $rh_into if number of folls exceeds limit [optional]; use this to drop degenerated accounts which would just add noise to the dataincl_authors => bool
[optional, default 1]incl_friends => bool
[optional, default 1]incl_followees => bool
[optional, default 1]Precondition: glogin()
reads a list of all comments posted from the given user on goodreads.com; it does not read a conversation by multiple users on some topic
from_user_id => string
ra_into => array reference (%comment,...)
[optional]limit => int
stop after reading N comments [optional, default 0 ]on_progress => sub
seegmeter()
[optional]
from_user_id => string
rh_into_nodes => hash reference (id => %user,...)
ra_into_edges => array reference ({from => id, to => id},...)
ignore_nhood_gt => int
ignore users with with a neighbourhood > N [optional, default 1000]; such users just add noise to the data and waste computing timedepth => int
[optional, default 1]incl_authors => bool
[optional, default 0]incl_friends => bool
[optional, default 1]incl_followees => bool
[optional, default 1]on_progress => sub({ done => int, count => int, perc => int, depth => int })
[optional]on_user => sub( %user )
return false to exclude user [optional]Precondition: glogin()
reads the Goodreads.com list of authors who are similar to the given author
rh_into => hash reference (id => %user,...)
author_id => string
on_progress => sub
seegmeter()
[optional]increments
'_seen'
counter of each author if already in %$rh_into
searches the Goodreads.com database for books that match a given phrase
ra_into => array reference (%book,...)
phrase => string
with space separated keywordsis_exact => bool
[optional, default 0]ra_order_by => array reference
property names from%book
[optional, default: 'stars', 'num_ratings', 'year']num_ratings => int
only list books with at least N ratings [optional, default 0]on_progress => sub
seegmeter()
[optional]
string
amz_book_html( %book )
HTML body of an Amazon article page
returns a string with HTML boiler plate code for a table-based report
$title: HTML title, Table caption
$ra_cols: [ "Normal", ">Sort ASC", "<Sort DESC", "!Not sortable/searchable", "Right-Aligned:", ">Sort ASC, right-aligned:", ":Centered:" ]
returns a string with HTML boiler plate code for a table-based report
always use this when generating HTML reports in order to prevent cross site scripting attacks (XSS) through malicious text on the Goodreads.com website
prints a year-based histogram for the given hash on the terminal
rh_from => hash reference (id => %any,...)
date_key => string
name of the Time::Piece component of any hash item [optional, default 'date']start_year => int
[optional, default 2007]title => string
[optional, default '...reviews...']bar_width => int
[optional, default 40]bar_char => char
[optional, default '#']
string
_amz_url( %book )
Requires at least {isbn=>string}
URL for a page with a list of books (not all books)
"&print=true" allows 200 items per page with a single request, which is a huge speed improvement over loading books from the "normal" view with max 20 books per request. Showing 100 books in normal view is oddly realized by 5 AJAX requests on the Goodreads.com website.
"&per_page" in print-view can be any number if you work with your own shelf, otherwise max 200 if print view; ignored in non-print view; per_page>20 requires access with a cookie, see glogin()
"&view=table" puts all book data in code, although invisible (display=none)
"&sort=rating" is important for `friendrated.pl` with its book limit: Some users read 9000+ books and scraping would take forever. We sort lower-rated books to the end and could just scrape the first pages: Even those with 9000+ books haven't top-rated more than 2700 books.
"&shelf" supports intersection "shelf1%2Cshelf2" (comma)
Warning: changes to the URL structure will bust the file-cache
URL for a page with a list of the people $user is following
Warning: changes to the URL structure will bust the file-cache
URL for a page with a list of people befriended to
$user_id
"&sort=date_added" (as opposed to 'last online') avoids moving targets while reading page by page
"&skip_mutual_friends=false" because we're not doing this just for me
Warning: changes to the URL structure will bust the file-cache
string
_revs_url( $book_id, $str_sort_newest_oldest = undef, $search_text = undef, $rating = undef, $is_text_only = undef, $page_number = 1 )
"&sort=newest" and "&sort=oldest" reduce the number of reviews for some reason (also observable on the Goodreads website), so only use if really needed (&sort=default)
"&search_text=example" invalidates sort order argument
"&rating=5"
"&text_only=true" just returns 1 page, you might get more text-reviews without this flag
the maximum of retrievable pages is 10 (300 reviews), see https://www.goodreads.com/topic/show/18937232-why-can-t-we-see-past-page-10-of-book-s-reviews?comment=172163745#comment_172163745
seems less throttled, not true for text-search
page number > N just returns same page, so no easy stop criteria; not sure, if there's more than page, though
"&q=" URL-encoded, e.g., linux+%40+"hase (linux @ "hase)
%book
_extract_book( $book_page_html_str )
%user
_extract_user( $user_page_html_str )
%user
_extract_author( $user_page_html_str )
bool
_extract_books( $rh_books, $rh_authors, $on_book_fn, $on_progress_fn, $shelf_tableview_html_str )
$rh_books:
(id => %book,...)
$rh_authors:
(id => %user,...)
returns 0 if no books, 1 if books, 2 if error
$rh_books:
(id => %book,...)
$r_limit: is counted to zero
returns 0 if no books, 1 if books, 2 if error
bool
_extract_followees( $rh_users, $on_progress_fn, $incl_authors, $discard_threshold, $following_page_html_str )
$rh_users:
(user_id => %user,...)
returns 0 if no followees, 1 if followees, 2 if error
bool
_extract_friends( $rh_users, $on_progress_fn, $incl_authors, $discard_threshold, $friends_page_html_str )
$rh_users:
(user_id => %user,...)
returns 0 if no friends, 1 if friends, 2 if error
- Convert Unicode codepoints such as \u003c
bool
_extract_revs( $rh_revs, $on_progress_fn, $filter_fn, $since_time_piece, $reviews_xhr_html_str )
$rh_revs:
(review_id => %review,...)
returns 0 if no reviews, 1 if reviews, 2 if error
bool
_extract_similar_authors( $rh_into, $author_id_to_skip, $on_progress_fn, $similar_page_html_str )
returns 0 if no authors, 1 if authors, 2 if error
result pages sometimes have different number of items: P1: 20, P2: 16, P3: 19
website says "about 75 results" but shows 70 (I checked that manually). So we fake "100%" to the progress indicator function at the end, otherwise it stops with "93%".
ra_books:
(%book,...)
returns 0 if no books, 1 if books, 2 if error
returns 0 if no groups, 1 if groups, 2 if error
- Example: my $csrftok = _extract_csrftok( _html( _user_url( $uid ) ) ); $curl->setopt( $curl->CURLOPT_HTTPHEADER, [ "X-CSRF-Token: ${csrftok}",
returns $_ENO_XXX constants
warn if sign-in page (https://www.goodreads.com/user/sign_in) or in-page message
warn if "page unavailable, Goodreads request took too long"
warn if "page not found"
error if page unavailable: "An unexpected error occurred. We will investigate this problem as soon as possible"
error if over capacity (TODO UNTESTED): "<?>Goodreads is over capacity.</?> <?>You can never have too many books, but Goodreads can sometimes have too many visitors. Don't worry! We are working to increase our capacity.</?> <?>Please reload the page to try again.</?> <a ...>get the latest on Twitter</a>" https://pbs.twimg.com/media/DejvR6dUwAActHc.jpg https://pbs.twimg.com/media/CwMBEJAUIAA2bln.jpg https://pbs.twimg.com/media/CFOw6YGWgAA1H9G.png (with title)
error if maintenance mode (TODO UNTESTED): "<?>Goodreads is down for maintenance.</?> <?>We expect to be back within minutes. Please try again soon!<?> <a ...>Get the latest on Twitter</a>" https://pbs.twimg.com/media/DgKMR6qXUAAIBMm.jpg https://i.redditmedia.com/-Fv-2QQx2DeXRzFBRKmTof7pwP0ZddmEzpRnQU1p9YI.png
error if website temporarily unavailable (TODO UNTESTED): "Our website is currently unavailable while we make some improvements to our service. We'll be open for business again soon, please come back shortly to try again. <?> Thank you for your patience." (No Alice error) https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/hostedimages/1404319071i/10224522.png
- updates "_session_id2" for X-CSRF-Token, "csid", "u" (user?). "p" (password?)
- Sets default options for GET, POST, PUT, DELETE
HTML body of a web document
caches documents (if $can_cache is true)
retries on errors