Skip to content
This repository has been archived by the owner on Mar 9, 2023. It is now read-only.

Latest commit

 

History

History
1042 lines (573 loc) · 28.5 KB

Goodscrapes.pod

File metadata and controls

1042 lines (573 loc) · 28.5 KB

NAME

Goodscrapes - Goodreads.com HTML-API

VERSION

  • Updated: 2022-01-21

  • Since: 2014-11-05

COMPARED TO THE OFFICIAL API

  • focuses on analysing, not updating info on GR

  • less limited, e.g., reading shelves and reviews of other members: Goodscrapes can scrape thousands of fulltext reviews.

  • official is slow too; API users are even second-class citizen

  • theoretically this library is more likely to break, but Goodreads progresses very very slowly: nothing actually broke between 2019-2014 (I started this); actually their API seems to change more often than their web pages; they can and do disable API functions without being noticed by the majority, but they cannot easily disable important webpages that we use too; There are unit-tests to detect markup changes on the scraped Goodreads.com website.

  • this library grew with every new usecase and program; it retries operations on errors on Goodreads.com, which are not seldom (over capacity, exceptions etc); it saw a lot of flawed data such as wrong review dates ("Jan 01, 1010"), which broke Time::Piece.

  • Goodreads "isn't eating its own dog food" https://www.goodreads.com/topic/show/18536888-is-the-public-api-maintained-at-all#comment_number_1

LIMITATIONS

  • slow: version with concurrent AnyEvent::HTTP requests was marginally faster, so I sticked with simpler code; doesn't actually matter due to Amazon's and Goodreads' request throttling. You can only speed things up significantly with a pool of work-sharing computers and unique IP addresses...

  • just text pattern matching, no ECMAScript execution and DOM parsing with a headless renderer (so far sufficient and faster). Regex is not meant for HTML parsing and a HTML parser would had been easier from time to time, I would use one today. However, regular expressions proved good enough for goodreads.com, given that user generated content is very restricted and cannot easily confuse the regex patterns. The Regex code is small too. We just look at the server response as text with some features which mark the start and end of a value of interest.

HOW TO USE

  • for real-world usage examples see Andre's Goodreads Toolbox. There are unit tests in the "t" directory, too. Tests are good (up-to-date) tutorials and might help comprehending the yet terse API documentation.

  • _ prefix means private function or constant (use in module only)

  • ra prefix means array reference, rh prefix means hash reference

  • on prefix or fn suffix means function variable

  • constants are uppercase, functions lowercase

  • Goodscrapes code in your program is usually recognizable by the 'g' or 'GOOD' prefix in the function or constant name

  • common internal abbreviations: pfn = progress function, bfn = book handler function, pag = page number, nam = name, au = author, bk = book, uid = user id, bid = book id, aid = author id, rat = rating, tit = title, q = query string, slf = shelf name, shv = shelves names, t0 = start time of an operation, ret = return code, tmp = temporary helper variable, gp = group, gid = group id, us = user

AUTHOR

https://github.com/andre-st/

DATA STRUCTURES

Note

  • never cast 'id' to int or use %d format string, despite digits only, compare as strings

  • don't expect all attributes set (undef), this depends on the available info on the scraped page

%book

  • id => string

  • title => string

  • isbn => string

  • isbn13 => string

  • num_pages => int

  • num_reviews => int

  • num_ratings => int 103 for example

  • avg_rating => float 4.36 for example, 0 if no rating

  • stars => int rounded avg_rating, e.g., 4

  • format => string (binding)

  • user_rating => int number of stars 1,2,3,4 or 5 (program user)

  • user_read_count => int (program user)

  • user_num_owned => int (program user)

  • user_date_read => Time::Piece (program user)

  • user_date_added => Time::Piece (program user)

  • ra_user_shelves => string[] reference

  • url => string

  • img_url => string

  • review_id => string

  • year => int (original publishing date)

  • year_edit => int (edition publishing date)

  • rh_author => %user reference

%user

  • id => string

  • name => string "Firstname Lastname"

  • name_lf => string "Lastname, Firstname"

  • residence => string (might require login)

  • age => int (might require login)

  • num_books => int books shelfed, not books written (even if is_author == 1)

  • is_friend => bool

  • is_author => bool

  • is_female => bool

  • is_private => bool

  • is_staff => bool true if user is a Goodreads.com employee

  • is_mainstream => bool currently, guessed from number of ratings for any book, is_author == 1

  • url => string URL to the user's profile page

  • works_url => string URL to the author's distinct works (is_author == 1)

  • img_url => string

  • user_min_rating => int requires is_author == 1

  • user_max_rating => int requires is_author == 1

  • user_avg_rating => float 3.3 for example (user of the program), requires is_author == 1, value depends on the shelves involved

  • _seen => int incremented if user already exists in a load-target structure

%review

  • id => string

  • rh_user => %user reference

  • book_id => string

  • rating => int with 0 meaning no rating, "added" or "marked it as abandoned" or something similar

  • rating_str => string represention of rating, e.g., 3/5 as "[*** ]" or "[TTT ]" if there's additional text, or "[ttt ]" if not longer than 160 chars

  • text => string

  • date => Time::Piece

  • url => string full text review

%group

  • id => string

  • name => string

  • url => string

  • img_url => string

  • num_members => int

%comment

  • text => string

  • rh_to_user => %user reference, addressed user

  • rh_review => %review reference, addressed review, undefined if not comment on a review (but group, another user's status, book list, ...)

  • rh_book => %book reference, undefined if rh_review is undefined and vice versa

PUBLIC ROUTINES

string gverifyuser( $user_id_to_verify )

  • returns a sanitized, valid Goodreads user id or kills the current process with an error message

string gverifyshelf( $name_to_verify )

  • returns the given shelf name if valid

  • returns a shelf which includes all books if no name given

  • kills the current process with an error message if name is malformed

bool gisbaduser( $user_or_author_id )

  • returns true if the given user or author is blacklisted and would slow down any analysis

sub gmeter( $unit_str = '' )

  • generates and returns a CLI progress indicator function $f, with $f->( 20 ) adding 20 to the last values and printing the sum like "40 unit_str". Given a second (max value) argument $f->( 10, 100 ), it will print a percentage without any unit: "10%". Given a modern terminal, the text remains at the same position if the progress function is called multiple times.

void glogin({ ... })

  • some Goodreads.com pages are only accessible by authenticated members

  • some Goodreads.com pages are optimized for authenticated members (e.g. get 200 books vs 30 books per request)

  • usermail => string

  • userpass => string

  • r_userid => string ref set user ID if variable is empty/undef [optional]

void gsetopt({ ... })

  • change one or multiple library-scope parameters

  • ignore_errors => bool disables retries for [ERROR] and [CRIT] with the process just keep going with the next step

  • maxretries => int sets number of retries when there is an error, critical issues are retried indefinitely (if ignore_errors is false)

  • retrydelay_secs => int

  • cache_days => int sets the number of days that a resource can be loaded from the local storage. Scraping Goodreads.com is a very slow process; scraped documents can be cached if you don't need them "fresh" during development time or long running sessions (cheap recovery on crash, power blackout or pauses), or when experimenting with parameters

%book greadbook( $book_id )

%user greaduser( $user_id, $prefer_author = 0 )

  • there can be a different user and author with the same ID (2456: Joana vs Chuck Palahniuk); if there's no user but an author, Goodreads would redirect to the author page with the same ID and this function would return the author

  • if ambiguous you can set the $prefer_author flag

void greadusergp({ ... })

  • reads all group memberships of the given user into rh_into

  • from_user_id => string

  • rh_into => hash reference (id => %group,...)

  • on_group => sub( %group ) [optional]

  • on_progress => sub see gmeter() [optional]

void greadshelf({ ... })

  • reads a list of books (and/or authors) present in the given shelves of the given user

  • from_user_id => string

  • ra_from_shelves => string-array reference with shelf names

  • rh_into => hash reference (id => %book,...) [optional]

  • rh_authors_into => hash reference (id => %user,...) [optional]; this parameter is for convenience and also replaces the former greadauthors() function. It's not required to access author data as author data is available from the book data too: $book->{rh_author}->{...}

  • on_book => sub( %book ) [optional]

  • on_progress => sub see gmeter() [optional]

  • doesn't add users to rh_authors_into when gisbaduser() is true

  • sets the user_XXX and is_mainstream fields in each author item

void greadshelfnames({ ... })

  • reads the names of all shelves of the given user

  • from_user_id => string

  • ra_into => array reference

  • ra_exclude => array reference won't add given names to the result [optional]

  • Precondition: glogin()

  • Postcondition: result includes 'read', 'to-read', 'currently-reading', but doesn't include '#ALL#'

void _update_author_stats(rh_from_books)

  • sets the user_XXX and is_mainstream fields in each author item

void greadauthors({ ... })

  • DEPRECATED: use greadshelf() with rh_authors_into parameter

  • gets a list of authors whose books are present in the given shelves of the given

  • from_user_id => string

  • ra_from_shelves => string-array reference with shelf names

  • rh_into => hash reference (id => %user,...) [optional]

  • on_progress => sub see gmeter() [optional]

  • If you need authors and books data, then use greadshelf which also populates the author property of every book

  • skips authors where gisbaduser() is true

  • sets the user_XXX and is_mainstream fields in each author item

void greadauthorbk({ ... })

  • reads the Goodreads.com list of books written by the given author

  • author_id => string

  • limit => int number of books to read into rh_into

  • rh_into => hash reference (id => %book,...)

  • on_book => sub( %book ) [optional]

  • on_progress => sub see gmeter() [optional]

void greadreviews({ ... })

  • loads ratings (no text), reviews (text), "to-read", "added" etc; you can filter later or via on_filter parameter

  • rh_for_book => hash reference %book, see greadbook()

  • rh_into => hash reference (id => %review,...)

  • since => Time::Piece [optional]

  • on_filter => sub( %review ), return 0 to drop [optional]

  • on_progress => sub see gmeter() [optional]

  • dict_path => string path to a dictionary file (1 word per line) [optional]

  • text_minlen => int overwrites on_filter argument [optional, default 0 ]

    0  =  no text filtering
    n  =  specified minimum length (see also GOOD_USEFUL_REVIEW_LEN constant)
  • rigor => int [optional, default 2]

    level 0   = search newest reviews only (max 300 ratings)
    level 1   = search with a combination of filters (max 5400 ratings)
    level 2   = like 1 plus dict-search if more than 3000 ratings with stall-time of 2 minutes
    level n   = like 1 plus dict-search with stall-time of n minutes

void greadfolls({ ... })

  • queries Goodreads.com for the friends and followees list of the given user

  • rh_into => hash reference (id => %user,...)

  • from_user_id => string

  • on_user => sub( %user ) return false to exclude user from $rh_into [optional]

  • on_progress => sub see gmeter() [optional]

  • discard_threshold => number> don't add anything to $rh_into if number of folls exceeds limit [optional]; use this to drop degenerated accounts which would just add noise to the data

  • incl_authors => bool [optional, default 1]

  • incl_friends => bool [optional, default 1]

  • incl_followees => bool [optional, default 1]

  • Precondition: glogin()

void greadcomments({ ... })

  • reads a list of all comments posted from the given user on goodreads.com; it does not read a conversation by multiple users on some topic

  • from_user_id => string

  • ra_into => array reference (%comment,...) [optional]

  • limit => int stop after reading N comments [optional, default 0 ]

  • on_progress => sub see gmeter() [optional]

void gsocialnet({ ... })

  • from_user_id => string

  • rh_into_nodes => hash reference (id => %user,...)

  • ra_into_edges => array reference ({from => id, to => id},...)

  • ignore_nhood_gt => int ignore users with with a neighbourhood > N [optional, default 1000]; such users just add noise to the data and waste computing time

  • depth => int [optional, default 1]

  • incl_authors => bool [optional, default 0]

  • incl_friends => bool [optional, default 1]

  • incl_followees => bool [optional, default 1]

  • on_progress => sub({ done => int, count => int, perc => int, depth => int }) [optional]

  • on_user => sub( %user ) return false to exclude user [optional]

  • Precondition: glogin()

void greadsimilaraut({ ... })

  • reads the Goodreads.com list of authors who are similar to the given author

  • rh_into => hash reference (id => %user,...)

  • author_id => string

  • on_progress => sub see gmeter() [optional]

  • increments '_seen' counter of each author if already in %$rh_into

void gsearch({ ... })

  • searches the Goodreads.com database for books that match a given phrase

  • ra_into => array reference (%book,...)

  • phrase => string with space separated keywords

  • is_exact => bool [optional, default 0]

  • ra_order_by => array reference property names from %book [optional, default: 'stars', 'num_ratings', 'year']

  • num_ratings => int only list books with at least N ratings [optional, default 0]

  • on_progress => sub see gmeter() [optional]

string amz_book_html( %book )

  • HTML body of an Amazon article page

PUBLIC REPORT-GENERATION HELPERS

string ghtmlhead( $title, $ra_cols )

  • returns a string with HTML boiler plate code for a table-based report

  • $title: HTML title, Table caption

  • $ra_cols: [ "Normal", ">Sort ASC", "<Sort DESC", "!Not sortable/searchable", "Right-Aligned:", ">Sort ASC, right-aligned:", ":Centered:" ]

string ghtmlfoot()

  • returns a string with HTML boiler plate code for a table-based report

string ghtmlsafe($string)

  • always use this when generating HTML reports in order to prevent cross site scripting attacks (XSS) through malicious text on the Goodreads.com website

void ghistogram({ ... })

  • prints a year-based histogram for the given hash on the terminal

  • rh_from => hash reference (id => %any,...)

  • date_key => string name of the Time::Piece component of any hash item [optional, default 'date']

  • start_year => int [optional, default 2007]

  • title => string [optional, default '...reviews...']

  • bar_width => int [optional, default 40]

  • bar_char => char [optional, default '#']

PRIVATE URL-GENERATION ROUTINES

string _amz_url( %book )

  • Requires at least {isbn=>string}

string _shelf_url( $user_id, $shelf_name, $page_number = 1 )

  • URL for a page with a list of books (not all books)

  • "&print=true" allows 200 items per page with a single request, which is a huge speed improvement over loading books from the "normal" view with max 20 books per request. Showing 100 books in normal view is oddly realized by 5 AJAX requests on the Goodreads.com website.

  • "&per_page" in print-view can be any number if you work with your own shelf, otherwise max 200 if print view; ignored in non-print view; per_page>20 requires access with a cookie, see glogin()

  • "&view=table" puts all book data in code, although invisible (display=none)

  • "&sort=rating" is important for `friendrated.pl` with its book limit: Some users read 9000+ books and scraping would take forever. We sort lower-rated books to the end and could just scrape the first pages: Even those with 9000+ books haven't top-rated more than 2700 books.

  • "&shelf" supports intersection "shelf1%2Cshelf2" (comma)

  • Warning: changes to the URL structure will bust the file-cache

string _followees_url( $user_id, $page_number = 1 )

  • URL for a page with a list of the people $user is following

  • Warning: changes to the URL structure will bust the file-cache

string _friends_url( $user_id, $page_number = 1 )

  • URL for a page with a list of people befriended to $user_id

  • "&sort=date_added" (as opposed to 'last online') avoids moving targets while reading page by page

  • "&skip_mutual_friends=false" because we're not doing this just for me

  • Warning: changes to the URL structure will bust the file-cache

string _book_url( $book_id )

string _user_url( $user_id, $is_author = 0 )

string _revs_url( $book_id, $str_sort_newest_oldest = undef, $search_text = undef, $rating = undef, $is_text_only = undef, $page_number = 1 )

string _rev_url( $review_id )

string _author_books_url( $user_id, $page_number = 1 )

string _author_followings_url( $author_id, $page_number = 1 )

string _similar_authors_url( $author_id )

  • page number > N just returns same page, so no easy stop criteria; not sure, if there's more than page, though

string _search_url( phrase_str, $page_number = 1 )

  • "&q=" URL-encoded, e.g., linux+%40+"hase (linux @ "hase)

string _user_groups_url( $user_id, $page_number = 1 )

string _group_url( $group_id )

string _comments_url( $user_id, $page_number = 1 )

PRIVATE HTML-EXTRACTION ROUTINES

%book _extract_book( $book_page_html_str )

%user _extract_user( $user_page_html_str )

%user _extract_author( $user_page_html_str )

bool _extract_books( $rh_books, $rh_authors, $on_book_fn, $on_progress_fn, $shelf_tableview_html_str )

  • $rh_books: (id => %book,...)

  • $rh_authors: (id => %user,...)

  • returns 0 if no books, 1 if books, 2 if error

bool _extract_author_books( $rh_books, $r_limit, $on_book_fn, $on_progress_fn, $html_str )

  • $rh_books: (id => %book,...)

  • $r_limit: is counted to zero

  • returns 0 if no books, 1 if books, 2 if error

bool _extract_followees( $rh_users, $on_progress_fn, $incl_authors, $discard_threshold, $following_page_html_str )

  • $rh_users: (user_id => %user,...)

  • returns 0 if no followees, 1 if followees, 2 if error

bool _extract_friends( $rh_users, $on_progress_fn, $incl_authors, $discard_threshold, $friends_page_html_str )

  • $rh_users: (user_id => %user,...)

  • returns 0 if no friends, 1 if friends, 2 if error

bool _extract_comments( $ra, $on_progress, $comment_history_html_str )

string _conv_uni_codepoints( $string )

Convert Unicode codepoints such as \u003c

string _dec_entities( $string )

$value _require_arg( $name, $value )

string _trim( $string )

bool _extract_revs( $rh_revs, $on_progress_fn, $filter_fn, $since_time_piece, $reviews_xhr_html_str )

  • $rh_revs: (review_id => %review,...)

  • returns 0 if no reviews, 1 if reviews, 2 if error

bool _extract_similar_authors( $rh_into, $author_id_to_skip, $on_progress_fn, $similar_page_html_str )

  • returns 0 if no authors, 1 if authors, 2 if error

bool _extract_search_books( $ra_books, $on_progress_fn, $search_result_html_str )

  • result pages sometimes have different number of items: P1: 20, P2: 16, P3: 19

  • website says "about 75 results" but shows 70 (I checked that manually). So we fake "100%" to the progress indicator function at the end, otherwise it stops with "93%".

  • ra_books: (%book,...)

  • returns 0 if no books, 1 if books, 2 if error

bool _extract_user_groups( $rh_into, $on_group_fn, on_progress_fn, $groups_html_str )

  • returns 0 if no groups, 1 if groups, 2 if error

string _extract_csrftok( $html )

Example: my $csrftok = _extract_csrftok( _html( _user_url( $uid ) ) ); $curl->setopt( $curl->CURLOPT_HTTPHEADER, [ "X-CSRF-Token: ${csrftok}",

PRIVATE I/O PLUMBING SUBROUTINES

int _check_page( $any_html_str )

void _updcookie( $string_with_changed_fields )

updates "_session_id2" for X-CSRF-Token, "csid", "u" (user?). "p" (password?)

void _setcurlopts( $curl_ref , $url_str )

Sets default options for GET, POST, PUT, DELETE

string _html( $url, $warn_level = $_ENO_WARN, $can_cache = 1 )

  • HTML body of a web document

  • caches documents (if $can_cache is true)

  • retries on errors