Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HYC-1950 - Dimensions Ingest Failed Downloads QA #1111

Merged
merged 27 commits into from
Aug 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
edd1626
encode pdf url to handle non ascii
davidcam-src Jul 24, 2024
2e7686e
retrieve pdfs from wiley library api for certain ingest urls
davidcam-src Jul 25, 2024
cb76039
conditional for wiley publications
davidcam-src Aug 1, 2024
839445e
make reports include pdf linkout, specified error message for potenti…
davidcam-src Aug 2, 2024
1c6f75f
typo and test for fetching from wiley library url
davidcam-src Aug 2, 2024
8279b22
tests for pdfs retrieved with wiley library api
davidcam-src Aug 2, 2024
7d0fce9
mock env variables for dimensions ingest service tests
davidcam-src Aug 2, 2024
372bb18
pr changes to logging and report html
davidcam-src Aug 6, 2024
656abcb
pr changes: remove instance variable, introduce delays betweeen downl…
davidcam-src Aug 6, 2024
bec689d
linting
davidcam-src Aug 6, 2024
badd8d4
setting article depositor
davidcam-src Aug 6, 2024
eea20f6
updating tests, fixing syntax
davidcam-src Aug 6, 2024
6dd17af
pr changes: delay time and static depositor onyen
davidcam-src Aug 6, 2024
52fb682
syntax
davidcam-src Aug 6, 2024
bf95e67
using config instead of env variables, assume api url content types a…
davidcam-src Aug 7, 2024
58f5391
check depositors in ingest service spec
davidcam-src Aug 8, 2024
f1c73be
remove deduplication, modify query for testing
davidcam-src Aug 9, 2024
86acb14
smaller sample for manual testing, more robust logs for the ingest he…
davidcam-src Aug 9, 2024
4601007
testing a mixed sample again
davidcam-src Aug 9, 2024
d5127ca
reintroduce deduplication, remove comments
davidcam-src Aug 9, 2024
6b3ed30
update ingest service test
davidcam-src Aug 9, 2024
74356ae
logging, deduplication toggle for testing
davidcam-src Aug 9, 2024
cfe4391
removing hardcoded sample
davidcam-src Aug 9, 2024
0823db5
removing debug logs
davidcam-src Aug 12, 2024
bf71af9
add config argument
davidcam-src Aug 12, 2024
9f6deb9
addressing text rendering issues in email
davidcam-src Aug 12, 2024
1f87eaf
alternate value for nil linkout
davidcam-src Aug 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 64 additions & 16 deletions app/services/tasks/dimensions_ingest_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ module Tasks
class DimensionsIngestService
include Tasks::IngestHelper
attr_reader :admin_set, :depositor
UNC_GRID_ID = 'grid.410711.2'

def initialize(config)
@config = config
admin_set_title = @config['admin_set']
@download_delay = @config['download_delay'] || 2
@wiley_tdm_api_token = @config['wiley_tdm_api_token']
@admin_set = ::AdminSet.where(title: admin_set_title)&.first
raise(ActiveRecord::RecordNotFound, "Could not find AdminSet with title #{admin_set_title}") unless @admin_set.present?

Expand All @@ -29,7 +30,7 @@ def ingest_publications(publications)
rescue StandardError => e
publication.delete('pdf_attached')
res[:failed] << publication.merge('error' => [e.class.to_s, e.message])
Rails.logger.error("Error ingesting publication '#{publication['title']}'")
Rails.logger.error("Error ingesting publication '#{publication['title']}' with Dimensions ID: #{publication['id']}")
Rails.logger.error [e.class.to_s, e.message, *e.backtrace].join($RS)
end
end
Expand All @@ -40,7 +41,6 @@ def process_publication(publication)
article = article_with_metadata(publication)
create_sipity_workflow(work: article)
pdf_path = extract_pdf(publication)

if pdf_path
pdf_file = attach_pdf_to_work(article, pdf_path, @depositor)
pdf_file.update(permissions_attributes: group_permissions(@admin_set))
Expand Down Expand Up @@ -68,6 +68,7 @@ def populate_article_metadata(article, publication)
def set_basic_attributes(article, publication)
article.title = [publication['title']]
article.admin_set = @admin_set
article.depositor = @config['depositor_onyen']
article.creators_attributes = publication['authors'].map.with_index { |author, index| [index, author_to_hash(author, index)] }.to_h
article.funder = publication['funders']&.map { |funder| funder['name'] }
article.date_issued = publication['date']
Expand Down Expand Up @@ -96,7 +97,6 @@ def set_identifiers(article, publication)
article.issn = publication['issn'].presence
end


def author_to_hash(author, index)
hash = {
'name' => "#{[author['last_name'], author['first_name']].compact.join(', ')}",
Expand Down Expand Up @@ -130,39 +130,87 @@ def parse_page_numbers(publication)
end

def extract_pdf(publication)
pdf_url = publication['linkout']? publication['linkout'] : nil
pdf_url = publication['linkout'] ? publication['linkout'] : nil
# Set pdf_attached to false by default
publication['pdf_attached'] = false
unless publication['linkout']
headers = {}

unless pdf_url
Rails.logger.warn('Failed to retrieve PDF. Publication does not have a linkout URL.')
return nil
end
# Use the Wiley Online Library text data mining API to retrieve the PDF if it's a Wiley publication
if pdf_url.include?('hindawi.com') || pdf_url.include?('wiley.com')
Rails.logger.info('Detected a Wiley affiliated publication, attempting to retrieve PDF with their API.')
encoded_doi = URI.encode_www_form_component(publication['doi'])
encoded_url = "https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}"
headers['Wiley-TDM-Client-Token'] = "#{@wiley_tdm_api_token}"
else
# Use the encoded linkout URL from dimensions otherwise
encoded_url = URI.encode(pdf_url)
end
download_pdf(encoded_url, publication, headers)
end

def download_pdf(encoded_url, publication, headers)
begin
response = HTTParty.head(pdf_url)
# Verify the content type is PDF; raise an error if not.
if response.code == 200
raise "Incorrect content type: '#{response.headers['content-type']}'" unless response.headers['content-type']&.include?('application/pdf')
# Enforce a delay before making the request
sleep @download_delay
# Assume API URLs are valid and skip the content type check to avoid rate limiting
if !is_api(encoded_url)
# Verify the content type of the PDF before downloading
response = HTTParty.head(encoded_url, headers: headers)
if response.code == 200
# Log a warning if the content type is not a PDF
raise "Incorrect content type: '#{response.headers['content-type']}'" unless response.headers['content-type']&.include?('application/pdf')
else
# Log a warning if the response code is not 200
Rails.logger.warn("Received a non-200 response code (#{response.code}) when making a HEAD request to the PDF URL: #{encoded_url}")
end
else
# Provide a warning and proceed with GET in case the HEAD request failed
Rails.logger.warn("Received a non-200 response code (#{response.code}) when making a HEAD request to the PDF URL: #{pdf_url}")
Rails.logger.info("Skipping content type check for API URL: #{encoded_url}")
end
# Attempt to retrieve the PDF from the encoded URL
pdf_response = HTTParty.get(encoded_url, headers: headers)
if pdf_response.code != 200
wiley_rate_exceeded = headers.keys.include?('Wiley-TDM-Client-Token') && pdf_response&.body.match?(/rate/i)
# Retry the request after a delay if the Wiley-TDM API rate limit is exceeded
if wiley_rate_exceeded
delay_time = @download_delay * 15
Rails.logger.warn("Wiley-TDM API rate limit exceeded. Retrying request in #{delay_time} seconds.")
# Retry the request after a delay if the Wiley-TDM API rate limit is exceeded
sleep delay_time
pdf_response = HTTParty.get(encoded_url, headers: headers)
end

# If the second attempt also fails, or if there's another error, raise an error
if pdf_response.code != 200
e_message = "Failed to download PDF: HTTP status '#{pdf_response.code}'"
# Include specific error message for potential Wiley-TDM API rate limiting (pdf_response.code != 200 and the Wiley API response body mentions rate limiting)
e_message += ' (Exceeded Wiley-TDM API rate limit)' if wiley_rate_exceeded
raise e_message
end
end

pdf_response = HTTParty.get(pdf_url)
raise "Failed to download PDF: HTTP status '#{pdf_response.code}'" unless pdf_response.code == 200
raise "Incorrect content type: '#{pdf_response.headers['content-type']}'" unless pdf_response.headers['content-type']&.include?('application/pdf')

# Generate a unique filename for the PDF and save it to the temporary storage directory
storage_dir = ENV['TEMP_STORAGE']
time_stamp = Time.now.strftime('%Y%m%d%H%M%S%L')
filename = "downloaded_pdf_#{time_stamp}.pdf"
file_path = File.join(storage_dir, filename)

# Write the PDF to the file system and mark the publication as having a PDF attached
File.open(file_path, 'wb') { |file| file.write(pdf_response.body) }
publication['pdf_attached'] = true
file_path # Return the file path
rescue StandardError => e
Rails.logger.error("Failed to retrieve PDF from URL '#{pdf_url}'. #{e.message}")
Rails.logger.error("Failed to retrieve PDF from URL '#{encoded_url}'. #{e.message}")
nil
end
end

def is_api(encoded_url)
encoded_url.include?('api')
end
end
end
9 changes: 3 additions & 6 deletions app/services/tasks/dimensions_query_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ class DimensionsPublicationQueryError < StandardError

attr_accessor :dimensions_total_count

def initialize
def initialize(config)
@config = config
@dimensions_total_count = 0
end

Expand Down Expand Up @@ -42,12 +43,9 @@ def query_dimensions(options = {})
if response.success?
# Merge the new publications with the existing set
all_publications.merge(process_response(response))
# total_count = response['_stats']['total_count']
self.dimensions_total_count = response.parsed_response['_stats']['total_count']
cursor += page_size
# End the loop if the cursor exceeds the total count
# WIP: Break if cursor is greater than or equal to 100 for testing purposes
# break if cursor >= total_count || cursor >= 100
break if cursor >= self.dimensions_total_count
# Reset the retry count if the query is successful
retries[0] = 0
Expand Down Expand Up @@ -107,8 +105,7 @@ def post_query(query_string, token)
def process_response(response)
parsed_body = JSON.parse(response.body)
# To disable deduplication during testing, comment out the next line and uncomment the following line. (Makes it easier to conduct repeated tests of the ingest process.)
publications = deduplicate_publications(parsed_body['publications'])
# publications = parsed_body['publications']
publications = @config['deduplicate'] ? deduplicate_publications(parsed_body['publications']) : parsed_body['publications']
Rails.logger.info("Dimensions API returned #{parsed_body['publications'].size} publications.")
Rails.logger.info("Unique Publications after Deduplicating: #{publications.size}.")
publications
Expand Down
4 changes: 2 additions & 2 deletions app/services/tasks/dimensions_reporting_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ def generate_report
def extract_publication_info
publication_info = {successfully_ingested: [], failed_to_ingest: []}
@ingested_publications[:ingested].map do |publication|
publication_item = { title: publication['title'], id: publication['id'], url: "#{ENV['HYRAX_HOST']}/concern/articles/#{publication['article_id']}?locale=en", pdf_attached: publication['pdf_attached'] ? 'Yes' : 'No' }
publication_item = { title: publication['title'], id: publication['id'], linkout: publication['linkout'], url: "#{ENV['HYRAX_HOST']}/concern/articles/#{publication['article_id']}?locale=en", pdf_attached: publication['pdf_attached'] ? 'Yes' : 'No' }
publication_info[:successfully_ingested] << publication_item
end
@ingested_publications[:failed].map do |publication|
publication_info[:failed_to_ingest] << { title: publication['title'], id: publication['id'], error: "#{publication['error'][0]} - #{publication['error'][1]}" }
publication_info[:failed_to_ingest] << { title: publication['title'], id: publication['id'], linkout: publication['linkout'], error: "#{publication['error'][0]} - #{publication['error'][1]}" }
end
publication_info
end
Expand Down
21 changes: 12 additions & 9 deletions app/services/tasks/ingest_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,18 @@ def attach_xml_to_work(work, file_path, depositor)

def attach_file_set_to_work(work:, file_path:, user:, visibility:)
file_set_params = { visibility: visibility }
Rails.logger.info("Attaching file_set for #{file_path} to DOI: #{work.identifier.first}")
file_set = FileSet.create
actor = Hyrax::Actors::FileSetActor.new(file_set, user)
actor.create_metadata(file_set_params)
file = File.open(file_path)
actor.create_content(file)
actor.attach_to_work(work, file_set_params)
file.close

begin
file_set = FileSet.create
actor = Hyrax::Actors::FileSetActor.new(file_set, user)
actor.create_metadata(file_set_params)
file = File.open(file_path)
actor.create_content(file)
actor.attach_to_work(work, file_set_params)
file.close
rescue StandardError => e
Rails.logger.error("Error attaching file_set for new work with DOI: #{work.identifier.first} and file_path: #{file_path}")
Rails.logger.error [e.class.to_s, e.message, *e.backtrace].join($RS)
end
file_set
end

Expand Down
18 changes: 18 additions & 0 deletions app/views/dimensions_report_mailer/dimensions_report_email.erb
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 20%;">Dimensions ID</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 30%;">URL</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Attached</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Link</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 15%;">Error</th>
</tr>
</thead>
Expand All @@ -27,6 +28,14 @@
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:id] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><a href="<%= publication[:url] %>"><%= publication[:url] %></a></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:pdf_attached] %></td>
<!-- Render PDF Link Text Conditionally -->
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">
<% if publication[:linkout].blank? %>
N/A
<% else %>
<a href="<%= publication[:linkout].gsub('"', '\"') %>">Link</a>
<% end %>
</td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
</tr>
<% end %>
Expand All @@ -41,6 +50,7 @@
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 20%;">Dimensions ID</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 30%;">URL</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Attached</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Link</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 15%;">Error</th>
</tr>
</thead>
Expand All @@ -51,6 +61,14 @@
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:id] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
<!-- Render PDF Link Text Conditionally -->
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">
<% if publication[:linkout].blank? %>
N/A
<% else %>
<a href="<%= publication[:linkout].gsub('"', '\"') %>">Link</a>
<% end %>
</td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:error] %></td>
</tr>
<% end %>
Expand Down
7 changes: 5 additions & 2 deletions lib/tasks/dimensions_ingest.rake
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@ namespace :dimensions do
EARLIEST_DATE = '1970-01-01'

desc 'Ingest metadata and publications from Dimensions'
task :ingest_publications, [:admin_set, :start_date, :end_date] => :environment do |t, args|
task :ingest_publications, [:admin_set, :start_date, :end_date, :deduplicate] => :environment do |t, args|
Rails.logger.info "[#{Time.now}] starting dimensions publications ingest"
is_cron_job = false

start_date = args[:start_date]
end_date = args[:end_date]
deduplicate = args[:deduplicate].present? ? args[:deduplicate].downcase == 'true' : true

# Ensure start_date and end_date are provided, or neither are provided
if args[:start_date].present? && !args[:end_date].present?
Expand All @@ -23,10 +24,12 @@ namespace :dimensions do
config = {
'admin_set' => args[:admin_set],
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'],
'wiley_tdm_api_token' => ENV['WILEY_TDM_API_TOKEN'],
'deduplicate' => deduplicate
}

# Query and ingest publications
query_service = Tasks::DimensionsQueryService.new
query_service = Tasks::DimensionsQueryService.new(config)
ingest_service = Tasks::DimensionsIngestService.new(config)
publications = ingest_service.ingest_publications(query_service.query_dimensions(start_date: start_date, end_date: end_date))
report = Tasks::DimensionsReportingService.new(publications, query_service.dimensions_total_count, { start_date: start_date, end_date: end_date }, is_cron_job).generate_report
Expand Down
13 changes: 4 additions & 9 deletions spec/mailers/dimensions_report_mailer_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
let(:config) {
{
'admin_set' => 'Open_Access_Articles_and_Book_Chapters',
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
'depositor_onyen' => 'admin',
'download_delay' => 0
}
}
let(:dimensions_ingest_test_fixture) do
Expand Down Expand Up @@ -76,14 +77,6 @@
ingested_publications
end

# Override the depositor onyen for the duration of the test
around do |example|
dimensions_ingest_depositor_onyen = ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'] = 'admin'
example.run
ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'] = dimensions_ingest_depositor_onyen
end

describe 'dimensions_report_email' do
let(:mail) { DimensionsReportMailer.dimensions_report_email(report) }

Expand All @@ -104,12 +97,14 @@
expect(mail.body.encoded).to include(publication[:title])
.and include(publication[:id])
.and include(publication[:url])
.and include(publication[:linkout] || 'N/A')
.and include(publication[:pdf_attached])
end
report[:failed_to_ingest_rows].each do |publication|
expect(mail.body.encoded).to include(publication[:title])
.and include(publication[:id])
.and include(publication[:error])
.and include(publication[:linkout] || 'N/A')
end
end

Expand Down
16 changes: 15 additions & 1 deletion spec/mailers/previews/dimensions_report_mailer_preview.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ def dimensions_report_email
test_publications = JSON.parse(dimensions_ingest_test_fixture)['publications']
config = {
'admin_set' => 'default',
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'],
'wiley_tdm_api_token' => ENV['WILEY_TDM_API_TOKEN']
}

dimensions_ingest_service = Tasks::DimensionsIngestService.new(config)
Expand All @@ -32,6 +33,19 @@ def dimensions_report_email
pub.merge('error' => ['Test error', 'Test error message'])
end

# Test conditional rendering of linkout for even publications in both failed and ingested arrays
ingested_publications[:failed].each_with_index do |pub, index|
if index.even?
pub['linkout'] = nil
end
end

ingested_publications[:ingested].each_with_index do |pub, index|
if index.even?
pub['linkout'] = nil
end
end

dimensions_reporting_service = Tasks::DimensionsReportingService.new(ingested_publications, FIXED_DIMENSIONS_TOTAL_COUNT, { start_date: TEST_START_DATE, end_date: TEST_END_DATE }, FALSE)
report = dimensions_reporting_service.generate_report
DimensionsReportMailer.dimensions_report_email(report)
Expand Down
Loading
Loading