Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HYC-1950 - Dimensions Ingest Failed Downloads QA #1111

Merged
merged 27 commits into from
Aug 12, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
edd1626
encode pdf url to handle non ascii
davidcam-src Jul 24, 2024
2e7686e
retrieve pdfs from wiley library api for certain ingest urls
davidcam-src Jul 25, 2024
cb76039
conditional for wiley publications
davidcam-src Aug 1, 2024
839445e
make reports include pdf linkout, specified error message for potenti…
davidcam-src Aug 2, 2024
1c6f75f
typo and test for fetching from wiley library url
davidcam-src Aug 2, 2024
8279b22
tests for pdfs retrieved with wiley library api
davidcam-src Aug 2, 2024
7d0fce9
mock env variables for dimensions ingest service tests
davidcam-src Aug 2, 2024
372bb18
pr changes to logging and report html
davidcam-src Aug 6, 2024
656abcb
pr changes: remove instance variable, introduce delays betweeen downl…
davidcam-src Aug 6, 2024
bec689d
linting
davidcam-src Aug 6, 2024
badd8d4
setting article depositor
davidcam-src Aug 6, 2024
eea20f6
updating tests, fixing syntax
davidcam-src Aug 6, 2024
6dd17af
pr changes: delay time and static depositor onyen
davidcam-src Aug 6, 2024
52fb682
syntax
davidcam-src Aug 6, 2024
bf95e67
using config instead of env variables, assume api url content types a…
davidcam-src Aug 7, 2024
58f5391
check depositors in ingest service spec
davidcam-src Aug 8, 2024
f1c73be
remove deduplication, modify query for testing
davidcam-src Aug 9, 2024
86acb14
smaller sample for manual testing, more robust logs for the ingest he…
davidcam-src Aug 9, 2024
4601007
testing a mixed sample again
davidcam-src Aug 9, 2024
d5127ca
reintroduce deduplication, remove comments
davidcam-src Aug 9, 2024
6b3ed30
update ingest service test
davidcam-src Aug 9, 2024
74356ae
logging, deduplication toggle for testing
davidcam-src Aug 9, 2024
cfe4391
removing hardcoded sample
davidcam-src Aug 9, 2024
0823db5
removing debug logs
davidcam-src Aug 12, 2024
bf71af9
add config argument
davidcam-src Aug 12, 2024
9f6deb9
addressing text rendering issues in email
davidcam-src Aug 12, 2024
1f87eaf
alternate value for nil linkout
davidcam-src Aug 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 52 additions & 11 deletions app/services/tasks/dimensions_ingest_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ module Tasks
class DimensionsIngestService
include Tasks::IngestHelper
attr_reader :admin_set, :depositor
UNC_GRID_ID = 'grid.410711.2'

def initialize(config)
@config = config
admin_set_title = @config['admin_set']
@download_delay = @config['download_delay'] || 2
@admin_set = ::AdminSet.where(title: admin_set_title)&.first
raise(ActiveRecord::RecordNotFound, "Could not find AdminSet with title #{admin_set_title}") unless @admin_set.present?

Expand Down Expand Up @@ -68,6 +68,7 @@ def populate_article_metadata(article, publication)
def set_basic_attributes(article, publication)
article.title = [publication['title']]
article.admin_set = @admin_set
article.depositor = ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
bbpennel marked this conversation as resolved.
Show resolved Hide resolved
article.creators_attributes = publication['authors'].map.with_index { |author, index| [index, author_to_hash(author, index)] }.to_h
article.funder = publication['funders']&.map { |funder| funder['name'] }
article.date_issued = publication['date']
Expand Down Expand Up @@ -130,39 +131,79 @@ def parse_page_numbers(publication)
end

def extract_pdf(publication)
pdf_url = publication['linkout']? publication['linkout'] : nil
pdf_url = publication['linkout'] ? publication['linkout'] : nil
# Set pdf_attached to false by default
publication['pdf_attached'] = false
unless publication['linkout']
headers = {}

unless pdf_url
Rails.logger.warn('Failed to retrieve PDF. Publication does not have a linkout URL.')
return nil
end
# Use the Wiley Online Library text data mining API to retrieve the PDF if it's a Wiley publication
if pdf_url.include?('hindawi.com') || pdf_url.include?('wiley.com')
Rails.logger.info('Detected a Wiley affiliated publication, attempting to retrieve PDF with their API.')
encoded_doi = URI.encode_www_form_component(publication['doi'])
encoded_url = "https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}"
headers['Wiley-TDM-Client-Token'] = "#{ENV['WILEY_TDM_API_TOKEN']}"
else
# Use the encoded linkout URL from dimensions otherwise
encoded_url = URI.encode(pdf_url)
end
download_pdf(encoded_url, publication, headers)
end

def download_pdf(encoded_url, publication, headers)
begin
response = HTTParty.head(pdf_url)
# Verify the content type is PDF; raise an error if not.
# Enforce a delay before making the request
sleep @download_delay
# Verify the content type of the PDF before downloading
response = HTTParty.head(encoded_url, headers: headers)
if response.code == 200
# Log a warning if the content type is not a PDF
raise "Incorrect content type: '#{response.headers['content-type']}'" unless response.headers['content-type']&.include?('application/pdf')
else
# Provide a warning and proceed with GET in case the HEAD request failed
Rails.logger.warn("Received a non-200 response code (#{response.code}) when making a HEAD request to the PDF URL: #{pdf_url}")
# Log a warning if the response code is not 200
Rails.logger.warn("Received a non-200 response code (#{response.code}) when making a HEAD request to the PDF URL: #{encoded_url}")
end

pdf_response = HTTParty.get(pdf_url)
raise "Failed to download PDF: HTTP status '#{pdf_response.code}'" unless pdf_response.code == 200
# Attempt to retrieve the PDF from the encoded URL
pdf_response = HTTParty.get(encoded_url, headers: headers)
if pdf_response.code != 200
wiley_rate_exceeded = headers.keys.include?('Wiley-TDM-Client-Token') && pdf_response&.body.match?(/rate/i)
# Retry the request after a delay if the Wiley-TDM API rate limit is exceeded
if wiley_rate_exceeded
delay_time = 30
bbpennel marked this conversation as resolved.
Show resolved Hide resolved
Rails.logger.warn("Wiley-TDM API rate limit exceeded. Retrying request in #{delay_time} seconds.")
# Retry the request after a delay if the Wiley-TDM API rate limit is exceeded
sleep delay_time
pdf_response = HTTParty.get(encoded_url, headers: headers)
end

# If the second attempt also fails, or if there's another error, raise an error
if pdf_response.code != 200
e_message = "Failed to download PDF: HTTP status '#{pdf_response.code}'"
# Include specific error message for potential Wiley-TDM API rate limiting (pdf_response.code != 200 and the Wiley API response body mentions rate limiting)
e_message += ' (Exceeded Wiley-TDM API rate limit)' if wiley_rate_exceeded
raise e_message
end
end
raise "Incorrect content type: '#{pdf_response.headers['content-type']}'" unless pdf_response.headers['content-type']&.include?('application/pdf')

# Generate a unique filename for the PDF and save it to the temporary storage directory
storage_dir = ENV['TEMP_STORAGE']
time_stamp = Time.now.strftime('%Y%m%d%H%M%S%L')
filename = "downloaded_pdf_#{time_stamp}.pdf"
file_path = File.join(storage_dir, filename)

# Write the PDF to the file system and mark the publication as having a PDF attached
File.open(file_path, 'wb') { |file| file.write(pdf_response.body) }
publication['pdf_attached'] = true
file_path # Return the file path
rescue StandardError => e
Rails.logger.error("Failed to retrieve PDF from URL '#{pdf_url}'. #{e.message}")
Rails.logger.error("Failed to retrieve PDF from URL '#{encoded_url}'. #{e.message}")
nil
end
end

end
end
4 changes: 2 additions & 2 deletions app/services/tasks/dimensions_reporting_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,11 @@ def generate_report
def extract_publication_info
publication_info = {successfully_ingested: [], failed_to_ingest: []}
@ingested_publications[:ingested].map do |publication|
publication_item = { title: publication['title'], id: publication['id'], url: "#{ENV['HYRAX_HOST']}/concern/articles/#{publication['article_id']}?locale=en", pdf_attached: publication['pdf_attached'] ? 'Yes' : 'No' }
publication_item = { title: publication['title'], id: publication['id'], linkout: publication['linkout'] || 'None', url: "#{ENV['HYRAX_HOST']}/concern/articles/#{publication['article_id']}?locale=en", pdf_attached: publication['pdf_attached'] ? 'Yes' : 'No' }
publication_info[:successfully_ingested] << publication_item
end
@ingested_publications[:failed].map do |publication|
publication_info[:failed_to_ingest] << { title: publication['title'], id: publication['id'], error: "#{publication['error'][0]} - #{publication['error'][1]}" }
publication_info[:failed_to_ingest] << { title: publication['title'], id: publication['id'], linkout: publication['linkout'] || 'None', error: "#{publication['error'][0]} - #{publication['error'][1]}" }
end
publication_info
end
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 20%;">Dimensions ID</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 30%;">URL</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Attached</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Link</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 15%;">Error</th>
</tr>
</thead>
Expand All @@ -27,6 +28,7 @@
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:id] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><a href="<%= publication[:url] %>"><%= publication[:url] %></a></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:pdf_attached] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><a href="<%= publication[:linkout].gsub('"', '\"') %>">Link</a></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
</tr>
<% end %>
Expand All @@ -41,6 +43,7 @@
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 20%;">Dimensions ID</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 30%;">URL</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Attached</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 5%;">PDF Link</th>
<th style="border: 2px solid #dddddd; text-align: left; padding: 10px; background-color: #f2f2f2; width: 15%;">Error</th>
</tr>
</thead>
Expand All @@ -51,6 +54,7 @@
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:id] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;">N/A</td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:linkout] %></td>
<td style="border: 2px solid #dddddd; text-align: left; padding: 10px;"><%= publication[:error] %></td>
</tr>
<% end %>
Expand Down
5 changes: 4 additions & 1 deletion spec/mailers/dimensions_report_mailer_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
let(:config) {
{
'admin_set' => 'Open_Access_Articles_and_Book_Chapters',
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'],
'download_delay' => 0
}
}
let(:dimensions_ingest_test_fixture) do
Expand Down Expand Up @@ -104,12 +105,14 @@
expect(mail.body.encoded).to include(publication[:title])
.and include(publication[:id])
.and include(publication[:url])
.and include(publication[:linkout])
.and include(publication[:pdf_attached])
end
report[:failed_to_ingest_rows].each do |publication|
expect(mail.body.encoded).to include(publication[:title])
.and include(publication[:id])
.and include(publication[:error])
.and include(publication[:linkout])
end
end

Expand Down
50 changes: 44 additions & 6 deletions spec/services/tasks/dimensions_ingest_service_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
let(:config) {
{
'admin_set' => 'Open_Access_Articles_and_Book_Chapters',
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
'depositor_onyen' => ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'],
'download_delay' => 0
}
}
let(:dimensions_ingest_test_fixture) do
Expand All @@ -27,10 +28,7 @@
FactoryBot.create(:workflow_state, workflow_id: workflow.id, name: 'deposited')
end
let(:pdf_content) { File.binread(File.join(Rails.root, '/spec/fixtures/files/sample_pdf.pdf')) }
let(:test_publications) do
JSON.parse(dimensions_ingest_test_fixture)['publications']
end

let(:test_publications) { JSON.parse(dimensions_ingest_test_fixture)['publications'] }

before do
ActiveFedora::Cleaner.clean!
Expand All @@ -55,9 +53,12 @@
# Override the depositor onyen for the duration of the test
around do |example|
dimensions_ingest_depositor_onyen = ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN']
wiley_tdm_client_token = ENV['WILEY_TDM_API_TOKEN']
ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'] = 'admin'
ENV['WILEY_TDM_API_TOKEN'] = 'test-token'
example.run
ENV['DIMENSIONS_INGEST_DEPOSITOR_ONYEN'] = dimensions_ingest_depositor_onyen
ENV['WILEY_TDM_API_TOKEN'] = wiley_tdm_client_token
end

describe '#initialize' do
Expand Down Expand Up @@ -186,7 +187,7 @@

describe '#extract_pdf' do
it 'extracts the PDF from the publication' do
publication = test_publications.first
publication = test_publications.last
pdf_path = service.extract_pdf(publication)
expect(File.exist?(pdf_path)).to be true
expect(publication['pdf_attached']).to be true
Expand Down Expand Up @@ -230,6 +231,43 @@
expect(publication['pdf_attached']).to be true
expect(File.exist?(pdf_path)).to be true
end

context 'when the publication linkout url is a Wiley Online Library URL' do
let (:test_doi) { '10.1002/cne.24143' }
let (:encoded_doi) { URI.encode_www_form_component(test_doi) }
let (:wiley_test_publication) do
publication = test_publications.first
publication['linkout'] = "https://onlinelibrary.wiley.com/doi/pdfdirect/#{test_doi}"
publication['doi'] = test_doi
publication
end

before do
allow(service).to receive(:download_pdf).and_call_original
stub_request(:head, "https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}")
.to_return(status: 200, headers: { 'Content-Type' => 'application/pdf' })
end

it 'attempts to download a PDF using their API' do
stub_request(:get, "https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}")
.to_return(body: pdf_content, status: 200, headers: { 'Content-Type' => 'application/pdf' })
pdf_path = service.extract_pdf(wiley_test_publication)
expect(service).to have_received(:download_pdf).with(
"https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}",
wiley_test_publication,
'Wiley-TDM-Client-Token' => ENV['WILEY_TDM_API_TOKEN']
)
end

it 'raises a unique error if the Wiley API returns a non 200 status code with rate limit message in the body after a retry' do
rate_limit_error = '"{""fault"":{""faultstring"":""Rate limit quota violation. Quota limit exceeded. Identifier : 3c6dd2e9-faf5-4017-9039-b09388fc5a08"",""detail"":{""errorcode"":""policies.ratelimit.QuotaViolation""}}}"'
stub_request(:get, "https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}")
.to_return(body: rate_limit_error, status: 403, headers: { 'Content-Type' => 'application/pdf' })
expect(Rails.logger).to receive(:warn).with('Wiley-TDM API rate limit exceeded. Retrying request in 30 seconds.')
expect(Rails.logger).to receive(:error).with("Failed to retrieve PDF from URL 'https://api.wiley.com/onlinelibrary/tdm/v1/articles/#{encoded_doi}'. Failed to download PDF: HTTP status '403' (Exceeded Wiley-TDM API rate limit)")
service.extract_pdf(wiley_test_publication)
end
end
end

describe '#article_with_metadata' do
Expand Down
Loading