Skip to content

Commit

Permalink
Update ct.gov connector and parser for ct.gov API v2
Browse files Browse the repository at this point in the history
Refactor to use v2 API and update URLs that referenced "classic" site for study pages.

Connector changes:
1. Recurse over its pages of results -- no longer an option to return more than 100 studies at once when searching with location etc. criteria.
2. Show status messages for both current "page" and overall.
3. Fix #clear, which had not been updated to delete TrialSubgroups when those were added (hence was failing to delete Trials with a reference error).

Parser changes:
1. Major changes to switch from XML/xpath to the new JSON format for API responses.
2. Don't re-fetch data for each study individually, data for studies loads in each "page" in the connector and that data for each study is included in the parser object initialization (each parser instance represents handling of data for a single study).
3. Change location matching from exact to substring (via regex). This fixes an issue where e.g. a location of "University of Minnesota/Cancer Center" would not match if the location in the site settings is "University of Minnesota". This was causing data for studies to be incorrectly omitted.
4. New contacts algorithm, V2 API no longer has "contact" and "backup contact".
5. Misc. updates for API changes to case in enumerated values, naming and nesting, etc.

Spec changes:
Added many tests and updated existing ones for ct.gov parser.

Rake task changes:
Update for new private connector API.

README changes:
Include update notes.

uncomment line and update docker-compose
  • Loading branch information
machinehum committed Jul 18, 2024
1 parent 0a17cbe commit 9241dcc
Show file tree
Hide file tree
Showing 7 changed files with 722 additions and 587 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,11 @@ Contact the StudyFinder team at studyfinder@umn.edu if you:
- Have any questions about StudyFinder, or
- Want to learn more about updates or enhancements of the tool.

## Upgrade notes for 2.1
## Upgrade notes for 2.2
The built-in clinicaltrials.gov connector has been transitioned fully to the clinicaltrials.gov V2 API. This includes two breaking changes in the private API for the ctgov connector.

The main page carousel/video feature was an accessibility and usability issue, and has been replaced with a three-wide panel of "featured studies". These can be configured in the admin panel, where the carousel configuration formerly was.
1. In `Connectors::Ctgov#load(start_date,end_date)` the start and end dates must now be in ISO format YYYY-MM-DD (the old format was MM/DD/YYYY). Any custom tasks that directly call this method should be updated.
2. `Connectors::Ctgov#load(start_date,end_date)` now calls `Connectors::Ctgov#process` itself to recurse through the V2 API's paged results. Formerly, `load` and `process` had to be called separately in that order. Remove any direct calls to `process` in order to avoid a redundant re-processing of the last "page" of data from the API.

## Development

Expand Down
2 changes: 1 addition & 1 deletion app/views/studies/_clinicaltrialsgov_button.html.erb
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<% if Trial.is_nct_number?(study.nct_id) %>
<a class="btn btn-school btn-more-info" href="https://www.clinicaltrials.gov/ct2/show/study/<%= study.nct_id%>" onclick="track('send', 'event', 'ctgov', 'click', {'nct_id':'<%= study.nct_id %>'});" target="_blank">
<a class="btn btn-school btn-more-info" href="https://www.clinicaltrials.gov/study/<%= study.nct_id%>" onclick="track('send', 'event', 'ctgov', 'click', {'nct_id':'<%= study.nct_id %>'});" target="_blank">
<i class="fa-solid fa-info-circle"></i>
See this study on ClinicalTrials.gov
</a>
Expand Down
1 change: 0 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
version: '3'
services:
elasticsearch:
image: elasticsearch:8.10.2
Expand Down
119 changes: 70 additions & 49 deletions lib/connectors/ctgov.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,77 +5,98 @@ class Ctgov

def initialize
@system_info = SystemInfo.current
@parser_id = Parser.find_by({ klass: 'Parsers::Ctgov'}).id

if @system_info.nil?
raise "There is no system info associated. Please run the seeds file, or add the info in the system administration section."
end
end

def load(start_date=nil, end_date=nil)
start_load_time = Time.now

url = "https://clinicaltrials.gov/ct2/results/download_studies?locn=#{ERB::Util.url_encode(@system_info.search_term)}"
@parser_id = Parser.find_by({ klass: 'Parsers::Ctgov'}).id
@location = @system_info.search_term
@page_token = nil
@payload = nil
@start_date = 'MIN'
@end_date = 'MAX'
@start_load_time = nil
@total_count = nil
@count = 0
end

if !start_date.nil? and !end_date.nil?
puts "Loading clinicaltrials.gov results for #{@system_info.search_term} ... from #{start_date} to #{end_date}"
url = url + "&lup_s=#{ERB::Util.url_encode(start_date)}&lup_e=#{ERB::Util.url_encode(end_date)}"
else
puts "Loading all clinicaltrials.gov results for #{@system_info.search_term} ..."
def study_filters
q = {
'query.locn' => "AREA[LocationFacility]#{@location} AND AREA[LocationStatus]RECRUITING",
'query.term' => "AREA[LastUpdatePostDate]RANGE[#{@start_date},#{@end_date}]",
countTotal: true,
pageSize: 100,
format: "json"
}
# API only wants a pageToken arg at all if we are actually asking for one.
if !@page_token.blank?
q[:pageToken] = @page_token
end

puts "Search URL: #{url}"
# @zipfile = Tempfile.new('file')
# @zipfile.binmode
return q
end

dirname = "#{Rails.root}/tmp/"
unless File.directory?(dirname)
FileUtils.mkdir_p(dirname)
end
def studies_page
response = HTTParty.get(
"https://clinicaltrials.gov/api/v2/studies",
query: self.study_filters
)
@payload = JSON.parse(response.body || "{}")
@total_count ||= @payload.dig('totalCount')
puts "Retrieved page (#{@page_token})"
end

FileUtils.rm_rf("#{dirname}search_result.zip")
File.open("#{dirname}search_result.zip", "w+") do |f|
f.write(HTTParty.get(url).body)
end
# @zipfile.write(HTTParty.get(url).body)
# @zipfile.close
def load(start_date="MIN", end_date="MAX")
puts "Adding/Updating trials in the database. If it is a full reload it's going to be awhile... Maybe get some coffee? :)"
@start_date = start_date
@end_date = end_date
@start_load_time ||= Time.now

puts "Extracting trials from zip file"
extract()
end_load_time = Time.now
self.studies_page

puts "Time elapsed #{(end_load_time - start_load_time)} seconds"
end
# Process the studies we just received, and ...
self.process
# ... recurse if there's another page.

def extract
start_load_time = Time.now
extract_zip()
end_load_time = Time.now
if @payload.dig("nextPageToken")
@page_token = @payload.dig("nextPageToken")
else
@page_token = nil
end

puts "Zip time elapsed: #{(end_load_time - start_load_time)}"
return true
if @page_token.blank?
puts "clinicaltrials.gov load COMPLETE."
else
puts "Now we'll load page #{@payload.dig("nextPageToken")}}"
@payload = nil
self.load(@start_date,@end_date)
end
end

def process
start_load_time = Time.now
count = 0
puts "Adding/Updating trials in the database. If it is a full reload it's going to be awhile... Maybe get some coffee? :)"

Dir.glob("#{Rails.root}/tmp/trials/*.xml") do |file|
p = Parsers::Ctgov.new( file.gsub("#{Rails.root}/tmp/trials/", "").gsub(".xml", ""), @parser_id)
p.load(file)
page_start_load_time = Time.now
page_count = 0
puts "Processing page (#{@page_token})"

@payload.dig('studies').each do |study|
@id = study.dig('protocolSection', 'identificationModule', 'nctId')
p = Parsers::Ctgov.new(@id, @parser_id, study)
puts "Processing: #{@id} (#{@count + 1} of #{@total_count})"
p.process
count = count + 1
page_count = page_count + 1
@count = @count + 1
end
end_load_time = Time.now
page_end_load_time = Time.now

puts "Logging update to updaters table. Processed #{count} records."
puts "Logging update to updaters table."
Updater.create({
parser_id: @parser_id,
num_updated: count
num_updated: page_count
})

puts "Process time elapsed: #{(end_load_time - start_load_time)} seconds"
puts "Page time elapsed: #{(page_end_load_time - page_start_load_time)} seconds for #{page_count} records."
puts "Total process elapsed: #{(page_end_load_time - @start_load_time)} seconds for #{@count} records."
return true
end

Expand All @@ -86,8 +107,9 @@ def clear
TrialLocation.delete_all
TrialKeyword.delete_all
Location.delete_all
Trial.delete_all
TrialSubgroup.delete_all
TrialCondition.delete_all
Trial.delete_all
end

def site_nct_ids
Expand All @@ -103,7 +125,6 @@ def cleanup_stray_trials
end

def nct_ids_for_location(location, page_token = nil)
csc = 'M Health Fairview Clinics and Surgery Center'
ids = []
q = {
'query.locn' => "SEARCH[Location](AREA[LocationFacility]#{location} AND AREA[LocationStatus]RECRUITING)",
Expand Down
Loading

0 comments on commit 9241dcc

Please sign in to comment.