diff --git a/R/utils.R b/R/utils.R index c97b78e..53e489b 100644 --- a/R/utils.R +++ b/R/utils.R @@ -97,7 +97,7 @@ gen_categories_df <- function(data) { return(data_sum) } -clean_podcast_df <- function(podcast_dup_df) { +clean_podcast_df <- function(podcast_dup_df, dev_mode = FALSE) { df <- podcast_dup_df |> tibble::as_tibble() |> mutate(newestItemPubdate = na_if(newestItemPubdate, 0), @@ -137,5 +137,7 @@ clean_podcast_df <- function(podcast_dup_df) { dplyr::select(-newestItemPubdate, -oldestItemPubdate, -createdOn, -lastUpdate) |> dplyr::select(imageUrl, podcastGuid, title, url, lastUpdate_p, newestEnclosureDuration, newestItemPubdate_p, oldestItemPubdate_p, episodeCount, everything()) + if (dev_mode) df <- dplyr::slice(df, 1:100) + return(df) } diff --git a/index.qmd b/index.qmd index 433ccc1..f74e4d5 100644 --- a/index.qmd +++ b/index.qmd @@ -1,12 +1,14 @@ --- -title: "My Dashboard" +title: "PodcastIndex Dashboard" format: dashboard: logo: assets/img/brand-icon.svg nav-buttons: - github - icon: mastodon - href: https://podcastindex.social + href: https://podcastindex.social/@rpodcast + - icon: twitter + href: https://twitter.com/theRcast - icon: broadcast-pin href: https://podcastindex.org theme: [cosmo, custom.scss] @@ -23,6 +25,7 @@ params: ```{r} #| context: setup +#| label: setup-chunk # load packages library(reactable) @@ -40,7 +43,7 @@ source("R/fct_tables.R") pointblank_object <- podcastdb_pointblank_object(url = params$pointblank_object_path, dev_mode = params$dev_mode) podcast_dup_df <- podcastdb_dupdf_object(url = params$podcast_dup_df_path, dev_mode = params$dev_mode) analysis_metrics_df <- podcastdb_analysisdf_object(url = params$podcast_analysis_df_path, dev_mode = params$dev_mode) -podcast_db_date <- podcastdb_log_object(root_url = params$podcast_log_path, date = "2024-03-11") |> date_report() +podcast_db_date <- podcastdb_log_object(root_url = params$podcast_log_path, date = as.character(lubridate::today())) |> date_report() ``` # Duplicates @@ -100,7 +103,7 @@ pb_extracts <- get_data_extracts(pointblank_object) pb_extracts <- purrr::map_at( pb_extracts, c('1', '3', '4', '7'), - ~clean_podcast_df(.x) + ~clean_podcast_df(.x, dev_mode = params$dev_mode) ) ``` @@ -112,6 +115,6 @@ pointblank_table(pointblank_object, report_date = podcast_db_date, extracts = pb ``` ::: -# About +# Methodology -Add more here \ No newline at end of file +{{< include methods.md >}} \ No newline at end of file diff --git a/methods.md b/methods.md new file mode 100644 index 0000000..332a823 --- /dev/null +++ b/methods.md @@ -0,0 +1,29 @@ +

Introduction

+ +The [Podcast Index](https://podcastindex.org) is an independent and open catalog of podcasts feeds serving as the backbone of what is referred to as the Podcasting 2.0 initiative. The data contained in the Podcast Index is available through a robust [REST API](https://podcastindex-org.github.io/docs-api/#overview--libraries) as well as a [SQLite database](https://public.podcastindex.org/podcastindex_feeds.db.tgz) updated every week. + +In previous episodes of [Podcasting 2.0](https://podcastindex.org/podcast/920666), Dave Jones lamented that duplicate podcast entries in the Podcast Index can cause annoying issues for many podcast apps and other services relying on the integrity of the index. Seeing an opportunity to help this amazing project, I sent a boost to the show in [episode 156](https://podverse.fm/episode/hLh98zHNo) to offer up a new solution powered by the R statistical computing language for identifiying potential duplicates alongside other data quality issues. Hence the objectives of this dashboard are to highlight potential duplicate podcast entries as well as perform quality assessments of the index to highlight potential issues. + +

Duplicates Analysis

+ +The methodology used to identify candidate duplicates is a technique called [record linkage](https://rpubs.com/ahmademad/RecordLinkage). At a high level, record linkage evaluates possible pairwise combinations of records from two data sets (or in the case of de-duplication, a single data set compared to itself) and determines if a given pair are likely to originate from the same entity. For a data set the size of the Podcast Index, it would be next to impossible to perform the analysis on all pairwise combinations of records. Next we describe the techniques for pruning the possible record combinations and the criteria used to categorize a given pair as a possible duplicate. Within R, we leverage the [`{reclin2}`](https://github.com/djvanderlaan/reclin2) package that strikes a nice balance between performance and logical workflow for performing probabilistic record linkage and de-duplication. + +

Reducing Candidate Pairs

+ +To reduce candidate record pairs supplied to the record linkage analyses, we apply a technique called **blocking**, which requires a pair of records to agree on one or multiple variables before it can be moved to further analysis. For this analysis, we are using the **title** and **content hash** variables as the blocking variables. + +

Comparing Pairs

+ +With the candidate pairs available, the next step is to derive a similarity score between the records in a given pair based on a set of variables common between the records. Based on advice from the Pod Sage Dave Jones, we use the following variables for comparison: + +* URL +* Newest Enclosure URL +* Image URL + +The statistical method used to derive the similarity score is the [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) metric, which is a great fit for the URL variables. The metric produces a score ranging from 0 (no match in any of the string characters) to 1 (perfect match between the strings). The algorithm can be customized with a threshold value that gives a cutoff for determining if the two strings are a likely match. For this analysis we use a threshold of 0.95, but this is up for discussion as there is a tradeoff between a threshold value and the number of candidate duplicate groupings identified. This is a subject that requires further attention going forward. + +With the Jaro-WInkler distance score calculated, only the records with a score of 0.95 or above will be retained for further evaluation. + +

Derive Duplicate Groups

+ +Once the candidate pairs are pruned with the threshold cutoff, the last step is to organize the potential duplicate records into groups. The dashboard presents each of these groups with the ability to drill down within each group and inspect the records that were considered duplicates. \ No newline at end of file