fix filtering in `prepare_financial_data()` #32

cjyetman · 2024-07-09T09:12:57Z

The problem was that unwanted Equity rows were getting through the filters of the financial data. Equity rows that have NA or 0 for unit_share_price or current_shares_outstanding end up with a NA or 0 value for market capitalization (unit_share_price * current_shares_outstanding), which leads to an NA or Inf value for "ownership weight" when "value invested" is divided by "market_cap". Since "market cap" is necessary information for the ownership weight calculation, we should filter out rows where market capitalization cannot be properly calculated.

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 23 × 15
#> # Groups:   factset_entity_id, asset_type [22]
#>    isin     unit_share_price current_shares_outst…¹ asset_type factset_entity_id
#>    <chr>               <dbl>                  <dbl> <chr>      <chr>            
#>  1 US92554…            0                 4701690000 Equity     065FWN-E         
#>  2 US02341…           38.8                        0 Equity     05DZHZ-E         
#>  3 US03420…            0                 4919209650 Equity     006JZS-E         
#>  4 US05858…            0.383                      0 Equity     05J60L-E         
#>  5 US29562…            0                  215271000 Equity     09NNL5-E         
#>  6 US39322…            0                  499989996 Equity     05LWKW-E         
#>  7 US69338…            0                43123215171 Equity     002TYS-E         
#>  8 US91349…            0                86188033465 Equity     0045KD-E         
#>  9 US24379…            0                  229374605 Equity     001B1Q-E         
#> 10 AU00000…            1.20                       0 Equity     0K17S2-E         
#> # ℹ 13 more rows
#> # ℹ abbreviated name: ¹current_shares_outstanding_all_classes
#> # ℹ 10 more variables: company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

These filters were originally introduced in https://github.com/RMI-PACTA/archive.pacta.data.preparation/pull/226 (sorry @jdhoffa, not trying to point fingers, just want to be super clear about the provenance). It looks like the filters added there did not have any additional intent, they simply did not work completely as expected.

I'm propsing to modify the filter with a dplyr::case_when() which is easier to interpret and maintain. To understand the logic, see the below code. The goals are:

for Bonds, keep all rows
for Other, keep all rows
for Funds, keep only rows that do not have NA for the adj_price
for Equity, keep only rows that have a positive value (non-NA, and > 0) for both adj_price and adj_shares_outstanding

The rows of Equity in the filtered result should only have one row where both adj_price and adj_shares_outstanding have a positive value.

library(dplyr)

bonds <- tibble(
  asset_type = c("Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

equity <- tibble(
  asset_type = c("Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

funds <- tibble(
  asset_type = c("Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

others <- tibble(
  asset_type = c("Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

test <- bind_rows(bonds, equity, funds, others)

test %>% print(n = 40)
#> # A tibble: 36 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Equity             1                      0
#> 12 Equity             1                     NA
#> 13 Equity             0                      1
#> 14 Equity             0                      0
#> 15 Equity             0                     NA
#> 16 Equity            NA                      1
#> 17 Equity            NA                      0
#> 18 Equity            NA                     NA
#> 19 Funds              1                      1
#> 20 Funds              1                      0
#> 21 Funds              1                     NA
#> 22 Funds              0                      1
#> 23 Funds              0                      0
#> 24 Funds              0                     NA
#> 25 Funds             NA                      1
#> 26 Funds             NA                      0
#> 27 Funds             NA                     NA
#> 28 Others             1                      1
#> 29 Others             1                      0
#> 30 Others             1                     NA
#> 31 Others             0                      1
#> 32 Others             0                      0
#> 33 Others             0                     NA
#> 34 Others            NA                      1
#> 35 Others            NA                      0
#> 36 Others            NA                     NA

test %>% 
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$adj_price),
      asset_type == "Equity" ~ .data$adj_price > 0 & .data$adj_shares_outstanding > 0
    )
  ) %>% 
  print(n = 40)
#> # A tibble: 25 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Funds              1                      1
#> 12 Funds              1                      0
#> 13 Funds              1                     NA
#> 14 Funds              0                      1
#> 15 Funds              0                      0
#> 16 Funds              0                     NA
#> 17 Others             1                      1
#> 18 Others             1                      0
#> 19 Others             1                     NA
#> 20 Others             0                      1
#> 21 Others             0                      0
#> 22 Others             0                     NA
#> 23 Others            NA                      1
#> 24 Others            NA                      0
#> 25 Others            NA                     NA

Note that .data$adj_price > 0 & .data$adj_shares_outstanding > 0 can return NAs, but dplyr::filter()drops rows where the result isNA(versusTRUEorFALSE`).

library(dplyr)

c(1, 0, NA) > 0
#> [1]  TRUE FALSE    NA

tibble(x = LETTERS[1:3]) %>% filter(c(1, 0, NA) > 0)
#> # A tibble: 1 × 1
#>   x    
#>   <chr>
#> 1 A

Modifying @Antoine-Lalechere's example code to use the new filter (and adjusting the column names appropriately), should yield no rows...

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$unit_share_price),
      asset_type == "Equity" ~ .data$unit_share_price > 0 & .data$current_shares_outstanding_all_classes > 0
    )
  ) %>% 
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 0 × 15
#> # Groups:   factset_entity_id, asset_type [0]
#> # ℹ 15 variables: isin <chr>, unit_share_price <dbl>,
#> #   current_shares_outstanding_all_classes <dbl>, asset_type <chr>,
#> #   factset_entity_id <chr>, company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

codecov · 2024-07-09T09:15:50Z

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 18.82%. Comparing base (fa6e801) to head (d3d185a).

Files	Patch %	Lines
R/prepare_financial_data.R	0.00%	5 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #32   +/-   ##
=======================================
  Coverage   18.82%   18.82%           
=======================================
  Files          35       35           
  Lines        1126     1126           
=======================================
  Hits          212      212           
  Misses        914      914

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jdhoffa · 2024-07-09T09:41:09Z

@cjyetman thanks as always for the careful investigation here, and the comprehensive explanation in this PR. Really solid example of what a good PR should look like, I appreciate it.

Reviewing now.

jdhoffa

lgtm

jdhoffa · 2024-07-09T09:41:46Z

R/prepare_financial_data.R

+#' @importFrom dplyr case_when
+#'


NB: Agree that case_when is a more appropriate solution here, and much more interpretable for future us.

Antoine-Lalechere

LGTM

fix filtering in prepare_financial_data()

951e6a2

import dplyr::case_when

d3d185a

cjyetman requested review from jdhoffa and Antoine-Lalechere July 9, 2024 09:40

jdhoffa approved these changes Jul 9, 2024

View reviewed changes

Antoine-Lalechere approved these changes Jul 9, 2024

View reviewed changes

jdhoffa merged commit 0b3d197 into main Jul 9, 2024
9 checks passed

jdhoffa deleted the financial-data-filtering-fix branch July 9, 2024 10:38

This was referenced Jul 9, 2024

config: point pa2024ch to data prep with new weo 2023 scenarios RMI-PACTA/workflow.transition.monitor#333

Merged

Build/336 update peer data RMI-PACTA/workflow.transition.monitor#340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix filtering in `prepare_financial_data()` #32

fix filtering in `prepare_financial_data()` #32

cjyetman commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading

jdhoffa commented Jul 9, 2024

jdhoffa left a comment

jdhoffa Jul 9, 2024

Antoine-Lalechere left a comment

fix filtering in prepare_financial_data() #32

fix filtering in prepare_financial_data() #32

Conversation

cjyetman commented Jul 9, 2024 • edited Loading

codecov bot commented Jul 9, 2024 • edited Loading

Codecov Report

jdhoffa commented Jul 9, 2024

jdhoffa left a comment

Choose a reason for hiding this comment

jdhoffa Jul 9, 2024

Choose a reason for hiding this comment

Antoine-Lalechere left a comment

Choose a reason for hiding this comment

fix filtering in `prepare_financial_data()` #32

fix filtering in `prepare_financial_data()` #32

cjyetman commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 9, 2024 •

edited

Loading