Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix filtering in prepare_financial_data() #32

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

cjyetman
Copy link
Member

@cjyetman cjyetman commented Jul 9, 2024

The problem was that unwanted Equity rows were getting through the filters of the financial data. Equity rows that have NA or 0 for unit_share_price or current_shares_outstanding end up with a NA or 0 value for market capitalization (unit_share_price * current_shares_outstanding), which leads to an NA or Inf value for "ownership weight" when "value invested" is divided by "market_cap". Since "market cap" is necessary information for the ownership weight calculation, we should filter out rows where market capitalization cannot be properly calculated.

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 23 × 15
#> # Groups:   factset_entity_id, asset_type [22]
#>    isin     unit_share_price current_shares_outst…¹ asset_type factset_entity_id
#>    <chr>               <dbl>                  <dbl> <chr>      <chr>            
#>  1 US92554…            0                 4701690000 Equity     065FWN-E         
#>  2 US02341…           38.8                        0 Equity     05DZHZ-E         
#>  3 US03420…            0                 4919209650 Equity     006JZS-E         
#>  4 US05858…            0.383                      0 Equity     05J60L-E         
#>  5 US29562…            0                  215271000 Equity     09NNL5-E         
#>  6 US39322…            0                  499989996 Equity     05LWKW-E         
#>  7 US69338…            0                43123215171 Equity     002TYS-E         
#>  8 US91349…            0                86188033465 Equity     0045KD-E         
#>  9 US24379…            0                  229374605 Equity     001B1Q-E         
#> 10 AU00000…            1.20                       0 Equity     0K17S2-E         
#> # ℹ 13 more rows
#> # ℹ abbreviated name: ¹​current_shares_outstanding_all_classes
#> # ℹ 10 more variables: company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

These filters were originally introduced in https://github.com/RMI-PACTA/archive.pacta.data.preparation/pull/226 (sorry @jdhoffa, not trying to point fingers, just want to be super clear about the provenance). It looks like the filters added there did not have any additional intent, they simply did not work completely as expected.


I'm propsing to modify the filter with a dplyr::case_when() which is easier to interpret and maintain. To understand the logic, see the below code. The goals are:

  • for Bonds, keep all rows
  • for Other, keep all rows
  • for Funds, keep only rows that do not have NA for the adj_price
  • for Equity, keep only rows that have a positive value (non-NA, and > 0) for both adj_price and adj_shares_outstanding

The rows of Equity in the filtered result should only have one row where both adj_price and adj_shares_outstanding have a positive value.

library(dplyr)

bonds <- tibble(
  asset_type = c("Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds", "Bonds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

equity <- tibble(
  asset_type = c("Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity", "Equity"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

funds <- tibble(
  asset_type = c("Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds", "Funds"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

others <- tibble(
  asset_type = c("Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others", "Others"),
  adj_price = c(1, 1, 1, 0, 0 , 0, NA, NA, NA),
  adj_shares_outstanding = c(1, 0, NA, 1, 0, NA, 1, 0, NA)
)

test <- bind_rows(bonds, equity, funds, others)

test %>% print(n = 40)
#> # A tibble: 36 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Equity             1                      0
#> 12 Equity             1                     NA
#> 13 Equity             0                      1
#> 14 Equity             0                      0
#> 15 Equity             0                     NA
#> 16 Equity            NA                      1
#> 17 Equity            NA                      0
#> 18 Equity            NA                     NA
#> 19 Funds              1                      1
#> 20 Funds              1                      0
#> 21 Funds              1                     NA
#> 22 Funds              0                      1
#> 23 Funds              0                      0
#> 24 Funds              0                     NA
#> 25 Funds             NA                      1
#> 26 Funds             NA                      0
#> 27 Funds             NA                     NA
#> 28 Others             1                      1
#> 29 Others             1                      0
#> 30 Others             1                     NA
#> 31 Others             0                      1
#> 32 Others             0                      0
#> 33 Others             0                     NA
#> 34 Others            NA                      1
#> 35 Others            NA                      0
#> 36 Others            NA                     NA

test %>% 
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$adj_price),
      asset_type == "Equity" ~ .data$adj_price > 0 & .data$adj_shares_outstanding > 0
    )
  ) %>% 
  print(n = 40)
#> # A tibble: 25 × 3
#>    asset_type adj_price adj_shares_outstanding
#>    <chr>          <dbl>                  <dbl>
#>  1 Bonds              1                      1
#>  2 Bonds              1                      0
#>  3 Bonds              1                     NA
#>  4 Bonds              0                      1
#>  5 Bonds              0                      0
#>  6 Bonds              0                     NA
#>  7 Bonds             NA                      1
#>  8 Bonds             NA                      0
#>  9 Bonds             NA                     NA
#> 10 Equity             1                      1
#> 11 Funds              1                      1
#> 12 Funds              1                      0
#> 13 Funds              1                     NA
#> 14 Funds              0                      1
#> 15 Funds              0                      0
#> 16 Funds              0                     NA
#> 17 Others             1                      1
#> 18 Others             1                      0
#> 19 Others             1                     NA
#> 20 Others             0                      1
#> 21 Others             0                      0
#> 22 Others             0                     NA
#> 23 Others            NA                      1
#> 24 Others            NA                      0
#> 25 Others            NA                     NA

Note that .data$adj_price > 0 & .data$adj_shares_outstanding > 0 can return NAs, but dplyr::filter()drops rows where the result isNA(versusTRUEorFALSE`).

library(dplyr)

c(1, 0, NA) > 0
#> [1]  TRUE FALSE    NA

tibble(x = LETTERS[1:3]) %>% filter(c(1, 0, NA) > 0)
#> # A tibble: 1 × 1
#>   x    
#>   <chr>
#> 1 A

Modifying @Antoine-Lalechere's example code to use the new filter (and adjusting the column names appropriately), should yield no rows...

library(dplyr)

financial_data <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/financial_data.rds")
entity_info <- readRDS("~/data/pactadatadev/workflow-data-preparation-outputs/2023Q4_20240701T114132Z/entity_info.rds")

financial_data_no_mkt_cap <- financial_data %>%
  filter(
    case_when(
      asset_type == "Bonds" ~ TRUE,
      asset_type == "Others" ~ TRUE,
      asset_type == "Funds" ~ !is.na(.data$unit_share_price),
      asset_type == "Equity" ~ .data$unit_share_price > 0 & .data$current_shares_outstanding_all_classes > 0
    )
  ) %>% 
  group_by(factset_entity_id, asset_type) %>%
  filter(
    asset_type == "Equity",
    sum(current_shares_outstanding_all_classes) == 0 | sum(unit_share_price) == 0 
  ) %>%
  left_join(entity_info, by = "factset_entity_id") %>%
  filter(!is.na(ar_company_id))

financial_data_no_mkt_cap
#> # A tibble: 0 × 15
#> # Groups:   factset_entity_id, asset_type [0]
#> # ℹ 15 variables: isin <chr>, unit_share_price <dbl>,
#> #   current_shares_outstanding_all_classes <dbl>, asset_type <chr>,
#> #   factset_entity_id <chr>, company_name <chr>, country_of_domicile <chr>,
#> #   bics_sector_code <chr>, bics_sector <chr>,
#> #   security_bics_subgroup_code <chr>, security_bics_subgroup <chr>,
#> #   security_mapped_sector <chr>, ar_company_id <chr>, credit_parent_id <chr>,
#> #   credit_parent_ar_company_id <chr>

Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 18.82%. Comparing base (fa6e801) to head (d3d185a).

Files Patch % Lines
R/prepare_financial_data.R 0.00% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #32   +/-   ##
=======================================
  Coverage   18.82%   18.82%           
=======================================
  Files          35       35           
  Lines        1126     1126           
=======================================
  Hits          212      212           
  Misses        914      914           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jdhoffa
Copy link
Member

jdhoffa commented Jul 9, 2024

@cjyetman thanks as always for the careful investigation here, and the comprehensive explanation in this PR. Really solid example of what a good PR should look like, I appreciate it.

Reviewing now.

Copy link
Member

@jdhoffa jdhoffa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +13 to +14
#' @importFrom dplyr case_when
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: Agree that case_when is a more appropriate solution here, and much more interpretable for future us.

Copy link

@Antoine-Lalechere Antoine-Lalechere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jdhoffa jdhoffa merged commit 0b3d197 into main Jul 9, 2024
9 checks passed
@jdhoffa jdhoffa deleted the financial-data-filtering-fix branch July 9, 2024 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants