Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inpatient days mismatch #36

Open
2miatran opened this issue Apr 26, 2021 · 2 comments
Open

inpatient days mismatch #36

2miatran opened this issue Apr 26, 2021 · 2 comments

Comments

@2miatran
Copy link

Hello, when running the results, I found that the value of inpatient days is not aligned with what I observed in the original claim input file, e.g. patients having no inpatient visits but have inpatient days of 24, or vice versa. Upon debugging, it seems it lines in the part where the inpatient_days is created with index using claim_df, this actually chose only value of date_diff where index == personId.

    preprocessed_df['# of Admissions (12M)'] = inpatient_rows.groupby('personId').admitDate.nunique()
    date_diff = pd.to_timedelta(inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date)
    inpatient_days = pd.Series(date_diff.dt.days, index=claim_df['personId'])
    preprocessed_df['Inpatient Days'] = inpatient_days.groupby('personId').sum()

Example of date_diff:
date_diff.dt.days
10 8
29 2
53 2
56 9
60 2
..
1333281 3
1333325 2 --> if there was a personid == 1333325, then there inpatient days is 2, while this is the index of the claim_df, not related to personId.
1333336 10
1333337 5
1333340 5
Length: 74609, dtype: int64


The claim_df and demo_df were set up as suggested:

  • demo_df has unique row for each patient with age and gender
  • claim_df has one or multiple rows for each patient (only patient with claims are included).
    Please let me know if you have any suggestion? Thank you.
@DaveDeCaprio
Copy link
Contributor

IF you make this change, does it work correctly?

inpatient_days = pd.Series(date_diff.dt.days, index=inpatient_rows['personId'])

@2miatran
Copy link
Author

Thanks, I already modified the code to work meanwhile, but was wondering if there is any potential impact on the way the test set "inpatient days" feature was created (if it was created using the same way) and used to generate the risk_score distribution, as from here:

risk_score - This percentile which indicates where this prediction lies in the distribution of predictinos on the test set. A value of 95 indicates that the prediction was higher than 95% of the test population, which was designed to be representative of the overall US population.

Additionally, we observed this difference but just to confirm, the xgboost_all_age model will give higher risk_score to compared to xgboost model which was trained on Medicare member only? Have you compared between the 2 models about the difference in risk_score on same population, Medicares for example?

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants