Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/sample data for demos #96

Closed
wants to merge 1 commit into from
Closed

Conversation

K-Beicher
Copy link
Contributor

@K-Beicher K-Beicher commented Jan 10, 2024

This PR adds three csvs created with Synthea for demonstrating Seedcase (Sprout for now).

Link to data description: https://github.com/synthetichealth/synthea/wiki/CSV-File-Data-Dictionary

@K-Beicher K-Beicher requested a review from a team as a code owner January 10, 2024 14:28
@K-Beicher K-Beicher linked an issue Jan 10, 2024 that may be closed by this pull request
@signekb signekb changed the title add three csv files feat/sample data for demos Jan 10, 2024
Copy link
Member

@lwjohnst86 lwjohnst86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the best repo for this. It will definitely need to easily accessible in the sprout repo.

I made some comments on the data files. Could you also add a README or some file with instructions on what you did to get that data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data we don't need. You can delete this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought it over so that we would have a small data set that we can use to work on primary and foreign keys. I agree that we don't need that type of data, but the rest of the files are too big to bring over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of variables in here that we don't need at all. Like name, address, Drivers license, SSN, Passport, race, birthplace, lat, and lon. Basically any really sensitive/specific "personally identifying" information, since Sprout isn't designed around those use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fairly ok, but much too big of a dataset. More likely, the ID and Encounter variables are taking up most of the space. Is there any way to use smaller/simpler IDs formats? This looks like UUID or GUIDs are used, which are excessive for our purposes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. The only way to bring it down is to go down in number of participants. I suspect that if we ran it with only 1000 patients we'd get a more manageable data set.

@K-Beicher
Copy link
Contributor Author

This has been superseded by the creation of the Data repo.

@K-Beicher K-Beicher closed this Jan 15, 2024
@K-Beicher
Copy link
Contributor Author

@lwjohnst86 will you remove the branch associated with this, please?

@lwjohnst86
Copy link
Member

Yes for sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Potential example dataset to use for testing/demo'ing seedcase
2 participants