Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for User Stats Pipeline to Handle Both Scenarios (New and No New Data) #1006

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TeachMeTW
Copy link
Contributor

@TeachMeTW TeachMeTW commented Dec 24, 2024

Summary

This PR addresses a fix to ensure that the user stats pipeline behaves as expected in two specific scenarios:

  1. When no new data is added: The stats should remain unchanged after re-running the pipeline.
  2. When new data is added: The stats should update correctly to reflect the new data.

While implementing the fix, I identified the need for unit tests to ensure these behaviors are consistently verified.

Changes Made

  • Added checks for both scenarios to improve pipeline reliability and prevent unexpected stat updates.

Unit Tests

  1. Scenario (i): Stats remain unchanged when the pipeline is re-run without new data.
  2. Scenario (ii): Stats update correctly when new data is added, ensuring consistency with expected values.

The logic for these tests can be based on the example methods provided:

  • testGetAndStoreUserStatsSecondRunNoNewData: Verifies no changes to stats when re-run without new data.
  • testGetAndStoreUserStatsNewData: Verifies correct updates when new data is added.

See #1005

@TeachMeTW TeachMeTW changed the title Added unit tests to verify that the stats are generated in both cases… Fix for User Stats Pipeline to Handle Both Scenarios (New and No New Data) Dec 25, 2024
@shankari
Copy link
Contributor

shankari commented Jan 6, 2025

@TeachMeTW just to clarify, this PR does not incorporate the fix to the two use cases - that was fixed earlier.
This PR adds two new test cases, and passes in the trip_key as an input instead of hardcoding analysis/confirmed_trip

  • Please split the two unrelated changes into two separate commits
  • In the current two commits that you have, the first one adds testGetAndStoreUserStatsNewData, and the second one removes it. Please clean up the commit history to avoid unnecessary churn.

@TeachMeTW TeachMeTW force-pushed the Add-Further-Tests-To-User-Stats branch 2 times, most recently from 7e613fb to 19118ca Compare January 6, 2025 17:18
@TeachMeTW
Copy link
Contributor Author

@shankari I thought incorporating the fix meant fixing the unit test, hence the two new test cases.

I have split the two unrelated changes into two commits and cleaned up the commit history as requested.

@shankari
Copy link
Contributor

shankari commented Jan 6, 2025

This is correct, so I am happy to merge it.
But you also need to test what happens when these entries are not available in the profile.
e.g. between the time that the user registers and the pipeline runs for the first time, if the dashboard tries to access it, what happens?

@shankari
Copy link
Contributor

shankari commented Jan 6, 2025

I thought incorporating the fix meant fixing the unit test, hence the two new test cases.

Well, the fix is already done, but I guess you can argue that if it is not tested, it is not done 😄

@shankari
Copy link
Contributor

shankari commented Jan 6, 2025

Aha! After I pushed c77c5c5 related to using the input key, the tests started failing
https://github.com/e-mission/e-mission-server/actions/runs/12638957212/job/35216284230

with this error

web-server-1  | ======================================================================
web-server-1  | FAIL: testGetAndStoreUserStats (analysisTests.intakeTests.TestUserStat.TestUserStats)
web-server-1  | Test get_and_store_user_stats for the user to ensure that user statistics
web-server-1  | ----------------------------------------------------------------------
web-server-1  | Traceback (most recent call last):
web-server-1  |   File "/src/e-mission-server/emission/tests/analysisTests/intakeTests/TestUserStat.py", line 81, in testGetAndStoreUserStats
web-server-1  |     self.assertEqual(profile["total_trips"], expected_total_trips,
web-server-1  | AssertionError: 8 != 5 : Expected total_trips to be 5, got 8
web-server-1  | 
web-server-1  | ----------------------------------------------------------------------

@TeachMeTW if a test fails, you cannot arbitrarily modify the tests so that it passes. That breaks the entire point of having automated tests, which is to catch regressions.

We need to investigate and indentify why we are getting different results for confirmed and composite trips.
Also, the change in the number of expected trips was part of the key_list change and should have been included in that commit and not in the commit related to expanding the unit tests.

@TeachMeTW do you have an ETA for the investigation to be complete??

@TeachMeTW
Copy link
Contributor Author

@shankari I am looking into it at the moment

@TeachMeTW
Copy link
Contributor Author

@shankari That is because we WERE using confirmed_trip rather than composite_trip

        total_trips = ts.find_entries_count(key_list=["analysis/confirmed_trip"])
        labeled_trips = ts.find_entries_count(
            key_list=["analysis/confirmed_trip"],
            extra_query_list=[{'data.user_input': {'$ne': {}}}]
        )

During your refactor, you modified the statement to use composite_trip whereas the get_store_user_stats was hardcoded to confirmed_trip hence the mismatch:

        try:
            run_intake_pipeline_for_user(uuid, skip_if_no_new_data)
            with ect.Timer() as gsr:
                logging.info("*" * 10 + "UUID %s: storing user stats " % uuid + "*" * 10)
                print(str(arrow.now()) + "*" * 10 + "UUID %s: storing user stats " % uuid + "*" * 10)
                eaurs.get_and_store_user_stats(uuid, "analysis/composite_trip")
            esds.store_pipeline_time(uuid, 'STORE_USER_STATS',
                                time.time(), gsr.elapsed)

@TeachMeTW
Copy link
Contributor Author

@shankari By changing the internal get_and_store_user_stat to analysis/composite_trip rather than analysis/confirmed_trips, it matches with the 8 expected. Hence I believe 8 is the ideal value and not 5.

@TeachMeTW
Copy link
Contributor Author

@shankari Based on my investigation:
The discrepancy arises due to a change in how trips are categorized during the refactor. Initially, the get_and_store_user_stats function was hardcoded to work with analysis/confirmed_trip. During the refactor, the key was updated to analysis/composite_trip. This change impacts the test because the expected_total_trips was calculated based on the count of confirmed_trip, but the actual computation now uses composite_trip. The two trip categories have different definitions and inclusions, which led to the discrepancy in the trip count.

Why I believe 8 is the Correct Value:

  • After updating the get_and_store_user_stats function to use analysis/composite_trip, the actual result of 8 trips aligns with the broader trip categorization introduced in the refactor.
  • This indicates that the value of 8 is correct under the new logic, reflecting the intended behavior of the refactored function.

Proposed Solution:

  • Modify the test description or comments to explicitly state the use of composite_trip to avoid confusion in the future.

Next Steps:

I will update the test to align with the composite_trip logic and include detailed comments to document the reasoning. Also going to modify the commit message so. Let me know if you have further concerns or if there’s anything else you’d like me to address.

@shankari
Copy link
Contributor

shankari commented Jan 6, 2025

@TeachMeTW the data model is such that we should have a 1:1 mapping between confirmed and composite trips.
We create composite trips by taking confirmed trips and adding on the other components of trips.

By changing the internal get_and_store_user_stat to analysis/composite_trip rather than analysis/confirmed_trips, it matches with the 8 expected. Hence I believe 8 is the ideal value and not 5.

This is not expected and needs more explanation

the actual result of 8 trips aligns with the broader trip categorization introduced in the refactor.

I am not sure what this means. **This mismatch is not expected. **. You can look at the code that creates composite objects - it reads the confirmed trips and creates composite trips from them.

For the record, I don't need the summary here: #1006 (comment) the earlier comments are sufficient

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 6, 2025

@shankari

@TeachMeTW the data model is such that we should have a 1:1 mapping between confirmed and composite trips. We create composite trips by taking confirmed trips and adding on the other components of trips.

By changing the internal get_and_store_user_stat to analysis/composite_trip rather than analysis/confirmed_trips, it matches with the 8 expected. Hence I believe 8 is the ideal value and not 5.

This is not expected and needs more explanation

I will look into it.

@TeachMeTW
Copy link
Contributor Author

@shankari So as we

should have a 1:1 mapping between confirmed and composite trips. We create composite trips by taking confirmed trips
and adding on the other components of trips.

Does that mean confirmed trips is the ground truth in terms of length/amount?

@TeachMeTW
Copy link
Contributor Author

I added debugging and logging with my changes and got to this conclusion:

INFO:root:User UUID-xyz-abc-bla-bla-bla Summary:
INFO:root:Confirmed Places: 9
INFO:root:Confirmed Trips: 5
INFO:root:Composite Trips: 5

@TeachMeTW
Copy link
Contributor Author

Changes:

triplikeEntries = list(ts.find_entries([esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY], time_query=time_query))

to

triplikeEntries = list(ts.find_entries([esda.CONFIRMED_TRIP_KEY], time_query=time_query))

Removed esda.CONFIRMED_UNTRACKED_KEY to ensure that only Confirmed Trips are processed for creating Composite Trips.



statuscheck["curr_confirmed_trip_count"] = db.count_documents({
   "metadata.key": esda.CONFIRMED_TRIP_KEY,
   "user_id": user_id
})

Introduced a new count to track the number of Confirmed Trips per user.


if needs_hack:
  assert statuscheck["curr_composite_trip_count"] == 0, "Current composite trip count is not zero."
  estbt.BuiltinTimeSeries.update(ecwe.Entry(ct))

Added an assertion to verify that the current composite trip count is zero before updating, preventing duplication and ensuring alignment.



[esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY] to [esda.CONFIRMED_TRIP_KEY]
Eliminated the inclusion of esda.CONFIRMED_UNTRACKED_KEY to prevent Composite Trips from being created based on untracked trip data.


Questions

@shankari I just had a question in regards to the 1:1 mapping? If we are adding to it, how is it keeping a 1:1 ratio? I am not sure if ommiting the untracked_key is the good way to go about it. Let me know your thoughts before I push this commit. Currently the test works with 5 as the expected trip. The debug statement was above.

@shankari
Copy link
Contributor

shankari commented Jan 7, 2025

@TeachMeTW first, identify the problem, then think about potential fixes, and then make the changes.

Aha! This is the problem. Good job identifying it.

triplikeEntries = list(ts.find_entries([esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY], time_query=time_query))

One potential fix is indeed to change the triplikeEntries. But the app also uses composite trips (which is why we want to keep it warm), so changing this data structure will break untracked time.

Do you have thoughts for an alternate fix? I can think of at least one other option. Please think through all the options, and list their pros and cons first.

Please do not make changes without thinking through several pros and cons

@shankari
Copy link
Contributor

shankari commented Jan 7, 2025

One option is also just to take the larger number, but change the label to "trips and untracked time", but I am concerned that will confuse the admins.

@TeachMeTW
Copy link
Contributor Author

@shankari I will take a look at these options and explore the pros and cons as suggested; will see if I can uncover a pssobile fix

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 7, 2025

@shankari

Here are some options, let me know your thoughts. This is based on how I understood the issue at hand:

Add a metadata tag

  • When creating a composite trip, add metadata indicating whether it originated from a confirmed trip or an untracked period.
  • When counting or comparing confirmed trips to composite trips, filter out composite trips that originated from untracked data.

Pros:

  • Selective Counting: Allows accurate matching of confirmed trips to their composite counterparts without losing untracked trip processing.
  • Flexibility: You can still generate composite trips for untracked data, but they won't interfere with the confirmed trip count.
  • Transparency: Admins can later view details about which composites came from untracked time if needed.

Cons:

  • Additional Complexity: Requires adding a metadata field and updating filtering logic.
  • Implementation Overhead: Slight changes to data structure and queries are necessary.

Maintain Separate Counts and Labels

  • Continue creating composite trips from both confirmed and untracked data.
  • Adjust the reporting interface to display two distinct numbers:
  • One for confirmed trips (and their composites).
  • Another for untracked time composites.

Pros:

  • Clarity in Data: Admins can see how many composites come from confirmed trips versus untracked periods.
  • No Data Filtering Needed: No need to change the creation logic or add metadata; just adjust the reporting.

Cons:

  • It gets muddy by multiple similar metrics, potentially leading to confusion.

Change Labeling to "Trips and Untracked Time"

  • Leave the data structure as is.
  • Simply change the labels in the admin reports to reflect that the count includes both trips and untracked time.

Pros:

  • Minimal Changes: Only label/text changes; no changes to logic or structure.
  • Quick Fix: Fast to implement

Cons:

  • Misleading Metric: The label might still be confusing or misleading to admins, as it lumps two different kinds of data together.
  • Lack of Granularity: Does not resolve the underlying issue of distinguishing between confirmed and untracked data.

Thoughts:

  • Option 3 might be the fastest and simplest
  • Option 1 might be the better way in terms of providing more information
    What do you suggest I proceed with?

@shankari
Copy link
Contributor

shankari commented Jan 7, 2025

@TeachMeTW

  1. this still fairly generic and what I would expect from ChatGPT if I said "tell me now to solve a problem where I need to filter two types of trips". I would encourage you to carefully look at the current implementation instead of saying "add a metadata tag"
    • What would it take to add the tag?
    • How much filtering of the logic is needed?
      You are writing code within the context of an existing, complex codebase. You want to start with seeing what exists and how your solution can integrate with it, instead of starting de novo.
  2. If you think that option 1 is best, but it is too complex, you need to know how complex so you can weigh the options.

@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 7, 2025

@TeachMeTW

1. this still fairly generic and what I would expect from ChatGPT if I said "tell me now to solve a problem where I need to filter two types of trips". I would encourage you to carefully **look at the current implementation** instead of saying "add a metadata tag"
   
   * What would it take to add the tag?
   * How much filtering of the logic is needed?
     You are writing code within the context of an existing, complex codebase. You want to start with seeing what exists and how your solution can integrate with it, instead of starting de novo.

2. If you think that option 1 is best, but it is too complex, you need to know **how** complex so you can weigh the options.

@shankari I believe from reading the code, we already have a dict for composite_trip_data which as keys like locations, sections, and we already have metadata which has an origin_key which is from ct[metadata][key] which has stuff like confirmed_trip and confirmed_untracked and whatnot based on my digging.

I would think just to do add to it by doing `composite_trip_entry['metadata']['origin_type'] = tracked or untracked. But the thing is we already have confirmed_untracked so is there an unconfirmed untracked or unconfirmed tracked? I am just a bit confused on how composite_trips and the other keys are determined. I will look into it.

Now with that, we can modify the db.count_document() to filter for metadata.origin_type

ie:

metadata.key: esda.COMPOSITE_TRIP_KEY
metadata.origin_type: confirmed

So:

What would it take to add the tag?

I believe my iteration above would suffice by adding to the composite trip entry so that line

How much filtering of the logic is needed?

We could use the count_documents and filter with the new metadata.origin so itd be like:

db.count_documents({
       metadata.key: esda.COMPOSITE_TRIP_KEY
       metadata.origin_type: confirmed
})

Would you say this explanation and train of thought suffices -- not to be always asking for reassurances that is.

@TeachMeTW
Copy link
Contributor Author

@TeachMeTW first, identify the problem, then think about potential fixes, and then make the changes.

Aha! This is the problem. Good job identifying it.


triplikeEntries = list(ts.find_entries([esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY], time_query=time_query))

One potential fix is indeed to change the triplikeEntries.

Do you have thoughts for an alternate fix? I can think of at least one other option. Please think through all the options, and list their pros and cons first.

Please do not make changes without thinking through several pros and cons

@shankari I also wanted to clarify by what you mean by "But the app also uses composite trips (which is why we want to keep it warm), so changing this data structure will break untracked time."

How are we keeping it warm and how will changing the structure break things?

@TeachMeTW
Copy link
Contributor Author

Well thinking about it more -- it begs the question what things are effected with this change. Furthermore, why wasn't this 1:1 issue caught before? Does this mean it was broken already? If so why didn't the tests catch it?

@TeachMeTW
Copy link
Contributor Author

Since I saw that we have origin_keys already, I was thinking modifying filtering to:

composite_trip_count = db.count_documents({
    "metadata.key": esda.COMPOSITE_TRIP_KEY,
    "user_id": user_id,
    "metadata.origin_key": esda.CONFIRMED_TRIP_KEY # (or hardcode it for this manner i suppose)
})

We could filter by confirmed this way?

But this seems too good to be true. How do you suggest me testing this @shankari -- what depends on the composite objects? I know as you said the phone does and I don't want to affect that and whatnot

@TeachMeTW
Copy link
Contributor Author

We set composite_trip_entry["metadata"]["origin_key"] = origin_key in def create_composite_trip(ts, ct): but we never use it in the composite_trip creation. Perhaps we could utilize this

@TeachMeTW
Copy link
Contributor Author

@shankari Rereading the code, I don't seem to understand by what this means:

# composite trips are created from both confirmed trips and cleaned untracked trips

if composite is created from BOTH confirmed trips AND cleaned untracked trips hence the ts.find_entries([esda.CONFIRMED_TRIP_KEY, esda.CONFIRMED_UNTRACKED_KEY]), I’m struggling to see how the 1:1 relationship between confirmed trips and composite trips is maintained.

From my understanding, if both confirmed trips and untracked trips are being combined into composites, shouldn’t the number of composite trips exceed the number of confirmed trips? Or am I missing something in how the relationship is established?

@JGreenlee the git blame refers to 9d41f1d where include untracked time in composite trip creation.

@TeachMeTW
Copy link
Contributor Author

@shankari
In essence, would total trips be the larger of confirmed trips and composites or would total trips just be confirmed trips?

@JGreenlee
Copy link
Contributor

JGreenlee commented Jan 8, 2025

n_confirmed = number of confirmed_trip
n_untracked = number of confirmed_untracked
n_composite = number of composite_trip

n_confirmed to n_composite is not necessarily 1:1. Every confirmed trip has a corresponding composite trip, and every untracked has a corresponding composite trip. They do not overlap.
Thus, n_confirmed + n_untracked = n_composite
Presumably in your test, n_confirmed=5, n_untracked=3, and n_composite=8

Since composite_trips may have been created from either confirmed_trips or confirmed_untrackeds, they have an extra property metadata.origin_key to disambiguate:

logging.info("End place type for trip is %s" % type(ct['data']['end_place']))
composite_trip_data = copy.copy(ct["data"])
origin_key = ct["metadata"]["key"]
logging.debug("Origin key for trip %s is %s" % (ct["_id"], origin_key))
composite_trip_data["locations"] = get_locations_for_confirmed_trip(ct)
composite_trip_data["confirmed_trip"] = ct["_id"]
composite_trip_data["start_confirmed_place"] = eaum.get_confirmed_place_for_confirmed_trip(ct, "start_place")
composite_trip_data["end_confirmed_place"] = eaum.get_confirmed_place_for_confirmed_trip(ct, "end_place")
composite_trip_data["sections"] = get_sections_for_confirmed_trip(ct)
composite_trip_entry = ecwe.Entry.create_entry(ct["user_id"], "analysis/composite_trip", composite_trip_data)
composite_trip_entry["metadata"]["origin_key"] = origin_key
ts.insert(composite_trip_entry)

I can explain in more detail if you want to get on a call later

@JGreenlee
Copy link
Contributor

I can also provide context for

How are we keeping it warm and how will changing the structure break things?

@shankari
Copy link
Contributor

shankari commented Jan 8, 2025

@TeachMeTW here's a concrete example working more independently. If you look at this PR, you have asked me 4 questions in ~ 12 hours! Neither Jack nor I can respond at that rate and still get our own work done. The goal is for you to slow down, read the code, explore the data and figure it out yourself.

It is more efficient for you to ask us when you have a question, but it is less efficient for us.

You have been added to the internal issue where we are discussing scalability changes.
You have access to the full codebase.
You have access to a dataset and have been taught how to load it.
Poke around at all of those and answer your own questions, including by making modifications if necessary.

Learning how to read existing code is a critical skill.
A good intermediate goal for you would be to slow down and investigate and limit yourself to one question a day.
Continue to record the result of your investigations in the issue, but ask the questions of yourself and answer them yourself! 😄

shankari added a commit that referenced this pull request Jan 8, 2025
… instead of hardcoding analysis/confirmed_trip"

This reverts commit c77c5c5.

Reverting because this broke the tests.
#1006 (comment)

This is because the composite trips include both actual trips and untracked time.
#1006 (comment)

We don't want to store untracked time in the user stats or show it to users, so
we can't push this to a real production system, where people check the admin
dashboard, just yet. The fix is simple, but will take @TeachMeTW some time to
figure out. So let's revert the change for now.
@TeachMeTW TeachMeTW force-pushed the Add-Further-Tests-To-User-Stats branch from 19118ca to a0e51c0 Compare January 8, 2025 19:39
@TeachMeTW
Copy link
Contributor Author

TeachMeTW commented Jan 8, 2025

@shankari After clearing my thoughts, rereading and building off from my theories and approaches from yesterday, it seems that Option 1 -- the metadata was ALREADY in the codebase so it was a simple fix once I clarified with @JGreenlee with what composite trips are and why and how they are used in conjunction with the app.

My notes:

Untracked:

  • Time when phone is off or not tracking location, gaps in the location spanning x time, any point in time for a user there should be a place, trip, or untracked — so there’s no overlap. So people can see the gaps. At any time there will always be one of the three
  • The ground truth is confirmed trips
  • We want to query with composite_trips — what gets sent to the phone (has place, trip or untracked) — created from untracked or places — embedded

Keeping composite warm:

  • Pipeline runs every hour
  • We want composite_trips to be queried so it stays warm

Filter by origin_key

Total trips = confirmed trips

Thus, the solution should just be filtering by origin which I suspected yesterday, and I did so by adding an extra query: extra_query_list=[{'metadata.origin_key': 'analysis/confirmed_trip'}]

@TeachMeTW
Copy link
Contributor Author

Now the tests passes and we keep a 1:1 ratio WITHOUT changing the existing structure. Furthermore, I edited the existing commits to ensure no commit muddle/churn

@shankari
Copy link
Contributor

shankari commented Jan 8, 2025

that Option 1 -- the metadata was ALREADY in the codebase so it was a simple fix

Yes. this was my point with

I can think of at least one other option

Again, the goal here is not to write a ton of code. The goal is to understand the existing system, and to make changes that are consistent with it.

once I clarified with @JGreenlee with what composite trips are and why and how they are used in conjunction with the app.

I'm glad you were able to get clarity from Jack on this, and I appreciate Jack taking the time to help you, but I don't want you to ask Jack 5 questions in 12 hours either - that won't let him do any work.

@TeachMeTW TeachMeTW force-pushed the Add-Further-Tests-To-User-Stats branch 5 times, most recently from 4c359d0 to 1df808d Compare January 9, 2025 20:24
…: (i) when there is new data and (ii) when there is no new data. Changed the test to track ts on both cases and simplified.
@TeachMeTW TeachMeTW force-pushed the Add-Further-Tests-To-User-Stats branch from 1df808d to 9255c5d Compare January 10, 2025 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for review by Shankari
Development

Successfully merging this pull request may close these issues.

3 participants