Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzer: Add time_bin argument to area_to_pandas #130

Closed
wants to merge 7 commits into from

Conversation

EwoutH
Copy link
Contributor

@EwoutH EwoutH commented Sep 14, 2024

This PR introduces time-binned analysis capabilities to the link_to_pandas and area_to_pandas methods in the Analyzer class, while also addressing an issue with traffic volume calculation. These changes allow for more flexible, granular, and accurate analysis of traffic data over time.

Changes

  1. Added time_bin argument to link_to_pandas:

    • When time_bin is None, the method behaves as before.
    • When time_bin is specified, data is aggregated into time bins.
    • Performance is improved for time-binned analysis due to reduced iterations.
  2. Updated area_to_pandas to work with time-binned data:

    • Now uses the time-binned data from link_to_pandas when time_bin is specified.
    • Calculates area statistics for each time bin.
  3. Fixed traffic volume calculation in area_to_pandas:

    • Now correctly counts vehicles entering the area from outside.
    • Also counts vehicles starting their trip within the area.
    • Avoids double-counting vehicles passing through multiple links within the area.
  4. Data validation:

    • traffic_volume now represents the volume per bin instead of cumulative.
    • total_travel_time is calculated per bin, which is more logical for a binned approach.

Notes

  • The test suite has been updated to reflect the changes in traffic volume calculation.
  • The new approach provides a more accurate representation of traffic patterns over time.

Part of #120.

TODO

  • Validate data
  • Add argument to link_to_pandas
  • Fix traffic volume calculation

@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 14, 2024

Okay, it finally starts to look like something.

1. Added time_bin argument to link_to_pandas

Added the optional argument time_bin argument to link_to_pandas. Default behavior is now restored, and it's a bit faster since it only need to iterate over a lower amount of slices instead of over all individual timesteps.

2. Validated the data.

Previous strategy, with the timestep as in #123 and #128, it looked like this:

image

With this PR it looks like this:
image

Note that:

  • traffic_volume looks like it's higher. This could be a behavior change, since
  • total_travel_time is now per bin instead of cumulative, which seems more logical when going for a binned approach.

3. Tests fail

The test fails because we now set "area" as the index name (from the inputted area_names). We can either change that or update the tests, since I find "area" quite an logical index.

@EwoutH EwoutH marked this pull request as ready for review September 14, 2024 11:33
@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 14, 2024

@toruseo This is now ready for an initial review, before I optimize further.

@toruseo
Copy link
Owner

toruseo commented Sep 16, 2024

Thanks for your hard work.

This time_bin approach is very cool! Let's go this way.

The test fails because we now set "area" as the index name (from the inputted area_names). We can either change that or update the tests, since I find "area" quite an logical index.

For this issue, I want to ensure the backward compatibility. Maybe the index name for non time_bin mode can be just an integer (i.e., drop line 1519).

Another related issue is that the index for area_to_pandas is different from the others. This might be confusing. Is it possible to make these (time_bin, area) indexing a optional mode to be enabled by optional argument? Edit: At least, the internal variable Analyzer.df_area should have a consistent data structure. I think something like below would be suitable

s.df_area = pd.DataFrame(result)
if index_by_area:
    if time_bin is not None:
        return s.df_area.set_index(["time_bin", "area"])
    else:
        return s.df_area.set_index(["area"])        

Similar indexing mode could be added to link_to_pandas as well.

As a minor improvement, it is better to accept any numbers (including float) as time_bin argument as it is in seconds. Just doing time_bin=int(time_bin) in the beginning will be sufficient.

@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 16, 2024

Awesome, I will implement your suggestions later today.

@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 16, 2024

Implemented and tested!

Copy link
Owner

@toruseo toruseo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I forgot to submit the review

uxsim/analyzer.py Show resolved Hide resolved
@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 17, 2024

I don't really get the test error. It seems a value diff this time?

@toruseo
Copy link
Owner

toruseo commented Sep 18, 2024

This seems to be a real issue.

image

image

This means that your method says the traffic volume was 9310, whereas the actual number of all vehicles was 6880. Maybe double-counting or something.

Pls verify your method very carefully. Traffic volume and other stats in time_bin mode need to be verified as well. And can you include your test code (based on a small, tractable scenario like one in test_area_stats()) as well?

@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 24, 2024

A huge advantage of calculating traffic_volume based on Link data, is that you can turn off vehicle data logging completely and still get those 6 metrics. That can save an order of magnitude in memory usage in both running the simulation and saving data from it afterwards.

I will try to get the count deviation bug fixed this week.

This commit addresses an issue with traffic volume calculation in the
area_to_pandas method of the Analyzer class. Previously, the method was
overcounting vehicles, resulting in inflated traffic volume numbers for
areas in the network.

The main changes include:
1. Modified the process_area function to more accurately count vehicles:
   - Now counts vehicles entering the area from outside
   - Also counts vehicles starting their trip within the area
   - Avoids double-counting vehicles passing through multiple links within the area

2. Updated the traffic volume calculation logic:
   - Uses two separate queries to identify entering vehicles and
     vehicles starting their trips within the area
   - Sums these two values to get the total traffic volume
@EwoutH
Copy link
Contributor Author

EwoutH commented Sep 30, 2024

@toruseo I fixed the remaining issues (see 2dca14f for some details), all tests now pass!

@toruseo
Copy link
Owner

toruseo commented Oct 3, 2024

can you add test for time_bin mode as I instructed?

Validity of Analyzer must be very carefully checked, because if its output is corrupted, entire results of the research will be seriously damaged.

This commit modifies the link_to_pandas function in the analyzer to ensure
that the last time bin is always included in the output, even if it's not
complete. The change addresses an issue where the final time bin was being
omitted when the simulation end time was exactly divisible by the time bin size.

- Use math.ceil instead of integer division to calculate the number of bins
- This ensures that there's always a bin that includes the final time step
- Improves consistency in time-based analysis, especially for simulations
  where the last time bin is significant
@EwoutH
Copy link
Contributor Author

EwoutH commented Oct 3, 2024

Of course, I understand the importance of tests. I just wanted to check first if you were happy about the implementation before moving forward.

I fixed a nasty bug in which the last time bin was excluded sometimes in f543c34 and added tests in ced235e.

@EwoutH EwoutH mentioned this pull request Oct 3, 2024
@toruseo
Copy link
Owner

toruseo commented Oct 8, 2024

sorry, I missed your last message. Let me check

@toruseo
Copy link
Owner

toruseo commented Oct 8, 2024

strange. There may be some bugs. I used the following code for testing in a tractable small scale scenario.

from uxsim import *

W = World(
    name="",
    deltan=5, 
    tmax=2000, 
    print_mode=1, save_mode=1, show_mode=1,
    random_seed=0
)

W.addNode("orig", 0, 0)
W.addNode("mid1", 1, 1)
W.addNode("mid2", 2, 2)
W.addNode("dest", 3, 3)
link1 = W.addLink("link1", "orig", "mid1", length=1000, free_flow_speed=20, jam_density=0.2)
link2 = W.addLink("link2", "mid1", "mid2", length=1000, free_flow_speed=20, jam_density=0.2, capacity_out=0.4)
link3 = W.addLink("link3", "mid2", "dest", length=1000, free_flow_speed=20, jam_density=0.2)
t1 = 400
t2 = 800
W.adddemand("orig", "dest", 100, t1, 0.8)
W.adddemand("orig", "dest", t1, t2, 0.4)
W.adddemand("orig", "dest", t2, 1000, 0.1)

W.exec_simulation()

W.analyzer.print_simple_stats()

df = W.analyzer.link_to_pandas(time_bin=500)
display(df)

print(sum(df["traffic_volume"][df["link"]=="link1"]), sum(df["traffic_volume"][df["link"]=="link2"]), sum(df["traffic_volume"][df["link"]=="link3"]))

df2 = W.analyzer.area_to_pandas([["orig","mid1","mid2","dest"]])
display(df2)
print(df2["traffic_volume"])

The total number of vehicles is 420, so print(sum(df["traffic_volume"][df["link"]=="link1"]), sum(df["traffic_volume"][df["link"]=="link2"]), sum(df["traffic_volume"][df["link"]=="link3"])) should print 420 420 420, and print(df2["traffic_volume"]) should print 420.
But I got the following results:

image

Can you check?

Edit: BTW, in the main branch version, W.analyzer.area_to_pandas([["orig","mid1","mid2","dest"]]) return the following dataframe. Thus, this function is very broken.
image

@EwoutH
Copy link
Contributor Author

EwoutH commented Oct 8, 2024

Thanks for getting back.

Unfortunately, I don’t have time to work on this in the near future. If you are able, feel free to continue work on this PR.

My gut tells me it might be either A) some time indexing bug, like not including the first or last time step, B) something to do with when the vehicles get counted on a link, or C) something to do with how the areas are split and links the edges behaving weird,

@toruseo
Copy link
Owner

toruseo commented Oct 8, 2024

okay, I leave this in pending status

@toruseo toruseo marked this pull request as draft October 8, 2024 07:42
@toruseo
Copy link
Owner

toruseo commented Nov 13, 2024

This is related to #143 (comment)
But this is too good to close, so I keep it for future updates.

@EwoutH
Copy link
Contributor Author

EwoutH commented Nov 13, 2024

I don't see me working further on it, but I have successfully used this code in my thesis. So feel free to continue it.

@toruseo
Copy link
Owner

toruseo commented Nov 14, 2024

The current error may be small, but I value perfection for this kind of stats computation. I also dont see me working on this, so let me close this one.

@toruseo toruseo closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants