Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Added intergration with openwisp-monitoring #488

Merged
merged 51 commits into from
Oct 30, 2024

Conversation

pandafy
Copy link
Member

@pandafy pandafy commented Aug 31, 2023

@pandafy pandafy force-pushed the radius-monitoring branch 3 times, most recently from b07090d to d6cd4ad Compare August 31, 2023 12:30
@coveralls
Copy link

coveralls commented Aug 31, 2023

Coverage Status

coverage: 98.657% (+0.002%) from 98.655%
when pulling 36e7b01 on radius-monitoring
into 0dd6504 on master.

@pandafy pandafy force-pushed the radius-monitoring branch 3 times, most recently from 2387c29 to 0794151 Compare August 31, 2023 13:09
@nemesifier nemesifier changed the title [feature] Added intergraion with openwisp-monitoring [feature] Added intergration with openwisp-monitoring Sep 1, 2023
@pandafy pandafy force-pushed the radius-monitoring branch 2 times, most recently from d5667db to 92bb4e4 Compare September 13, 2023 10:50
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing on Django 3.2.0 here.

Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still failing, I restarted the build to find out whether it was a temporary failure because the error looked weird (had to do with a non-existent time).

nemesifier
nemesifier previously approved these changes Oct 2, 2023
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build passed, let's keep testing this a bit more before merging.

Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem found during testing: the count of total users should be taken from the SQL DB because influxdb doesn't know. The goal of the total users chart is to show the total amount of users registered in the system at any given time.

Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add more space between date filter and the pie charts.

Can we change the default colors of the RADIUS charts to differentiate between traffic charts?
They look too similar to the traffic charts and do not stand out.
Eg: (total: #3A3335, #FDF0D5, #C6D8D3, #F0544F, #D81E5B, #048A81, #343E8A)

I noticed that if the called-station-id of the session does not match exactly the mac address of a device, the total session and session traffic charts do not count these sessions, why? I think this is wrong, it can lead to loss of important data (eg: if conversion of called-station-id doesn't work temporarily or when traffic is performed by NASes which are not registered in OW).

Let's also move the radius sessions tab in the device page to a similar position as the WiFi sessions.

@pandafy
Copy link
Member Author

pandafy commented Feb 26, 2024

We found these shortcomings during the review

  • Write total number of users registered on the systems to influxdb once a day.
    No need to write the data point on every user registration.
  • Change the colors in the charts.

@pandafy
Copy link
Member Author

pandafy commented Mar 12, 2024

Summary of Changes:

The current implementation faced a challenge wherein the charts for "User Registration" and "Total User Registration" started from zero, displaying the number of users registered since the feature deployment rather than the total count of users registered on the system.

To address this, the proposed solution was to write the total number of registered users to InfluxDB once a day, eliminating the need for a data point write on every user registration event.

Solution Exploration:

In the latest iteration, I delved into the feasibility of the proposed solution. Initial investigations revealed a necessity to write metric points every hour instead of once a day, as the chart resolution for "last 7 days" draws a point for each hour.

Implementation Details:

To accommodate this, the following logic was implemented in Python:

def write_user_registration_metrics():
    now = timezone.now()
    metric_data = []
    # Get the registration data for the past hour.
    # The query returns a tuple of organization_id, registration_method and
    # count of users who registered with that organization and method.
    registered_users = (
        RegisteredUser.objects.filter(
            user__openwisp_users_organizationuser__created__lte=now,
            user__openwisp_users_organizationuser__created__gt=now
            - timezone.timedelta(hours=1),
        )
        .values_list('user__openwisp_users_organizationuser__organization_id', 'method')
        .annotate(count=Count('user', distinct=True))
    )
    for org_id, registration_method, count in registered_users:
        registration_method = clean_registration_method(registration_method)
        metric, _ = Metric._get_or_create(
            configuration='user_signups_new',
            name='User SignUps',
            key='user_signups_new',
            object_id=None,
            content_type=None,
            extra_tags={
                'organization_id': str(org_id),
                'method': registration_method,
            },
        )
        metric_data.append((metric, {'value': count, 'time': now}))
    Metric.batch_write(metric_data)

Caveats Identified:

However, two caveats were encountered:

  1. The logic wrote a metric point for each organization when a user registered with multiple organizations, causing the total user count to be misrepresented.
  2. Users without a corresponding RegisteredUser object were not considered, such as manually created superusers, further contributing to data inconsistency.

Decision and Refinement:

After discussion with @nemesifier, it was concluded that the current implementation, which allows a DISTINCT operation on user IDs, resolves the first caveat effectively.

To address the second caveat, it was decided to update the logic to write a metric point for all users created. For users without a RegisteredUser object, the metric point will be assigned an "unspecified" method.

Resolution:

This adjustment resolves the mismatch between the total user count displayed on charts and the user's changelist page.

Furthermore, to rectify the initial problem, a celery task will be added to retroactively write metric points for old data from the database. This celery task would get triggered on the post_migration signal, thus aligning the implementation with the desired outcome.

@pandafy
Copy link
Member Author

pandafy commented Mar 14, 2024

I introduced a task to retroactively write RADIUS metrics from the database to InfluxDB. This task aimed to populate InfluxDB with historical data, ensuring comprehensive metric tracking.

Task Implementation:

The task, outlined below, iterates through user data to extract relevant metrics:

@shared_task
def post_migrate_retroactive_write():
    """
    This task retroactively writes the RADIUS metrics
    to InfluxDB using the data present in the database.
    """
    metric_data = []
    for user in (
        User.objects.select_related('registered_user')
        .only('id', 'registered_user__method')
        .iterator()
    ):
        try:
            registration_method = user.registered_user.method
        except RegisteredUser.DoesNotExist:
            registration_method = 'unspecified'
        registration_method = clean_registration_method(registration_method)
        org_query = list(
            user.openwisp_users_organizationuser.values_list(
                'organization_id',
                'created',
            )
        )
        if not org_query:
            # This user is not a member of any organization.
            # Use "None" for organization_id for this user.
            org_query = [(None, user.date_joined)]
        for (org_id, time) in org_query:
            metric = _get_user_signup_metric(
                organization_id=str(org_id), registration_method=registration_method
            )
            metric_data.append(
                # (metric, {'value': sha1_hash(str(user.id)), 'time': time})
                (metric, {'value': user.username, 'time': time})
            )
    Metric.batch_write(metric_data)

Encountered Caveats:

During testing with real-world data, two significant challenges were identified:

Retention Policy Limitations: InfluxDB's retention policy imposes constraints on storing data beyond defined limits. Consequently, storing metrics from the installation's inception becomes infeasible. Moreover, InfluxDB's continuous deletion of older data introduces inconsistency in total user values over time.

Chart Display Limitations: The chart displays totals based on the selected time range exclusively. For instance, if there are a total of 60 users – 59 registered within the last year and one registered within the last two years – the chart would only depict 59 users when the time range is set to "last 365 days."

Resolution Considerations:

In response to the encountered caveats, a proposal was made to implement a separate retention period for this data. While this would address the first caveat, the second remains a concern.

@nemesifier suggested reverting to the previous implementation, where the number of users registered with each method for each organization is recorded in InfluxDB every hour. To resolve the caveats with this approach:

  1. The logic wrote a metric point for each organization when a user registered with multiple organizations, causing the total user count to be misrepresented.

The total count of registered users will be recorded in InfluxDB with a special value (e.g., __all__) for the organization_id field. When superusers view the charts, the metrics will be filtered using this value, allowing them to see the total registered user count. Additionally, the number of registered users for each organization will be recorded separately, ensuring accurate counts when the chart is filtered by an organization or viewed by non-superusers.

We would write the number of registered users for each organization separately. This values would be used when the chart is filtered by an organization or when a non-superuser is viewing the chart.

  1. Users without a corresponding RegisteredUser object were not considered, such as manually created superusers, further contributing to data inconsistency.

If a user does not have a related RegisteredUser object, then use "unspecified" method for that user. User the special value as organization_id for such users.

@pandafy pandafy force-pushed the radius-monitoring branch 3 times, most recently from 8a8cb1c to 4a61e80 Compare October 30, 2024 17:37
Copy link
Member

@nemesifier nemesifier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready! 🏁 🔥 🚀

@nemesifier nemesifier merged commit dcc5165 into master Oct 30, 2024
15 checks passed
@nemesifier nemesifier deleted the radius-monitoring branch October 30, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature] RADIUS Pie Charts [feature] RADIUS Timeseries Charts
3 participants