Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An existing agent may get disconnected, when a new agent is added #2065

Open
staskysel opened this issue Oct 11, 2022 · 0 comments
Open

An existing agent may get disconnected, when a new agent is added #2065

staskysel opened this issue Oct 11, 2022 · 0 comments

Comments

@staskysel
Copy link

Sometimes, when removing/adding agents, especially when adding several agents one immediately after another, the oldest connected agent may lose connection.
When this happens, the following is logged to ossec.log:

ossec-remoted: INFO: Assigning counter for agent HOSTNAME: 'NNNNN:NNNN'.
ossec-remoted: INFO: Assigning sender counter: NNNNN:NNNN <-- this is a normal message, but it is suspicious, that the numbers are the same as in the "Assigning counter" message
ossec-remoted(NNNN): ERROR: Unable to send message to 'N'.
ossec-remoted(NNNN): ERROR: Error creating encrypted message.
ossec-remoted(NNNN): ERROR: Incorrectly formatted message from 'X.X.X.X'.
ossec-remoted: ERROR: Unable to send file 'merged.mg' to agent

where:

  • "X.X.X.X" is the IP address of the disconnected agent.
  • "N" is agent's ID, present in the agent list
  • "NNNNN:NNNN" are unusually high (for an agent counter) numbers, equal or close to those in queue/rids/sender_counter file on server

Further, counter file /var/ossec/queue/rids/N on server also contains NNNNN:NNNN, while sender_counter on the agent remains normal (much smaller, similar to that before the failure).

After some time manage_agents -l | head begins to show "Disconnected" for the first agent in the list.

To say it short, sender counter is written, instead of queue/rids/sender_counter file, into agent's message counter queue/rids/N on server.

It is evident from code, that remoted:

  • is multithreaded
  • keeps file pointers for the agents in array, referenced from keys structure by its keyentries field
  • keeps array size in keysize field
  • keeps file pointer for queue/rids/sender_counter in the same array (it is the last element of the array)
  • uses the following to write sender counter
    fprintf(keys->keyentries[keys->keysize]->fp, "%u:%u:", global, local);
  • sets keysize to 0 in OS_FreeKeys() function (then sleeps for 1 seconds)

In theory this opens a race condition: it's possible, that one thread is trying to write sender counter, while another just has set keys->keysize to zero, so in fact it writes to the wrong file: to the counter file for the first agent in the array. Then within one second the sleeping thread wakes up and reads the wrong counter.
In practice debugger shows, that the agent's counter file is modified in StoreSenderCounter():

Hardware watchpoint 6: -location *(keys->keyentries[0]->fp->_IO_write_ptr)

Old value = 49 '1'
New value = 57 '9'
0x00007ffff5ea50a0 in __GI__IO_default_xsputn (f=f@entry=0x5555557dee80, data=<optimized out>, n=n@entry=5) at genops.c:451
451                     *p++ = *s++;

(gdb) bt
#0  0x00007ffff5ea50a0 in __GI__IO_default_xsputn (f=f@entry=0x5555557dee80, data=<optimized out>, n=n@entry=5) at genops.c:451
#1  0x00007ffff5ea2a92 in _IO_new_file_xsputn (f=0x5555557dee80, data=<optimized out>, n=5) at fileops.c:1347
#2  0x00007ffff5e72ec5 in _IO_vfprintf_internal (s=s@entry=0x5555557dee80, format=<optimized out>, ap=ap@entry=0x7ffff25856b8) at vfprintf.c:1634
#3  0x00007ffff5f3e325 in ___fprintf_chk (fp=0x5555557dee80, flag=1, format=<optimized out>) at fprintf_chk.c:36
#4  0x000055555555c502 in fprintf (__fmt=0x5555555755db "%u:%u:", __stream=<optimized out>) at /usr/include/bits/stdio2.h:97
#5  StoreSenderCounter (local=NNNN, global=NNNNN, keys=0x5555557d08e0 <keys>) at os_crypto/shared/msgs.c:147
#6  CreateSecMSG (keys=keys@entry=0x5555557d08e0 <keys>,
    msg=msg@entry=0x7ffff258a4e0 "who cares"...,
    msg_encrypted=msg_encrypted@entry=0x7ffff2588860 "binary stuff here", <incomplete sequence \317>, id=id@entry=317) at os_crypto/shared/msgs.c:478
#7  0x0000555555558fcf in send_msg (agentid=agentid@entry=317,
    msg=msg@entry=0x7ffff258a4e0 "who cares"...) at remoted/sendmsg.c:101
#8  0x0000555555558315 in send_file_toagent (agentid=agentid@entry=317, name=<optimized out>, sum=0x5555557d3740 "an md5 sum here") at remoted/manager.c:290
#9  0x0000555555558bc3 in read_controlmsg (
    msg=0x7ffff258a9f3 "more binary stuff here"..., agentid=317) at remoted/manager.c:372
#10 wait_for_msgs (none=<optimized out>) at remoted/manager.c:507
#11 0x00007ffff78c4ea5 in start_thread (arg=0x7ffff258b700) at pthread_create.c:307
#12 0x00007ffff5f26b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The bug is triggered rather seldom, but may be quite annoying in environment, where agents are created and removed automatically and quite often.

When researching the problem, I've created an artificial procedure, that helps to reproduce the bug more reliably:

  1. Pick a recently added (that is, not the oldest) agent and remove it on server with manage_agents -r
  2. Restart the corresponding agent (on client). This causes the agent to send a lot of messages to the server, while previous step makes the agent unknown to the server, so remoted calls check_keyupdate() in its main thread for every message. It is better to check with tcpdump if messages are really being sent, because sometimes they are not.
  3. Start a cycle touching keys every 0.5 sec. This makes check_keyupdate() to proceed to OS_FreeKeys()
    while true; do sudo touch /var/ossec/etc/client.keys; sleep 0.5; done
  4. Pick a couple of recently added agents. Remove them, wait 10 seconds, add them again. The bug will be triggered, otherwise repeat this step 1-2 times.
    Removal and addition involve removing/adding agent package on the client, removing/adding the agent on server with manage_agents and restarting OSSEC service on server. OSSEC is also pushing agent configuration, this may be important (see the backtrace).

A fix is proposed in PR #2064.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant