You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes, when removing/adding agents, especially when adding several agents one immediately after another, the oldest connected agent may lose connection.
When this happens, the following is logged to ossec.log:
ossec-remoted: INFO: Assigning counter for agent HOSTNAME: 'NNNNN:NNNN'. ossec-remoted: INFO: Assigning sender counter: NNNNN:NNNN <-- this is a normal message, but it is suspicious, that the numbers are the same as in the "Assigning counter" message ossec-remoted(NNNN): ERROR: Unable to send message to 'N'. ossec-remoted(NNNN): ERROR: Error creating encrypted message. ossec-remoted(NNNN): ERROR: Incorrectly formatted message from 'X.X.X.X'. ossec-remoted: ERROR: Unable to send file 'merged.mg' to agent
where:
"X.X.X.X" is the IP address of the disconnected agent.
"N" is agent's ID, present in the agent list
"NNNNN:NNNN" are unusually high (for an agent counter) numbers, equal or close to those in queue/rids/sender_counter file on server
Further, counter file /var/ossec/queue/rids/N on server also contains NNNNN:NNNN, while sender_counter on the agent remains normal (much smaller, similar to that before the failure).
After some time manage_agents -l | head begins to show "Disconnected" for the first agent in the list.
To say it short, sender counter is written, instead of queue/rids/sender_counter file, into agent's message counter queue/rids/N on server.
It is evident from code, that remoted:
is multithreaded
keeps file pointers for the agents in array, referenced from keys structure by its keyentries field
keeps array size in keysize field
keeps file pointer for queue/rids/sender_counter in the same array (it is the last element of the array)
uses the following to write sender counter fprintf(keys->keyentries[keys->keysize]->fp, "%u:%u:", global, local);
sets keysize to 0 in OS_FreeKeys() function (then sleeps for 1 seconds)
In theory this opens a race condition: it's possible, that one thread is trying to write sender counter, while another just has set keys->keysize to zero, so in fact it writes to the wrong file: to the counter file for the first agent in the array. Then within one second the sleeping thread wakes up and reads the wrong counter.
In practice debugger shows, that the agent's counter file is modified in StoreSenderCounter():
Hardware watchpoint 6: -location *(keys->keyentries[0]->fp->_IO_write_ptr)
Old value = 49 '1'
New value = 57 '9'
0x00007ffff5ea50a0 in __GI__IO_default_xsputn (f=f@entry=0x5555557dee80, data=<optimized out>, n=n@entry=5) at genops.c:451
451 *p++ = *s++;
(gdb) bt
#0 0x00007ffff5ea50a0 in __GI__IO_default_xsputn (f=f@entry=0x5555557dee80, data=<optimized out>, n=n@entry=5) at genops.c:451
#1 0x00007ffff5ea2a92 in _IO_new_file_xsputn (f=0x5555557dee80, data=<optimized out>, n=5) at fileops.c:1347
#2 0x00007ffff5e72ec5 in _IO_vfprintf_internal (s=s@entry=0x5555557dee80, format=<optimized out>, ap=ap@entry=0x7ffff25856b8) at vfprintf.c:1634
#3 0x00007ffff5f3e325 in ___fprintf_chk (fp=0x5555557dee80, flag=1, format=<optimized out>) at fprintf_chk.c:36
#4 0x000055555555c502 in fprintf (__fmt=0x5555555755db "%u:%u:", __stream=<optimized out>) at /usr/include/bits/stdio2.h:97
#5 StoreSenderCounter (local=NNNN, global=NNNNN, keys=0x5555557d08e0 <keys>) at os_crypto/shared/msgs.c:147
#6 CreateSecMSG (keys=keys@entry=0x5555557d08e0 <keys>,
msg=msg@entry=0x7ffff258a4e0 "who cares"...,
msg_encrypted=msg_encrypted@entry=0x7ffff2588860 "binary stuff here", <incomplete sequence \317>, id=id@entry=317) at os_crypto/shared/msgs.c:478
#7 0x0000555555558fcf in send_msg (agentid=agentid@entry=317,
msg=msg@entry=0x7ffff258a4e0 "who cares"...) at remoted/sendmsg.c:101
#8 0x0000555555558315 in send_file_toagent (agentid=agentid@entry=317, name=<optimized out>, sum=0x5555557d3740 "an md5 sum here") at remoted/manager.c:290
#9 0x0000555555558bc3 in read_controlmsg (
msg=0x7ffff258a9f3 "more binary stuff here"..., agentid=317) at remoted/manager.c:372
#10 wait_for_msgs (none=<optimized out>) at remoted/manager.c:507
#11 0x00007ffff78c4ea5 in start_thread (arg=0x7ffff258b700) at pthread_create.c:307
#12 0x00007ffff5f26b0d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
The bug is triggered rather seldom, but may be quite annoying in environment, where agents are created and removed automatically and quite often.
When researching the problem, I've created an artificial procedure, that helps to reproduce the bug more reliably:
Pick a recently added (that is, not the oldest) agent and remove it on server with manage_agents -r
Restart the corresponding agent (on client). This causes the agent to send a lot of messages to the server, while previous step makes the agent unknown to the server, so remoted calls check_keyupdate() in its main thread for every message. It is better to check with tcpdump if messages are really being sent, because sometimes they are not.
Start a cycle touching keys every 0.5 sec. This makes check_keyupdate() to proceed to OS_FreeKeys() while true; do sudo touch /var/ossec/etc/client.keys; sleep 0.5; done
Pick a couple of recently added agents. Remove them, wait 10 seconds, add them again. The bug will be triggered, otherwise repeat this step 1-2 times.
Removal and addition involve removing/adding agent package on the client, removing/adding the agent on server with manage_agents and restarting OSSEC service on server. OSSEC is also pushing agent configuration, this may be important (see the backtrace).
Sometimes, when removing/adding agents, especially when adding several agents one immediately after another, the oldest connected agent may lose connection.
When this happens, the following is logged to
ossec.log
:ossec-remoted: INFO: Assigning counter for agent HOSTNAME: 'NNNNN:NNNN'.
ossec-remoted: INFO: Assigning sender counter: NNNNN:NNNN
<-- this is a normal message, but it is suspicious, that the numbers are the same as in the "Assigning counter" messageossec-remoted(NNNN): ERROR: Unable to send message to 'N'.
ossec-remoted(NNNN): ERROR: Error creating encrypted message.
ossec-remoted(NNNN): ERROR: Incorrectly formatted message from 'X.X.X.X'.
ossec-remoted: ERROR: Unable to send file 'merged.mg' to agent
where:
queue/rids/sender_counter
file on serverFurther, counter file
/var/ossec/queue/rids/N
on server also contains NNNNN:NNNN, whilesender_counter
on the agent remains normal (much smaller, similar to that before the failure).After some time
manage_agents -l | head
begins to show "Disconnected" for the first agent in the list.To say it short, sender counter is written, instead of
queue/rids/sender_counter
file, into agent's message counterqueue/rids/N
on server.It is evident from code, that
remoted
:keys
structure by itskeyentries
fieldkeysize
fieldqueue/rids/sender_counter
in the same array (it is the last element of the array)fprintf(keys->keyentries[keys->keysize]->fp, "%u:%u:", global, local);
keysize
to 0 inOS_FreeKeys()
function (then sleeps for 1 seconds)In theory this opens a race condition: it's possible, that one thread is trying to write sender counter, while another just has set
keys->keysize
to zero, so in fact it writes to the wrong file: to the counter file for the first agent in the array. Then within one second the sleeping thread wakes up and reads the wrong counter.In practice debugger shows, that the agent's counter file is modified in
StoreSenderCounter()
:The bug is triggered rather seldom, but may be quite annoying in environment, where agents are created and removed automatically and quite often.
When researching the problem, I've created an artificial procedure, that helps to reproduce the bug more reliably:
manage_agents -r
remoted
callscheck_keyupdate()
in its main thread for every message. It is better to check withtcpdump
if messages are really being sent, because sometimes they are not.check_keyupdate()
to proceed toOS_FreeKeys()
while true; do sudo touch /var/ossec/etc/client.keys; sleep 0.5; done
Removal and addition involve removing/adding agent package on the client, removing/adding the agent on server with
manage_agents
and restarting OSSEC service on server. OSSEC is also pushing agent configuration, this may be important (see the backtrace).A fix is proposed in PR #2064.
The text was updated successfully, but these errors were encountered: