Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webdav failed uploads leave 0 length files in metadata #7724

Open
ageorget opened this issue Dec 20, 2024 · 8 comments
Open

Webdav failed uploads leave 0 length files in metadata #7724

ageorget opened this issue Dec 20, 2024 · 8 comments

Comments

@ageorget
Copy link

Hi,

I found many 0 length files in the namespace corresponding of failed uploads (3500 files for Atlas and CMS, 500 for LHCb)
Files are well deleted from the pools but not from the namespace :

select * from t_inodes where itype=32768 and isize=0 and icrtime < CURRENT_TIMESTAMP-INTERVAL '24 hours' order by icrtime desc;
              ipnfsid                | itype | imode | inlink | iuid | igid | isize | iio |           ictime           |           iatime           |           imtime           |          icrtime           | igeneration | iaccess_latency | iretention_policy |  inumber   | iqos
_policy | iqos_state 
--------------------------------------+-------+-------+--------+------+------+-------+-----+----------------------------+----------------------------+----------------------------+----------------------------+-------------+-----------------+-------------------+------------+-----
--------+------------
 0000179591B66B7245B5A7737413750F4F47 | 32768 |   420 |      1 | 3033 |  119 |     0 |   2 | 2024-12-19 06:51:27.989+01 | 2024-12-19 06:51:27.989+01 | 2024-12-19 06:51:27.989+01 | 2024-12-19 06:51:27.989+01 |           0 |                 |                   | 1288503194 |          
 00005DA951F022A3419CAE8F2BE86FFC31B8 | 32768 |   420 |      1 | 3033 |  119 |     0 |   2 | 2024-12-19 06:46:03.511+01 | 2024-12-19 06:46:03.511+01 | 2024-12-19 06:46:03.511+01 | 2024-12-19 06:46:03.511+01 |           0 |                 |                   | 1288499582 |       
 0000843FC0B40E324342889C8B334B9C53F8 | 32768 |   420 |      1 | 3033 |  119 |     0 |   2 | 2024-12-19 06:45:47.725+01 | 2024-12-19 06:45:47.725+01 | 2024-12-19 06:45:47.725+01 | 2024-12-19 06:45:47.725+01 |           0 |                 |                   | 1288499339 |           
0000C648F112A32E4C9EB0F24E695181872D | 32768 |   420 |      1 | 3327 |  124 |     0 |   2 | 2024-12-18 14:21:26.529+01 | 2024-12-18 14:21:26.529+01 | 2024-12-18 14:21:26.529+01 | 2024-12-18 14:21:26.529+01 |           0 |                 |                   | 1288099252 |             |

Nothing left in t_location_trash, cleaner is working well.

Failed transfers from the pool logs :

Dec 19 06:51:28 ccdcatli344 dcache@ccdcatli344-pool-cms-hpssdata-li344a-Domain[65368]: 19 Dec 2024 06:51:28 (pool-cms-hpssdata-li344a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmR8NFLA webdav-ccdcatli345 PoolAcceptFile 0000179591B66B7245B5A7737413750F4F47] Transfer failed: Connection lost before end of file.
Dec 19 06:51:28 ccdcatli344 dcache@ccdcatli344-pool-cms-hpssdata-li344a-Domain[65368]: 19 Dec 2024 06:51:28 (pool-cms-hpssdata-li344a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmR8NFLA webdav-ccdcatli345 PoolAcceptFile 0000179591B66B7245B5A7737413750F4F47] Transfer failed in post-processing: File size mismatch (expected=497992534, actual=0)
Dec 19 06:51:28 ccdcatli344 dcache@ccdcatli344-pool-cms-hpssdata-li344a-Domain[65368]: 19 Dec 2024 06:51:28 (pool-cms-hpssdata-li344a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmR8NFLA webdav-ccdcatli345 PoolAcceptFile 0000179591B66B7245B5A7737413750F4F47] Failed to read file size: java.nio.file.NoSuchFileException: /data/pool-cms-hpssdata-li344a/pool/data/0000179591B66B7245B5A7737413750F4F47

Dec 19 06:46:06 ccdcatli415 dcache@ccdcatli415-pool-cms-hpssdata-li415a-Domain[104743]: 19 Dec 2024 06:46:06 (pool-cms-hpssdata-li415a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmQu1/Dg webdav-ccdcatli345 PoolAcceptFile 00005DA951F022A3419CAE8F2BE86FFC31B8] Transfer failed: Connection lost before end of file.
Dec 19 06:46:06 ccdcatli415 dcache@ccdcatli415-pool-cms-hpssdata-li415a-Domain[104743]: 19 Dec 2024 06:46:06 (pool-cms-hpssdata-li415a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmQu1/Dg webdav-ccdcatli345 PoolAcceptFile 00005DA951F022A3419CAE8F2BE86FFC31B8] Transfer failed in post-processing: File size mismatch (expected=424527680, actual=0)
Dec 19 06:46:06 ccdcatli415 dcache@ccdcatli415-pool-cms-hpssdata-li415a-Domain[104743]: 19 Dec 2024 06:46:06 (pool-cms-hpssdata-li415a) [door:webdav-ccdcatli345@webdav-ccdcatli345Domain:AAYpmQu1/Dg webdav-ccdcatli345 PoolAcceptFile 00005DA951F022A3419CAE8F2BE86FFC31B8] Failed to read file size: java.nio.file.NoSuchFileException: /data/pool-cms-hpssdata-li415a/pool/data/00005DA951F022A3419CAE8F2BE86FFC31B8

Dec 18 14:30:47 ccdcatli416 dcache@ccdcatli416-pool-atlas-dq2-li416a-Domain[100801]: 18 Dec 2024 14:30:47 (pool-atlas-dq2-li416a) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAYpi0py7lg webdav-ccdcatli367 PoolAcceptFile 0000C648F112A32E4C9EB0F24E695181872D] Transfer failed: No connection from client after 300 seconds. Giving up.
Dec 18 14:30:47 ccdcatli416 dcache@ccdcatli416-pool-atlas-dq2-li416a-Domain[100801]: 18 Dec 2024 14:30:47 (pool-atlas-dq2-li416a) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAYpi0py7lg webdav-ccdcatli367 PoolAcceptFile 0000C648F112A32E4C9EB0F24E695181872D] Transfer failed in post-processing: File size mismatch (expected=692252, actual=0)
Dec 18 14:30:47 ccdcatli416 dcache@ccdcatli416-pool-atlas-dq2-li416a-Domain[100801]: 18 Dec 2024 14:30:47 (pool-atlas-dq2-li416a) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAYpi0py7lg webdav-ccdcatli367 PoolAcceptFile 0000C648F112A32E4C9EB0F24E695181872D] Failed to read file size: java.nio.file.NoSuchFileException: /data/pool-atlas-dq2-li416a/pool/data/0000C648F112A32E4C9EB0F24E695181872D
@mksahakyan
Copy link
Contributor

Hi @ageorget, could you please, tell us the version used, and we will need the history of billing log, so that we can reproduce the issue.

Thanks
cheers
Marina

@DmitryLitvintsev
Copy link
Member

I would like to add. From billing records we are interested in information about any of these pnfsids.
(like grep '00005DA951F022A3419CAE8F2BE86FFC31B8' *). The suspicion here is that the file removal is never sent (or never received) to(by)PnfsManager. IOW, these files have never been in the trash table.

@ageorget
Copy link
Author

ageorget commented Jan 8, 2025

Hi,
We use dCache 9.2.14

About this pnfsid :

Jan 07 10:27:07 ccdcatli416 dcache@ccdcatli416-pool-lhcb-dst-li416a-Domain[5854]: 07 Jan 2025 10:27:07 (pool-lhcb-dst-li416a) [door:webdav-ccdcatli346@webdav-ccdcatli346Domain:AAYrGkdndvg webdav-ccdcatli346 PoolAcceptFile 00000BF71BB4C16F42BD92F5F53828EF5D7A] Transfer failed: No connection from client after 300 seconds. Giving up.
Jan 07 10:27:07 ccdcatli416 dcache@ccdcatli416-pool-lhcb-dst-li416a-Domain[5854]: 07 Jan 2025 10:27:07 (pool-lhcb-dst-li416a) [door:webdav-ccdcatli346@webdav-ccdcatli346Domain:AAYrGkdndvg webdav-ccdcatli346 PoolAcceptFile 00000BF71BB4C16F42BD92F5F53828EF5D7A] Transfer failed in post-processing: File size mismatch (expected=37095206, actual=0)
Jan 07 10:27:07 ccdcatli416 dcache@ccdcatli416-pool-lhcb-dst-li416a-Domain[5854]: 07 Jan 2025 10:27:07 (pool-lhcb-dst-li416a) [door:webdav-ccdcatli346@webdav-ccdcatli346Domain:AAYrGkdndvg webdav-ccdcatli346 PoolAcceptFile 00000BF71BB4C16F42BD92F5F53828EF5D7A] Failed to read file size: java.nio.file.NoSuchFileException: /data/pool-lhcb-dst-li416a/pool/data/00000BF71BB4C16F42BD92F5F53828EF5D7A

The billing logs :

billing=> select * from billinginfo where pnfsid='00000BF71BB4C16F42BD92F5F53828EF5D7A' order by datestamp desc;
-[ RECORD 1 ]--+----------------------------------------------------------------------------------------
client         | 2001:630:54:10:82f6:db04:0:0
initiator      | door:webdav-ccdcatli346@webdav-ccdcatli346Domain:AAYrGkdndvg:1736241727396000
isnew          | t
protocol       | Https-1.1
transfersize   | 0
fullsize       | 0
storageclass   | lhcb:lhcb@osm
connectiontime | 300018
action         | transfer
cellname       | pool-lhcb-dst-li416a@ccdcatli416-pool-lhcb-dst-li416a-Domain
datestamp      | 2025-01-07 10:27:07.474+01
errorcode      | 10006
errormessage   | No connection from client after 300 seconds. Giving up.
pnfsid         | 00000BF71BB4C16F42BD92F5F53828EF5D7A
transaction    | pool:pool-lhcb-dst-li416a@ccdcatli416-pool-lhcb-dst-li416a-Domain:1736242027474-1337382
p2p            | f
fqan           | 
mappeduid      | 3437
mappedgid      | 155
owner          | lhcbgrid

billing=> select * from doorinfo where pnfsid='00000BF71BB4C16F42BD92F5F53828EF5D7A' order by datestamp desc;
-[ RECORD 1 ]--+-----------------------------------------------------------------------------------------------------------------------
client         | 2001:630:54:10:82f6:db04:0:0
mappedgid      | 155
mappeduid      | 3437
owner          | lhcbgrid
path           | /pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/failover/lhcb/LHCb/Collision24/RD.DST/00257670/0015/00257670_00158272_1.rd.dst
queuedtime     | 0
connectiontime | 300085
action         | request
cellname       | webdav-ccdcatli346@webdav-ccdcatli346Domain
datestamp      | 2025-01-07 10:27:07.481+01
errorcode      | 10006
errormessage   | No connection from client after 300 seconds. Giving up.
pnfsid         | 00000BF71BB4C16F42BD92F5F53828EF5D7A
transaction    | door:webdav-ccdcatli346@webdav-ccdcatli346Domain:AAYrGkdndvg:1736241727396000
fqan           | 

The 3 Kafka events generated :
00000BF71BB4C16F42BD92F5F53828EF5D7A.txt
We can see the remove Kafka event generated by the pool with the msg "Transfer failed and replica is empty".

In the door access logs I only get PROPFIND and HEAD requests :

level=WARN ts=2025-01-07T10:22:06.326+0100 event=org.dcache.webdav.request request.method=PROPFIND request.url=https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/failover/lhcb/LHCb/Collision24/RD.DST/00257670/0015/00257670_00158272_1.rd.dst
 response.code=404 response.reason="Not Found" socket.remote=[2001:630:54:10:82f6:db04::]:45850 user-agent="libdavix/0.8.6 libcurl/8.10.1" user.dn="CN=5411875404,CN=3680153033,CN=8760452408,CN=5946495324,CN=Robot: LHCb offline productions,CN=693025,CN=lbprods,
OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=3437:155[w90v1x72:wXAwPZCu] duration=9
level=WARN ts=2025-01-07T10:22:06.471+0100 event=org.dcache.webdav.request request.method=HEAD request.url=https://ccdavlhcb.in2p3.fr:2880/pnfs/in2p3.fr/data/lhcb/LHCb-Disk/lhcb/failover/lhcb/LHCb/Collision24/RD.DST/00257670/0015/00257670_00158272_1.rd.dst response.code=404 response.reason="Not Found" socket.remote=[2001:630:54:10:82f6:db04::]:45858 user-agent="libdavix/0.8.6 libcurl/8.10.1" user.dn="CN=5411875404,CN=3680153033,CN=8760452408,CN=5946495324,CN=Robot: LHCb offline productions,CN=693025,CN=lbprods,OU=Users,OU=Organic Units,DC=cern,DC=ch" user.mapped=3437:155[w90v1x72:wXAwPZCu] duration=9

@mksahakyan
Copy link
Contributor

Ok, you will need to upgrade to 9.2.18, this issues has been fix there.

https://www.dcache.org/old/downloads/1.9/release-notes-9.2.shtml#release9.2.18

thanks
cheers

@ageorget
Copy link
Author

ageorget commented Jan 9, 2025

Only one Remote Transfermanager is running for now on this instance.
The 9.2.18 also fixes issue with the cleaner?

@mksahakyan
Copy link
Contributor

The fix in version 9.2.18 is indirect.
The issue could as weel occurred with a single transferManager when two transfers were initiated at the same time.
If a transfer failed and both transfers shared the same ID, the entry for one of the failed transfers could not be deleted from the namespace.
There should be no issues with the cleaner itself.

@ageorget
Copy link
Author

Ah ok thanks, I will try the upgrade so.
Do I need to upgrade only the webdav doors? Or webdav doors + transfermanager?

@mksahakyan
Copy link
Contributor

only webdav should be enought.

cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants