Skip to content
This repository has been archived by the owner on Jan 30, 2024. It is now read-only.

Releases: PanDAWMS/pilot

73.18

29 May 12:17
Compare
Choose a tag to compare
  • Added merged_lhef._0.events-new and Process to the list of files list that are removed before the job log is created. Discussed in JIRA ticket: https://its.cern.ch/jira/browse/ATLASJT-391

  • Now using function getmtime() instead of getctime() in the looping job killer; problem reported by David Cameron

73.17

15 May 09:55
Compare
Choose a tag to compare

Pilot 1 was updated this morning for a problem in the looping job algorithm affecting some user jobs (with getting the mod time for files that no longer exist - it is not clear how those files disappear since they are known less than a second before the mod time command).

73.16

08 May 15:31
Compare
Choose a tag to compare

Looping job update:

  • Fixing a bug in the looping job killer which now identifies the most recently modified file (discovered by R. Walker)

Wrong architecture:

  • Added new error code 1248, ERR_WRONGARCHITECTURE: "Job built on wrong architecture"
  • Added new error code 1249, ERR_RUNGENFAILURE: "RunGen failure (consult log file)"

Max work dir size:

No shell code returned for proxy error

  • Added NOPROXY error code to list in shellExitCode converter (discovered by P. Svirin)

73.15

28 Mar 09:17
Compare
Choose a tag to compare
  • Containers update for Pilot 2 that has implications on Pilot 1
  • Pilot 1 discards any @-syntax that might be present in cmtConfig from job definition (used to communicate ALRB_USER_PLATFORM to atlasSetup)
  • Rucio detect client location function update (used in combination with list_replicas())
  • Now setting the site name using 'gstat' instead of 'gocname' as the latter is not always correctly set

Contributions from A. Anisenkov, P. Nilsson

73.14

19 Mar 09:32
Compare
Choose a tag to compare
  • The location detection function used with list_replicas() has been updated and should now be IPv6 compatible. The previous version did not work on IPv6-only WNs.

73.13

06 Mar 14:41
Compare
Choose a tag to compare

Exception handling:

  • Added exception handling to findProcessesInGroup() function to avoid crash when ps command has failed (leading to 'invalid literal for int()')

Debug update (requested by R. Walker)

  • Pilot now looks for both .log and log. file patterns when looking for latest updated log file

Event service related

  • Bug related to corrupt files reporting

Nordugrid patch (already released on Nordugrid)

  • Moved back offending rucio imports from module header to where they are used (rucio_sitemover)

Code contributions from W. Guan, P. Nilsson.

73.12

05 Feb 16:01
Compare
Choose a tag to compare
  • Checksum verification update. For stage-out checksum verification the pilot now uses checksum value from XML which is calculated right after the
    payload finishes rather than the value recalculated after the stage-out has finished. This optimization will discover
    any file corruptions that happen between end of payload execution and end of stage-out.

  • Now reporting corrupted ES files

  • Rucio logger update; previously after an exception the rucio log was not propagated correctly

Code contributions from Wen Guan, Tomas Javurek, Paul Nilsson.

73.11

17 Jan 13:07
Compare
Choose a tag to compare

Corrected benchmark termination (a missing break). Previously the benchmark loop could not be terminated if the command ran for longer than the allowed time (due to a stuck benchmark command which is unrelated to the pilot). This led to large stdout in some jobs.

73.10

14 Dec 11:34
Compare
Choose a tag to compare
  • Stage-out verifications has been switched off for OS transfers (the site mover inherits from the rucio site mover which recently added this type of verifications which led to problems with OS:s). It led to problems with sim jobs at P1
  • Time-outs have been removed from Rucio API transfer calls since they led to problems on the server side

I also squeezed in an update to avoid port 9256 which might randomly be opened by a TCP server, to avoid problems with local monitoring which uses this port. Requested by A. De Salvo

Code contributions from W. Guan, T. Javurek, P. Nilsson.

73.8

10 Dec 10:33
Compare
Choose a tag to compare

The davs update from the previous pilot version was rolled back since it caused some problems with direct access on a couple of sites (but later fixed in AGIS by Rod Walker). If there is a need to continue testing with davs, a special pilot version can be provided.