PerServ - Network management


Theory of operation

Hydra accesses each unique acs in parallel on a polled basis to perform transfers. The time to complete transfer is determined by the slowest responder, not the sum of each responder.

The destination computer (city) walks the directory tree of the source computer. Each node of the directory tree uses a parallel http/https connection to retrieve the contents of that directory node. Assume directory node contains 5 sub-directories and 3 files. The 5 subdirectores are added to a queue of nodes to query. If any of the 3 files mismatches the local files for comparison, a read of the mismatched remote file is queued as well. If a file or directory is present locally and missing in the remote, the local file or directory is deleted immediately.

Clearly, the queue of directories and files can build up quickly. To avoid overwhelming the remote computer with many transfer connections, the number of connections to each acs is limited to 2. This is needed in case the remote computer is also central to a network and is accessed simultaneously by many computers at once.

After a city computer has completed its transfers, it sleeps for 5 minutes and then starts the transfer process over again.

Net Stat Report


Fig. 1 Net stat report
The Net Stat report can be found as a section of the Admin >> Dashboard. We have marked the report with numbers 1-10 to enable line by line discussion.

  • line 1 shows whether present computer is "acquiring" data or "sleeping". The duration and recency of the most recent complete transfer is shown. This will not reflect the current status of data being acquired until the complete transfer is complete.
  • line 2 identifies the computer the statistics are for.
  • line 3 identifies the 10 minute period of peak activity. Every 10 minutes, a stat record is written to the log file (see below). When a computer is rebooted, the duration is higher than steady state. So, within 0-30 minutes of reboot is treated differently than past the 30 minute mark. Past 30 minutes, the highest ignores activity within first 30 minutes.

    The util percentage is attempting to characterize the loading on the bandwidth. Sometimes, this will be an impossibly high number (e.g. 500%). In such a case, many transfers were taking place simultaneously. The util percentage is best viewed as an upper limit on loading.

  • line 5 and 7-10 illustrates the transfer process for a 4 acs network. There are 5 sets of number groups. The most recent number groups reads "198 (2, 0, 2, 0)". This means there are 198 transfers of either directory or file contents presently queued up (backlog), waiting for a free . The first position containing 2 means there are 2 transfers in progress at the first acs - line 7. The second position containing 0 means transfer is complete for the second acs - line 8. The third and fourth position, in turn, give the queue/transfer status for the third and fourth acs - lines 9 and 10.

    By refreshing browser on the dashboard display page, we can see the evolution of the transfer. More typically, we might see "0 (0, 0, 0, 0)". because it is more common to take the snapshot when hydra is resting than when it is acquiring.

Backlogs illustrated here are 198, 125, 134, 126, 94. 198 is the most recent backlog within the last 6 seconds and 94 the furthest in the past: 24 seconds previous to that.


Statistics Log Entries


Fig. 2 Net stat log entries
At right, we have used a tool to show the summary log file lines in the @data/arch/debug/dair.log file. These lines are written ever 10 minutes whether there is network activity or not. Over a 24 hour period, there will be 144 such records in the log file. It is this log record referred to in line 3 of the Net Stat report above.

The meaning of each item is:

  • cm_c - common count. The number of receive buffers or replies combined
  • cm_d - common duration. The number of seconds spent receiving or transmitting data. Can exceed 600 seconds in a 10 minute sampling interval due to overlap.
  • rc_s - receive size. Number of bytes received.
  • rp_s - reply size. Number of bytes sent in reply to incoming request.
  • util - utilization upper limit. The cm_d expressed as a percentage of sampling interval. Same util as per line 3 discussion above.


Managing multiple computers


Fig. 3 Color scheme settings
In the admin >> Set_colors page, you can change the color scheme of a computers page. Not every color changes (esp. the image files) but those defined programmatically do. You can assign different color schemes to groups of or individual computers to make it easier to track which computer or group you have open via a web browser at any one time.

Additionally, the title of each web page contains the computer's name as the first element so that scanning a browser's tab list identifies the computer connected.

Troubleshoting

Setting up a network can be especially challenging. Start by opening log files visible to you on each computer you control. Verify outgoing requests from city computer are reaching the intended computer. You can use a web browser to mimic the city computer request to help isolate where a problem is.

Inability to reach intended computer may include, wrong ip address, firewall, antivirus, etc. Once these are eliminated, the problem is likely somewhere in the acs: protocol, domain, port, password. Examine each carefully. It is way too easy to write http instead of https.

You can use the Net Stat report in a new setup to verify data transfers are taking place. It is not so useful for troubleshooting an existing active network because of all the "noise" but then an existing network typically has working examples to copy/compare.

After connection is established, the next likely problems are: (1) unintended file erasure and (2) no files transferred.

Unintended file erase should NOT happen if following the end advice in Hydra App page to set up a "test" directory. With this protection in place, the likely correction is to the src_top_dir parameter in the capitol file.

Nothing happening could be merely a symptom of an earlier problem: unable to reach intended computer above. Once you are passed that point, it is likely that no files transferred is really that files *are* being transferred, just not where you expect. The likely correction is to the dest_top_dir parameter in the capitol file.

Finally, there is a special case of files overwritten in error. In most cases, you can recover from this via your own backup procedures. For files you create manually in Perserv or Hydra, you do not have prior backup. Whenever you run PS_LAUNCH (not PS_RUN), a backup of your @data directory is written within arch/fileback/[S] where S is one of 1,2,3,4 and today of Mon,...Sun.