JP Vossen on 20 Apr 2010 21:58:43 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Advice needed on collecting files from FTP site

> Date: Tue, 20 Apr 2010 15:38:00 -0400
> From: Mike Leone <>
> So here's my situation - I have a Linux server set up in a DMZ, running 
> VSFTP. Each FTP account is chrooted. We will be using this for vendors 
> to send us invoices, etc.

FTP, yuck...  (Had to be said.)

> The FTPing part is working fine. The chrooting is working fine. What I 
> need to do now, is to have a method of sweeping through all these home 
> folders; collect any new files; zip them all together; and FTP them 
> inbound to the trusted part of my LAN. And then delete the file, once 
> it's been FTPed in.
> And there I am stuck. :-) I'm sure that's something simple to set up in 
> a script, but I'm not a scripting guy. Not on Linux, and only very 
> little in Windows (although I can figure out how to do this as a windows 
> CMD file).

I'd encourage you to pick up some shell scripting, it will vastly 
increase what you can do as a sysadmin.  The basics are nothing more 
than a DOS batch file, except less buggy and arguably more quirky, 
though there are some DOS batch constructs that boggle the mind.

For learning the bash shell, I recommend _Learning the bash Shell_ 
currently in 3rd edition, though 4th is being worked on.  And if you 
like cookbooks, I may modestly suggest the _bash Cookbook_.  :-)

If you want free/on-line stuff, go to and look for "Bash Guide 
for Beginners", "Advanced Bash-Scripting Guide" and similar.

> So: if anybody knows of a program that already does that sort of pruning 
> and collecting of files, that would be a start. Or a sample script that 
> does something similar, I could maybe fumble my way through.
> This is running on Ubuntu at the moment; eventually it will go on a 
> server running Red Hat Enterprise.

That's an interesting one, and there are a few points I haven't seen 
addressed yet.

1) You need to consider what happens if you do your 
collect/zip/move/delete part while someone is uploading a file.  Sure 
you can run your part in the middle of the night, and that'll work fine 
until someone works late, in a different time zone, or automates their 
part too.

2) A DMZ machine should have very strictly limited ability to connect 
*in* to the LAN, else what's the point.  So having that machine initiate 
the connection into the LAN is sub-optimal.

3) If you run from cron on the DMZ machine, you really need to allow 
email from that machine for cases where the job messes up.  But per #2, 
you don't want it going in to the LAN.  So you can send plain-text email 
out to a mail relay, and then come back in, which possibly leaks info, 
or you can do something else.

So, I'd do something like this.

First, I'd write a script that looks for files in the right place, then 
for each one it finds waits a few secs to see if the file gets bigger. 
If it does, it's still being written, so skip it.  That won't account 
for network delays longer than our wait time, but you've gotta draw the 
line somewhere.  I'd zip -9m the files, which will delete them *if* the 
zip works (and the files are writable).  And I'd keep the last few ZIP 
files, just in case.  Call this (below).

I'd create a NON-ROOT user on the DMZ machine, make sure it had 
read-write perms where it needed them, and give it and an 
SSH key.

Then, I'd put another trivial script on some trusted machine on the LAN, 
that has working cron and email, and I'd set up a cron job for that 
script.  Call this one  Or, just do it all in-line in 
cron, which'll work for a while, until you start adding more features.

Part 1 of is to SSH into the DMZ as the right user and 
actually run (using password-less keys or better yet 
'keychain' & SSH agent).  That avoids allowing the DMZ machine in, since 
you are already going out.  And it avoids having to deal with cron and 
email on the DMZ server (though you probably really do want email to 
work for log monitoring and other cron jobs).

Part 2 of is to actually download the file.  But how 
do you know what file to download, if we're naming them with 
CCYYMMDD_HHMMSS and keeping archives?  There are a few ways to deal with 
that, but maybe the simplest and most brute force solution is just to 
rsync all of them, which also gives you a bit of a backup.  Using the 
'rsync --delete' flag will keep the local side cleaned up too.

So, to pull it all together:


LAN side cron job (all on one line):
# Need "passwordless" SSH working first!
... ssh -i /path/to/key/file -c 
'/remote/path/to/' && rsync --delete -e ssh*.zip /home/user/snagged/

DMZ side script (will probably get mangled by the MTAs and MUAs):

#!/bin/bash -
# some files and package up in ZIP file

TREE='/home/ftp/'  # Must be read-write by user, so ZIP can read and delete
LAST_RUN="$HOME/snag_files.last_run"   # Must be writable by user
SLEEP_SECS='5'     # Wait between file checks.
   # If you have a lot of files to process, this will add up fast...
ZIP_FILE="$HOME/snagged_$(date '+%Y-%m-%d_%H:%M:%S').zip"
MAX_ZIPS_TO_KEEP='5'  # Keep this many previous ZIP files, just in case

# Define functions

function _file_size {
     # "Utility" function to return file size
     # Called like: now=$(_file_size "$file")
     # Returns: file size
     # We've already made sure the file exists and is readable, so...
     local file="$1"

     \ls -s "$file" | cut -d' ' -f1
} # end of function _file_size

function _shift_by {
     # Shift or remove a given number of items from the top or front of 
a list,
     # such that you can then perform an action on whatever is left.
     # For example, list some files or directories, then keep only the 
top 10.
     # It is CRITICAL that you pass the items in order, since all this 
     # does is remove the number of entries you specify from the front 
or top
     # of the list.
     # You should experiment with echo or mv before using rm!
     # Called like:  _shift_by <# to keep> <ls command, or whatever>
     # For example:
     #      rm -rf $(_shift_by $MAX_BUILD_DIRS_TO_KEEP $(ls -rd backup.20*))
     # Returns:  shifted list

     # If $1 is zero or greater than $#, the positional parameters are
     # not changed.  In this case that is a BAD THING!
     if (( $1 == 0 || $1 > ($# - 1) )); then
         echo ''
         # Remove the number of dirs to keep from the list, plus 1 for the
         # 'number of dirs to keep' argument itself.
         shift $(( $1 + 1 ))

         # Return whatever is left
         echo "$*"
} # end of function _shift_by

# Main()

# Find the files, and make sure they aren't still being written
for file in $(find $TREE -newer $LAST_RUN -type f); do
     # Make sure the file exists and is readable.  Since we just found 
it, it
     # *should* be, but check anyway...
     [ -r "$file" ] && {
         now=$(_file_size "$file")    # File size?
         sleep $SLEEP_SECS            # Wait a bit
         later=$(_file_size "$file")  # File size again?

         # If the file isn't any bigger, I guess it isn't still being 
         [ "$now" = "$later" ] && files_to_zip="$files_to_zip $file"

# IF we have any files to process:
[ "$files_to_zip" ] && {
     # Zip them up (this will barf on files with spaces)
     # -9 = max compression, -m = move them into ZIP (i.e., delete original)
     # Note this KEEPs paths.  -j to junk 'em, but that risks file 
     echo zip -9m $ZIP_FILE $files_to_zip && {
         # IF the zip worked, remove old ZIP files
         zip_files_to_nuke=$( \
           _shift_by $MAX_ZIPS_TO_KEEP $(ls -1r ${ZIP_FILE//_*./*.}) )
         [ "$zip_files_to_nuke" ] && echo rm -rf $zip_files_to_nuke

*** UNTESTED ***

As noted all of that code is untested.  Also, the script has two 'echo' 
commands in the place it would actually do something.  Fiddle with it 
and make sure if works if you try to use it, then remove the echos.

For some primitive sanity checking try: bash -n {script}
For debugging the script try: bash -x {script}
Once it works chmod it executable.

Good luck & hope this is useful,
JP Vossen, CISSP            |:::======|
My Account, My Opinions     |=========|
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
Philadelphia Linux Users Group         --
Announcements -
General Discussion  --