Paul.L.Snyder on 25 Feb 2004 20:36:02 -0000


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] rationalizing .Mac web pages


"gabriel rosenkoetter" <gr@eclipsed.net> wrote on 02/25/2004 08:23:23 AM:

> On Wed, Feb 25, 2004 at 01:52:57AM -0500, Paul wrote:
> > Where are the images stored?  Can't you tell wget to just grab *.JPG
files?
> How would you go about finding out the urls for the "*.JPG files"?
> Note that wget explicitly does NOT parse the files it retrieves.

("Paul", above, is <gyoza@comcast.net>, not me.)

Here's a four-line perl filter that will de-javascript
a .mac photo album web page piped to it.

I googled for a random .mac photo album, and found the following
link, which I used to test the script:

  http://homepage.mac.com/toj/PhotoAlbum6.html

Here's the de.mac filter:

#!/usr/bin/perl
local $/;
$_=<>;
while(s@(HREF="javascript:openSlideShow\((\d+)\)[^>]+)(.*slides\[\2\] = new
Slide\(')([^']+)@HREF="\4"\3\4@gis){};
print;

This works as a one-liner, as well, if you quote your single
quotes carefully.  We slurp the whole file into a variable so
so we can match across lines to grab the URL of the full
picture.

Also, here's a hacky shell script (demacwrap) to grab a .Mac
photo album URL and feed it through de.mac:

#!/bin/sh
if [[ -e ${1##*/}; then rm ${1##*/}; fi
wget -kq $1
de.mac ${1##*/}
rm ${1##*/}

Wget's -k switch is broken.  -k changes all relative links in
the downloaded file to absolute links, but doesn't work correctly
with the -O switch - it always looks for the default filename,
which is the name of the file on the web server (possibly with a
numeric extension if the file already exists, thus the first rm).
We /should/ be able simply to use 'wget -kqO- | de.mac', but it
doesn't work.

Run it with, for example

demacwrap http://homepage.mac.com/toj/PhotoAlbum6.html > rational.html

The result is a local file called rational.html.  All the graphics
point back to mac.com's servers.  The thumbnails are in place and
link to the full image instead of a JS function. You can view this
local file and only download the pictures you wish to see more
closely, or you can now use wget to download all linked files.

Slicker would be to implement this as a web proxy filter.  You
might be able to coax muffin into doing something like this, with
some work - another tactic would be to try rewriting the URL for
the openSlideShowWindow.js file to a copy on your local machine
that behaves the way you desire.

pls

(See attached file: de.mac)(See attached file: demacwrap)

Attachment: de.mac
Description: Binary data

Attachment: demacwrap
Description: Binary data