Parsing a web page for Flash files using ColdFusion
A friend of mine likes the Flash movies that are on websites all over the place. After various movies he's liked in the past going missing, he likes to copy them on to his local hard disk. Not a problem, all he has to do is view the source of the web page that shows the Flash file, and he can find the link to the Flash file, write an HTML page with that link in, right-click it and save the .swf file on to his local computer.
The problem is he's not great at HTML so finding the links to the SWF files in all the other HTML, then writing a page with the right link in is a bit tricky. So, eventually, I wrote him a little script in ColdFusion which will search through a page he gives it and try to build links to any Flash files it finds references for, then he can save them himself.
Below is an explanation of what I built and why I did things in certain ways.
I wrote this code for ColdFusion Server 5, but I've tried it with CF MX as well and it works OK. You should be able to use it on CF v4.5, but the code uses CFHTTP, which causes memory leaks in 4.5, so don't use it too much or you'll have to stop and restart the CF services to get them working again. Also, for CFHTTP to work you're going to need to allow ColdFusion access through your firewall to the outside world from the server it runs on.
What's going to happen:
We're going to download the original web page that shows the Flash movie. Then we're going to look through the code for references for Flash .swf files and try and build them in to direct links to the Flash file.
Downloading the content
First, use CFHTTP to download the web page. In the example below, I'm taking the URL as it's being submitted from a form in a field called 'sourceurl'
ResolveURL = "Yes"
TIMEOUT = "120">
Then, put the downloaded page in to a variable.
<CFSET ProcessedCode = CFHTTP.FileContent>
The rest of the script is all to do with finding and processing the .swf references, so we'll wrap it in a simple find that looks for them. The Else section just gives a message that there are no .swf files found.
<CFIF FindNoCase('.swf', ProcessedCode, 1) GT 0>
(More code will be going here)
<CFSET OutputMsg = 'No .swf files listed in source. Bad luck.'>
Within the (More code...) section will go the rest of the processing for found .swf files:
Processing the downloaded content
First, we set the variable 'ProcessStart', which will keep track of where the processing of the downloaded content is started from.
<CFSET ProcessStart = 1>
Now, set up a loop to process the content looking for .swf references ten times (you can increase this to whatever you want.)
<CFLOOP FROM="1" TO="10" INDEX="OverLoop">
Set up an empty variable ready for the first URL containing a .swf reference.
<CFSET ThisURL = ''>
If a reference to '.swf' is found after the point the system is in (starting at the 1 we set earlier,) do the rest of the processing. This saves doing the processing if there are no more '.swf's to be found.
<CFIF FindNoCase('.swf', ProcessedCode, ProcessStart) GT 0>
Now we can strip out the URL for the .swf from the rest of the HTML we've downloaded. To start doing this, we're going to grab a chunk of HTML starting 116 characters back from the '.swf' that's been found, which is a nominal amount giving us 120 characters in all, which should be long enough to get the whole URL. If 116 characters before the '.swf' is a negative number, we'll set it to 1 so the command we're going to use in a moment won't get upset.
<CFSET StartURL = FindNoCase('.swf', ProcessedCode, ProcessStart) - 116>
<CFIF StartURL LT 1>
<CFSET StartURL = 1>
Using Mid(), cut out a chunk of HTML 120 characters long, ending in '.swf'
<CFSET ChunkCode = Mid(ProcessedCode, StartURL, 120)>
Now we want to find the start of the URL by looking for " or ', but there's a chance that could be the start or end of another URL within the chunk of code we've cut out. So, we reverse the order of the characters in the code, then search for the " or ', which in the reversed order will be the start of the URL we're trying to extract.
<CFSET ChunkCode = Reverse(ChunkCode)>
Set a variable of what could be the start of the URL, as you can't get FindOneOf to search for '"' and "'" at the same time without setting them within a variable.
<CFSET findthese = "'" & '"'>
If the start of the URL is found, it can continue extracting the URL. If not, it will stop looking for this URL and re-set where to start looking in the file, presuming it's on a broken link.
<CFIF FindOneOf(findthese, ChunkCode, 1) GT 0>
Extract the code from the start of the URL to the end of the chunk of code we've all ready pulled from the main downloaded content using Mid(). The '-1' is necessary on the end of the FindOneOf() section, otherwise you'll still have the " or ' at the start of the URL.
<CFSET ThisURL = Mid(ChunkCode, 1, FindOneOf(findthese, ChunkCode, 1)-1)>
Turn the extracted URL around the right way using Reverse()
<CFSET ThisURL = Reverse(ThisURL)>
If the URL is a relative (internal to the site) one, we can put the right path in front of it to match it to the site we downloaded it from. We find out whether it's internal by looking for 'http://' at the start.
<CFIF Mid(ThisURL, 1, 7) IS NOT 'http://'>
Take the parts from the FORM.sourceurl we sent over with the URL in to work out the domain (http://www.whatever.com part) and any directories that should go in front of the filename.
If the .swf and HTML file are in the same directory, put the website address and directory at the start.
<CFIF Mid(ThisURL, 1, 1) IS '/'>
<CFSET SitePath = ListGetAt(FORM.sourceurl, 1, "/")>
<CFSET SitePath = SitePath & "//" & ListGetAt(FORM.sourceurl, 2, "/")>
Otherwise, if it's in the same directory as the HTML file, put the domain and path in.
<CFSET SitePath = Reverse(FORM.sourceurl)>
<CFSET SitePath = ListRest(SitePath, "/")>
<CFSET SitePath = Reverse(SitePath) & '/'>
<CFSET ThisURL = SitePath & ThisURL>
If the URL was relative, it should now be a full URL.
Build the URL in to a message to be output when the script has finished. This builds a URL referencing the .swf so the link can be right-clicked and the .swf saved.
<CFSET OutputMsg = OutputMsg & '<p><a href="#ThisURL#">#ThisURL#</a></p>'>
Close the IF that looked for the start of a URL in the extracted chunk of HTML.
Set the place to start looking for the next '.swf' as just past where the current one was found.
<CFSET ProcessStart = FindNoCase('.swf', ProcessedCode, ProcessStart) + 4>
And that's it. This should extract links for up to ten .swf files from the HTML page you give it.
Here's a working version of the code, with a form added in to make it easy to use: zip of working page
It has occured to me that if the start of the chunk of code that gets cut out using Mid is set to 1 because it's a negative number, when you chop out 120 characters of code, it won't end with '.swf' any more. I'll need to put in some extra code to set that 120 as something smaller, so the rest can still hold true.
Paul Silver. October 2003