On Mirroring FA
12 years ago
So let's say you're a FA user, as indeed you probably are. You are subscribed to a number of artists, you want to build a spank bank, but FA as a service is notoriously unreliable, so you think: "Hey, I know, I'll just download everything to my hard drive, that way the next time FA goes down I can continue my masturbation uninterrupted!" Then you look at a few things: the load and render time for a page showing you just the thumbnails of your submissions, after the banner ads, CSS, JS, and thumbnails are pulled down and the scripts run, is between 2 and 3 seconds. (Source: Firefox+HTTPFox, my laptop on 500 KB/s wifi, accessing http://www.furaffinity.net/msg/submissions at about 8:27PM EST on 8/23.) Then you have to look at each thumbnail, decide whether you want to download it for 'later perusal in detail', click through to its individual submission page, click the 'download' link to get your dog dicks in throbbingly full resolution and achingly realistic color depth, then save that image to your hard drive. All of these steps take time, and are boring and tedious.
Therefore: write a program to do it. Computers don't get bored, can respond as quickly as new information comes in with none of that silly mousing and clicking, and have access to more information, and more precise information, than a human would ever want to take into account to get the job done.
So, you set out to download your complete collection of dog dicks... /automatically/. With Python, 'cause why not, long phallic muscular binding things are cool. So you go forth unto the world, and ask Google: "How do I automate interacting with a website, using Python?" You'd get a variety of answers, ranging from UBot, to Selenium, to Py.Mechanize, to doing it like a man with your bare hands and Requests, to realizing that Python is unnecessary doing it like a UNIX-using man and using wget and a custom parser written in Perl and butterfly waggles, to... etc. The point is that, using any of these tools, or a hypothetical unholy combination of them, you somehow convince FA that your bot acts enough like you (read: submits your username and password to the right form, saves the authentication cookie that it gets back and supplies it to FA on later requests, along with whatever hidden fields and anti-bot cleverness the site devs probably haven't included) to let your bot download /msg/submissions, and /msg/submissions/36, yea unto an infinite expanse of animal-people in flagrante delicto.
Your program now has the raw text of /msg/submissions. How does it go from that, to having an image on your hard drive? For lack of a better idea, I would suggest that it follow the same process that a human would: click on each thumbnail, follow the download link, save the file behind the download link to disk. So, given that the human in question is looking only at the HTML source of the submissions page, they first need to figure out what the thumbnail looks like. We already know that it'll be an image-based hyperlink - it's a thumbnail, after all - so we can look for instances of that with a regular expression looking something like "<a href="*">*<img * />*</a>" (Don't quote me, that's almost certainly not a functional regex, it's there purely to get the idea across.). Lo and behold, we get a number of matches that look like
<a href="/view/11435851/"><img alt="" src="//t.facdn.net/11435851@200-1377303446.jpg"/><i class="icon" title="Click for description"></i></a>
From this we know that the submission in question can be viewed at www.furaffinity.com/view/11435851, and we know where to pull down the thumbnail image - it lives at http:t.facdn.net/11435851@200-1377303446.jpg . The thumbnail's filename is interesting, as well - it has the submission number, the resolution, and a strange number that just so happens to be a valid UNIX-style timestamp (it yields 8/23/2013, 7:17:26 EST). I presume that it's the most recent time the owner uploaded it, but it's not terribly important for our purposes.
So we know what the submission's page is. Let's tell our bot to load it up, then move to the next step: saving the file to our hard drive. Looking at the page's source with an eye for anything useful, we can see code to load some more JS at runtime, the markup for the site's navigation, some dropdown menus, some ads... and then we get to a comment that calls out a block of code as a "Media embed container". Presumably this does interesting things depending on what exactly the artist uploaded: an image, a flash animation, a text document, a PDF, an audio file, etc - in this case the Container has a script that's fired when the user clicks on the submitted image, toggling the displayed image between files sourced from small_url and full_url, where small_url is http://t.facdn.net/11435851@400-1377303446.jpg, and full_url is http://d.facdn.net/art/kraytsao/137.....ytsao_live.jpg . I'm going to make a wild guess here and say that this is our pay dirt: URLs from which our browser may fetch either the partial-res or full-res image, for display to us, the user. We can be a little paranoid here and check that the image is served on request even after logging out, and yes indeedy, it gets served right up.
In summary: download /msg/submissions and all the following pages, to get the URL for the page of each submission. Download the file at that URL, and pull the URL of the full-res image out of it. Save the file at that URL to your hard drive, and the job's done. Quick, easy, simple.
See part 2 for how to do all of the above without being a dick.
Therefore: write a program to do it. Computers don't get bored, can respond as quickly as new information comes in with none of that silly mousing and clicking, and have access to more information, and more precise information, than a human would ever want to take into account to get the job done.
So, you set out to download your complete collection of dog dicks... /automatically/. With Python, 'cause why not, long phallic muscular binding things are cool. So you go forth unto the world, and ask Google: "How do I automate interacting with a website, using Python?" You'd get a variety of answers, ranging from UBot, to Selenium, to Py.Mechanize, to doing it like a man with your bare hands and Requests, to realizing that Python is unnecessary doing it like a UNIX-using man and using wget and a custom parser written in Perl and butterfly waggles, to... etc. The point is that, using any of these tools, or a hypothetical unholy combination of them, you somehow convince FA that your bot acts enough like you (read: submits your username and password to the right form, saves the authentication cookie that it gets back and supplies it to FA on later requests, along with whatever hidden fields and anti-bot cleverness the site devs probably haven't included) to let your bot download /msg/submissions, and /msg/submissions/36, yea unto an infinite expanse of animal-people in flagrante delicto.
Your program now has the raw text of /msg/submissions. How does it go from that, to having an image on your hard drive? For lack of a better idea, I would suggest that it follow the same process that a human would: click on each thumbnail, follow the download link, save the file behind the download link to disk. So, given that the human in question is looking only at the HTML source of the submissions page, they first need to figure out what the thumbnail looks like. We already know that it'll be an image-based hyperlink - it's a thumbnail, after all - so we can look for instances of that with a regular expression looking something like "<a href="*">*<img * />*</a>" (Don't quote me, that's almost certainly not a functional regex, it's there purely to get the idea across.). Lo and behold, we get a number of matches that look like
<a href="/view/11435851/"><img alt="" src="//t.facdn.net/11435851@200-1377303446.jpg"/><i class="icon" title="Click for description"></i></a>
From this we know that the submission in question can be viewed at www.furaffinity.com/view/11435851, and we know where to pull down the thumbnail image - it lives at http:t.facdn.net/11435851@200-1377303446.jpg . The thumbnail's filename is interesting, as well - it has the submission number, the resolution, and a strange number that just so happens to be a valid UNIX-style timestamp (it yields 8/23/2013, 7:17:26 EST). I presume that it's the most recent time the owner uploaded it, but it's not terribly important for our purposes.
So we know what the submission's page is. Let's tell our bot to load it up, then move to the next step: saving the file to our hard drive. Looking at the page's source with an eye for anything useful, we can see code to load some more JS at runtime, the markup for the site's navigation, some dropdown menus, some ads... and then we get to a comment that calls out a block of code as a "Media embed container". Presumably this does interesting things depending on what exactly the artist uploaded: an image, a flash animation, a text document, a PDF, an audio file, etc - in this case the Container has a script that's fired when the user clicks on the submitted image, toggling the displayed image between files sourced from small_url and full_url, where small_url is http://t.facdn.net/11435851@400-1377303446.jpg, and full_url is http://d.facdn.net/art/kraytsao/137.....ytsao_live.jpg . I'm going to make a wild guess here and say that this is our pay dirt: URLs from which our browser may fetch either the partial-res or full-res image, for display to us, the user. We can be a little paranoid here and check that the image is served on request even after logging out, and yes indeedy, it gets served right up.
In summary: download /msg/submissions and all the following pages, to get the URL for the page of each submission. Download the file at that URL, and pull the URL of the full-res image out of it. Save the file at that URL to your hard drive, and the job's done. Quick, easy, simple.
See part 2 for how to do all of the above without being a dick.
The other way is get FA to run File Transfer Protocol server software. It's better for them because it doesn't have all the bulshiite overhead of Apache. All that markup is useless and bandwidth-hogging anyway.
You know ftp, it's been around before the USA Department of Defense researched ARPAnet, that turned into the internet. Your web browser uses ftp to download images and hrefs anyway.
Just realise that neither the thumbnails or the href images contain metadata like tags or comments. It's all the Structured Query Language database searches in addition to image retrieval, that slows the final webpage down.