On Mirroring FA Politely, Scalably, Throttlably, and Fault T
Posted 12 years agoIn Part 1 I discussed the mechanics of writing a web crawler, tailored for FA, that would download each of your submissions and save them to your machine without you having to lift a finger - beyond writing it, of course, but the time and attention cost there is more than amortized over doing the process by hand for several thousand images. What Part 2 will discuss isn't how you figure out what file you want to download, and then download it - but instead how to do it in such a way to minimize disruption to the site's normal operation while doing so.
Running a website isn't free, and it's even less free at the scale at which FA operates. They lease space to physically place the servers in, buy power to feed them and more power to handle the resulting waste heat, and most importantly they pay for bandwidth, generally priced per terabyte of total traffic into their servers and out into the world. Further, their computational resources are finite: assuming we know the URLs for all umpty-thousand pictures we want to know, submitting a request for each of them simultaneously would tax their servers much more heavily than any single user is expected to. In short, interacting with the site, whether by hand or through automation, costs FA money, and if we're not careful our program can easily impact the site's usability for everyone else.
So, then, how much usage is okay? At one extreme, we rent some time on a botnet and have a thousand computers each submit N/1000 requests simultaneously, then we saturate our connection to pull the results from the botnet without guilt. Unfortunately, this would constitute a DDoS on FA, and is therefore not an option. At the other extreme, we don't download anything, ever - but then we don't get our private copy of our submissions, which is the whole point of the exercise. So, let's ask the question, not of what's reasonable, but of how much beyond a single person's normal use is reasonable?
This leads naturally to the definition of 'normal use'. Doing the bot's job by hand, we first open /msg/submissions. Our browser sends a request to the site, which runs a PHP script to generate a result and return it to use. We can get a sense for how much of an imposition this is by looking at how long the script has to run to do this, and how many queries it makes to the database to do so. Taking my check just now as indicative, the page footer tells me "Page generated in 0.045 seconds [ 70.8% PHP, 29.2% SQL ] (13 queries)" - the script ran for about 30 milliseconds, and caused 13 database queries in that time, which cost about another 15 milliseconds. What we get from this is that generating the source code for /msg/submissions is a cheap operation, one that we can do at high frequency without much guilt.
"But wait," you cry, "way back in Part 1 you claimed that the page took between two and three seconds to load completely! And looking at my handy-dandy tool HTTPFox, I see that the request for the first page is completed in 250 milliseconds! Where is all that time going?" The answer is that it gets spent pulling down other resources - your browser fetches the page source, which tells it everything it needs to complete rendering the page, which for /msg/submissions is a CSS file, 38 image files, 2 javascript files, and then whatever else the page's scripting or the browser's extensions might pull in from third parties. Each image weighs in at 8 to 12 KB on average, and takes, in my experiment, between 1200 and 1800 ms to completely download. All of this, however, is wasted on our bot - recall that it only needs to load the submissions page at all in order to parse out the location of each submission page, and we only need the submission page because we can't figure out how to turn the submission ID directly into a URL for the full-resolution image.
So, then, we see already that we have two different intertwined operations: getting the URLs of the files to download, and actually saving the files themselves to disk. A request for each kind of resource costs different amounts of different resources to file: generating a page takes mostly CPU time in running the page script and executing database queries, but doesn't generate a lot of traffic over the wire; serving up a static image is computationally very easy if the server is configured correctly, but easily incurs hundreds of KB of bandwidth.
So, then, let's model a normal user's load as follows: they load /msg/submissions; this incurs 30 ms of script time and 15ms of DB time, and 300KB of image downloads, and completes in 3 seconds. Then, being quick on the draw, they click a thumbnail, costing 6 ms of CPU time and 14 ms of DB time over 22 queries, and 130KB for the image and (about) 100KB for UI images, completing in 3.2s. They then immediately click the image to get its full-res version, costing (virtually) no CPU or DB time, and another 350 KB of bandwidth, completed in about 1 second. Return to Submissions, repeat, apply bandwidth savings from browser cache. In total, a single user's normal load, worst case, is 36 ms of script time, 30ms of DB time (over 37 queries), and 750KB of image bandwith, over 7.5s.
Calculating our bot's equivalent load, then, is straightforward: dividing 45ms of CPU time per submissions page request into 65ms of CPU time per 7.5s gives one submissions page request every 5.19 seconds. Similarly, equivalent loading of the server with nothing but individual item pages is one request every 2.3 seconds. For bulk image downloads, normal use is almost exactly 100KB/s. An equivalent rate of image requests naturally depends on the size of the images requested; a very brief survey of full-res images gives a guesstimated average file size at perhaps 200KB.
All of this is very, very bad modelling. It doesn't account for users blocking image downloads, or for aborting a page load/render, or for spending more than no time examining an image before returning to the Submissons page, or for returning to the Submissions page before a single piece's page has finished loading, or for caching at the browser or ISP level, or any number of other alternative use cases, all of which must be considered in order to declare anything like an 'average load per concurrent user per second'. It does however serve handily as an order-of-magnitude estimate: a single user making 10 submissions pages per second is clearly abusing the service, and a single user making 1 request every 10 seconds is requesting pages more slowly than usual.
(Factoring in that our bot isn't loading other parts of FA nearly as heavily as a normal user would in order to justify a different rate of requests of each type is completely valid. Scheduling our bot to run outside of peak use times is an additional factor. I personally wouldn't feel terribly guilty about requesting 1 page/s of either type at any time, and then running a separate image-fetching script capped at 100 KB/s, at any time of the day.)
All of this, then, leads naturally to a pipelined architecture for our bot, with two stages - page fetcher and image fetcher. We already know where we start: at /msg/submissions. We use the freshly-fetched Submissions page to get the URL for the next Submissions page and look for the absence of a Next link to know when we're at the last page. Each Submissions page gives a number of individual page URLs, each of which yields an image URL. Each page source request takes about 250ms to complete, during which our computer is idle; parsing each page should take well under 100ms.
We can very cheaply generate the URL for the next submissions page: take the last image ID from the previous submissions page, add '@36' to the end, concatenate that to '/msg/submissions/' and request it; this means that the input to our page fetcher won't run dry easily. We might use a queue to store state of what pages we've requested, and what pages we should request at what time - /msg/submissions starts the chain, then the 36 links from that are pushed onto the queue, and the next Submissions URL is pushed on after those. The image URLs from the page fetcher are then pushed into a second queue, drained by a loop that fetches those URLs at a polite average bandwidth.
Implementing a rate-limited page fetch can be done in Python with time.sleep() and the other utilities from that package - if the last request made by the thread was more than X milliseconds ago, save the time, make the next page request, parse it, and sleep() until the next valid request time. A way of implementing a bandwidth-limited image fetcher without digging into network drivers might be to fetch the next URL, divide its size into the allotted bandwidth, then sleep() for the remainder of the period - so that you get both fast image downloads and also, averaged over time, aren't stressing FA's servers more than you want.
Miscellaneous notes:
The easiest, politest way of doing all this may simply be to email the admins and ask for a DVD with your submissions on it. If I were going to mirror the entirety of FA, that would definitely be cleanest, and I'm sure they'd appreciate being able to run such a hefty operation on their own terms.
If your download period is going to run for a long time, don't forget to account for new submissions coming in as you delve deeper into history. This might be implemented by having every 10th or 30th submissions request be for the root, and then only adding the new image pages to the page request queue.
With the above scheme, you're going to accumulate image URLs faster than you're going to use them. With a lot of URLs, this could exceed the size of your machine's memory, or cause other unpleasantries. And then for long-running operations, an interruption of any dependency shouldn't mean you restart your entire process.
What about fault tolerance? You don't want to have to refill the entire pipeline and retrace all your steps if your machine crashes, or your disk fills up, or you have to pause operations because FA went down. I hear that Redis databases are popular for handling large queues with persistence.
Building a searchable index based on image metadata - from the file itself, the file hash, the FA keywords and tags, the post date, number of commenters, sentiment analysis of comments, etc - might be cool. And you already have all that info at your fingertips since you've loaded the page; might be a good idea to stash a compressed copy of the page along with the image for later processing.
Running a website isn't free, and it's even less free at the scale at which FA operates. They lease space to physically place the servers in, buy power to feed them and more power to handle the resulting waste heat, and most importantly they pay for bandwidth, generally priced per terabyte of total traffic into their servers and out into the world. Further, their computational resources are finite: assuming we know the URLs for all umpty-thousand pictures we want to know, submitting a request for each of them simultaneously would tax their servers much more heavily than any single user is expected to. In short, interacting with the site, whether by hand or through automation, costs FA money, and if we're not careful our program can easily impact the site's usability for everyone else.
So, then, how much usage is okay? At one extreme, we rent some time on a botnet and have a thousand computers each submit N/1000 requests simultaneously, then we saturate our connection to pull the results from the botnet without guilt. Unfortunately, this would constitute a DDoS on FA, and is therefore not an option. At the other extreme, we don't download anything, ever - but then we don't get our private copy of our submissions, which is the whole point of the exercise. So, let's ask the question, not of what's reasonable, but of how much beyond a single person's normal use is reasonable?
This leads naturally to the definition of 'normal use'. Doing the bot's job by hand, we first open /msg/submissions. Our browser sends a request to the site, which runs a PHP script to generate a result and return it to use. We can get a sense for how much of an imposition this is by looking at how long the script has to run to do this, and how many queries it makes to the database to do so. Taking my check just now as indicative, the page footer tells me "Page generated in 0.045 seconds [ 70.8% PHP, 29.2% SQL ] (13 queries)" - the script ran for about 30 milliseconds, and caused 13 database queries in that time, which cost about another 15 milliseconds. What we get from this is that generating the source code for /msg/submissions is a cheap operation, one that we can do at high frequency without much guilt.
"But wait," you cry, "way back in Part 1 you claimed that the page took between two and three seconds to load completely! And looking at my handy-dandy tool HTTPFox, I see that the request for the first page is completed in 250 milliseconds! Where is all that time going?" The answer is that it gets spent pulling down other resources - your browser fetches the page source, which tells it everything it needs to complete rendering the page, which for /msg/submissions is a CSS file, 38 image files, 2 javascript files, and then whatever else the page's scripting or the browser's extensions might pull in from third parties. Each image weighs in at 8 to 12 KB on average, and takes, in my experiment, between 1200 and 1800 ms to completely download. All of this, however, is wasted on our bot - recall that it only needs to load the submissions page at all in order to parse out the location of each submission page, and we only need the submission page because we can't figure out how to turn the submission ID directly into a URL for the full-resolution image.
So, then, we see already that we have two different intertwined operations: getting the URLs of the files to download, and actually saving the files themselves to disk. A request for each kind of resource costs different amounts of different resources to file: generating a page takes mostly CPU time in running the page script and executing database queries, but doesn't generate a lot of traffic over the wire; serving up a static image is computationally very easy if the server is configured correctly, but easily incurs hundreds of KB of bandwidth.
So, then, let's model a normal user's load as follows: they load /msg/submissions; this incurs 30 ms of script time and 15ms of DB time, and 300KB of image downloads, and completes in 3 seconds. Then, being quick on the draw, they click a thumbnail, costing 6 ms of CPU time and 14 ms of DB time over 22 queries, and 130KB for the image and (about) 100KB for UI images, completing in 3.2s. They then immediately click the image to get its full-res version, costing (virtually) no CPU or DB time, and another 350 KB of bandwidth, completed in about 1 second. Return to Submissions, repeat, apply bandwidth savings from browser cache. In total, a single user's normal load, worst case, is 36 ms of script time, 30ms of DB time (over 37 queries), and 750KB of image bandwith, over 7.5s.
Calculating our bot's equivalent load, then, is straightforward: dividing 45ms of CPU time per submissions page request into 65ms of CPU time per 7.5s gives one submissions page request every 5.19 seconds. Similarly, equivalent loading of the server with nothing but individual item pages is one request every 2.3 seconds. For bulk image downloads, normal use is almost exactly 100KB/s. An equivalent rate of image requests naturally depends on the size of the images requested; a very brief survey of full-res images gives a guesstimated average file size at perhaps 200KB.
All of this is very, very bad modelling. It doesn't account for users blocking image downloads, or for aborting a page load/render, or for spending more than no time examining an image before returning to the Submissons page, or for returning to the Submissions page before a single piece's page has finished loading, or for caching at the browser or ISP level, or any number of other alternative use cases, all of which must be considered in order to declare anything like an 'average load per concurrent user per second'. It does however serve handily as an order-of-magnitude estimate: a single user making 10 submissions pages per second is clearly abusing the service, and a single user making 1 request every 10 seconds is requesting pages more slowly than usual.
(Factoring in that our bot isn't loading other parts of FA nearly as heavily as a normal user would in order to justify a different rate of requests of each type is completely valid. Scheduling our bot to run outside of peak use times is an additional factor. I personally wouldn't feel terribly guilty about requesting 1 page/s of either type at any time, and then running a separate image-fetching script capped at 100 KB/s, at any time of the day.)
All of this, then, leads naturally to a pipelined architecture for our bot, with two stages - page fetcher and image fetcher. We already know where we start: at /msg/submissions. We use the freshly-fetched Submissions page to get the URL for the next Submissions page and look for the absence of a Next link to know when we're at the last page. Each Submissions page gives a number of individual page URLs, each of which yields an image URL. Each page source request takes about 250ms to complete, during which our computer is idle; parsing each page should take well under 100ms.
We can very cheaply generate the URL for the next submissions page: take the last image ID from the previous submissions page, add '@36' to the end, concatenate that to '/msg/submissions/' and request it; this means that the input to our page fetcher won't run dry easily. We might use a queue to store state of what pages we've requested, and what pages we should request at what time - /msg/submissions starts the chain, then the 36 links from that are pushed onto the queue, and the next Submissions URL is pushed on after those. The image URLs from the page fetcher are then pushed into a second queue, drained by a loop that fetches those URLs at a polite average bandwidth.
Implementing a rate-limited page fetch can be done in Python with time.sleep() and the other utilities from that package - if the last request made by the thread was more than X milliseconds ago, save the time, make the next page request, parse it, and sleep() until the next valid request time. A way of implementing a bandwidth-limited image fetcher without digging into network drivers might be to fetch the next URL, divide its size into the allotted bandwidth, then sleep() for the remainder of the period - so that you get both fast image downloads and also, averaged over time, aren't stressing FA's servers more than you want.
Miscellaneous notes:
The easiest, politest way of doing all this may simply be to email the admins and ask for a DVD with your submissions on it. If I were going to mirror the entirety of FA, that would definitely be cleanest, and I'm sure they'd appreciate being able to run such a hefty operation on their own terms.
If your download period is going to run for a long time, don't forget to account for new submissions coming in as you delve deeper into history. This might be implemented by having every 10th or 30th submissions request be for the root, and then only adding the new image pages to the page request queue.
With the above scheme, you're going to accumulate image URLs faster than you're going to use them. With a lot of URLs, this could exceed the size of your machine's memory, or cause other unpleasantries. And then for long-running operations, an interruption of any dependency shouldn't mean you restart your entire process.
What about fault tolerance? You don't want to have to refill the entire pipeline and retrace all your steps if your machine crashes, or your disk fills up, or you have to pause operations because FA went down. I hear that Redis databases are popular for handling large queues with persistence.
Building a searchable index based on image metadata - from the file itself, the file hash, the FA keywords and tags, the post date, number of commenters, sentiment analysis of comments, etc - might be cool. And you already have all that info at your fingertips since you've loaded the page; might be a good idea to stash a compressed copy of the page along with the image for later processing.
On Mirroring FA
Posted 12 years agoSo let's say you're a FA user, as indeed you probably are. You are subscribed to a number of artists, you want to build a spank bank, but FA as a service is notoriously unreliable, so you think: "Hey, I know, I'll just download everything to my hard drive, that way the next time FA goes down I can continue my masturbation uninterrupted!" Then you look at a few things: the load and render time for a page showing you just the thumbnails of your submissions, after the banner ads, CSS, JS, and thumbnails are pulled down and the scripts run, is between 2 and 3 seconds. (Source: Firefox+HTTPFox, my laptop on 500 KB/s wifi, accessing http://www.furaffinity.net/msg/submissions at about 8:27PM EST on 8/23.) Then you have to look at each thumbnail, decide whether you want to download it for 'later perusal in detail', click through to its individual submission page, click the 'download' link to get your dog dicks in throbbingly full resolution and achingly realistic color depth, then save that image to your hard drive. All of these steps take time, and are boring and tedious.
Therefore: write a program to do it. Computers don't get bored, can respond as quickly as new information comes in with none of that silly mousing and clicking, and have access to more information, and more precise information, than a human would ever want to take into account to get the job done.
So, you set out to download your complete collection of dog dicks... /automatically/. With Python, 'cause why not, long phallic muscular binding things are cool. So you go forth unto the world, and ask Google: "How do I automate interacting with a website, using Python?" You'd get a variety of answers, ranging from UBot, to Selenium, to Py.Mechanize, to doing it like a man with your bare hands and Requests, to realizing that Python is unnecessary doing it like a UNIX-using man and using wget and a custom parser written in Perl and butterfly waggles, to... etc. The point is that, using any of these tools, or a hypothetical unholy combination of them, you somehow convince FA that your bot acts enough like you (read: submits your username and password to the right form, saves the authentication cookie that it gets back and supplies it to FA on later requests, along with whatever hidden fields and anti-bot cleverness the site devs probably haven't included) to let your bot download /msg/submissions, and /msg/submissions/36, yea unto an infinite expanse of animal-people in flagrante delicto.
Your program now has the raw text of /msg/submissions. How does it go from that, to having an image on your hard drive? For lack of a better idea, I would suggest that it follow the same process that a human would: click on each thumbnail, follow the download link, save the file behind the download link to disk. So, given that the human in question is looking only at the HTML source of the submissions page, they first need to figure out what the thumbnail looks like. We already know that it'll be an image-based hyperlink - it's a thumbnail, after all - so we can look for instances of that with a regular expression looking something like "<a href="*">*<img * />*</a>" (Don't quote me, that's almost certainly not a functional regex, it's there purely to get the idea across.). Lo and behold, we get a number of matches that look like
<a href="/view/11435851/"><img alt="" src="//t.facdn.net/11435851@200-1377303446.jpg"/><i class="icon" title="Click for description"></i></a>
From this we know that the submission in question can be viewed at www.furaffinity.com/view/11435851, and we know where to pull down the thumbnail image - it lives at http:t.facdn.net/11435851@200-1377303446.jpg . The thumbnail's filename is interesting, as well - it has the submission number, the resolution, and a strange number that just so happens to be a valid UNIX-style timestamp (it yields 8/23/2013, 7:17:26 EST). I presume that it's the most recent time the owner uploaded it, but it's not terribly important for our purposes.
So we know what the submission's page is. Let's tell our bot to load it up, then move to the next step: saving the file to our hard drive. Looking at the page's source with an eye for anything useful, we can see code to load some more JS at runtime, the markup for the site's navigation, some dropdown menus, some ads... and then we get to a comment that calls out a block of code as a "Media embed container". Presumably this does interesting things depending on what exactly the artist uploaded: an image, a flash animation, a text document, a PDF, an audio file, etc - in this case the Container has a script that's fired when the user clicks on the submitted image, toggling the displayed image between files sourced from small_url and full_url, where small_url is http://t.facdn.net/11435851@400-1377303446.jpg, and full_url is http://d.facdn.net/art/kraytsao/137.....ytsao_live.jpg . I'm going to make a wild guess here and say that this is our pay dirt: URLs from which our browser may fetch either the partial-res or full-res image, for display to us, the user. We can be a little paranoid here and check that the image is served on request even after logging out, and yes indeedy, it gets served right up.
In summary: download /msg/submissions and all the following pages, to get the URL for the page of each submission. Download the file at that URL, and pull the URL of the full-res image out of it. Save the file at that URL to your hard drive, and the job's done. Quick, easy, simple.
See part 2 for how to do all of the above without being a dick.
Therefore: write a program to do it. Computers don't get bored, can respond as quickly as new information comes in with none of that silly mousing and clicking, and have access to more information, and more precise information, than a human would ever want to take into account to get the job done.
So, you set out to download your complete collection of dog dicks... /automatically/. With Python, 'cause why not, long phallic muscular binding things are cool. So you go forth unto the world, and ask Google: "How do I automate interacting with a website, using Python?" You'd get a variety of answers, ranging from UBot, to Selenium, to Py.Mechanize, to doing it like a man with your bare hands and Requests, to realizing that Python is unnecessary doing it like a UNIX-using man and using wget and a custom parser written in Perl and butterfly waggles, to... etc. The point is that, using any of these tools, or a hypothetical unholy combination of them, you somehow convince FA that your bot acts enough like you (read: submits your username and password to the right form, saves the authentication cookie that it gets back and supplies it to FA on later requests, along with whatever hidden fields and anti-bot cleverness the site devs probably haven't included) to let your bot download /msg/submissions, and /msg/submissions/36, yea unto an infinite expanse of animal-people in flagrante delicto.
Your program now has the raw text of /msg/submissions. How does it go from that, to having an image on your hard drive? For lack of a better idea, I would suggest that it follow the same process that a human would: click on each thumbnail, follow the download link, save the file behind the download link to disk. So, given that the human in question is looking only at the HTML source of the submissions page, they first need to figure out what the thumbnail looks like. We already know that it'll be an image-based hyperlink - it's a thumbnail, after all - so we can look for instances of that with a regular expression looking something like "<a href="*">*<img * />*</a>" (Don't quote me, that's almost certainly not a functional regex, it's there purely to get the idea across.). Lo and behold, we get a number of matches that look like
<a href="/view/11435851/"><img alt="" src="//t.facdn.net/11435851@200-1377303446.jpg"/><i class="icon" title="Click for description"></i></a>
From this we know that the submission in question can be viewed at www.furaffinity.com/view/11435851, and we know where to pull down the thumbnail image - it lives at http:t.facdn.net/11435851@200-1377303446.jpg . The thumbnail's filename is interesting, as well - it has the submission number, the resolution, and a strange number that just so happens to be a valid UNIX-style timestamp (it yields 8/23/2013, 7:17:26 EST). I presume that it's the most recent time the owner uploaded it, but it's not terribly important for our purposes.
So we know what the submission's page is. Let's tell our bot to load it up, then move to the next step: saving the file to our hard drive. Looking at the page's source with an eye for anything useful, we can see code to load some more JS at runtime, the markup for the site's navigation, some dropdown menus, some ads... and then we get to a comment that calls out a block of code as a "Media embed container". Presumably this does interesting things depending on what exactly the artist uploaded: an image, a flash animation, a text document, a PDF, an audio file, etc - in this case the Container has a script that's fired when the user clicks on the submitted image, toggling the displayed image between files sourced from small_url and full_url, where small_url is http://t.facdn.net/11435851@400-1377303446.jpg, and full_url is http://d.facdn.net/art/kraytsao/137.....ytsao_live.jpg . I'm going to make a wild guess here and say that this is our pay dirt: URLs from which our browser may fetch either the partial-res or full-res image, for display to us, the user. We can be a little paranoid here and check that the image is served on request even after logging out, and yes indeedy, it gets served right up.
In summary: download /msg/submissions and all the following pages, to get the URL for the page of each submission. Download the file at that URL, and pull the URL of the full-res image out of it. Save the file at that URL to your hard drive, and the job's done. Quick, easy, simple.
See part 2 for how to do all of the above without being a dick.