Views: 2254
Submissions: 4
Favs: 37

Registered: Jan 24, 2008 12:31
Not Available...
Stats
Comments Earned: 284
Comments Made: 126
Journals: 2
Comments Made: 126
Journals: 2
Recent Journal
On Mirroring FA Politely, Scalably, Throttlably, and Fault T
12 years ago
In Part 1 I discussed the mechanics of writing a web crawler, tailored for FA, that would download each of your submissions and save them to your machine without you having to lift a finger - beyond writing it, of course, but the time and attention cost there is more than amortized over doing the process by hand for several thousand images. What Part 2 will discuss isn't how you figure out what file you want to download, and then download it - but instead how to do it in such a way to minimize disruption to the site's normal operation while doing so.
Running a website isn't free, and it's even less free at the scale at which FA operates. They lease space to physically place the servers in, buy power to feed them and more power to handle the resulting waste heat, and most importantly they pay for bandwidth, generally priced per terabyte of total traffic into their servers and out into the world. Further, their computational resources are finite: assuming we know the URLs for all umpty-thousand pictures we want to know, submitting a request for each of them simultaneously would tax their servers much more heavily than any single user is expected to. In short, interacting with the site, whether by hand or through automation, costs FA money, and if we're not careful our program can easily impact the site's usability for everyone else.
So, then, how much usage is okay? At one extreme, we rent some time on a botnet and have a thousand computers each submit N/1000 requests simultaneously, then we saturate our connection to pull the results from the botnet without guilt. Unfortunately, this would constitute a DDoS on FA, and is therefore not an option. At the other extreme, we don't download anything, ever - but then we don't get our private copy of our submissions, which is the whole point of the exercise. So, let's ask the question, not of what's reasonable, but of how much beyond a single person's normal use is reasonable?
This leads naturally to the definition of 'normal use'. Doing the bot's job by hand, we first open /msg/submissions. Our browser sends a request to the site, which runs a PHP script to generate a result and return it to use. We can get a sense for how much of an imposition this is by looking at how long the script has to run to do this, and how many queries it makes to the database to do so. Taking my check just now as indicative, the page footer tells me "Page generated in 0.045 seconds [ 70.8% PHP, 29.2% SQL ] (13 queries)" - the script ran for about 30 milliseconds, and caused 13 database queries in that time, which cost about another 15 milliseconds. What we get from this is that generating the source code for /msg/submissions is a cheap operation, one that we can do at high frequency without much guilt.
"But wait," you cry, "way back in Part 1 you claimed that the page took between two and three seconds to load completely! And looking at my handy-dandy tool HTTPFox, I see that the request for the first page is completed in 250 milliseconds! Where is all that time going?" The answer is that it gets spent pulling down other resources - your browser fetches the page source, which tells it everything it needs to complete rendering the page, which for /msg/submissions is a CSS file, 38 image files, 2 javascript files, and then whatever else the page's scripting or the browser's extensions might pull in from third parties. Each image weighs in at 8 to 12 KB on average, and takes, in my experiment, between 1200 and 1800 ms to completely download. All of this, however, is wasted on our bot - recall that it only needs to load the submissions page at all in order to parse out the location of each submission page, and we only need the submission page because we can't figure out how to turn the submission ID directly into a URL for the full-resolution image.
So, then, we see already that we have two different intertwined operations: getting the URLs of the files to download, and actually saving the files themselves to disk. A request for each kind of resource costs different amounts of different resources to file: generating a page takes mostly CPU time in running the page script and executing database queries, but doesn't generate a lot of traffic over the wire; serving up a static image is computationally very easy if the server is configured correctly, but easily incurs hundreds of KB of bandwidth.
So, then, let's model a normal user's load as follows: they load /msg/submissions; this incurs 30 ms of script time and 15ms of DB time, and 300KB of image downloads, and completes in 3 seconds. Then, being quick on the draw, they click a thumbnail, costing 6 ms of CPU time and 14 ms of DB time over 22 queries, and 130KB for the image and (about) 100KB for UI images, completing in 3.2s. They then immediately click the image to get its full-res version, costing (virtually) no CPU or DB time, and another 350 KB of bandwidth, completed in about 1 second. Return to Submissions, repeat, apply bandwidth savings from browser cache. In total, a single user's normal load, worst case, is 36 ms of script time, 30ms of DB time (over 37 queries), and 750KB of image bandwith, over 7.5s.
Calculating our bot's equivalent load, then, is straightforward: dividing 45ms of CPU time per submissions page request into 65ms of CPU time per 7.5s gives one submissions page request every 5.19 seconds. Similarly, equivalent loading of the server with nothing but individual item pages is one request every 2.3 seconds. For bulk image downloads, normal use is almost exactly 100KB/s. An equivalent rate of image requests naturally depends on the size of the images requested; a very brief survey of full-res images gives a guesstimated average file size at perhaps 200KB.
All of this is very, very bad modelling. It doesn't account for users blocking image downloads, or for aborting a page load/render, or for spending more than no time examining an image before returning to the Submissons page, or for returning to the Submissions page before a single piece's page has finished loading, or for caching at the browser or ISP level, or any number of other alternative use cases, all of which must be considered in order to declare anything like an 'average load per concurrent user per second'. It does however serve handily as an order-of-magnitude estimate: a single user making 10 submissions pages per second is clearly abusing the service, and a single user making 1 request every 10 seconds is requesting pages more slowly than usual.
(Factoring in that our bot isn't loading other parts of FA nearly as heavily as a normal user would in order to justify a different rate of requests of each type is completely valid. Scheduling our bot to run outside of peak use times is an additional factor. I personally wouldn't feel terribly guilty about requesting 1 page/s of either type at any time, and then running a separate image-fetching script capped at 100 KB/s, at any time of the day.)
All of this, then, leads naturally to a pipelined architecture for our bot, with two stages - page fetcher and image fetcher. We already know where we start: at /msg/submissions. We use the freshly-fetched Submissions page to get the URL for the next Submissions page and look for the absence of a Next link to know when we're at the last page. Each Submissions page gives a number of individual page URLs, each of which yields an image URL. Each page source request takes about 250ms to complete, during which our computer is idle; parsing each page should take well under 100ms.
We can very cheaply generate the URL for the next submissions page: take the last image ID from the previous submissions page, add '@36' to the end, concatenate that to '/msg/submissions/' and request it; this means that the input to our page fetcher won't run dry easily. We might use a queue to store state of what pages we've requested, and what pages we should request at what time - /msg/submissions starts the chain, then the 36 links from that are pushed onto the queue, and the next Submissions URL is pushed on after those. The image URLs from the page fetcher are then pushed into a second queue, drained by a loop that fetches those URLs at a polite average bandwidth.
Implementing a rate-limited page fetch can be done in Python with time.sleep() and the other utilities from that package - if the last request made by the thread was more than X milliseconds ago, save the time, make the next page request, parse it, and sleep() until the next valid request time. A way of implementing a bandwidth-limited image fetcher without digging into network drivers might be to fetch the next URL, divide its size into the allotted bandwidth, then sleep() for the remainder of the period - so that you get both fast image downloads and also, averaged over time, aren't stressing FA's servers more than you want.
Miscellaneous notes:
The easiest, politest way of doing all this may simply be to email the admins and ask for a DVD with your submissions on it. If I were going to mirror the entirety of FA, that would definitely be cleanest, and I'm sure they'd appreciate being able to run such a hefty operation on their own terms.
If your download period is going to run for a long time, don't forget to account for new submissions coming in as you delve deeper into history. This might be implemented by having every 10th or 30th submissions request be for the root, and then only adding the new image pages to the page request queue.
With the above scheme, you're going to accumulate image URLs faster than you're going to use them. With a lot of URLs, this could exceed the size of your machine's memory, or cause other unpleasantries. And then for long-running operations, an interruption of any dependency shouldn't mean you restart your entire process.
What about fault tolerance? You don't want to have to refill the entire pipeline and retrace all your steps if your machine crashes, or your disk fills up, or you have to pause operations because FA went down. I hear that Redis databases are popular for handling large queues with persistence.
Building a searchable index based on image metadata - from the file itself, the file hash, the FA keywords and tags, the post date, number of commenters, sentiment analysis of comments, etc - might be cool. And you already have all that info at your fingertips since you've loaded the page; might be a good idea to stash a compressed copy of the page along with the image for later processing.
Running a website isn't free, and it's even less free at the scale at which FA operates. They lease space to physically place the servers in, buy power to feed them and more power to handle the resulting waste heat, and most importantly they pay for bandwidth, generally priced per terabyte of total traffic into their servers and out into the world. Further, their computational resources are finite: assuming we know the URLs for all umpty-thousand pictures we want to know, submitting a request for each of them simultaneously would tax their servers much more heavily than any single user is expected to. In short, interacting with the site, whether by hand or through automation, costs FA money, and if we're not careful our program can easily impact the site's usability for everyone else.
So, then, how much usage is okay? At one extreme, we rent some time on a botnet and have a thousand computers each submit N/1000 requests simultaneously, then we saturate our connection to pull the results from the botnet without guilt. Unfortunately, this would constitute a DDoS on FA, and is therefore not an option. At the other extreme, we don't download anything, ever - but then we don't get our private copy of our submissions, which is the whole point of the exercise. So, let's ask the question, not of what's reasonable, but of how much beyond a single person's normal use is reasonable?
This leads naturally to the definition of 'normal use'. Doing the bot's job by hand, we first open /msg/submissions. Our browser sends a request to the site, which runs a PHP script to generate a result and return it to use. We can get a sense for how much of an imposition this is by looking at how long the script has to run to do this, and how many queries it makes to the database to do so. Taking my check just now as indicative, the page footer tells me "Page generated in 0.045 seconds [ 70.8% PHP, 29.2% SQL ] (13 queries)" - the script ran for about 30 milliseconds, and caused 13 database queries in that time, which cost about another 15 milliseconds. What we get from this is that generating the source code for /msg/submissions is a cheap operation, one that we can do at high frequency without much guilt.
"But wait," you cry, "way back in Part 1 you claimed that the page took between two and three seconds to load completely! And looking at my handy-dandy tool HTTPFox, I see that the request for the first page is completed in 250 milliseconds! Where is all that time going?" The answer is that it gets spent pulling down other resources - your browser fetches the page source, which tells it everything it needs to complete rendering the page, which for /msg/submissions is a CSS file, 38 image files, 2 javascript files, and then whatever else the page's scripting or the browser's extensions might pull in from third parties. Each image weighs in at 8 to 12 KB on average, and takes, in my experiment, between 1200 and 1800 ms to completely download. All of this, however, is wasted on our bot - recall that it only needs to load the submissions page at all in order to parse out the location of each submission page, and we only need the submission page because we can't figure out how to turn the submission ID directly into a URL for the full-resolution image.
So, then, we see already that we have two different intertwined operations: getting the URLs of the files to download, and actually saving the files themselves to disk. A request for each kind of resource costs different amounts of different resources to file: generating a page takes mostly CPU time in running the page script and executing database queries, but doesn't generate a lot of traffic over the wire; serving up a static image is computationally very easy if the server is configured correctly, but easily incurs hundreds of KB of bandwidth.
So, then, let's model a normal user's load as follows: they load /msg/submissions; this incurs 30 ms of script time and 15ms of DB time, and 300KB of image downloads, and completes in 3 seconds. Then, being quick on the draw, they click a thumbnail, costing 6 ms of CPU time and 14 ms of DB time over 22 queries, and 130KB for the image and (about) 100KB for UI images, completing in 3.2s. They then immediately click the image to get its full-res version, costing (virtually) no CPU or DB time, and another 350 KB of bandwidth, completed in about 1 second. Return to Submissions, repeat, apply bandwidth savings from browser cache. In total, a single user's normal load, worst case, is 36 ms of script time, 30ms of DB time (over 37 queries), and 750KB of image bandwith, over 7.5s.
Calculating our bot's equivalent load, then, is straightforward: dividing 45ms of CPU time per submissions page request into 65ms of CPU time per 7.5s gives one submissions page request every 5.19 seconds. Similarly, equivalent loading of the server with nothing but individual item pages is one request every 2.3 seconds. For bulk image downloads, normal use is almost exactly 100KB/s. An equivalent rate of image requests naturally depends on the size of the images requested; a very brief survey of full-res images gives a guesstimated average file size at perhaps 200KB.
All of this is very, very bad modelling. It doesn't account for users blocking image downloads, or for aborting a page load/render, or for spending more than no time examining an image before returning to the Submissons page, or for returning to the Submissions page before a single piece's page has finished loading, or for caching at the browser or ISP level, or any number of other alternative use cases, all of which must be considered in order to declare anything like an 'average load per concurrent user per second'. It does however serve handily as an order-of-magnitude estimate: a single user making 10 submissions pages per second is clearly abusing the service, and a single user making 1 request every 10 seconds is requesting pages more slowly than usual.
(Factoring in that our bot isn't loading other parts of FA nearly as heavily as a normal user would in order to justify a different rate of requests of each type is completely valid. Scheduling our bot to run outside of peak use times is an additional factor. I personally wouldn't feel terribly guilty about requesting 1 page/s of either type at any time, and then running a separate image-fetching script capped at 100 KB/s, at any time of the day.)
All of this, then, leads naturally to a pipelined architecture for our bot, with two stages - page fetcher and image fetcher. We already know where we start: at /msg/submissions. We use the freshly-fetched Submissions page to get the URL for the next Submissions page and look for the absence of a Next link to know when we're at the last page. Each Submissions page gives a number of individual page URLs, each of which yields an image URL. Each page source request takes about 250ms to complete, during which our computer is idle; parsing each page should take well under 100ms.
We can very cheaply generate the URL for the next submissions page: take the last image ID from the previous submissions page, add '@36' to the end, concatenate that to '/msg/submissions/' and request it; this means that the input to our page fetcher won't run dry easily. We might use a queue to store state of what pages we've requested, and what pages we should request at what time - /msg/submissions starts the chain, then the 36 links from that are pushed onto the queue, and the next Submissions URL is pushed on after those. The image URLs from the page fetcher are then pushed into a second queue, drained by a loop that fetches those URLs at a polite average bandwidth.
Implementing a rate-limited page fetch can be done in Python with time.sleep() and the other utilities from that package - if the last request made by the thread was more than X milliseconds ago, save the time, make the next page request, parse it, and sleep() until the next valid request time. A way of implementing a bandwidth-limited image fetcher without digging into network drivers might be to fetch the next URL, divide its size into the allotted bandwidth, then sleep() for the remainder of the period - so that you get both fast image downloads and also, averaged over time, aren't stressing FA's servers more than you want.
Miscellaneous notes:
The easiest, politest way of doing all this may simply be to email the admins and ask for a DVD with your submissions on it. If I were going to mirror the entirety of FA, that would definitely be cleanest, and I'm sure they'd appreciate being able to run such a hefty operation on their own terms.
If your download period is going to run for a long time, don't forget to account for new submissions coming in as you delve deeper into history. This might be implemented by having every 10th or 30th submissions request be for the root, and then only adding the new image pages to the page request queue.
With the above scheme, you're going to accumulate image URLs faster than you're going to use them. With a lot of URLs, this could exceed the size of your machine's memory, or cause other unpleasantries. And then for long-running operations, an interruption of any dependency shouldn't mean you restart your entire process.
What about fault tolerance? You don't want to have to refill the entire pipeline and retrace all your steps if your machine crashes, or your disk fills up, or you have to pause operations because FA went down. I hear that Redis databases are popular for handling large queues with persistence.
Building a searchable index based on image metadata - from the file itself, the file hash, the FA keywords and tags, the post date, number of commenters, sentiment analysis of comments, etc - might be cool. And you already have all that info at your fingertips since you've loaded the page; might be a good idea to stash a compressed copy of the page along with the image for later processing.
User Profile
Accepting Trades
No Accepting Commissions
No
This user has not added any information to their profile.