File :-(, x, )
chan imgs for JULY rjm !dMYYvF5Blo!!xq4
hey kids, here's the month end img dump from the boards listed below.

notes: subchan.be added, and not4chan down for past (?) weeks T__T;

real(2gigs):
http://upload.deadfrog.us/%5Bhbatch.org%5D%5Bchanimgs_real%5D%5B2006.07%5D.zip.torrent
hentai(4gigs):
http://upload.deadfrog.us/%5Bhbatch.org%5D%5Bchanimgs_hentai%5D%5B2006.07%5D.zip.torrent

"http://may.not4chan.org/ss/imgboard.html";
"http://may.not4chan.org/l/imgboard.html";
"http://orz.4chan.org/d/imgboard.html";
"http://orz.4chan.org/e/imgboard.html";
"http://orz.4chan.org/h/imgboard.html";
"http://img.4chan.org/s/imgboard.html";
Comment too long. Clickhereto view the full text.
>> Anonymous
Thanks OP,
ran it through dup checks?
any personal filter criteria?
>> rjm !dMYYvF5Blo!!xq4
>>116036
nope, it grabs everything it sees on the thread listing, and it's pages. in other words it won't get images embedded within a post because i haven't coded it to actualy hit the "reply" button, and get all those embedded images. that's the only limitation.
>> Anonymous
The comment for the hentai-pack says "real images, not drawn images". That is not true right?
>> rjm !dMYYvF5Blo!!xq4
>>116213
where?
>> Gearbox80 !B/rXlv6YcU
     File :-(, x)
Then you are missing stuff. I created a VBScript version in the middle of the month and have been testing it as I get chances. It goes to all the sub pages, and pulls all the thread links. Then it checks to see the last updated date of the thread to see if it needs to pull the main thread again.

It is smart enough to pull the name of the image that 4 chan gave it along with the real name of the image from inside the parens. That means it can name it correctly. Also, it stores some other stuff in the table just because I felt like it and it was easy to parce out. Those things are the actual post text itself that the image was posted in (I want to use it at some point to create thread titles when the thread doesn't have one) and some other junk. as of this second there are a little over 48,000 images that were pulled.

It also sorts by the thread title if there is one, or by threadid so all the related images are grouped together in folders.

The reason I started the project is because you didn't include the gif or wallpaper boards, which rock.

I did it in VB script because I know VBscript, and I can force it not to download images so it doesn't waste bandwith. It only downloads the text of the page if it is updated, then stores the image data and queues it for later download. Also, since I check the main pages, I can run it in the middle of the night when the bandwith usage for 4chan is lower. Don't want to mess up the browsing experence for anyone else. :)

I'm still working some bugs out of it trying to get it scheduled and keep it from crashing. once that is done i'll probably add a torrent or 2 to what you already have. I love downloading your monthly torrents because it saves 4chan the 4 gigs bandwith so someone else can use it :)
>> Anonymous
>>116340
WTF? You have 3.16 terabytes of free space!? How big is the drive?
>> Anonymous
>>116341

I don't see a "TB" anywhere on that screenshot. Check your vision.
>> Anonymous
     File :-(, x)
>>116355
NO U

(see status bars)
>> Anonymous
>>116340
share script?
doubtful but worth a shot anyway
>> Anonymous
That's not really a lot of free space. A lot of us have many TB's now, what with the size increases of hdd's. I just spent 2 weeks of pay and bought some 750gb Seagates, that's almost 2tb right there.
>> Gearbox80 !B/rXlv6YcU
>>116341
It is actually a new 16 disk 250 GB raid5 set, and is almost entirely free (~3.5 TB) I filled my old one.

>>116399
Yep, disk space is cheaper then heck. I used 250 drives because they are only about $0.25 per gig while the 750 GB drives are ~$0.50 per gig.


>>116377
Sorry, but I do not plan on sharing it. If I did then 200+ peeps would be downloading the entireity of 4chan and sucking up all the bandwith for the hundreds of thousands of other users.

However I have absolutely no problem with torrenting the images and the data, but my upload is slower than heck. I think I will be seeding the 'everything' torrent until 2007 to hit 100%.
>> Anonymous
>>116400

Enjoy your aids when 2 of those drives fail, and they will. Just like Zimmer, I GUARANTEE IT.
>> rjm !dMYYvF5Blo!!xq4
>>116340
i'm happy with my script so far.

are you saying my script consumes bandwidth? i grab the page source, and do url extraction with regex's, in other words no images are dl'd until i verify they aren't already in the dump folder.

also i run every 2 hours because other wise boards might get pushed through that i won't even see if i just run daily.

i guess i'll make a faq page on the script and how it works so people don't worry that i'm hammering.
>> Gearbox80 !B/rXlv6YcU
>>116491
Oh god no, I'm not saying it consumes bandwith at all. (Well obviously it does just like mine, but in the long run it saves tons of bandwith) It is an excellent opertunity for us to share the load of downloading (and sharing) instead of sucking it all from 4chan. I think your script rocks, and has rocked. I have downloaded the last 5 (I think 5 anyway) monthly batches. I was just not happy because you skipped 2 boards that I love, Gif and wallpaper, so I wrote my own to include those boards.

However I am saying that you are mssing stuff based on your explanation. If you only run every 2 hours from the main pages (basicly without going into the thread) then you only see the last few posts made to a thread. If there are 40 images posted in ~20 minutes, you will only pull 3 of them (the last 3) unless I miss-understand.

>>116408

yes, that will suck. It will happen sooner or later too; I agree with you. I am working on getting a very large tape drive, or another backup setup. I don't have the $ for that now though.
>> rjm !dMYYvF5Blo!!xq4
>>116492
yes you're correct, i will probably make it so that it goes into the reply, and grabs all new images. that would be cool if it increases the torrent.

also this prompted me to make a page, i'll probably include a little link in future releases so i don't have to explain this stuff all the time:
http://07261982.net/chanimgsfaq.html
>> Gearbox80 !B/rXlv6YcU
>>116493
Well here are some hints on how mine works. first, it is table based, so I can store data and come back to it.

There are 3 tables. A board table, a thread table, and a post table.

The board table is a list of all the url's of the main boards. for example, the 'torrent' section has 6 entries (for pages 0-5).

What the script does is go to each of those 6 pages and pull the html of the page, just like yours. (images are not downloaded.) On each page, it looks at all the threads and pulls the link from inside every hyperlink that says '[reply]' as the outer text. It then checks to see if that thread is in the next table, the threads table.

If the thread is in that table, then the script checks the time on the last post that is showing on the main page. If the newest post is > the curret updated date, then it marks that thread as needing an update. Otherwize, it skips that thread. It does this for whatever boards I have turned on.

Once all the main pages are scraped, it starts pulling the text of threads that need to be updated. It does the same thing as the main board, it pulls only the html, not the images.

It goes though the text and looks for image links. then it parces out the link to the image and compairs it to the third table, the post table.

>> Gearbox80 !B/rXlv6YcU
that way, I only need to run it once a day, as all boards besides /b/ will fit > 1 day of rotation even for threads with no replies.

The last thing I want to do is figure out a way to see if the image has already been downloaded before downloading it. This would require a checksum match and lookup to drop it and save bandwith. Now the cool thing is that 4chan has a md5 tag on the image. The sucky thing is I haven't yet figured out if that md5 sum is for the thumbnail or the final image. If for the final image,t hen I can save even more bandwith by never even downloading a duplicate, and being able to determine dupes months after they were orignaly posted. I just have to find the time to write a program (or find one) that can create md5 checksums so I can compare them to the ones that are already listed int he HTML.
>> rjm !dMYYvF5Blo!!xq4
>>116498
hmm, perhaps you take your method due to limitations of vbscript, though i don't know vbscript, (though i do know vba quite well).
i say this because it can be done in a far simpler manner using php using 2 steps:
*1 url extraction via regex's, and
*2 check if file is already in dump directory, don't download it if it is.

with this the need for a database, and a means to effeciently check if an image needs to be downloaded is taken care of by the OS, greatly reducing moving parts, and increasing dependibility.

the only reason i could see that as a problem is if vbscript has some inability to do basic file functions, though i doubt that is the case.
>> Anonymous
Should I be worried if the only working folder in the hentai archive is /d?
>> rjm !dMYYvF5Blo!!xq4
>>116628
i can only guess that you haven't downloaded the whole thing yet, or haven't used 7zip to unzip it.. though winzip should work too. (it is a standard zip)
>> Gearbox80 !B/rXlv6YcU
>>116510
True, it can be easier. But I specialize in databases, so I do most of my stuff that way because it is simpler for me.

Also, I did say I pull and match more information. I pull the post contents so I can create a title (basicly put it into a catergory) or the title itself. I also pull the correct file name and rename the file as they are downloaded. Both of these would be difficult using plain PHP unless I created a temprary storage location for the file information (AKA a flat file database)
>> rjm !dMYYvF5Blo!!xq4
i see, but am i missing something plainly obvious? all i see is the 10 digit numerical filenames for the images, where do you get the original name?
>> lumpy
What brand of drives do you use Gearbox? I personally am tired of WD drives dying on me because of heat issues. Lost a bunch of stuff AGAIN, includeing the hard to find viper ova uncensored set.
Not to mention 50 gigs of other things.
Any suggestions on drive fixing? Problem I'm haveing is the damn thing is unreadable correctly to bios.
>> Anonymous
I see that you have yet again failed at fulfilling many people's previous requests to make this one uTorrent friendly.

I'm yet again pissed that I can't deselect the things I don't want.
20 Internets have been taken from you.
>> Mediacl
Gearbox

you never did respond back with your results. The results of the scan comparing 'everything' to what you originally collected. How did that go?
>> Gearbox80 !B/rXlv6YcU
>>116841
Hit reply and you will see the original file name right next to the resolution. The one that you posted in the first post in this thread is '4chanup.png'

>>116855
Mostly Seagate, but I have some WD and Maxstor's. I prefer seagate because I have never had one crap out on me. Had a few Maxstor and WD drives fail, but never a seagate one. Probably just luck.

>>116915
Still downloading; at 83.3% complete. Leachers are taking all the bandwith, I am only getting 30k down with spikes to 50. ETA: 6 days 5 hours.
>> Mediacl
That is just not right! I'll do what I can to speed your download along. 83.3% and in the US, is that correct? what are the four last numbers in your ip address?
>> Anonymous
     File :-(, x)
>>116399
>> Anonymous
     File :-(, x)
>>116399
It sucks not being 'most people'
>> Gearbox80 !B/rXlv6YcU
>>116997
Don't worry about it, I can wait :) I wouldn't have time to do a match until sometime next week anyway. I need to write a program that will recursively unzip files so I can compare them anyway, and I haven't even started on that. No way in hell I am going to sit there and unzip >10,000 files manually.

>>117002
Yay linux :) I wish I knew linux better. Windows Server licences are megabucks.
>> rjm !dMYYvF5Blo!!xq4
>>116986
i see! tnx for showing me that.

>>116856
i have said this before, and i will add it to my faq page, but i will not release as individual files, the seeding would bomb in less than a few days. if you don't want it don't dl it. but seperating into 'real' and 'hentai' is the best compromise, that way the torrent is large enough that it's just under dvd size, and also allows users to choose just hentai, or get the other alternative content which includes real images, and some off boards, AND it helps keep seeders cause we don't have fags like you come and grab 150 mb, then run off after you've got what you want without seeding.

in short i'll be keeping it the way it is.
>> rjm !dMYYvF5Blo!!xq4
>>116856
AND IT IS utorrent friendly dumbfuck what do you think i'm fucking seeding it with
>> Anonymous
>>117008
why make it so complicated? just copy all the archives to one folder, then select all, then unzip into their own directory. 5 minutes of cut and pasting followed by 15 hours of unzipping. boom, your done.
>> Anonymous
>>117024
HAY DUMBFUCK I MEAN DUMBFUCK THAT DUMBFUCK I HAVE TO PUT UP WITH GIGS OF DICKGIRLS JUST SO I CAN GET THE HENTAI DUMBFUCK DUMBFUCK
>> Anonymous
>>117023
Mmmkay then.

Well, in that case, you don't sound like you want that many seeders in the first place. I'm very certain that this would attract many more people if they had the choice of selecting the various genres, instead of being forced to waste quota on stuff they don't actually want/find disgusting.
Not everyone has a wide taste in genre like you might...

Good job on doing this though. Unfortunately you won't have my support yet, but I'll happily wait until you actually try and torrent it in the way that allows people to select and deselect boards and then see if you lose heaps of seeders by doing so.
>> rjm !dMYYvF5Blo!!xq4
>>117096
alright, fine.. you got past my hatred.
next release will be all boards zipped individually, in a single large torrent, no more real/hentai. i'll put it on the faq that people should use utorrent to dl what they want.

>.> sigh..
>> Anonymous
>>117109
Excellent, I'll be looking forward to it =)