>> |
Gearbox80
!B/rXlv6YcU
>>116493 Well here are some hints on how mine works. first, it is table based, so I can store data and come back to it.
There are 3 tables. A board table, a thread table, and a post table.
The board table is a list of all the url's of the main boards. for example, the 'torrent' section has 6 entries (for pages 0-5).
What the script does is go to each of those 6 pages and pull the html of the page, just like yours. (images are not downloaded.) On each page, it looks at all the threads and pulls the link from inside every hyperlink that says '[reply]' as the outer text. It then checks to see if that thread is in the next table, the threads table.
If the thread is in that table, then the script checks the time on the last post that is showing on the main page. If the newest post is > the curret updated date, then it marks that thread as needing an update. Otherwize, it skips that thread. It does this for whatever boards I have turned on.
Once all the main pages are scraped, it starts pulling the text of threads that need to be updated. It does the same thing as the main board, it pulls only the html, not the images.
It goes though the text and looks for image links. then it parces out the link to the image and compairs it to the third table, the post table.
|