The New Wonderland Archive

Discuss the games (no level solutions or off-topic, please).

Moderators: ~xpr'd~, tyteen4a03, Stinky, Emerald141, Qloof234, jdl

User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: Tyteen's spider bot progress

Post by tyteen4a03 » Tue Jul 31, 2012 10:44 pm

VirtLands wrote:congrats. you're such a brain with that code, (which I don't even know what is displayed there),
it's some kind of Python i guess?

I can see bits and pieces of webpages being slowly
digested by that spider thing you created. Image

(The only programming I know is with Blitz. %@?_)

I suppose in the future, you can ultimately make some (offline) search
program for us, where when you input an author or level name, that it
returns all kinds of stuff, I suppose.

I'm optimistic. I think you'll finish this in about a week.
Keep up the good work. Image

tyteen4a03 wrote:The spider is working...
The big bit displayed was the topics, with the format "topicid": (topicname, topicStarterUID). It will be passed on to other parts to crawl the posts inside each topic.

The offline search part is actually going to be in the new website - I guess you can call that offline search :lol: (but the most accurate term is off-site search)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Re: Tyteen's spider bot progress

Post by VirtLands » Tue Jul 31, 2012 11:02 pm

Maybe you can make an offline Image version too, -one that you can give us to download...
Last edited by VirtLands on Wed Aug 08, 2012 7:51 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: Tyteen's spider bot progress

Post by tyteen4a03 » Tue Jul 31, 2012 11:17 pm

VirtLands wrote:
tyteen4a03 wrote:The offline search part is actually going to be in the new website - I guess you can call that offline search :lol: (but the most accurate term is off-site search)
Maybe you can make an offline Image version too, -one that you can give us to
download; its info will be finite, (dated up to a point).
All offline programs work faster than online.

Now that you've got the bot code off to a good start, I was wondering...
How will you prevent duplication of search efforts,
..what I mean is, how will you keep from searching the same links over and over?

For example, Link A links to several (perhaps 10 others), and those links will
eventually link back to Link A, in some strange round loop.

(Well, I'm going to lala land, will be back in a while... Image)
It won't, because the database backend for this bot has a unique key feature - it prevents the same post/topic/profile from being archived over and over again. The bot also does not follow links - it simply grabs information it needs then move on.

An offline version will depend on if I can spare time to work on a GUI application (which has been a bit of a failure to me). Development of this will also depend if MS wants to use this website to handle future level submissions - if that's the case, another application wouldn't be necessary. If the community really do want this application, it will start after the website's development finishes.
and the duck went moo

Beep bloop
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Fri Aug 03, 2012 6:54 pm

Another status update...

I am off to grabbing posts from the forum, and it required me to do some special configuration... anybody who can guess what I did gets a (virtual) cookie. :wink:
Attachments
guessnum.png
guessnum.png (15.91 KiB) Viewed 5459 times
and the duck went moo

Beep bloop
User avatar
StinkerSquad01
Rainbow AllStar
Posts: 4251
Joined: Mon Aug 09, 2010 3:39 am

Post by StinkerSquad01 » Fri Aug 03, 2012 7:08 pm

Well, you changed the date to the post number(?).
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Fri Aug 03, 2012 7:25 pm

StinkerSquad01 wrote:Well, you changed the date to the post number(?).
It's not the post number...

(hint: Compare my picture to what you see on the main page)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Unix Time Stamps

Post by VirtLands » Fri Aug 03, 2012 7:33 pm

Image
tyteen4a03 wrote:They are Unix Timestamps. They represent seconds since the Unix Epoch (1st January 1970 00:00:00 GMT) and is a very convenient time format since you can turn them into any time format displayable.
Image Image Image
Thanx.
Last edited by VirtLands on Thu Aug 16, 2012 9:02 pm, edited 6 times in total.
User avatar
StinkerSquad01
Rainbow AllStar
Posts: 4251
Joined: Mon Aug 09, 2010 3:39 am

Post by StinkerSquad01 » Fri Aug 03, 2012 7:51 pm

I was thinking that as well..
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Sat Aug 04, 2012 8:23 am

VirtLands wrote:
Image

My wildest guess is that you've conveniently converted the date-time into
a special numeric format for easier sorting.

Does anyone see a pattern here? (I can't).

Fri Aug 03, 2012 11:33 am -- 1343947903 -- VirtLands
Fri Jul 13, 2012 8:17 am -- 1342196228
Sat Jul 14, 2012 5:15 pm -- 1342314949
Fri Aug 03, 2012 6:51 am -- 1344005460
Thu Aug 02, 2012 2:30 pm -- 1343946627
Thu Jul 19, 2012 5:25 pm -- 1342747557 -- hex:5008B3A5

I'll get back to you on this. Image
They are Unix Timestamps. They represent seconds since the Unix Epoch (1st January 1970 00:00:00 GMT) and is a very convenient time format since you can turn them into any time format displayable.

(and yes, they can also be used for sorting)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Unix Time Format

Post by VirtLands » Sun Aug 05, 2012 4:46 am

Aha. I was so close.

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:04 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: Unix Time Format

Post by tyteen4a03 » Sun Aug 05, 2012 4:57 am

VirtLands wrote:Aha. I was so close. Image

( I was hunting for date-time formats, and it never occurred to me that it's Unix. )

Out of curiousity, I may compare the Unix to other formats to see 'complexity' vs 'convenience'.

Let us know of your eventual progress.
Here's some coffee to keep you going. Image
Thanks! I am having insomnias lately I can't focus on anything I do.

Here's some code snippets to show you the current progress. It might not be the tidiest, but it at least works.

This code snippet requires Python OOP knowledge, and Scrapy and Beautiful Soup library knowledge.

Code: Select all

    def parseTopics(self, response):
        soup = BeautifulSoup(response.body)
        # Find topic information
        topics = []
        for link, profileBit in zip(soup.find_all("a", attrs={"class": "topictitle"}),
            soup.find_all("span", attrs={"class": "name"})):
            if link["href"].split("=")[1] in self.topics_to_ignore: # Make sure Announcements and Stickies are not scanned twice while making sure they do get scanned at least once
                continue
            aTopic = Topic()
            aTopic["forumID"] = response.meta["forumid"]
            aTopic["topicID"] = link["href"].split("=")[1]
            aTopic["topicName"] = link.string
            aTopic["posterID"] = profileBit.a["href"].split("=")[2]
            topics.append(aTopic)
            if link.previous_sibling().string in ["Announcement:", "Sticky:"]: # We've scanned this topic before, let's skip it in the future
                self.topics_to_ignore.append(link["href"].split("=")[1])
        # Figure out if there's tomorrow
        hasMultiplePages = soup.find_all("td", align="right", valign="bottom", nowrap="nowrap")
        if hasMultiplePages:
            hasNextPage = hasMultiplePages[0].find_all("a", text="Next")
            if hasNextPage:
                yield Request((self.root_domain + "/" + hasNextPage[0]["href"]), callback=self.parseTopics, meta={"forumid": response.meta["forumid"]})
        for t in topics:
            if t["topicID"] not in ignoreList:
                yield Request((self.root_domain + "/" + "viewtopic.php?t=" + t["topicID"]), callback=self.parsePosts, meta={"topicID": t["topicID"]})
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Code is complex

Post by VirtLands » Sun Aug 05, 2012 5:35 am

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 8:30 pm, edited 5 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: Code is complex

Post by tyteen4a03 » Sat Aug 11, 2012 6:30 pm

VirtLands wrote:Hmmm, this post hasn't been updated in a while. Could be it's turning into a cobweb site.
Yes, haven't got time to work on it for a while.

Here's a very important bit of the spider - post scraping. This 70-lined code scraps post information and attachments, while initiating scrap of post content (will explain later why it's a separate process), user profile, and (of course) the scraping of next page.

It's 2:30 AM here now, I need to go to sleep.

Code: Select all

    def parsePosts(self, response):
        """
        Parse post content.
        """
        soup = BeautifulSoup(response.body)
        def aName(tag):
            return tag.name == "a" and isinstance(tag["name"], int)
        def aHref(tag):
            return tag.name == "a" and tag["href"].startswith("profile.php?mode=viewprofile&u=")
        def spanClass(tag):
            return tag.name == "span" and tag["class"] == "postdetails" and tag.string.startswith("Posted: ")
        def spanClassPostBody(tag):
            return tag.name == "span" and \
                   tag["class"] == "postbody" and not \
                   tag.string.startswith&#40;"<br />_________________<br />"&#41;
        def attachURL&#40;tag&#41;&#58;
            return tag.name == "a" and tag&#91;"href"&#93;.startswith&#40;"download.php?id="&#41;

        def determineContentFetchMode&#40;postid&#41;&#58;
            if soup.find&#40;"a", href="posting.php?mode=editpost&amp;p=" + postid&#41;&#58;
                return "edit"
            elif soup.find&#40;"a", href="posting.php?mode=quote&amp;p=" + postid&#41;&#58;
                return "quote"
            else&#58;
                return "raw"

        # Find posts information
        posts = &#91;&#93;
        attachments = &#91;&#93;
        for &#40;pid, userid, username, posttime, content&#41; in zip&#40;
            soup.find_all&#40;aName&#41;, # Post ID
            soup.find_all&#40;aHref&#41;, # User ID
            soup.find_all&#40;"span", attrs=&#123;"class"&#58; "name"&#125;&#41;.b.string,
            soup.find_all&#40;spanClass&#41;,
            soup.find_all&#40;spanClassPostBody&#41;, # Post body
        &#41;&#58;
            aPost = Post&#40;&#41;
            aPost&#91;"postID"&#93; = pid
            aPost&#91;"topicID"&#93; = response.meta&#91;"topicID"&#93;
            aPost&#91;"posterID"&#93; = userid.strip&#40;"profile.php?mode=viewprofile&amp;u="&#41;
            aPost&#91;"postTime"&#93; = posttime.strip&#40;"Posted&#58; "&#41;&#91;0&#58;9&#93; # Timestamps are always 10 digits
            attachTable = content.find_next_sibling&#40;"table", attrs=&#123;"class"&#58; "attachtable"&#125;&#41;
            # Attachment?
            if attachTable&#58;
                anAttachment = Attachment&#40;&#41;
                anAttachment&#91;"postID"&#93; = pid
                anAttachment&#91;"originalFilename"&#93; = attachTable.find&#40;"span", attrs=&#123;"class"&#58; "gen"&#125;&#41; # The original name
                anAttachment&#91;"displayFilename"&#93; = attachTable.find&#40;attachURL&#41;
            # Initiate Post content scraping
            mode = determineContentFetchMode&#40;pid&#41;
            if mode != "raw"&#58;
                yield Request&#40;&#40;self.root_domain + "/" + "posting.php?mode=" +
                               &#40;"editpost" if p&#91;"content"&#93;&#91;0&#93; == "edit" else "quote"&#41; +
                               "&amp;p=" + p&#91;"postID"&#93;&#41;,
                    callback=self.parsePostContent,
                    meta=&#123;"postID"&#58; pid, "content"&#58; None, "mode"&#58; mode&#125;&#41;
            # Initiate User scraping
            if username not in self.users_scanned&#58;
                yield Request&#40;&#40;self.root_domain + "/" + userid&#41;,
                    callback=self.parseUser&#41;
            posts.append&#40;aPost&#41;
        yield posts
        # Figure out if there's tomorrow
        hasMultiplePages = soup.find&#40;"td", align="left", valign="bottom", colspan=2&#41;
        if hasMultiplePages&#58;
            hasNextPage = hasMultiplePages.find&#40;"a", text="Next"&#41;
            if hasNextPage&#58;
                yield Request&#40;&#40;self.root_domain + "/" + hasNextPage&#91;"href"&#93;&#41;,
                callback=self.parseTopics,
                meta=&#123;"topicID"&#58; response.meta&#91;"topicID"&#93;&#125;&#41;
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

crawlers and stuff

Post by VirtLands » Sat Aug 11, 2012 9:30 pm

Good work: :)

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:07 pm, edited 3 times in total.
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Teleport Pro

Post by VirtLands » Sat Aug 11, 2012 11:26 pm

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 8:53 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Sun Aug 12, 2012 3:55 am

I don't want to work with regex (they are a pain in the butt), and all those customizations just hurt my brain.

And the code I'm writing is open-source (well, will be soon, haven't got time to upload it to GitHub yet)

For now, the best help would be to grab me coffee. :P
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Zoolz

Post by VirtLands » Mon Aug 13, 2012 2:55 am

I had an idea that if you ...

finish this project and wish to share your hard earned data with us
then you can upload it to the following.

Image Create an account on Zoolz: http://goo.gl/6D4uT
or
Image SkyDrive: https://skydrive.live.com/
Last edited by VirtLands on Mon Aug 13, 2012 7:32 pm, edited 2 times in total.
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Zoolz link for downloaded http://www.pcpuzzle.com/forum/

Post by VirtLands » Mon Aug 13, 2012 4:51 am

Image Image Image

Uploaded zipped form 240 MB (incomplete, but substantial.)
Zoolz link: http://zlz.me/yq5i9
Last edited by VirtLands on Thu Aug 16, 2012 8:56 pm, edited 2 times in total.
User avatar
jdl
Rainbow SuperStar
Posts: 2894
Joined: Fri Jun 06, 2008 8:37 pm
Location: West Virginia, USA
Contact:

Re: Zoolz link for downloaded http://www.pcpuzzle.com/forum/

Post by jdl » Mon Aug 13, 2012 12:24 pm

VirtLands wrote:Surprisingly it only amounts to 240 MB, yet it states "complete".
{ 8878 files in 1058 folders }, maybe it's just a very, very compact format. Image
I think I found out why this is. I just tested out your download, and for example, for the old off-topic it says "Goto page 1, 2, 3 ... 41, 42, 43". Pages 4-40 have not been archived. It appears that all the "in-between-pages" for all the subforums are not downloaded at all.
ImageImage
TheCracksOverhead#9565 | Oops, uh oh.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Mon Aug 13, 2012 1:11 pm

search.php does not work because it's a dynamic page.

I also have my own storage space - I don't use cloud file services.

I want to also clarify why making my own spider is better than archiving all pages then mine data out of it - it saves disk space. Because pages are scanned and mined on-the-fly, almost no disk space is needed to store unnecessary HTML files (which is a lot of overhead)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

offline and so fine

Post by VirtLands » Mon Aug 13, 2012 7:29 pm

I forgot to mention about (c),

(a) It gets stuck when it wanders onto http://www.midnightsynergy.com/
(b) The search function ( http://www.pcpuzzle.com/forum/search.php ) doesn't work.
(c) Contains no attachments, and therefore no levels & customMedia
______________________________________________________

Thanks to JDL for his studying of the download.
I thought there was something fishy about it only being 240 MB.
Last edited by VirtLands on Thu Aug 16, 2012 4:31 am, edited 1 time in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Mon Aug 13, 2012 11:36 pm

Yes, I blacklisted the login page.

The spider mines topic list, post list, user profile and attachments of specific forums.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Offline Explorer Enterprise

Post by VirtLands » Thu Aug 16, 2012 2:20 am

I see. Image Image Image
-----------------------------------------------------------------------
Last edited by VirtLands on Thu Aug 16, 2012 9:30 pm, edited 3 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: Offline Explorer Enterprise

Post by tyteen4a03 » Thu Aug 16, 2012 3:02 am

VirtLands wrote:I see.

Well, I tried the demo of Offline Explorer Enterprise, and tried so
many ways to set up "URL omissions" so that it won't log me out
of
http://www.pcpuzzle.com/forum/

But, I could never get it to download attachments,
can only get it to download the regular html stuff, (+images).

I basically gave up on the attachments option.

Looks like we'll have to get someone to download all
the attachments for us. Any volunteers? :)
-----------------------------------------------------------------------
So, how much progress have you made with your data mining ?
You need to login in order to download attachments.

Today's the last day that I'm not free. Work will start tomorrow.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

no login for Offline Explorer

Post by VirtLands » Thu Aug 16, 2012 8:58 pm

Thanks. Though my efforts to login did not result in an "apparent log-in",
I've resined myslef to the fact that I shall never download attachments. Some day you'll send us the life-line, I'll gladly wait.


Image Image Image
[ Don't worry about the sharks,
they've been around doing some house-cleaning, eating up unnecessary posts. ]

Image
[ cover art to Bobb Trimble's 1983 recording with The Crippled Dog Band, released July 26th 2011 on Yoga Records ]
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Web Boomerang

Post by VirtLands » Fri Aug 17, 2012 5:53 pm

Image Image Image
[ Time to start worrying about the sharks, they're coming after you.. ]


I changed my mind about Web Boomerang. It's awful.
Last edited by VirtLands on Fri Aug 17, 2012 7:45 pm, edited 3 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Fri Aug 17, 2012 6:25 pm

That site still exists? o.o

Anyways, work has resumed and I expect another update very soon. Hopefully I will be able to put the spider to work for the first time.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

SID

Post by VirtLands » Fri Aug 17, 2012 7:40 pm

I'm temporarily back on the 'rack. (HTTrack)

I just learned what a SID is:
Whenever one logs onto a forum (such as this), we are provided
with a SID.

For example, my SID is :

sid=786b9ecdf00278##################

AHa!. Did you really think I'd tell you my SID? (I covered it with #'s).

So, folks, never give out your SID.

Definition and examples of SID:

http://kb.iu.edu/data/aotl.html

A SID contains:

User and group security descriptors
48-bit ID authority
Revision level
Variable sub-authority values
____________________________________________
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Sat Aug 18, 2012 4:45 am

SID means PHP Session ID. Everytime you visit another page it refreshes in the database. There is absolutely no harm giving out your PHP Session ID out, as hackers can't really do anything about it.

(But if an exploit has been found in a software this will not be the case - won't explain here)
and the duck went moo

Beep bloop
User avatar
LexieTheFox
Rainbow Master
Posts: 769
Joined: Mon Sep 27, 2010 11:51 pm

Re: HTTrack project 20GB

Post by LexieTheFox » Sat Aug 18, 2012 5:06 am

VirtLands wrote:The following attachment is a sorted list of member names, ID's, emails,
compiled from 241 member webpages. Click on the attached download.

Enjoy. Image Image Image
Ok um... I'm not ok with my ID or Email being given out on that mirror website. Please, when its up. Display my name ONLY.
Rawr.
Fear me, I bite! >:3
Post Reply