The Wonderland Board

Posted: **Tue Jul 24, 2012 1:17 pm**

As an effort to organize levels and adventures better, I am starting a new project - The New Wonderland Archive.

The project has two parts:
1. Mirroring pcpuzzle.com/forum. I found a very useful program called HTTrack, which allows automatic mirroring of websites. I am currently mirroring the forum with it, and hopefully it will generate some useful results. After all files have been downloaded, I will parse the data using Python and store them in database-happy format.

2. Recreating a proper Level Exchange website. I will be recreating a proper Level Exchange website, powered by PHP. All current comments and pictures will be uploaded to this website.

This New Wonderland Archive will provide features that this forum lack, including On-the-fly level archive download, Level-with-CustomData archive download, support for files larger than 800kb, and Level Series support. It will also feature just-for-fun features, such as Level Of The Month, Level Submission Statistics, and others.

Due to the vast complexity brought by the lack of the actual database data, this project will take several months to a year before completion. Most of the project time will be spent on unpacking data and data integrity check.

Before continuing with the project, I would like to collect opinions from members, who will be using this project the most: Do you think it's a good idea? If the community feedback is positive, then the development would begin. I will also look for help along the way, so stay tuned.

Thanks for reading, and have a good day.

(and before you ask, in this post, "Level" stands for both RTW levels and WA Adventures.)

Posted: **Tue Jul 24, 2012 1:44 pm**

Ooh, fun idea! Maybe I could help.

Posted: **Tue Jul 24, 2012 2:05 pm**

Sounds like a good idea!

How do we start it

Posted: **Tue Jul 24, 2012 2:18 pm**

My answer to good idea:

Posted: **Tue Jul 24, 2012 3:30 pm**

Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:

Posted: **Tue Jul 24, 2012 3:35 pm**

Nobody wrote:Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:

The purpose of this is to allow people to download levels and organize them in a better way, I don't see how this is "taking stuff to other sites". If MS wants, I can even offer forum and game integration (although I cannot express how much I hate phpBB)

Posted: **Tue Jul 24, 2012 6:22 pm**

Tyteen loves phpBB, and so do I.

Posted: **Tue Jul 24, 2012 7:12 pm**

VirtLands wrote:
tyteen4a03 wrote:The purpose of this is to allow people to download levels and organize them in a better way,
... (although I cannot express how much I love phpBB )

http://www.httrack.com/ is very interesting.

Yeah I love phpBB too; I don't know any of it.

I can see where this HTTrack has lots of potential, ultimately
allowing you to create a powerful program for searching through
everything.

How long will it take to download the entire Discussion B to a drive?
Will it automatically download attachments too?

There has got to be an easier way than using Python to decipher it.
(Python is a pain.)

Please further define the following:

A: On-the-fly level archive download
B: Level-with-CustomData archive download

You have my vote on this.

Nice edit.

Although I'm not tyteen, on-the-fly level archive download probably means that you don't have to register for anything to download (like you have to do for attachments on this forum for example) and CustomData probably means a custom textures/models/etc directory for RTW and WA. Again I'm not sure. Also, I do believe it downloads attachments as well.

Posted: **Tue Jul 24, 2012 7:21 pm**

Level Archive Download means a zip file containing all levels are zipped up on-the-fly and is available for download.

Level with CustomData Archive Download means the zip file comes with all CustomData it requires.

HTTracker will automatically download attachments too - currently 772 MB of data are downloaded (ran the thing for 3:45, with at most 4 active connection to not overload the server, this is a software setting

) I expect all files to be above 2GB, however I suspect that only 1/7 of it would be useful.

Python is not a pain once you know how to use it - in fact it's very powerful and is useful for data manipulation.

Posted: **Tue Jul 24, 2012 7:41 pm**

okay, let us know the progress.
Have you started yet?

Posted: **Tue Jul 24, 2012 7:46 pm**

VirtLands wrote: okay, let us know the progress.
Have you started yet?

Yes, the mirroring has started. Website design has not, however.

Posted: **Tue Jul 24, 2012 9:07 pm**

tyteen4a03 wrote: Yes, the mirroring has started. Website design has not, however.

tyteen4a03 wrote:I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).

Posted: **Tue Jul 24, 2012 9:09 pm**

I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).

Posted: **Tue Jul 24, 2012 9:23 pm**

Posted: **Tue Jul 24, 2012 9:32 pm**

OK, just figured out that I can't make it crawl member-only content just yet. I asked in the forums and will hopefully get an answer soon.

EDIT: Figured out that the login cookie only stays during the first capture. All other pages will appear as not logged in.

Posted: **Tue Jul 24, 2012 10:10 pm**

UPDATE:

I don't know if my PC will be able to handle all the data.

project terminated.

Posted: **Wed Jul 25, 2012 6:00 am**

Yes, you need to turn off these special pages before they eat you.

Nutters hasn't been updated for years, what I am trying to grab is the level comments and anything that isn't archived since its closure.

Posted: **Wed Jul 25, 2012 10:57 am**

Good luck!
Here's a checklist if you want:

A: WA stuff

I. LINKS
1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources

II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons

III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics

Posted: **Wed Jul 25, 2012 11:34 am**

WonderWiki will fill the place for textual resources.

(and yes, my webhost is still refusing to unlock the main page. Hrrumph.)

Posted: **Wed Jul 25, 2012 8:44 pm**

tyteen4a03 wrote:Yes, you need to turn off these special pages before they eat you. Nutters hasn't been updated for years.

(ha,ha)

Posted: **Wed Jul 25, 2012 9:05 pm**

Dark Drago wrote:Good luck!
Here's a checklist if you want:

A: WA stuff

I. LINKS
1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources

II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons

III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics

Good ideas, but WM and Wonderland illustrations are not WA...

This site will be excellent anyway...

Posted: **Fri Jul 27, 2012 10:46 pm**

I finally figured out why HTTrack's login is not persistent - I didn't make HTTrack ignore the logout link.

Archiving process is now resumed.

Posted: **Sat Jul 28, 2012 7:04 pm**

How many megabytes ?

Posted: **Sun Jul 29, 2012 8:57 am**

VirtLands wrote:Clever. How many megabytes have you successfully downloaded?

I can guess that the entire project size is about 2 to 4 gigabytes.

Actually, more than that - the current size of files downloaded is 7.84GB (with about 250MB unusable file only). My approximation is over 20GB of files.

However, I had to abort this method of mirroring website because it is causing too much overhead and the bot logged itself out. Again.

It also failed to recognize file names. (isn't really a problem from HTTrack, it's just that the forum's attachments are member-only, which is why it's guarded by a php page, of which HTTrack can't tell who's who)

I will soon be writing a Python bot that grabs information instead. It will create much less overhead that it has created now, and will be much more efficient (I can instruct it to ignore everything but posts and attachments)

Until the bot's finished, the archiving process is paused. I might be starting website design process soon, but I think I might have to do this all by myself because not many people here know HTML+CSS or PHP+SQL (which is what will drive the website).

So... wish me luck.

EDIT: Just found a Python framework called Scrapy, which is built on top of my favourite networking framework - Twisted. I will be digging into this framework for the bot.

Posted: **Sun Jul 29, 2012 10:55 pm**

I decided to remove the attachment which is PCPuzzle Members List.TXT,
since it contains email data.

Posted: **Mon Jul 30, 2012 5:35 pm**

This is where Scrapy comes in - I am writing a bot that scraps Topics, Posts, Attachments and Profile Information. It's going to be hard work (I am still trying to understand how to use it), but in the end I think it will be worth it.

I will also release the source of this bot later on to the general public on GitHub - maybe it will help the others too.

(Oh, and Gigabyte, not Gigabit.)

Posted: **Tue Jul 31, 2012 5:46 am**

Posted: **Tue Jul 31, 2012 7:38 am**

VirtLands wrote:I downloaded scrapy, and I read some of the scrapy.PDF. It's complicated.
There has got to be an easier way. Good luck.

It requires some Python programming, yes. Right now, I can get it to extract the topic list. Hopefully I will be able to do more later on.

Posted: **Tue Jul 31, 2012 9:35 pm**

Excuse the double-post, but I feel that the progress made is absolutely worth it.

I've given up on XPath (the built-in filtering system, also a W3C recommendation) and switched to BeautifulSoup (a Python library that processes HTML files) instead, because XPath was too confusing. Now I can fetch the topic list correctly on 1 page.

This is a good progress because this means I have figured out how to play around with tags and figure out which is which. I'm pretty sure that I will be able to finish the bot real soon.

Next challenge: actually finishing the bot (the Next page bit turns out to be very easy, kudos to anybody who gets the following code:)

Code: Select all

# Hacky solution to find out if there's tomorrow
if string.endswith("Next</a>"):
    return Request(string.strip("</a>").strip('<a href="').strip('"Next'), callback=self.handleTomorrow)

Posted: **Tue Jul 31, 2012 10:39 pm**

Tyteen makes spider progress...

The Wonderland Board

The New Wonderland Archive

The New Wonderland Archive

phpBB

HTTrack

SiteMirroringsnapshot

tyteen4a03's HTTrack

speedtest.net

HTTrack progress

Re: HTTrack progress

PCpuzzle Members List Attachment

scraping by with scrapy

Re: scraping by with scrapy

Tyteen's spider bot progress