Page 1 of 3
The New Wonderland Archive
Posted: Tue Jul 24, 2012 1:17 pm
by tyteen4a03
As an effort to organize levels and adventures better, I am starting a new project - The New Wonderland Archive.
The project has two parts:
1.
Mirroring pcpuzzle.com/forum. I found a very useful program called
HTTrack, which allows automatic mirroring of websites. I am currently mirroring the forum with it, and hopefully it will generate some useful results. After all files have been downloaded, I will parse the data using Python and store them in database-happy format.
2.
Recreating a proper Level Exchange website. I will be recreating a proper Level Exchange website, powered by PHP. All current comments and pictures will be uploaded to this website.
This New Wonderland Archive will provide features that this forum lack, including
On-the-fly level archive download,
Level-with-CustomData archive download,
support for files larger than 800kb, and
Level Series support. It will also feature just-for-fun features, such as Level Of The Month, Level Submission Statistics, and others.
Due to the vast complexity brought by the lack of the actual database data, this project will take several months to a year before completion. Most of the project time will be spent on unpacking data and data integrity check.
Before continuing with the project, I would like to collect opinions from members, who will be using this project the most:
Do you think it's a good idea? If the community feedback is positive, then the development would begin. I will also look for help along the way, so stay tuned.
Thanks for reading, and have a good day.
(and before you ask, in this post, "Level" stands for both RTW levels and WA Adventures.)
Posted: Tue Jul 24, 2012 1:44 pm
by Sammy_P
Ooh, fun idea! Maybe I could help.
Posted: Tue Jul 24, 2012 2:05 pm
by Wonderman109
Sounds like a good idea!
How do we start it

Posted: Tue Jul 24, 2012 2:18 pm
by Technos72
My answer to good idea:

Posted: Tue Jul 24, 2012 3:30 pm
by Nobody
Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:
Posted: Tue Jul 24, 2012 3:35 pm
by tyteen4a03
Nobody wrote:Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:
The purpose of this is to allow people to download levels and organize them in a better way, I don't see how this is "taking stuff to other sites". If MS wants, I can even offer forum and game integration (although I cannot express how much I hate phpBB)
phpBB
Posted: Tue Jul 24, 2012 6:22 pm
by VirtLands
Tyteen loves phpBB, and so do
I.

Posted: Tue Jul 24, 2012 7:12 pm
by jdl
VirtLands wrote:tyteen4a03 wrote:The purpose of this is to allow people to download levels and organize them in a better way,
... (although I cannot express how much I love phpBB

)
http://www.httrack.com/ is very interesting.
Yeah I love phpBB too; I don't know any of it.
I can see where this HTTrack has lots of potential, ultimately
allowing you to create a powerful program for searching through
everything.
How long will it take to download the entire Discussion B to a drive? 
Will it automatically download attachments too?
There has got to be an easier way than using Python to decipher it.
(Python is a pain.)
Please further define the following:
A: On-the-fly level archive download
B: Level-with-CustomData archive download
You have my vote on this.
Nice edit.
Although I'm not tyteen, on-the-fly level archive download probably means that you don't have to register for anything to download (like you have to do for attachments on this forum for example) and CustomData probably means a custom textures/models/etc directory for RTW and WA. Again I'm not sure. Also, I do believe it downloads attachments as well.

Posted: Tue Jul 24, 2012 7:21 pm
by tyteen4a03
Level Archive Download means a zip file containing all levels are zipped up on-the-fly and is available for download.
Level with CustomData Archive Download means the zip file comes with all CustomData it requires.
HTTracker will automatically download attachments too - currently 772 MB of data are downloaded (ran the thing for 3:45, with at most 4 active connection to not overload the server, this is a software setting

) I expect all files to be above 2GB, however I suspect that only 1/7 of it would be useful.
Python is not a pain once you know how to use it - in fact it's very powerful and is useful for data manipulation.
Posted: Tue Jul 24, 2012 7:41 pm
by VirtLands
okay, let us know the progress.
Have you started yet?
Posted: Tue Jul 24, 2012 7:46 pm
by tyteen4a03
VirtLands wrote:
okay, let us know the progress.
Have you started yet?
Yes, the mirroring has started. Website design has not, however.
HTTrack
Posted: Tue Jul 24, 2012 9:07 pm
by VirtLands
tyteen4a03 wrote:
Yes, the mirroring has started. Website design has not, however.

tyteen4a03 wrote:I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).
Posted: Tue Jul 24, 2012 9:09 pm
by tyteen4a03
I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).
SiteMirroringsnapshot
Posted: Tue Jul 24, 2012 9:23 pm
by VirtLands
Posted: Tue Jul 24, 2012 9:32 pm
by tyteen4a03
OK, just figured out that I can't make it crawl member-only content just yet. I asked in the forums and will hopefully get an answer soon.
EDIT: Figured out that the login cookie only stays during the first capture. All other pages will appear as not logged in.
tyteen4a03's HTTrack
Posted: Tue Jul 24, 2012 10:10 pm
by VirtLands
UPDATE:
I don't know if my PC will be able to handle all the data.
project terminated.
Posted: Wed Jul 25, 2012 6:00 am
by tyteen4a03
Yes, you need to turn off these special pages before they eat you.
Nutters hasn't been updated for years, what I am trying to grab is the level comments and anything that isn't archived since its closure.
Posted: Wed Jul 25, 2012 10:57 am
by Dark Drago
Good luck!
Here's a checklist if you want:
A: WA stuff
I. LINKS
1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources
II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons
III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics
Posted: Wed Jul 25, 2012 11:34 am
by tyteen4a03
WonderWiki will fill the place for textual resources.
(and yes, my webhost is still refusing to unlock the main page. Hrrumph.)
speedtest.net
Posted: Wed Jul 25, 2012 8:44 pm
by VirtLands
Posted: Wed Jul 25, 2012 9:05 pm
by Yzfm
Dark Drago wrote:Good luck!
Here's a checklist if you want:
A: WA stuff
I. LINKS
1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources
II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons
III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics
Good ideas, but WM and Wonderland illustrations are not WA...
This site will be excellent anyway...
Posted: Fri Jul 27, 2012 10:46 pm
by tyteen4a03
I finally figured out why HTTrack's login is not persistent - I didn't make HTTrack ignore the logout link.
Archiving process is now resumed.
HTTrack progress
Posted: Sat Jul 28, 2012 7:04 pm
by VirtLands
How many megabytes ?
Re: HTTrack progress
Posted: Sun Jul 29, 2012 8:57 am
by tyteen4a03
VirtLands wrote:Clever. How many megabytes have you successfully downloaded?
I can guess that the entire project size is about 2 to 4 gigabytes.
Actually, more than that - the current size of files downloaded is 7.84GB (with about 250MB unusable file only). My approximation is over 20GB of files.
However, I had to abort this method of mirroring website because it is causing too much overhead and the bot logged itself out. Again.

It also failed to recognize file names. (isn't really a problem from HTTrack, it's just that the forum's attachments are member-only, which is why it's guarded by a php page, of which HTTrack can't tell who's who)
I will soon be writing a Python bot that grabs information instead. It will create much less overhead that it has created now, and will be much more efficient (I can instruct it to ignore everything but posts and attachments)
Until the bot's finished, the archiving process is paused. I might be starting website design process soon, but I think I might have to do this all by myself because not many people here know HTML+CSS or PHP+SQL (which is what will drive the website).
So... wish me luck.
EDIT: Just found a Python framework called
Scrapy, which is built on top of my favourite networking framework - Twisted. I will be digging into this framework for the bot.
PCpuzzle Members List Attachment
Posted: Sun Jul 29, 2012 10:55 pm
by VirtLands
I decided to remove the attachment which is PCPuzzle Members List.TXT,
since it contains email data.

Posted: Mon Jul 30, 2012 5:35 pm
by tyteen4a03
This is where Scrapy comes in - I am writing a bot that scraps Topics, Posts, Attachments and Profile Information. It's going to be hard work (I am still trying to understand how to use it), but in the end I think it will be worth it.
I will also release the source of this bot later on to the general public on GitHub - maybe it will help the others too.
(Oh, and Gigabyte, not Gigabit.)
scraping by with scrapy
Posted: Tue Jul 31, 2012 5:46 am
by VirtLands
Re: scraping by with scrapy
Posted: Tue Jul 31, 2012 7:38 am
by tyteen4a03
VirtLands wrote:I downloaded scrapy, and I read some of the scrapy.PDF. It's complicated.
There has got to be an easier way. Good luck.
It requires some Python programming, yes. Right now, I can get it to extract the topic list. Hopefully I will be able to do more later on.
Posted: Tue Jul 31, 2012 9:35 pm
by tyteen4a03
Excuse the double-post, but I feel that the progress made is absolutely worth it.
I've given up on XPath (the built-in filtering system, also a W3C recommendation) and switched to BeautifulSoup (a Python library that processes HTML files) instead, because XPath was too confusing. Now I can fetch the topic list correctly on 1 page.
This is a good progress because this means I have figured out how to play around with tags and figure out which is which. I'm pretty sure that I will be able to finish the bot real soon.
Next challenge: actually finishing the bot (the Next page bit turns out to be very easy, kudos to anybody who gets the following code:)
Code: Select all
# Hacky solution to find out if there's tomorrow
if string.endswith("Next</a>"):
return Request(string.strip("</a>").strip('<a href="').strip('"Next'), callback=self.handleTomorrow)
Tyteen's spider bot progress
Posted: Tue Jul 31, 2012 10:39 pm
by VirtLands
Tyteen makes spider progress...
