The New Wonderland Archive

Discuss the games (no level solutions or off-topic, please).

Moderators: ~xpr'd~, tyteen4a03, Stinky, Emerald141, Qloof234, jdl

User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

The New Wonderland Archive

Post by tyteen4a03 » Tue Jul 24, 2012 1:17 pm

As an effort to organize levels and adventures better, I am starting a new project - The New Wonderland Archive.

The project has two parts:
1. Mirroring pcpuzzle.com/forum. I found a very useful program called HTTrack, which allows automatic mirroring of websites. I am currently mirroring the forum with it, and hopefully it will generate some useful results. After all files have been downloaded, I will parse the data using Python and store them in database-happy format.

2. Recreating a proper Level Exchange website. I will be recreating a proper Level Exchange website, powered by PHP. All current comments and pictures will be uploaded to this website.

This New Wonderland Archive will provide features that this forum lack, including On-the-fly level archive download, Level-with-CustomData archive download, support for files larger than 800kb, and Level Series support. It will also feature just-for-fun features, such as Level Of The Month, Level Submission Statistics, and others.

Due to the vast complexity brought by the lack of the actual database data, this project will take several months to a year before completion. Most of the project time will be spent on unpacking data and data integrity check.

Before continuing with the project, I would like to collect opinions from members, who will be using this project the most: Do you think it's a good idea? If the community feedback is positive, then the development would begin. I will also look for help along the way, so stay tuned.

Thanks for reading, and have a good day. 8)

(and before you ask, in this post, "Level" stands for both RTW levels and WA Adventures.)
and the duck went moo

Beep bloop
User avatar
Sammy_P
Rainbow SuperStar
Posts: 2905
Joined: Fri May 11, 2007 9:01 pm
Location: he/they land
Contact:

Post by Sammy_P » Tue Jul 24, 2012 1:44 pm

Ooh, fun idea! Maybe I could help.
User avatar
Wonderman109
Rainbow MegaStar
Posts: 3593
Joined: Thu Jun 28, 2012 11:25 pm

Post by Wonderman109 » Tue Jul 24, 2012 2:05 pm

Sounds like a good idea! 8)

How do we start it :?:
Not really around much these years.
User avatar
Technos72
Rainbow MegaStar
Posts: 3227
Joined: Thu Nov 26, 2009 12:20 am
Location: Usa
Contact:

Post by Technos72 » Tue Jul 24, 2012 2:18 pm

My answer to good idea:
Image
User avatar
Nobody
Rainbow Spirit Chaser
Posts: 5545
Joined: Thu Aug 21, 2008 5:52 pm

Post by Nobody » Tue Jul 24, 2012 3:30 pm

Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:
i should change my signature to be rude to people who hate pictures of valves
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 24, 2012 3:35 pm

Nobody wrote:Wait, you're going to be taking the WA adventures and stuff and moving them to some random other site? D:
The purpose of this is to allow people to download levels and organize them in a better way, I don't see how this is "taking stuff to other sites". If MS wants, I can even offer forum and game integration (although I cannot express how much I hate phpBB)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

phpBB

Post by VirtLands » Tue Jul 24, 2012 6:22 pm

Tyteen loves phpBB, and so do I. :shock:
Last edited by VirtLands on Sat Aug 18, 2012 5:45 pm, edited 2 times in total.
User avatar
jdl
Rainbow SuperStar
Posts: 2894
Joined: Fri Jun 06, 2008 8:37 pm
Location: West Virginia, USA
Contact:

Post by jdl » Tue Jul 24, 2012 7:12 pm

VirtLands wrote:
tyteen4a03 wrote:The purpose of this is to allow people to download levels and organize them in a better way,
... (although I cannot express how much I love phpBB Image)

http://www.httrack.com/ is very interesting.

Yeah I love phpBB too; I don't know any of it.

I can see where this HTTrack has lots of potential, ultimately
allowing you to create a powerful program for searching through
everything.

How long will it take to download the entire Discussion B to a drive? :shock:
Will it automatically download attachments too?

There has got to be an easier way than using Python to decipher it.
(Python is a pain.)

Please further define the following:

A: On-the-fly level archive download
B: Level-with-CustomData archive download

You have my vote on this.
Nice edit. :P

Although I'm not tyteen, on-the-fly level archive download probably means that you don't have to register for anything to download (like you have to do for attachments on this forum for example) and CustomData probably means a custom textures/models/etc directory for RTW and WA. Again I'm not sure. Also, I do believe it downloads attachments as well. :)
ImageImage
TheCracksOverhead#9565 | Oops, uh oh.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 24, 2012 7:21 pm

Level Archive Download means a zip file containing all levels are zipped up on-the-fly and is available for download.

Level with CustomData Archive Download means the zip file comes with all CustomData it requires.

HTTracker will automatically download attachments too - currently 772 MB of data are downloaded (ran the thing for 3:45, with at most 4 active connection to not overload the server, this is a software setting :( ) I expect all files to be above 2GB, however I suspect that only 1/7 of it would be useful.

Python is not a pain once you know how to use it - in fact it's very powerful and is useful for data manipulation.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Post by VirtLands » Tue Jul 24, 2012 7:41 pm

Image okay, let us know the progress.
Have you started yet?
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 24, 2012 7:46 pm

VirtLands wrote:Image okay, let us know the progress.
Have you started yet?
Yes, the mirroring has started. Website design has not, however.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

HTTrack

Post by VirtLands » Tue Jul 24, 2012 9:07 pm

tyteen4a03 wrote:Image Yes, the mirroring has started. Website design has not, however. Image
tyteen4a03 wrote:I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).
Last edited by VirtLands on Tue Jul 24, 2012 9:49 pm, edited 3 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 24, 2012 9:09 pm

I have instructed HTTrack to ignore all special pages and non-level forums (hopefully they do work).
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

SiteMirroringsnapshot

Post by VirtLands » Tue Jul 24, 2012 9:23 pm

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:42 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 24, 2012 9:32 pm

OK, just figured out that I can't make it crawl member-only content just yet. I asked in the forums and will hopefully get an answer soon.

EDIT: Figured out that the login cookie only stays during the first capture. All other pages will appear as not logged in.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

tyteen4a03's HTTrack

Post by VirtLands » Tue Jul 24, 2012 10:10 pm

UPDATE:

I don't know if my PC will be able to handle all the data.

project terminated. Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:44 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Wed Jul 25, 2012 6:00 am

Yes, you need to turn off these special pages before they eat you. :lol:

Nutters hasn't been updated for years, what I am trying to grab is the level comments and anything that isn't archived since its closure.
and the duck went moo

Beep bloop
User avatar
Dark Drago
Rainbow Master
Posts: 919
Joined: Mon Apr 09, 2012 7:50 am

Post by Dark Drago » Wed Jul 25, 2012 10:57 am

Good luck!
Here's a checklist if you want:

A: WA stuff

I. LINKS

1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources

II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons

III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Wed Jul 25, 2012 11:34 am

WonderWiki will fill the place for textual resources.

(and yes, my webhost is still refusing to unlock the main page. Hrrumph.)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

speedtest.net

Post by VirtLands » Wed Jul 25, 2012 8:44 pm

tyteen4a03 wrote:Yes, you need to turn off these special pages before they eat you. Image Nutters hasn't been updated for years.
(ha,ha) Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:45 pm, edited 2 times in total.
User avatar
Yzfm
Rainbow Master
Posts: 853
Joined: Mon Jul 02, 2012 11:35 am

Post by Yzfm » Wed Jul 25, 2012 9:05 pm

Dark Drago wrote:Good luck!
Here's a checklist if you want:

A: WA stuff

I. LINKS

1. Links to various hubs
2. Links to various editors
3. A separate page for every user's contribution links
4. Links to various editor tools
5. Links to unofficial guide
6. Links to resources

II. DOWNLOAD
1. Custom level textures
2. Custom icons
3. Custom models
4. The WOP archive
5. Custom models
6. Custom object textures
7. Custom water textures
8. Custom resources like spectrum of mpbefs
9. Custom gates, buttons

III. EXTRA
1. Random things like idea generator, color id calculator
2. Wonder Magazine
3. Wonderland Fanfiction
4. Wonderland Illustrations
5. Wonderland Comics
Good ideas, but WM and Wonderland illustrations are not WA...

This site will be excellent anyway...
Previous Adventure:Time Out
Latest Adventure:Please Don't Feed The Dinosaurs!
Upcoming Adventure: History Lessons
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Fri Jul 27, 2012 10:46 pm

I finally figured out why HTTrack's login is not persistent - I didn't make HTTrack ignore the logout link. :oops:

Archiving process is now resumed.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

HTTrack progress

Post by VirtLands » Sat Jul 28, 2012 7:04 pm

How many megabytes ?
Last edited by VirtLands on Sat Aug 18, 2012 5:44 pm, edited 2 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: HTTrack progress

Post by tyteen4a03 » Sun Jul 29, 2012 8:57 am

VirtLands wrote:Clever. How many megabytes have you successfully downloaded?

I can guess that the entire project size is about 2 to 4 gigabytes.
Actually, more than that - the current size of files downloaded is 7.84GB (with about 250MB unusable file only). My approximation is over 20GB of files.

However, I had to abort this method of mirroring website because it is causing too much overhead and the bot logged itself out. Again. :evil: It also failed to recognize file names. (isn't really a problem from HTTrack, it's just that the forum's attachments are member-only, which is why it's guarded by a php page, of which HTTrack can't tell who's who)

I will soon be writing a Python bot that grabs information instead. It will create much less overhead that it has created now, and will be much more efficient (I can instruct it to ignore everything but posts and attachments)

Until the bot's finished, the archiving process is paused. I might be starting website design process soon, but I think I might have to do this all by myself because not many people here know HTML+CSS or PHP+SQL (which is what will drive the website).

So... wish me luck. :)

EDIT: Just found a Python framework called Scrapy, which is built on top of my favourite networking framework - Twisted. I will be digging into this framework for the bot.
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

PCpuzzle Members List Attachment

Post by VirtLands » Sun Jul 29, 2012 10:55 pm


I decided to remove the attachment which is PCPuzzle Members List.TXT,
since it contains email data.
Image


Image Image
Last edited by VirtLands on Sat Aug 18, 2012 6:46 pm, edited 4 times in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Mon Jul 30, 2012 5:35 pm

This is where Scrapy comes in - I am writing a bot that scraps Topics, Posts, Attachments and Profile Information. It's going to be hard work (I am still trying to understand how to use it), but in the end I think it will be worth it.

I will also release the source of this bot later on to the general public on GitHub - maybe it will help the others too.

(Oh, and Gigabyte, not Gigabit.)
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

scraping by with scrapy

Post by VirtLands » Tue Jul 31, 2012 5:46 am

Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:52 pm, edited 1 time in total.
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Re: scraping by with scrapy

Post by tyteen4a03 » Tue Jul 31, 2012 7:38 am

VirtLands wrote:I downloaded scrapy, and I read some of the scrapy.PDF. It's complicated.
There has got to be an easier way. Good luck.
It requires some Python programming, yes. Right now, I can get it to extract the topic list. Hopefully I will be able to do more later on.
and the duck went moo

Beep bloop
User avatar
tyteen4a03
Rainbow AllStar
Posts: 4386
Joined: Wed Jul 12, 2006 7:16 am
Contact:

Post by tyteen4a03 » Tue Jul 31, 2012 9:35 pm

Excuse the double-post, but I feel that the progress made is absolutely worth it.

I've given up on XPath (the built-in filtering system, also a W3C recommendation) and switched to BeautifulSoup (a Python library that processes HTML files) instead, because XPath was too confusing. Now I can fetch the topic list correctly on 1 page.

This is a good progress because this means I have figured out how to play around with tags and figure out which is which. I'm pretty sure that I will be able to finish the bot real soon.

Next challenge: actually finishing the bot (the Next page bit turns out to be very easy, kudos to anybody who gets the following code:)

Code: Select all

# Hacky solution to find out if there's tomorrow
if string.endswith&#40;"Next</a>"&#41;&#58;
    return Request&#40;string.strip&#40;"</a>"&#41;.strip&#40;'<a href="'&#41;.strip&#40;'"Next'&#41;, callback=self.handleTomorrow&#41;
Attachments
spider.png
The spider working.
spider.png (103.33 KiB) Viewed 8562 times
and the duck went moo

Beep bloop
User avatar
VirtLands
Rainbow Master
Posts: 756
Joined: Thu Dec 29, 2005 1:49 am

Tyteen's spider bot progress

Post by VirtLands » Tue Jul 31, 2012 10:39 pm

Tyteen makes spider progress... Image Image Image
Last edited by VirtLands on Thu Aug 16, 2012 9:57 pm, edited 3 times in total.
Post Reply