******************************************************************************************************* * * JWikiDocs * * Copyright (C) 2007 by * * Xuan-Hieu Phan * Email: hieuxuan@ecei.tohoku.ac.jp * pxhieu@gmail.com * URL: http://www.hori.ecei.tohoku.ac.jp/~hieuxuan * * Graduate School of Information Sciences, * Tohoku University * ******************************************************************************************************* Directory structure: -------------------- JWikiDocs |---data # containing data crawling from Wikipedia | |---Test | | |---artificial-intelligence # sample crawling with topic "Artificial Intelligence" | | | |---data # crawled data will be saved here (each file is a document) | | | |---graph.dot # auto-generated for building graph (for future use) | | | |---option.txt # option file (prepared BY USERS) | | | |---retrievallog.txt # crawling log | | |---information-retrieval # sample crawling with topic "Information Retrieval" | | | |---data # crawled data will be saved here (each file is a document) | | | |---graph.dot # auto-generated for building graph (for future use) | | | |---option.txt # option file (prepared BY USERS) | | | |---retrievallog.txt # crawling log | | |---machine-learning # sample crawling with topic "Machine Learning" | | | |---data # crawled data will be saved here (each file is a document) | | | |---graph.dot # auto-generated for building graph (for future use) | | | |---option.txt # option file (prepared BY USERS) | | | |---retrievallog.txt # crawling log | | |---natural-language-processing # sample crawling with topic "Natural Language Processing" | | | |---data # crawled data will be saved here (each file is a document) | | | |---graph.dot # auto-generated for building graph (for future use) | | | |---option.txt # option file (prepared BY USERS) | | | |---retrievallog.txt # crawling log | | |---world-wide-web # sample crawling with topic "World Wide Web" | | |---data # crawled data will be saved here (each file is a document) | | |---graph.dot # auto-generated for building graph (for future use) | | |---option.txt # option file (prepared BY USERS) | | |---retrievallog.txt # crawling log | | | |---(your OWN DIRECTORIES) # create your own directories for new crawlings | |---lib # libraries | |---htmlparser # HTMLParser (see http://htmlparser.sourceforge.net/) | | |---htmlparser.jar # Java bytecode of HTMLParser | | |---src.zip # Java source code of HTMLParser | |---jwikidocs.jar # compiling/building output |---src # source code | |---jwikidocs | | |---Engine.java # Wikipedia crawling engine | | |---JWikiDocs.java # main program | | |---Option.java # option class (object) of JWikiDocs | | |---URLChecker.java # for checking valid Wikipedia URLs | | |---URLElement.java | |---Makefile # detailed make file |---Makefile # make file for compiling/building JWikiDocs |---README # this one How to build/compile JWikiDocs: ------------------------------- Go to the root directory of JWikiDocs and type: $ make clean # to clean any previous outputs $ make all # to compile JWikiDocs How to test JWikiDocs: ---------------------- Go to the root directory of JWikiDocs or to JWikiDocs/src and type $ make test # to crawl and download data from Wikipedia # the sample crawlings include: # "Artificial Intelligence", "Information Retrieval", "Machine Learning", # "Natural Langauge Processing", and "World Wide Web" After crawling, go to data directory (e.g., JWikiDocs/data/Test/artificial-intelligence/data) to see the outputs JWikiDocs |---data | |---Test | | |---artificial-intelligence | | | |---data <= THIS DIRECTORY Notes: open JWikiDocs/src/Makefile to see the command lines of the sample/test crawlings How to crawl Wikipedia with JWikiDocs: -------------------------------------- If we would like to crawl Wikipedia from a starting Wikipedia URL (we call seed URL), for example Microsoft (http://en.wikipedia.org/wiki/Microsoft), we will follow the steps below: 1) Create our own directory: We can create a directory anywhere but it is better to be JWikiDocs/data/Microsoft 2) Create option file (JWikiDocs/data/Microsoft/option.txt) and insert two lines: totalPages=100 seedURL=http://en.wikipedia.org/wiki/Microsoft where: - totalPages: the maximum number of Wikipedia documents that will be downloaded. You can download any number of documents as you want - seedURL: is the starting Wikipedia URL Notes: you can modify the variable "maxDepth" (default value = 4) in the source code file (JWikiDocs/src/jwikidocs/Option.java) in order to increase or decrease the maximum crawling depth. 3) Run JWikiDocs with the following command line. Go to the root directory of JWikiDocs and type $ java -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft Notes: if you encounter the error "Out of Memory", you can increase the heap memory for Java using the following command: $ java -mx512M -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft 4) After crawling, go to JWikiDocs/data/Microsoft to see the outputs --------------------------------------- END --------------------------------------