*******************************************************************************************************
*
*    JWikiDocs
*
*	Copyright (C) 2007 by 
*  
*	Xuan-Hieu Phan
*	Email:	hieuxuan@ecei.tohoku.ac.jp
*		pxhieu@gmail.com
*	URL: 	http://www.hori.ecei.tohoku.ac.jp/~hieuxuan
*
*    	Graduate School of Information Sciences,
*	Tohoku University
*
*******************************************************************************************************


Directory structure:
--------------------

    JWikiDocs
    |---data					# containing data crawling from Wikipedia    
    |	|---Test		
    |	|   |---artificial-intelligence		# sample crawling with topic "Artificial Intelligence"
    |	|   |	|---data			# crawled data will be saved here (each file is a document)
    |	|   |	|---graph.dot			# auto-generated for building graph (for future use)		
    |	|   |	|---option.txt			# option file (prepared BY USERS)		
    |	|   |	|---retrievallog.txt		# crawling log		
    |	|   |---information-retrieval		# sample crawling with topic "Information Retrieval"
    |	|   |	|---data			# crawled data will be saved here (each file is a document)
    |	|   |	|---graph.dot			# auto-generated for building graph (for future use)		
    |	|   |	|---option.txt			# option file (prepared BY USERS)		
    |	|   |	|---retrievallog.txt		# crawling log		
    |	|   |---machine-learning		# sample crawling with topic "Machine Learning"
    |	|   |	|---data			# crawled data will be saved here (each file is a document)
    |	|   |	|---graph.dot			# auto-generated for building graph (for future use)		
    |	|   |	|---option.txt			# option file (prepared BY USERS)		
    |	|   |	|---retrievallog.txt		# crawling log		
    |	|   |---natural-language-processing	# sample crawling with topic "Natural Language Processing"
    |	|   |	|---data			# crawled data will be saved here (each file is a document)
    |	|   |	|---graph.dot			# auto-generated for building graph (for future use)		
    |	|   |	|---option.txt			# option file (prepared BY USERS)		
    |	|   |	|---retrievallog.txt		# crawling log		
    |	|   |---world-wide-web			# sample crawling with topic "World Wide Web"
    |	|   	|---data			# crawled data will be saved here (each file is a document)
    |	|   	|---graph.dot			# auto-generated for building graph (for future use)		
    |	|   	|---option.txt			# option file (prepared BY USERS)		
    |	|   	|---retrievallog.txt		# crawling log		
    |	|
    |	|---(your OWN DIRECTORIES)		# create your own directories for new crawlings
    |
    |---lib					# libraries
    |	|---htmlparser				# HTMLParser (see http://htmlparser.sourceforge.net/)
    |	|   |---htmlparser.jar			# Java bytecode of HTMLParser
    |	|   |---src.zip				# Java source code of HTMLParser
    |	|---jwikidocs.jar			# compiling/building output
    |---src					# source code
    |	|---jwikidocs
    |	|   |---Engine.java			# Wikipedia crawling engine
    |	|   |---JWikiDocs.java			# main program
    |	|   |---Option.java			# option class (object) of JWikiDocs
    |	|   |---URLChecker.java			# for checking valid Wikipedia URLs
    |	|   |---URLElement.java		
    |	|---Makefile				# detailed make file
    |---Makefile				# make file for compiling/building JWikiDocs
    |---README					# this one
    

How to build/compile JWikiDocs:
-------------------------------

    Go to the root directory of JWikiDocs and type:
    
	$ make clean	# to clean any previous outputs
	$ make all	# to compile JWikiDocs


How to test JWikiDocs:
----------------------

    Go to the root directory of JWikiDocs or to JWikiDocs/src and type
    
	$ make test	# to crawl and download data from Wikipedia
			# the sample crawlings include:
			# "Artificial Intelligence", "Information Retrieval", "Machine Learning",
			# "Natural Langauge Processing", and "World Wide Web"

    After crawling, go to data directory (e.g., JWikiDocs/data/Test/artificial-intelligence/data)
    to see the outputs

	JWikiDocs
	|---data					    
	|   |---Test		
	|   |	|---artificial-intelligence		
	|   |	|   |---data <= THIS DIRECTORY
    
    Notes: open JWikiDocs/src/Makefile to see the command lines of the sample/test crawlings
    
    
How to crawl Wikipedia with JWikiDocs:
--------------------------------------  
    
    If we would like to crawl Wikipedia from a starting Wikipedia URL (we call seed URL), for example 
    Microsoft (http://en.wikipedia.org/wiki/Microsoft), we will follow the steps below:

    1) Create our own directory: We can create a directory anywhere but it is better to be 
       
	JWikiDocs/data/Microsoft

    2) Create option file (JWikiDocs/data/Microsoft/option.txt) and insert two lines:

	totalPages=100
	seedURL=http://en.wikipedia.org/wiki/Microsoft

	where:
	- totalPages: the maximum number of Wikipedia documents that will be downloaded. You can 
	  download any number of documents as you want
	- seedURL: is the starting Wikipedia URL

	Notes: you can modify the variable "maxDepth" (default value = 4) in the source code file 
	(JWikiDocs/src/jwikidocs/Option.java) in order to increase or decrease the maximum crawling
	depth.

    3) Run JWikiDocs with the following command line. Go to the root directory of JWikiDocs and type

	$ java -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft

	Notes: if you encounter the error "Out of Memory", you can increase the heap memory for Java using
	the following command:

	$ java -mx512M -classpath lib/htmlparser/htmlparser.jar:lib/jwikidocs.jar jwikidocs.JWikiDocs -d data/Microsoft

    4) After crawling, go to JWikiDocs/data/Microsoft to see the outputs


--------------------------------------- END --------------------------------------