Including JWikiDocs (NEW) - A Java tool for crawling and downloading Wikipedia documents
Copyright (c) 2006 - 2007 by
Xuan-Hieu Phan (pxhieu at gmail dot com), Graduate School of Information Sciences, Tohoku University
JWebPro: A Java-based Web Processing toolkit that can interact with Google search via Google Web APIs and then process the returned Web documents in a couple of ways. The outputs of JWebPro can serve as inputs for natural language processing, information retrieval, information extraction, Web data mining, online social network extraction/analysis, and ontology development applications. Currently, JWebPro includes the following features:
In the near future, we plan to develop new features as follows:
Generally speaking, this toolkit aims at providing researchers/practitioners a convenient framework to interact with the Web to build various kinds of applications in information retrieval, information extraction, text/web mining, natural language processing, social network analysis, and many others. If you want to process only offline text documents, please visit JTextPro.
This project also includes JWikiDocs (NEW) - a tool for crawling and downloading Wikipedia documents. JWikiDocs provides some features:
JWikiDocs is useful for building Web data collections/corpora for Text/Web Data Mining and Natural Language Processing. See the README file to learn how to compile, test, and run JWikiDocs.
How to use JWebPro:
One thing you have to prepare is to get a Google account together with a client key that will be used to access Google Web search via Google Web APIs. To know more about the way JWebPro interacts with Google search engine, please visit Google SOAP Search APIs. Unfortunately, Google has stopped issuing new client keys for SOAP Search APIs from December 5, 2006. As a result, only users who have client keys in hand can use JWebPro. Take a look at the README file in the JWebPro source code to see how to provide your client key to JWebPro. Note, the current version only supports direct connection to the Internet, i.e., you do not need to provide proxy username/password to reach the outside world. In the future, we will explore new ways other than SOAP Search API to connect to Google search.
Researches using this tool for running experiments should include the following citation:
Xuan-Hieu Phan, "JWebPro: A Java-based Web Processing Toolkit", http://jwebpro.sourceforge.net/, 2006.
Xuan-Hieu Phan, "JWikiDocs: A Java-based Wikipedia Crawling Toolkit", http://jwebpro.sourceforge.net/, 2007.
We would like to thank professor Tu-Bao Ho for providing us Penn Treebank data for training the POS tagging and chunking models. We would also like to thank Sourceforge.net for hosting this project.
Last updated: September 04, 2007