JWebPro: A Java-based Web Processing Toolkit

Including JWikiDocs (NEW) - A Java tool for crawling and downloading Wikipedia documents

Xuan-Hieu Phan (pxhieu at gmail dot com), Graduate School of Information Sciences, Tohoku University

JWebPro: A Java-based Web Processing toolkit that can interact with Google search via Google Web APIs and then process the returned Web documents in a couple of ways. The outputs of JWebPro can serve as inputs for natural language processing, information retrieval, information extraction, Web data mining, online social network extraction/analysis, and ontology development applications. Currently, JWebPro includes the following features:

Interact with Google search via Google Web APIs
Web crawler that crawls on the Web to download relevant web documents (according to the index list returned by Google search)
Parsing HTML documents (using the library from Htmlparser).
Sentence boundary detection for web documents (using maximum entropy classifier trained on WSJ corpus)
Word tokenization for Web document
Part-of-speech tagging (using Conditional Random Field - CRFTagger). Note: the CRFTagger in JWebPro is different from part-of-speech tagging for pure text/natural language because Web documents are more noisy and less grammatical. We continue to improve this over time.
Phrase chunking (also using Conditional Random Fields - CRFChunker).

In the near future, we plan to develop new features as follows:

Open-domain named entity recognition (NER) and relation (among entities) classification in Web documents.
Tracking and visualizing information about entities over time on the Web.
Building online (social) network systems to support business intelligence.

Generally speaking, this toolkit aims at providing researchers/practitioners a convenient framework to interact with the Web to build various kinds of applications in information retrieval, information extraction, text/web mining, natural language processing, social network analysis, and many others. If you want to process only offline text documents, please visit JTextPro.

This project also includes JWikiDocs (NEW) - a tool for crawling and downloading Wikipedia documents. JWikiDocs provides some features:

Breadth-first crawling and downloading Wikipedia documents according to some initial (seed) URLs.
Removing HTML tags, navigation links, and noisy text
Users can confine the crawling space by setting maximum number of retrieved documents or the maximum hyperlink depth.

JWikiDocs is useful for building Web data collections/corpora for Text/Web Data Mining and Natural Language Processing. See the README file to learn how to compile, test, and run JWikiDocs.

Download JWebPro and JWikiDocs

How to use JWebPro:

One thing you have to prepare is to get a Google account together with a client key that will be used to access Google Web search via Google Web APIs. To know more about the way JWebPro interacts with Google search engine, please visit Google SOAP Search APIs. Unfortunately, Google has stopped issuing new client keys for SOAP Search APIs from December 5, 2006. As a result, only users who have client keys in hand can use JWebPro. Take a look at the README file in the JWebPro source code to see how to provide your client key to JWebPro. Note, the current version only supports direct connection to the Internet, i.e., you do not need to provide proxy username/password to reach the outside world. In the future, we will explore new ways other than SOAP Search API to connect to Google search.

Related links:

FlexCRFs: Flexible Conditional Random Fields
GibbsLDA++: A C/C++ and Gibbs Sampling-based Implementation of Latent Dirichlet Allocation (LDA)
CRFTagger: CRF English POS Tagger
CRFChunker: CRF English Phrase Chunker
JTextPro: A Java-based Text Processing Toolkit
JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool

Researches using this tool for running experiments should include the following citation:

Xuan-Hieu Phan, "JWebPro: A Java-based Web Processing Toolkit", http://jwebpro.sourceforge.net/, 2006.

Xuan-Hieu Phan, "JWikiDocs: A Java-based Wikipedia Crawling Toolkit", http://jwebpro.sourceforge.net/, 2007.

We would like to thank professor Tu-Bao Ho for providing us Penn Treebank data for training the POS tagging and chunking models. We would also like to thank Sourceforge.net for hosting this project.

Last updated: September 04, 2007