Sinopsis
Apache Nutch is a very robust and scalable tool for web crawling; it can be integrated with the scripting language Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data.
This chapter covers the introduction to Apache Nutch and its installation, and also guides you on crawling, parsing, and creating plugins with Apache Nutch. It will start from the basics of how to install Apache Nutch and then will gradually take you to the crawling of a website and creating your own plugin.
Content
- Introduction to Apache Nutch
- Installing and configuring Apache Nutch
- Crawling your website using the crawl script
- Crawling the Web, the CrawlDb, and URL filters
- Parsing and parse filters
- The Apache Nutch plugin
- Understanding the Nutch Plugin architecture
- Deployment, Sharding, and AJAX Solr with Apache Nutch
- Deployment of Apache Solr
- Sharding using Apache Solr
- Working with AJAX Solr
- Integration of Apache Nutch with Apache Hadoop and Eclipse
- Integrating Apache Nutch with Apache Hadoop
- Configuring Apache Nutch with Eclipse
- Apache Nutch with Gora, Accumulo, and MySQL
- Introduction to Apache Accumulo
- Introduction to Apache Gora
- Use of Apache Gora
- Integration of Apache Nutch with Apache Accumulo
- Integration of Apache Nutch with MySQL
0 komentar:
Posting Komentar