Skip to content

USCDataScience/nutch-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nutch-Analytics

This is an Apache Spark based project to analyze crawls generated by Apache Nutch. The project is still in incubation and has the CDRv2 dump feature for now.

The vision is to continue developing Analytical features for Nutch using Spark. This will also interesect with awesome concepts like Machine Learning and Natural Language Processing.

Build and Deploy

mvn clean install

Run Analytics

java -cp analytics-1.0.jar gov.nasa.jpl.analytics.dump.Cdrv2Dump -m local[*] -s PATH_TO_SEGMENT_FOLDER -o OUTPUT_FILE -l PATH_TO_LINK_DB

Contact Us

In case you have any questions or suggestions, please drop them at irds-l@mymaillists.usc.edu

Website: http://irds.usc.edu

About

Nutch Crawl Analysis - Spark based project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published