Nutch-Analytics

This is an Apache Spark based project to analyze crawls generated by Apache Nutch. The project is still in incubation and has the CDRv2 dump feature for now.

The vision is to continue developing Analytical features for Nutch using Spark. This will also interesect with awesome concepts like Machine Learning and Natural Language Processing.

Build and Deploy

mvn clean install

Run Analytics

java -cp analytics-1.0.jar gov.nasa.jpl.analytics.dump.Cdrv2Dump -m local[*] -s PATH_TO_SEGMENT_FOLDER -o OUTPUT_FILE -l PATH_TO_LINK_DB

Contact Us

In case you have any questions or suggestions, please drop them at irds-l@mymaillists.usc.edu

Website: http://irds.usc.edu

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analytics.iml		analytics.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nutch-Analytics

Build and Deploy

Run Analytics

Contact Us

About

Releases

Packages

Languages

License

USCDataScience/nutch-analytics

Folders and files

Latest commit

History

Repository files navigation

Nutch-Analytics

Build and Deploy

Run Analytics

Contact Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages