The contents of the module are the following:
+ pom.xml maven pom file which deals with everything related to compilation and execution of the module
+ src/ java source code of the module
+ Furthermore, the installation process, as described in the README.md, will generate another directory:
target/ it contains binary executable and other directories
Installing MSM requires the following steps:
If you already have installed in your machine JDK8 and MAVEN 3, please go to step 3 directly. Otherwise, follow these steps:
- Install JDK 1.8
If you do not install JDK 1.8 in a default location, you will probably need to configure the PATH in .bashrc or .bash_profile:
export JAVA_HOME=/yourpath/local/java18
export PATH=${JAVA_HOME}/bin:${PATH}
If you use tcsh you will need to specify it in your .login as follows:
setenv JAVA_HOME /usr/java/java18
setenv PATH ${JAVA_HOME}/bin:${PATH}
If you re-login into your shell and run the command
java -version
You should now see that your jdk is 1.8
- Install MAVEN 3
Download MAVEN 3 from
wget http://apache.rediris.es/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz
Now you need to configure the PATH. For Bash Shell:
export MAVEN_HOME=/home/myuser/local/apache-maven-3.0.5
export PATH=${MAVEN_HOME}/bin:${PATH}
For tcsh shell:
setenv MAVEN3_HOME ~/local/apache-maven-3.0.5
setenv PATH ${MAVEN3}/bin:{PATH}
If you re-login into your shell and run the command
mvn -version
You should see reference to the MAVEN version you have just installed plus the JDK 7 that is using.
- Get module source code
git clone https://github.com/Elhuyar/MSM
- Install using maven
cd MSM
mvn clean package
This step will create a directory called target/ which contains various directories and files. Most importantly, there you will find the module executable:
elh-MSM-1.4.0.jar
This executable contains every dependency the module needs, so it is completely portable as long as you have a JVM 1.8 installed.
To install the module in the local maven repository, usually located in ~/.m2/, execute:
mvn clean install
Since version 1.4.0, MSM supports connecting to subscription based digital press media, by means of selenium. User must handle those subscriptions and have valid credentials. Selenium drivers and browsers can be found at: https://www.selenium.dev/ current versions used are:
- chromedriver: inux64-104.0.5112.20
- browser: Version 105.0.5195.19 (Official Build) beta (64-bit)
- selenium-java library: 4.3.0
MSM provides 6 main funcionalities:
-
twitter: Twitter Public stream crawling.
-
feed: Syndication feed crawling (RSS, Atom, ...). Feed types supported by ROME tools (http://rometools.github.io/rome/)
-
influence: looks for the influence of a given list of sources. Klout index for twitter users and PageRank for websites. As of May 2018 Klout index is no longer available.
-
twtUser: asks Twitter for the user profiles of a given list of Twitter users and return their follower and friend information.
-
langid: Language detection for sentences. Used mainly to evaluate langid and optimaize.
-
geocode: Geocoding wrapper for several geocoding APIs (access keys needed for some of them). Given a string it returns its geolocation coordinates.
- You can get general help by calling:
java -jar MSM-1.4.0.jar -help
- twitter: Twitter Public stream crawling. Call Example:
java -jar MSM-1.4.0.jar twitter -c config.cfg -s stout 2>> MSM-twitter.log
Parameters:
java -jar target/MSM-1.4.0.jar twitter -h
usage: MSM-1.4.0.jar twitter [-h] -c CONFIG [-p {terms,users,geo,all}] [-tl] [-s {stout,db,solr}] [-cn CENSUS]
named arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file that contains the necessary parameters to connect to the twitter public stream for crawling.
-p {terms,users,geo,all}, --params {terms,users,geo,all}
Search parameters to use when connecting to the Twitter Streaming API. Parameter values areeither stored in the database or specified in the configuration file.
- "terms" : terms to track.
- "users" : users to follow.
- "geo" : bounding boxes defining geographic areas
- "all" : check for all of the previous parameters
-tl, --twitterLangid Whether the crawler shall trust twitter to filter languages or not. Default is no.
NOTE 1: languages are defined in the config file.
NOTE 2: before activating this option make sure twitter identifies all languages you are working with, especially in case of less-resourced languages.NOTE 3: even if this option is
active MSM will perform its own language identification, and leverage it with Twitter info.
-s {stout,db,solr}, --store {stout,db,solr}
Whether tweets shall be stored in a database, an Apache solr Index or printed to stdout (default).
- "stout" : standard output
- "db" : database
- "solr" : solr (not implemented yet)
-cn CENSUS, --census CENSUS
Census is used to store tweets by users in a certain geographic area. A path to a file must be given as an argument. The file contains a list of Twitter users. The users and their
tweets will be marked in the database as 'local'. The file must contain one user per line in the following format:
userId<tab>screenName[<tab>Additional Fields]
NOTE: if the value 'db' is give instead of a file path MSM will try to generate the census from the database.
Run java -jar target/MSM-1.4.0.jar (twitter|feed|influence|twtUser|langid|geocode) -help for detail
- feed: Syndication feed crawling (RSS, Atom, ...). Feed types supported by ROME tools (http://rometools.github.io/rome/)
java -jar MSM-1.4.0.jar feed -c config.cfg -u http://feeds2.feedburner.com/example
Parameters:
java -jar target/MSM-1.4.0.jar feed -h
usage: MSM-1.4.0.jar feed [-h] [-c CONFIG] [-t {press,multimedia}] [-ff FFMPEG] [-u URL] [-s {stout,db,solr}] [-cn CENSUS]
named arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file that contains the necessary parameters to connect to the feed stream for crawling.Only needed if no url is given through the --url parameter.
-t {press,multimedia}, --type {press,multimedia}
The type of feed we are dealing with.
- "press" : written digital press feeds. Standard rss feed parsers are used
- "multimedia" : feed coming from multimedia source (tv/radio) a custom parser is activated in this case.
-ff FFMPEG, --ffmpeg FFMPEG
The path to the ffmpeg executable (the directory containing ffmpeg and ffprobe binaries). Only needed if for multimedia source feeds. Default value is "/usr/bin". Change the value
to match your installation path
-u URL, --url URL URL(s) of the feed(s) we want to crawl. Feeds must be separated with ',' chars.
e.g. : java -jar MSM-1.0.jar feed -u 'url1,url2,...'
-s {stout,db,solr}, --store {stout,db,solr}
Whether mentions shall be stored in a database, an Apache solr Index or printed to stdout (default).
- "stout" : standard output
- "db" : database
- "solr" : solr (not implemented yet)
-cn CENSUS, --census CENSUS
Census is used to store mentions from sources in a certain geographic area. A path to a file must be given as an argument. The file contains a list of source ids (as stored in the
database). The sources and their mentions will be marked in the database as 'local'. The file must contain one user per line in the following format:
userId<tab>sourceName[<tab>Additional Fields]
Run java -jar target/MSM-1.4.0.jar (twitter|feed|influence|twtUser|langid|geocode) -help for details
- influence: looks for the influence of a given list of sources. Klout index for twitter users and PageRank for websites.
java -jar MSM-1.4.0.jar influence -c config.cfg -db 2>> MSM-Influence.log
Parameters:
java -jar target/MSM-1.4.0.jar influence -h
usage: MSM-1.4.0.jar influence [-h] [-s SOURCES] -c CONFIG [-db] [-t {twitter,feed,all}] [-w {null,error,all}] [-l LIMIT]
named arguments:
-h, --help show this help message and exit
-s SOURCES, --sources SOURCES
web domain (pageRank) or twitter screen name (KloutIndex) to look for its influence for.Many sources may be introduced separated by ',' char.If you want to retrieve sources from
the database left this option empty or use the 'db' value
-c CONFIG, --config CONFIG
Configuration file that contains the necessary parameters to connect to Influence APIs and Database you want to interact with the database
-db, --database Whether influences shall be stored in a database or printed to stdout (default). Database parameters must be given in the configuration file.
-t {twitter,feed,all}, --type {twitter,feed,all}
type of the sources to look for its influence for: - "twitter" : sources are twitter user screen names
- "domain" : sources are web domains
- "all" : sources are mixed (default) system will detect the source type for each source
-w {null,error,all}, --which {null,error,all}
which sources to look for its influence for (only for database interaction):
- "null" : sources that have not been yet processed at all
- "error" : sources that have been processed but no influence could be retrieved
- "all" : all sources.
WARNING: this will override values in the database.
WARNING2:Depending on the number of sources in the database this could take a very long time.
-l LIMIT, --limit LIMIT
limit the number of sources processed in the execution (only for database interaction): default is 500
--limit = 0 means no limit is established, and thus the command will atempt to process all sources found in the db (not processed yet).
This parameter is important not to exceed the twitter api rate limits. Increase it at your own risk.
Run java -jar target/MSM-1.4.0.jar (twitter|feed|influence|twtUser|langid|geocode) -help for details
- twtUser: asks Twitter for the user profiles of a given list of Twitter users and return their follower and friend information.
java -jar MSM-1.4.0.jar twtUser -h
usage: MSM-1.4.0.jar twtUser [-h] -c CONFIG [-s {stout,db,solr}] [-l LIMIT] [-o]
named arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Configuration file that contains the necessary parameters to connect to the twitter REST API 1.1.
-s {stout,db,solr}, --store {stout,db,solr}
Whether tweets shall be stored in a database, an Apache solr Index or printed to stdout (default).
- "stout" : standard output
- "db" : database
- "solr" : solr (not implemented yet)
-l LIMIT, --limit LIMIT
limit the number of users processed in the execution (only for database interaction): default is 500
--limit = 0 means no limit is established, and thus the command will atempt to process all sources found in the db (not processed yet).
This parameter is important depending on the number of APIs you have available and your usage rate limits.
-o, --onlyffCount If this flag is active only follower and friend info will be returned, but no follower and friends lists. Returning only follower and friends count is much faster because of the
higher rate limit of the API.
- langid: Language detection for sentences. Used mainly to evaluate langid and optimaize.
java -jar MSM-1.4.0.jar langid -h
usage: MSM-1.4.0.jar langid [-h] [-a {langid,optimaize}] -s STRINGS [-l LANGS] [-tl] [-o] [-t {twitter,longtext}] [-tr] [-c CONFIDENCETHRESHOLD] [-lc]
named arguments:
-h, --help show this help message and exit
-a {langid,optimaize}, --algorithm {langid,optimaize}
algorithm to use for language identification, optimaize or langid. default is langid.
-s STRINGS, --strings STRINGS
string to look for its language, or file containing strings. The language detection unit is the whole fileMany locations may be introduced separated by '::' string (semicolon may
be used inside the location string, that is why they are not used as separators).
-l LANGS, --langs LANGS
list of accepted langs. Use iso-639 codes separated by commas (e.g. --langs=es,eu,en,fr) Default is 'eu,es,en,fr'.
NOTE 1: before activating this option make sure twitter identifies all languages you are working with, especially in case of less-resourced languages.
NOTE 2: even if this option is active MSM will perform its own language identification, and leverage it with Twitter info.
-tl, --twitterLangid Whether the crawler shall trust twitter to filter languages or not. Default is no.
NOTE 1: before activating this option make sure twitter identifies all languages you are working with, especially in case of less-resourced languages.
NOTE 2: even if this option is active MSM will perform its own language identification, and leverage it with Twitter info.
-o, --onlySpecificLanguageProfiles
Do not load all language profiles, only those specified in --langs argument.
-t {twitter,longtext}, --type {twitter,longtext}
which type of texts are we dealing with:
- "twitter" : microbloging messages or short messages
- "longtext" : paragraphs or longer sentences
WARNING: Nothing.
-tr, --train train niew model with the given files in the --strings parameter
WARNING: langs argument value is used as the language name to store the new language profile. WARNING: type is sued to generate short or standard text profile WARNING: profile
is stored in the same place of the input file, with the lang name and "ld_profile" string. e.g. input_es_ldprofile
-c CONFIDENCETHRESHOLD, --confidenceThreshold CONFIDENCETHRESHOLD
Confidence threshold for language identification:
If no candidate achieves the required threshold 'unk' is returned.
If more than one candidate achieves the required threshold the one with the highest probability is returned.
-lc, --lowerCase Convert everything to lower case before doing language identification.
- geocode: Geocoding wrapper for several geocoding APIs (access keys needed for some of them). Given a string it returns its geolocation coordinates.
java -jar MSM-1.4.0.jar geocode -h
usage: MSM-1.4.0.jar geocode [-h] [-s SOURCES] -c CONFIG [-db] [-t {twitter,feed,all}] [-w {unknown,error,all}] [-a {mapquest,mapquest-open,openstreetmaps,googlemaps,LocationIQ,OpenCage,all}] [-l LIMIT]
named arguments:
-h, --help show this help message and exit
-s SOURCES, --sources SOURCES
location to look for its geo-coordinate info.Many locations may be introduced separated by '::' string (semicolon may be used inside the location string, that is why they are not
used as separators).If you want to retrieve sources from the database left this option empty or use the 'db' value
-c CONFIG, --config CONFIG
Configuration file that contains the necessary parameters to connect to the corresponding geocoding service APIsand Database you want to interact with the database
-db, --database Whether influences shall be stored in a database or printed to stdout (default). Database parameters must be given in the configuration file.
-t {twitter,feed,all}, --type {twitter,feed,all}
type of the sources to look for its geolocation for: - "twitter" : sources are twitter user screen names
- "domain" : sources are web domains
- "all" : sources are mixed (default) system will detect the source type for each source
-w {unknown,error,all}, --which {unknown,error,all}
which sources to look for its influence for (only for database interaction):
- "unknown" : sources that have not been yet processed at all
- "error" : sources that have been processed but no geolocation could be retrieved
- "all" : all sources.
WARNING: this will override values in the database.
WARNING1:Depending on the number of sources in the database this could take a very long time. Also be careful about your API key rate limits!
-a {mapquest,mapquest-open,openstreetmaps,googlemaps,LocationIQ,OpenCage,all}, --api {mapquest,mapquest-open,openstreetmaps,googlemaps,LocationIQ,OpenCage,all}
Geocoding is by default multi API, using all the APIs for which the user obtains keys (they must be specified in the config file).
Through this parameter the user may specify a single API to use for geocoding.
BEWARE to set --limit option according to your usage rate limits.
-l LIMIT, --limit LIMIT
limit the number of sources processed in the execution (only for database interaction): default is 1000
--limit = 0 means no limit is established, and thus the command will atempt to process all sources found in the db (not processed yet).
This parameter is important depending on the number of APIs you have available and your usage rate limits.
You can also generate the javadoc of the module by executing:
mvn javadoc:jar
Which will create a jar file core/target/elh-MSM-1.4.0-javadoc.jar
Iñaki San Vicente and Xabier Saralegi
Elhuyar Foundation
{i.sanvicente,x.saralegi}@elhuyar.eus