Skip to content

Future Work

akshay31057 edited this page Jul 2, 2017 · 1 revision

1. Sharding in Backup event_log database content using python:

  • Presently, the event_log_backup sub-module enables the administrator and privileged users to create a backup file that consists of current database records. However, this feature can be extended to incorporate the auto sharding functionality to enable partitioning and preventing the creation of large-sized backup files, hence making their management easier.
  • As of now, records are being stored in a csv file on the cloud using the implementation of Feature 5 as mentioned above. However, if the administrator requires a restriction of x records on a page, then that can be easily done by changing the worksheet on required number of records.

2. Using the Pagerank algorithm to visualize the pages with most hits:

  • The Pagerank algorithm designed by Larry Page can be implemented to display the pages with maximum hits with the help of a Force Directed graph.
  • The force directed graph takes into consideration the page ranks, based on which the diameter of each node of the graph is determined.
  • Along with the diameter of the node, the links to different nodes in the graph tantamounts to the pages that are accessible from the respective page under consideration.
  • Firstly, a start url is used as the first node (generally the homepage). The handle for this url is fetched using the urllib library and the source code for this page is then extracted and stored in the database. Then, subsequent url’s are extracted from the source code via href tags and are stacked in the Pages table. (Note: The database “urldb” has three tables, namely, “Pages”,”Webs”,”Links”) This functionality is coded in the spider.py file.
  • Secondly, extracted data from url’s are then dumped into the database using spdump.py which is later used to create a json file required for visualisation. This functionality is implemented in the spdump.py file.
  • Thirdly, retrieved pages are ranked using the pagerank algorithm and new rank is calculated and replaced.(The old rank 1.0 is replaced) This functionality is implemented in the sprank.py file.
  • Ranked pages are then visualized by creating a json file of pages and their corresponding ranks. Visualization is done by using d3.v2.min.js and code of force directed graph which is readily available on d3js.org. This functionality is implemented in the spjson.py file. Visualization can be accessed either by opening the force.html file.

3. Extension of experimental feature of logging data on cloud:

  • Currently, this feature allows the user to log the data into the Spreadsheet present on cloud.
  • However, for accessing this feature the python script has to be executed explicitly. This script can’t be executed through Drupal sub-module due to import issues.
  • In the future work, a solution has to be found to execute the python script through Drupal.