Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph Summarization #9

Open
Hevia opened this issue Dec 14, 2022 · 2 comments
Open

Graph Summarization #9

Hevia opened this issue Dec 14, 2022 · 2 comments

Comments

@Hevia
Copy link
Contributor

Hevia commented Dec 14, 2022

Implement graph summarization method similar to: https://github.com/mswellhao/PacSum

Required Tasks:

  1. Tokenize by Sentence, and create Sentence nodes that connect to a Document node
  2. Add functionality to SentenceGraph to support sentence/node mapping
  3. Add previous/next sentence relations for sentences in a document
  4. Create sentence similarity relations if sentences meet a threshold (may or may not be worth saving all edge weights)
  5. Research augmentations you can make to make this method suitable for MDS
  6. Implement the PACSUM extractor algorithm (This might be worth implementing in raw Neo4J as opposed to computing at the API level)

Helpful links
PACSUM extractor code: https://github.com/mswellhao/PacSum/blob/master/code/extractor.py

@Hevia
Copy link
Contributor Author

Hevia commented Dec 16, 2022

So ideally we write this using Cypher + APOC: https://github.com/neo4j-contrib/neo4j-apoc-procedures

Looks like the two functions we need to copy are:

Will be worth writing some pseudo code here. Will help narrow down the Cypher required

@Hevia
Copy link
Contributor Author

Hevia commented Dec 16, 2022

Looks like this is also important: https://github.com/mswellhao/PacSum/blob/67cc8ad370eac160ede997b7c32eb74907728bf8/code/extractor.py#L107

Algorithm:

Inputs: A list of sentence nodes, beta, lambda1, lambda2

  1. Get the minimum, and maximum edge weight
  2. Use those values + a provided beta value to compute the minimum edge threshold
  3. We then compute the forward and backward scores (after playing with the code, I have a better idea of how/why this works)
  4. Add each nodes forward and backward score together (multiply each respect score by a lambda beforehand). Append this result to a list along with the associated node
  5. PACSUM randomly shuffles the list to avoid any bias, sort the list by the highest scores, extract top K sentences from the shuffled/sorted list

This will be relatively easy to implement in Python, my concern would be grabbing all the sentence nodes from the associated documents using Cypher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant