Can CoreNLP recognize sentence fragments? #1205

julkhami · 2021-09-13T15:43:50Z

julkhami
Sep 13, 2021

Does anyone think any CoreNLP methods could detect if a string seems like a fragment of a sentence, i.e. it forms part of a correct natural language sentence but it is only part of it, hence incomplete - vs a string which may not be a grammatically correct sentence but is some other syntactic type, for example, a title or chapter header?

Would it be better to do this with a rule-based approach or with a machine learning algorithm?

Thank you.

AngledLuffa · 2021-09-13T15:51:19Z

AngledLuffa
Sep 13, 2021
Maintainer

It is certainly something that can be done, such as with the ColumnDataClassifier in CoreNLP or the CNNClassifier in Stanza, but we have put zero effort that I know of into that

…

On Mon, Sep 13, 2021 at 8:44 AM julkhami ***@***.***> wrote: Does anyone think any CoreNLP methods could detect if a string seems like a fragment of a sentence, i.e. it forms part of a correct natural language sentence but it is only part of it, hence incomplete - vs a string which may not be a grammatically correct sentence but is some other syntactic type, for example, a title or chapter header? Would it be better to do this with a rule-based approach or with a machine learning algorithm? Thank you. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1188>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWM42TFJ7QKYR7R5M5DUBYL4ZANCNFSM5D6D76WQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

julkhami · 2021-09-14T08:56:45Z

julkhami
Sep 14, 2021
Author

Thank you very much. So, for ColumnDataClassifier and the CNN, do I need to train it on data, or has it already been trained, or can it classify sentences without needing much training? Thank you, Julius

…

On Mon, Sep 13, 2021 at 5:51 PM John Bauer ***@***.***> wrote: It is certainly something that can be done, such as with the ColumnDataClassifier in CoreNLP or the CNNClassifier in Stanza, but we have put zero effort that I know of into that On Mon, Sep 13, 2021 at 8:44 AM julkhami ***@***.***> wrote: > Does anyone think any CoreNLP methods could detect if a string seems like > a fragment of a sentence, i.e. it forms part of a correct natural language > sentence but it is only part of it, hence incomplete - vs a string which > may not be a grammatically correct sentence but is some other syntactic > type, for example, a title or chapter header? > > Would it be better to do this with a rule-based approach or with a machine > learning algorithm? > > Thank you. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#1188>, or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AA2AYWM42TFJ7QKYR7R5M5DUBYL4ZANCNFSM5D6D76WQ > > . > Triage notifications on the go with GitHub Mobile for iOS > < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 > > or Android > < https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub >. > > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1188 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVQQT7QCIQMJG7UCBBNB2KDUBYMYDANCNFSM5D6D76WQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

AngledLuffa · 2021-09-14T17:15:44Z

AngledLuffa
Sep 14, 2021
Maintainer

I can tell you we don't already have any models which detect specifically chapter titles, for example, so you would need to train that. However, it does occur to me that the constituency parser models can distinguish between a statement, a question, and a fragment - although not with 100% accuracy, of course. You'd want to look at the top level constituent under ROOT for S, SINV, SQ, FRAG, etc. I think sometimes it can detect a noun phrase at the top such as NP. You'd have to further figure out why a particular piece of text is only a FRAG, though, and if it matches what you're looking for. If you're looking for ways to break down a FRAG into smaller categories, you'd still need to train that.

0 replies

julkhami · 2021-10-05T07:32:48Z

julkhami
Oct 5, 2021
Author

Thank you.

You'd want to look at the top level constituent under ROOT for S, SINV, SQ, FRAG, etc.

So the model outputs a parsed syntax tree and the top level node would have a classification of the string, "Sentence", "Inverted sentence", "Fragment", etc? Is that right?

How should I train it? For example, should I just manually write an input and its classification - ("and then the", "fragment") - some number of times? How many? 1000? Or do I have to harvest the data from somewhere? I.e., I could think of a script that could pull in a bunch of examples of sentence fragments from somewhere.

What about something unsupervised? If I gave the algorithm a bunch of lines of text, can I nudge the unsupervised algorithm in different directions until it separates into the kinds of groups I have in mind?

Any chance we could connect via email? You've seemed very willing to help which I appreciate. I've got tons of questions and I'd really like to try my best to write some machine learning NLP scripts. Please let me know. Thanks very much.

0 replies

AngledLuffa · 2021-10-12T02:20:26Z

AngledLuffa
Oct 12, 2021
Maintainer

In terms of the parser, I was thinking that looking for FRAG, NP, or other non-S tags would be a good way to identify candidates. There must be a variety of different structures that would turn into those tags. There's also something similar in Stanza, which now has a constituency parser which is a bit more accurate than CoreNLP's. For example, if I parse

A dog

I get back

(ROOT (NP (DT A) (NN dog)))

Throw in a couple Wheel of Time book titles, although that's easy mode since they were all Adj Noun or Noun of Noun

>>> doc = pipe("The Path of Daggers")
>>> doc.sentences[0].constituency
(ROOT (NP (NP (DT The) (NN Path)) (PP (IN of) (NP (NNPS Daggers)))))

>>> doc = pipe("Crossroads of Twilight")
>>> doc.sentences[0].constituency
(ROOT (NP (NP (NNPS Crossroads)) (PP (IN of) (NP (NNP Twilight)))))

It's not going to be perfect, of course. You could scrape a bunch of candidates for the classes you care about and train a CDC or a stanza classifier using that. To be honest, I don't know a good number... It may also wind up being the case that using the parser is a red herring

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can CoreNLP recognize sentence fragments? #1205

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can CoreNLP recognize sentence fragments? #1205

julkhami Sep 13, 2021

Replies: 5 comments

AngledLuffa Sep 13, 2021 Maintainer

julkhami Sep 14, 2021 Author

AngledLuffa Sep 14, 2021 Maintainer

julkhami Oct 5, 2021 Author

AngledLuffa Oct 12, 2021 Maintainer

julkhami
Sep 13, 2021

AngledLuffa
Sep 13, 2021
Maintainer

julkhami
Sep 14, 2021
Author

AngledLuffa
Sep 14, 2021
Maintainer

julkhami
Oct 5, 2021
Author

AngledLuffa
Oct 12, 2021
Maintainer