Replies: 5 comments
-
It is certainly something that can be done, such as with the
ColumnDataClassifier in CoreNLP or the CNNClassifier in Stanza, but we have
put zero effort that I know of into that
…On Mon, Sep 13, 2021 at 8:44 AM julkhami ***@***.***> wrote:
Does anyone think any CoreNLP methods could detect if a string seems like
a fragment of a sentence, i.e. it forms part of a correct natural language
sentence but it is only part of it, hence incomplete - vs a string which
may not be a grammatically correct sentence but is some other syntactic
type, for example, a title or chapter header?
Would it be better to do this with a rule-based approach or with a machine
learning algorithm?
Thank you.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1188>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWM42TFJ7QKYR7R5M5DUBYL4ZANCNFSM5D6D76WQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
Thank you very much.
So, for ColumnDataClassifier and the CNN, do I need to train it on data, or
has it already been trained, or can it classify sentences without needing
much training?
Thank you,
Julius
…On Mon, Sep 13, 2021 at 5:51 PM John Bauer ***@***.***> wrote:
It is certainly something that can be done, such as with the
ColumnDataClassifier in CoreNLP or the CNNClassifier in Stanza, but we have
put zero effort that I know of into that
On Mon, Sep 13, 2021 at 8:44 AM julkhami ***@***.***> wrote:
> Does anyone think any CoreNLP methods could detect if a string seems like
> a fragment of a sentence, i.e. it forms part of a correct natural
language
> sentence but it is only part of it, hence incomplete - vs a string which
> may not be a grammatically correct sentence but is some other syntactic
> type, for example, a title or chapter header?
>
> Would it be better to do this with a rule-based approach or with a
machine
> learning algorithm?
>
> Thank you.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#1188>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA2AYWM42TFJ7QKYR7R5M5DUBYL4ZANCNFSM5D6D76WQ
>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
>
> or Android
> <
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
>.
>
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AVQQT7QCIQMJG7UCBBNB2KDUBYMYDANCNFSM5D6D76WQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Beta Was this translation helpful? Give feedback.
-
I can tell you we don't already have any models which detect specifically
chapter titles, for example, so you would need to train that.
However, it does occur to me that the constituency parser models can
distinguish between a statement, a question, and a fragment - although not
with 100% accuracy, of course. You'd want to look at the top level
constituent under ROOT for S, SINV, SQ, FRAG, etc. I think sometimes it
can detect a noun phrase at the top such as NP. You'd have to further
figure out why a particular piece of text is only a FRAG, though, and if it
matches what you're looking for. If you're looking for ways to break down
a FRAG into smaller categories, you'd still need to train that.
|
Beta Was this translation helpful? Give feedback.
-
Thank you.
So the model outputs a parsed syntax tree and the top level node would have a classification of the string, "Sentence", "Inverted sentence", "Fragment", etc? Is that right? How should I train it? For example, should I just manually write an input and its classification - ("and then the", "fragment") - some number of times? How many? 1000? Or do I have to harvest the data from somewhere? I.e., I could think of a script that could pull in a bunch of examples of sentence fragments from somewhere. What about something unsupervised? If I gave the algorithm a bunch of lines of text, can I nudge the unsupervised algorithm in different directions until it separates into the kinds of groups I have in mind? Any chance we could connect via email? You've seemed very willing to help which I appreciate. I've got tons of questions and I'd really like to try my best to write some machine learning NLP scripts. Please let me know. Thanks very much. |
Beta Was this translation helpful? Give feedback.
-
In terms of the parser, I was thinking that looking for FRAG, NP, or other non-S tags would be a good way to identify candidates. There must be a variety of different structures that would turn into those tags. There's also something similar in Stanza, which now has a constituency parser which is a bit more accurate than CoreNLP's. For example, if I parse
I get back
Throw in a couple Wheel of Time book titles, although that's easy mode since they were all
It's not going to be perfect, of course. You could scrape a bunch of candidates for the classes you care about and train a CDC or a stanza classifier using that. To be honest, I don't know a good number... It may also wind up being the case that using the parser is a red herring |
Beta Was this translation helpful? Give feedback.
-
Does anyone think any CoreNLP methods could detect if a string seems like a fragment of a sentence, i.e. it forms part of a correct natural language sentence but it is only part of it, hence incomplete - vs a string which may not be a grammatically correct sentence but is some other syntactic type, for example, a title or chapter header?
Would it be better to do this with a rule-based approach or with a machine learning algorithm?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions