NAME
Lingua::EN::Sentence - Module for splitting text into sentences.
SYNOPSIS
use Lingua::EN::Sentence;
add_acronyms(' Lt Gen'); ## adding support for 'Lt. Gen.'
$text = Q[First sentence with some abbreviations, Mr. J. Smith, 2 Jones St. SomeTown Ariz. U.S.A. is an address.
Sentence 2: Sequences like ellipsis ... are handled. Sentence 3, numbered sections such as point 1. are ok.];
my @sentences = $text.sentences;
for @sentences -> $sent {
say $sent;
}
Output is:
First sentence with some abbreviations, Mr. J. Smith, 2 Jones St. SomeTown Ariz. U.S.A. is an address.
Sentence 2: Sequences like ellipsis ... are handled.
Sentence 3, numbered sections such as point 1. are ok.
DESCRIPTION
The Lingua::EN::Sentence module contains the method sentences, which splits text into its constituent sentences, based on regular expressions, a list of abbreviations (built in and given) and other rules.
Certain well know exceptions, such as abbreviations like Mr., Calif. and Ave. will cause incorrect segmentations. But many of these are already integrated into this code and are being taken care of. Note that abbreviations are case sensitive.
The add_acronyms method allows you to add custom abbreviations.
ALGORITHM
Before any regex processing, quotations are hidden away and inserted after the sentences are split. That entails that no sentence splitting will be attempted between pairs of double quotes. Common cases of full stops that do not denote an end of sentence are also hidden. These include the dot after abbreviations mentioned above, acronymns and ellipsis.
Basically, I use a 'brute' regular expression to split the text into sentences. (Well, nothing is yet split - I just mark the end-of-sentence). Then I look into a set of rules which decide when an end-of-sentence is justified and when it's a mistake. In case of a mistake, the end-of-sentence mark is removed.
What are such mistakes? Cases of abbreviations, for example. I have a list of such abbreviations (Please see `Acronym/Abbreviations list' section), and more general rules (for example, the abbreviations 'i.e.' and 'e.g.' need not to be in the list as a special rule takes care of all single letter abbreviations).
FUNCTIONS
$text.sentences
A very convenient extension to the Perl6 Str string type, the .sentences method allows us to natively request the sentences in a string, similarly to the Str "words" method.
The sentences method takes a Str variable containing the text as an argument and returns an array of sentences that the text has been split into.
Returned sentences will be trimmed (beginning and end of sentence) of white-spaces.
Strings with no alpha-numeric characters in them, won't be returned as sentences.
add_acronyms( @acronyms )
This function is used for adding acronyms not supported by this code. Please see `Acronym/Abbreviations list' section for the abbreviations already supported by this module.
get_acronyms()
This function will return the defined list of acronyms.
set_acronyms( @my_acronyms )
This function replaces the predefined acroynm list with the given list.
get_EOS()
This function returns the value of the string used to mark the end of sentence. You might want to see what it is, and to make sure your text doesn't contain it. You can use set_EOS() to alter the end-of-sentence string to whatever you desire.
set_EOS( $new_EOS_string )
This function alters the end-of-sentence string used to mark the end of sentences.
Acronym/Abbreviations list
You can use the get_acronyms() function to get acronyms. It has become too long to specify in the documentation.
If I come across a good general-purpose list - I'll incorporate it into this module. Feel free to suggest such lists.
Limitations
There are some valid cases cannot be detected, such as: This belongs to John A. Smith, which will break after A. This cannot be distinguished from a valid sequence like so said I. Next sentence. A sentence ending in an acronym does not cause a split such as St.
AUTHOR
Deyan Ginev, 2013. Kim Ryan, 2023
Perl5 CPAN author: Shlomo Yona (shlomo@cs.haifa.ac.il)
Released under the same terms as Perl 6; see the LICENSE file for details.