NAME

    Text::HumanComputerWords - Split human and computer words in a
    naturalish manner

VERSION

    version 0.04

SYNOPSIS

     use Text::HumanComputerWords;
     
     my $hcw = Text::HumanComputerWords->new(
       Text::HumanComputerWords->default_perl,
     );
     
     my $text = "this is some text with a url: https://metacpan.org, "
              . "a unix path name: /usr/local/bin "
              . "and a windows path name: c:\\Windows";
     
     foreach my $combo ($hcw->split($text))
     {
       my($type, $word) = @$combo;
       if($type eq 'word')
       {
         # $word is a regular human word
         # this, is, some, etc.
       }
       elsif($type eq 'module')
       {
         # $word looks like a module
       }
       elsif($type eq 'url_link')
       {
         # $word looks like a URL
         # https://metacpan.org,
       }
       elsif($type eq 'path_name')
       {
         # $word looks like a windows or unix filename
         # /usr/local/bin
         # c:\\Windows
       }
     }

DESCRIPTION

    This module extracts human and computer words from text. This is useful
    for checking the validity of these words. Human words can be checked
    for spelling, while "computer" words like URLs can be validated by
    other means. URLs for example could be checked for 404s and module
    names could be checked against a module registry like CPAN.

    The algorithm works like thus:

    1. The text is split on whitespace into fragments /\s/

      fragments could be either a single computer word like a URL or a
      module, or it could be one or more human words. If a fragment doesn't
      contain any word characters then it is skipped entirely /\w/.

    2. If the fragment is recognized as a computer word we are done.

      Computer words can be defined any way you want. The default_perl
      method below is reasonable for Perl technical documentation.

    3. Split the fragment into words using the Unicode word boundary
    /\b{wb}/

      After the split words are identified as those containing word
      characters /\w/.

CONSTRUCTOR

 new

     my $hcw = Text::HumanComputerWords->new(@cpu);

    Creates a new instance of the splitter class. The @cpu pairs lets you
    specify the logic for identifying "computer" words. The keys are the
    type names and the values are code references that identify those
    words. These are special reserved types:

    skip

       Text::HumanComputerWords->new(
         skip => sub ($word) {
           # return true if $word should be skipped entirely
         },
       );

      This is a code reference which should return true, if the $word
      should be skipped entirely. The default skip code reference always
      returns false.

    substitute

       Text::HumanComputerWord->new(
         substitute => sub {
           # the value is passed in as $_ and can be modified
         },
       );

      This allows you to substitute the current word. The main intent here
      is to allow supporting splitting CamelCase and snakeCase into
      separate words, so they can be checked as human words. Example:

       Text::HumanComputerWords->new(
         substitute => sub {
           # this should split both CamelCase and snakeCase
           s/([A-Z]+)/ $1/g if /^[a-z]+$/i && lcfirst($_) ne lc $_;
         },
       ),

    word

       Text::HumanComputerWords->new(
         word => sub ($word) {},  # error
       );

      The word type is reserved for human words, and cannot be overridden.

    The order of the pairs matters and a type can be specified more than
    once. If a given computer word matches multiple types it will only be
    reported as the first type matches. Example:

     Text::HumanComputerWords->new(
       foo_or_bar => sub ($word) { $word eq 'foo' },
       foo_or_bar => sub ($word) { $word eq 'bar' },
     );

METHODS

 default_perl

     my @cpu = Text::HumanComputerWords->default_perl;

    Returns the computer word pairs reasonable for a technical Perl
    document. These pairs should be passed into "new", optionally with
    extra pairs if you like, for example:

     my $hcw = Text::HumanComputerWords->new(
     
       # this needs to come first so that platypus modules are recognized before
       # non-platypus modules in the default rule set
       platypus_module => sub ($word) { $word =~ /^FFI::Platypus(::[A-Za-z0-9_]+)*$/ },
     
       # the normal Perl rules.
       Text::HumanComputerWords->default_perl,
     
       # this can go anywhere, but we check for it last.
       plus_one => sub ($word) { $word eq '+1' },
     );

    By itself, this returns pairs that will recognize these types:

    path_name

      A file system path. Something that looks like a UNIX or Windows
      filename or directory path.

    url_link

      A URL. The regex to recognize a URL is naive so if the URLs need to
      be validated they should be done separately.

    module

      A Perl module name. Something::Like::This.

 split

     my @pairs = $hcw->split($text);

    This method splits the text into word combo pairs. Each pair is
    returned as an array reference. The first element is the type, and the
    second is the word. The types are as defined when the $hcw object is
    created, plus the word type for human words.

CAVEATS

    Doesn't recognize VMS paths! Oh noes!

    The default_perl method provides computer "words" that are identified
    with a regular expression which is somewhat reasonable, but probably
    has a few false positives or negatives, and doesn't do any validation
    for things like URLs or modules. Modules like strict or warnings that
    do not have a :: cannot be recognized.

AUTHOR

    Graham Ollis <plicease@cpan.org>

COPYRIGHT AND LICENSE

    This software is copyright (c) 2021 by Graham Ollis.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.