Difference between revisions of "NLP++"

From Natural Philosophy Wiki
Jump to navigation Jump to search
m
 
(25 intermediate revisions by 2 users not shown)
Line 15: Line 15:
 
| website = {{URL|https://www.visualtext.com}}
 
| website = {{URL|https://www.visualtext.com}}
 
}}
 
}}
'''NLP++''' is a computer language specifically designed for processing unstructured or semi-unstructured text and includes the NLP++ language, as well as the [[Conceptual Grammar]], a hierarchical knowledge base. NLP++ works in conjunction with the Integrated Development Environment [[VisualText]]. It is the only computer language that is dedicated exclusively to [[natural language processing]]
+
'''NLP++''' is a computer programming language for natural language processing created by [[Amnon Meyers]] and [[David de Hilster]] in 1998. It operates on an input text via multiple passes that elaborate a best-first parse tree. It can access and update a hierarchical knowledge base management system (KBMS) called [[Conceptual Grammar]] (CG). NLP++ and CG deploy with an Integrated Development Environment (IDE) called [[VisualText]], which supports rapid development of text analyzers. NLP++ is the only computer language exclusively dedicated to [[natural language processing]].
  
 
== Overview ==
 
== Overview ==
NLP++ is a language specifically designed to read and process text like human readers combining bottom up, island-driven sequential processing in conjunction with a dynamic hierarchical knowledge base called the conceptual grammar. NLP++ can dynamically build knowledge and use stored knowledge in order to aid in the task of understanding written text. It's scope in processing text runs from simple syntactic processing and matching to full-blown natural language understanding. NLP++ is a language, not a toolkit so all analyzers must be created from scratch for each application.
+
NLP++ is a computer language dedicated to building natural language text analyzers. It allows programmers to capture and apply linguistic and world knowledge, emulating processes by which humans read and understand text. NLP++ combines bottom up, island-driven, recursive grammar, and other methods in a multi-pass architecture that operates on one parse tree. It works with a hierarchical knowledge base (KB), called Conceptual Grammar (CG), to dynamically build and use stored knowledge in analyzing text. Applications range from simple syntactic processing to full natural language understanding. VisualText is a developer's environment that exploits NLP++ and CG to rapidly elaborate text analyzers.  Passes and KBs from one analyzer may be exploited to more rapidly construct and tailor new text analyzers.
  
 
=== NLP++ ===
 
=== NLP++ ===
NLP++ is a computer language that takes text, breaks it down into tokens, builds up those tokens into syntactic trees, and builds and uses knowledge stored in the conceptual grammar. The language includes functions, rules, local variables specific to its internal representations of text and knowledge.
+
NLP++ is a computer language that takes text, breaks it down into tokens, builds up those tokens into syntactic trees, and builds and uses knowledge stored in Conceptual Grammar. The language includes functions, rules, operators, and variables specific to its internal representations of text and knowledge.  NLP++ comprises general C or C++-lke programming language constructs, as well as integrally addressing rule matches and the associated knowledge base.
  
 
==== Variables ====
 
==== Variables ====
Variables are written with a single letter and a string name. There are special variable types in NLP++ that are used in specific contexts.
+
Variables are written with a single letter and a string name. Special variable types in NLP++ apply to specific contexts.
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 100: Line 100:
 
|}
 
|}
  
==== RULES ====
+
==== Rules ====
NLP++ has rules for pattern matching. In
+
NLP++ has rules for pattern matching. A rule is written in the form of "@RULES _node <- a b c @@" where "<-" where a, b, and c are match and put under the new node "_node". Here is an example of a rule.
  
 
<syntaxhighlight lang="nlp">
 
<syntaxhighlight lang="nlp">
 
@POST
 
@POST
S("count") = N("$text",1);
+
S("count") = N("$text",2);
 +
S("concept") = makeconcept(G("Counts"),N("$text",1));
 
single();
 
single();
  
 
@RULES
 
@RULES
 
_count <-
 
_count <-
     _xNUM [s]   ### (1)
+
     _xALPHA [s] ### (1)
     count [s]    ### (2)
+
     _xNUM [s]    ### (2)
 
     @@
 
     @@
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
==== Built In Functions ====
 +
NLP++ has built in functions for the following areas:
 +
* Database Functions
 +
* Formatting and I/O Functions
 +
* Knowledge Base Functions
 +
* Math Functions
 +
* Parse Tree Functions
 +
* Special Functions
 +
* Spelling Functions
 +
* String Functions
 +
* Web Functions
 +
 +
==== User Functions ====
 +
NLP++ allows the user to create their own functions in the @@DECL area. These functions can access the syntactic tree and any part of the knowledge base, as well as files on the system.
  
 
=== Conceptual Grammar ===
 
=== Conceptual Grammar ===
Line 119: Line 135:
  
 
=== VisualText ===
 
=== VisualText ===
 +
[[File:VisualText2IDE.png|thumb|VisualText version 2 for Windows]]
 
VisualText is an [[IDE]] that is specifically built to edit, run, and debug NLP++ text analyzers. It includes a text director of texts to process, a special editor for NLP++, text highlighting of matching rules for each sequential pass of rule patterns, and tree visualizations for the syntactic tree as well as the hierarchical knowledge base. It also has the ability to quickly generate rules directly from text.
 
VisualText is an [[IDE]] that is specifically built to edit, run, and debug NLP++ text analyzers. It includes a text director of texts to process, a special editor for NLP++, text highlighting of matching rules for each sequential pass of rule patterns, and tree visualizations for the syntactic tree as well as the hierarchical knowledge base. It also has the ability to quickly generate rules directly from text.
  
Line 125: Line 142:
  
 
=== Amnon Meyers ===
 
=== Amnon Meyers ===
Amnon Meyers worked on previous systems including Vox and the conceptual grammar ... blah blah blah
+
Amnon Meyers got his MS in Computer Science UC Berkeley, MS in Organic Chemistry UC Berkeley, and a BS in biology MIT. Amnon spent over a decade in aerospace NLP R&D (McDonnell Douglas & TRW) in the artificial intelligence groups developing VOX and the conceptual grammar which was developed in LISP. Amnon worked for 5 years at UC Irvine AI Lab on DARPA/Navy projects. In 1997, Amnon secured funding from friends and family to state Text Analysis International in order to create a computer programming language and IDE for creating natural language processing systems.
  
 
=== David de Hilster ===
 
=== David de Hilster ===
David de Hilster first developed island-driven pattern matching rules first in LISP on Xerox 1108 machines at [[Battelle Memorial Institute]] in their artificial intelligence group. He then worked in C for the commercial product called NLQuery from [[Battelle Memorial Institute]]. In the late 1980s, de Hilster developed Verbo, a natural language query system for databases in Portuguese while living in working in Rio de Janeiro Brazil.
+
David de Hilster got his BS in mathematics and MA in linguistics from the Ohio Statue University. De Hilster first developed island-driven pattern matching rules first in LISP on Xerox 1108 machines at [[Battelle Memorial Institute]] in their artificial intelligence group. He then worked in C for the commercial product called NLQuery from [[Battelle Memorial Institute]]. In the late 1980s, de Hilster developed Verbo, a natural language query system for databases in Portuguese while living in working in Rio de Janeiro Brazil.
  
 
=== Collaboration ===
 
=== Collaboration ===
 
In 1990, David de Hilster was hired into the Artificial Intelligence group at McDonnell Douglas in Huntington Beach California where he met Amnon and was tasked to move Amnon's Vox program to C++. The two collaborated, combining the island-driven parsing with the conceptual grammar and coming up with TexUS. Their system was used in the Message Understanding Conferences sponsored by Darp in the early 1990s and they placed third among the participants which included [[MIT]], [[SRI]], [[Carnegie Mellon]] among others.
 
In 1990, David de Hilster was hired into the Artificial Intelligence group at McDonnell Douglas in Huntington Beach California where he met Amnon and was tasked to move Amnon's Vox program to C++. The two collaborated, combining the island-driven parsing with the conceptual grammar and coming up with TexUS. Their system was used in the Message Understanding Conferences sponsored by Darp in the early 1990s and they placed third among the participants which included [[MIT]], [[SRI]], [[Carnegie Mellon]] among others.
  
In the mid 1990s, the two moved to the Aritifical Intelligence Group at Space Park at TRW where the two continued their collaboration, with de Hilster's work inspiring the creation of a new company ISearch which electronically processed resumes. In 1997, de Hilster was hired by ISearch to move their text processing system to the C language.
+
In the mid 1990s, the two moved to the Artificial Intelligence Group at Space Park at TRW where the two continued their collaboration, with de Hilster's work inspiring the creation of a new company ISearch which electronically processed resumes. In 1997, de Hilster was hired by ISearch to move their text processing system to the C language.
  
 
In 1998, Meyers secured funding from friends and family to start Text Analysis International which eventually lured de Hilster to join where the two created and formalized NLP++ and VisualText. The idea was to formalize a computer language that incorporated the pattern matching of de Hilster's, with the Conceptual Grammar knowledge base from Meyers, along with an [[integrated development environment]] specifically tailored to NLP++, its tree structures, and its knowledge base.
 
In 1998, Meyers secured funding from friends and family to start Text Analysis International which eventually lured de Hilster to join where the two created and formalized NLP++ and VisualText. The idea was to formalize a computer language that incorporated the pattern matching of de Hilster's, with the Conceptual Grammar knowledge base from Meyers, along with an [[integrated development environment]] specifically tailored to NLP++, its tree structures, and its knowledge base.
Line 140: Line 157:
  
 
== Open Source ==
 
== Open Source ==
 +
[[File:VSCodeNLP.jpg|thumb|VSCode NLP++ Language Extension]]
 
In December of 2018, NLP++ and VisualText went open source. The company Text Analysis International was dissolved by and it was moved to an open source MIT licensed repository by creators [[Amnon Meyers]] and [[David de Hilster]].
 
In December of 2018, NLP++ and VisualText went open source. The company Text Analysis International was dissolved by and it was moved to an open source MIT licensed repository by creators [[Amnon Meyers]] and [[David de Hilster]].
  
 
=== NLP Engine ===
 
=== NLP Engine ===
The NLP engine is a C++ class and executable that can called by other languages that can call c++ libraries the nlp.exe executable. It currently compiles on Linux and will be available on windows and Mac Os in the near future.
+
The NLP engine is a C++ class and executable that can called by other languages that can call c++ libraries or call the nlp.exe executable. It currently compiles on Linux and will be available on windows and Mac Os in the near future. The nlp.exe executable is called by the NLP++ Language Extension for VScode.
  
 
=== VisualText ===
 
=== VisualText ===
Version 2 of the VisualText IDE written for Microsoft's Windows and is available as a free download from the [http://visualtext.org VisualText website]. Version 3 is also available but was never completed and is not 100 percent functional.
+
All versions of VisualText are now also open source. Version 2 and 3 are no longer supported but heavily used (almost exclusively Version 2). The new VSCode Extension version will be the supported version of the IDE and runs on windows, macs, and Linux.
 +
 
 +
==== Windows Version ====
 +
The C++ code for Version 2 and 3 of VisualText are now open source but they are currently unsupported. They both use commercial C++ libraries from CodeJock making future support in the open source world impossible.
 +
The downloads are still available from the [http://visualtext.org VisualText website]. Version 3 is also available but was never completed and is not 100 percent functional.
 +
 
 +
==== VSCode Language Extension ====
 +
VisualText IDE is now ported to [[VSCode]] as a language extension which runs cross platform. This is now considered to be the current version of the IDE and will continue to be developed and enhanced. It was officially released as a Microsoft VSCode Language Extension on December 22, 2020 whose source code can be found in the [https://github.com/VisualText/vscode-nlp VSCode-NLP repository] on GitHub.
 +
 
 +
The output of the nlp.exe engine and the NLP++ Language Extension for VSCode produce a number of special files for analyzer development. Those files are:
 +
 
 +
{| class="wikitable"
 +
!File Extension
 +
!Description
 +
|-
 +
|.nlp or .pat
 +
|Files containing NLP++ code. The original files are .pat files and are the only files understood by the windows versions of the VisualText IDE.
 +
|-
 +
|.seq
 +
|The analyzer sequence file that store the calling order of the sequence of nlp++ files
 +
|-
 +
|.txxt
 +
|Rules matching highlight files containing double square or curly brackets around matched words and phrases (new to the VSCode NLP++ Language Extension)
 +
|-
 +
|.kb
 +
|Files dumped by the built-in function kbdumptree which is used to save and read in knowledge bases by NLP++
 +
|-
 +
|.kbb
 +
|Conceptual Grammar files with the knowledge pretty-printed for inspection during development
 +
|-
 +
|}
  
VisualText is now being ported to [[VSCode]] as a language extension which runs cross platform. It is planned to be officially released as a Microsoft VSCode Language Extension sometime in the first quarter of 2021. It is currently available as a beta IDE in the VisualText repository on GitHub.
+
The long-term plan is to take ownership of the .nlp extension for NLP++ given it is the only computer language exclusively dedicated to natural languaage.
  
 
== External links ==
 
== External links ==
 
* [https://github.com/VisualText Github open source repository]
 
* [https://github.com/VisualText Github open source repository]
 +
* [https://marketplace.visualstudio.com/items?itemName=dehilster.nlp VSCode NLP++ Language Extension]
 
* [http://www.visualtext.org Official website]
 
* [http://www.visualtext.org Official website]
 
* [http://www.textanalysis.com Legacy website]
 
* [http://www.textanalysis.com Legacy website]

Latest revision as of 07:39, 24 December 2020

NLP++
NLPppLogo-150x150.png
The NLP++ Logo
Paradigms Natural Language Processing
Designed by Amnon Meyers David de Hilster
Developer Text Analysis International
First appeared Template:Start date and age
Template:Infobox software/simple
Platform Cross Platform
OS Most major
Filename extensions .nlp, .pat, .seq, .txxt, .kb
Website www.visualtext.com

NLP++ is a computer programming language for natural language processing created by Amnon Meyers and David de Hilster in 1998. It operates on an input text via multiple passes that elaborate a best-first parse tree. It can access and update a hierarchical knowledge base management system (KBMS) called Conceptual Grammar (CG). NLP++ and CG deploy with an Integrated Development Environment (IDE) called VisualText, which supports rapid development of text analyzers. NLP++ is the only computer language exclusively dedicated to natural language processing.

Overview

NLP++ is a computer language dedicated to building natural language text analyzers. It allows programmers to capture and apply linguistic and world knowledge, emulating processes by which humans read and understand text. NLP++ combines bottom up, island-driven, recursive grammar, and other methods in a multi-pass architecture that operates on one parse tree. It works with a hierarchical knowledge base (KB), called Conceptual Grammar (CG), to dynamically build and use stored knowledge in analyzing text. Applications range from simple syntactic processing to full natural language understanding. VisualText is a developer's environment that exploits NLP++ and CG to rapidly elaborate text analyzers. Passes and KBs from one analyzer may be exploited to more rapidly construct and tailor new text analyzers.

NLP++

NLP++ is a computer language that takes text, breaks it down into tokens, builds up those tokens into syntactic trees, and builds and uses knowledge stored in Conceptual Grammar. The language includes functions, rules, operators, and variables specific to its internal representations of text and knowledge. NLP++ comprises general C or C++-lke programming language constructs, as well as integrally addressing rule matches and the associated knowledge base.

Variables

Variables are written with a single letter and a string name. Special variable types in NLP++ apply to specific contexts.

Variable Description Example Scope
N Specific node N("$text",2) Rules
S Suggested node S("count") Rules
X Context node and level X("concept",3) Rules
G Global variable G("People") Rules & Functions
L Local variable L("num") Rules & Functions

Regions

There are numerous regions in NLP++ files:

Region Description Position and Scope
@NODES Specifies the nodes to be matched in the @RULES region Comes before the @RULES region
@PATH Specifies a specific path in the syntax tree to match Comes before the @RULES region
@CODE Specifies a region where NLP++ code is executed outside of a @RULES region Region ends with @@CODE
@DECL Declarative area for functions Region ends with @@DECL
@POST Specifies a region of post processing for a rule or rules Comes right before the @RULES region
@PRE Specifies a region of post processing for a rule or rules Comes right before the @RULES region
@CHECK Specifies certain conditions on rule nodes before trying to make the rule Comes right before the @POST or @RULES region
@RULES Specifies a region for rules Region ends with @@

Rules

NLP++ has rules for pattern matching. A rule is written in the form of "@RULES _node <- a b c @@" where "<-" where a, b, and c are match and put under the new node "_node". Here is an example of a rule.

@POST
S("count") = N("$text",2);
S("concept") = makeconcept(G("Counts"),N("$text",1));
single();

@RULES
_count <-
    _xALPHA [s]  ### (1)
    _xNUM [s]    ### (2)
    @@

Built In Functions

NLP++ has built in functions for the following areas:

  • Database Functions
  • Formatting and I/O Functions
  • Knowledge Base Functions
  • Math Functions
  • Parse Tree Functions
  • Special Functions
  • Spelling Functions
  • String Functions
  • Web Functions

User Functions

NLP++ allows the user to create their own functions in the @@DECL area. These functions can access the syntactic tree and any part of the knowledge base, as well as files on the system.

Conceptual Grammar

The conceptual grammar is a hierarchical knowledge base that can be imported and used by NLP++ and also created by NLP++ code and pattern matching. The hierarchy contains concepts and concepts can have attributes and phrases attached to them.

VisualText

VisualText version 2 for Windows

VisualText is an IDE that is specifically built to edit, run, and debug NLP++ text analyzers. It includes a text director of texts to process, a special editor for NLP++, text highlighting of matching rules for each sequential pass of rule patterns, and tree visualizations for the syntactic tree as well as the hierarchical knowledge base. It also has the ability to quickly generate rules directly from text.

History

The roots of NLP++ come from its two creators, Amnon Meyers and David de Hilster who are computer programmers working in the area of natural language processing since the early 1980s.

Amnon Meyers

Amnon Meyers got his MS in Computer Science UC Berkeley, MS in Organic Chemistry UC Berkeley, and a BS in biology MIT. Amnon spent over a decade in aerospace NLP R&D (McDonnell Douglas & TRW) in the artificial intelligence groups developing VOX and the conceptual grammar which was developed in LISP. Amnon worked for 5 years at UC Irvine AI Lab on DARPA/Navy projects. In 1997, Amnon secured funding from friends and family to state Text Analysis International in order to create a computer programming language and IDE for creating natural language processing systems.

David de Hilster

David de Hilster got his BS in mathematics and MA in linguistics from the Ohio Statue University. De Hilster first developed island-driven pattern matching rules first in LISP on Xerox 1108 machines at Battelle Memorial Institute in their artificial intelligence group. He then worked in C for the commercial product called NLQuery from Battelle Memorial Institute. In the late 1980s, de Hilster developed Verbo, a natural language query system for databases in Portuguese while living in working in Rio de Janeiro Brazil.

Collaboration

In 1990, David de Hilster was hired into the Artificial Intelligence group at McDonnell Douglas in Huntington Beach California where he met Amnon and was tasked to move Amnon's Vox program to C++. The two collaborated, combining the island-driven parsing with the conceptual grammar and coming up with TexUS. Their system was used in the Message Understanding Conferences sponsored by Darp in the early 1990s and they placed third among the participants which included MIT, SRI, Carnegie Mellon among others.

In the mid 1990s, the two moved to the Artificial Intelligence Group at Space Park at TRW where the two continued their collaboration, with de Hilster's work inspiring the creation of a new company ISearch which electronically processed resumes. In 1997, de Hilster was hired by ISearch to move their text processing system to the C language.

In 1998, Meyers secured funding from friends and family to start Text Analysis International which eventually lured de Hilster to join where the two created and formalized NLP++ and VisualText. The idea was to formalize a computer language that incorporated the pattern matching of de Hilster's, with the Conceptual Grammar knowledge base from Meyers, along with an integrated development environment specifically tailored to NLP++, its tree structures, and its knowledge base.

For the two decades, the technology was privately owned and was licensed by private companies to process medial, social media, historical documents, and real estate text.

Open Source

VSCode NLP++ Language Extension

In December of 2018, NLP++ and VisualText went open source. The company Text Analysis International was dissolved by and it was moved to an open source MIT licensed repository by creators Amnon Meyers and David de Hilster.

NLP Engine

The NLP engine is a C++ class and executable that can called by other languages that can call c++ libraries or call the nlp.exe executable. It currently compiles on Linux and will be available on windows and Mac Os in the near future. The nlp.exe executable is called by the NLP++ Language Extension for VScode.

VisualText

All versions of VisualText are now also open source. Version 2 and 3 are no longer supported but heavily used (almost exclusively Version 2). The new VSCode Extension version will be the supported version of the IDE and runs on windows, macs, and Linux.

Windows Version

The C++ code for Version 2 and 3 of VisualText are now open source but they are currently unsupported. They both use commercial C++ libraries from CodeJock making future support in the open source world impossible. The downloads are still available from the VisualText website. Version 3 is also available but was never completed and is not 100 percent functional.

VSCode Language Extension

VisualText IDE is now ported to VSCode as a language extension which runs cross platform. This is now considered to be the current version of the IDE and will continue to be developed and enhanced. It was officially released as a Microsoft VSCode Language Extension on December 22, 2020 whose source code can be found in the VSCode-NLP repository on GitHub.

The output of the nlp.exe engine and the NLP++ Language Extension for VSCode produce a number of special files for analyzer development. Those files are:

File Extension Description
.nlp or .pat Files containing NLP++ code. The original files are .pat files and are the only files understood by the windows versions of the VisualText IDE.
.seq The analyzer sequence file that store the calling order of the sequence of nlp++ files
.txxt Rules matching highlight files containing double square or curly brackets around matched words and phrases (new to the VSCode NLP++ Language Extension)
.kb Files dumped by the built-in function kbdumptree which is used to save and read in knowledge bases by NLP++
.kbb Conceptual Grammar files with the knowledge pretty-printed for inspection during development

The long-term plan is to take ownership of the .nlp extension for NLP++ given it is the only computer language exclusively dedicated to natural languaage.

External links