This tutorial explains the basic concepts in the NLP editor. The flow created in this tutorial can be imported from sample-flows/tutorial-flow.json and can be executed by uploading the text file [4Q2006.txt](./sample-data/revenue by division/financial statements/4Q2006.txt) into the Input Document .
Under Extractors, drag and drop Input Documents on the canvas. Configure with document 4Q2006.txt. Click Upload, then Close.
Under Extractors, drag Dictionary on the canvas. Connect its input to the output of Input Documents.
Rename the node to Division
and enter the terms: Software
, Hardware
, Global Business Services
, and Global Technology Services
. Click Save.
Select the Division
node, and click Run.
Similar to the prior step, create a dictionary called Metric
with a single term revenue
. Select Lemma Match. Don't forget to click Save.
Create a dictionary Preposition
with terms for
, and from
. Select Ignore case. Click Save.
Create a sequence that identifies text such as "Software revenues". Under Generation, drag and drop Sequence to the canvas. Connect its input with the outputs of nodes Division
and Metric
.
Open the sequence, rename it to RevenueOfDivision1
and write (<Division.Division>)<Token>{0,2}(<Metric.Metric>)
under Sequence Pattern. Click Save. Run the sequence to see results.
- Create another sequence called
RevenueOfDivision2
to identify text such as "revenues from Software". Connect its input to the output of nodesMetric
,Preposition
, andDivision
. Modify the Sequence Pattern as:(<Metric.Metric>)<Token>{0,1}(<Preposition.Preposition>)<Token>{0,2}(<Division.Division>)
. Note: the order in which you connect the inputs of the sequence dictates the initial sequence pattern filled in by default.
Click Save and Run.
Under Generation, drag Union to the canvas. Connect its inputs to the outputs of RevenueOfDivision1
and RevenueOfDivision2
. Rename the union to RevenueOfDivision
. Click Close and Run.
You will see an error "Union node requires attribute aligned" because the two attributes of the two input nodes have different names. You must make the input nodes union compatible by renaming the attributes.
For this, open the node RevenueOfDivision1
and rename the first attribute RevenueOfDivision
and click Save.
Do the same for the node RevenueOfDivision2
: rename the first attribute RevenueOfDivision
and Save.
Now select the Union node RevenueOfDivision
and run it. You will see 6 results: one result from RevenueOfDivision1
, and five results RevenueOfDivision2
.
Under Extractors, drag ReGex to the canvas. Name it Amount
and specify the regular expression as \$\d+(\.\d+)?\s+billion
.
Click Save, then Run.
The regular expression captures mentions of currency amounts.
Create a sequence called RevenueByDivision
and specify the pattern as (<RevenueOfDivision.RevenueOfDivision>)<Token>{0,35}(<Amount.Amount>)
. Ensure the name of the first attribute is also RevenueByDivision
, renaming it if necessary. Click Save and Run.
In the result, we notice a few overlapping results: the second result revenues from Global Technology Services ... $8.6 billion
overlaps with the third results revenues from Global Technology Services ... $8.6 billion ... $4.2 billion
.
The third result is incorrect, as $4.2 billion
is the revenue of a different division.
We can remove such overlaps using the Consolidate node.
Under Refinement, drag Consolidate on the canvas and connect its input with RevenueByDivision
.
Rename it to RevenueConsolidated
and configure it using the NotContainedWithin
policy, as shown below. Click Save.
Run RevenueConsolidated
. The incorrect overlapping results have been removed.