WIP online docs #93

DerwenAI · Feb 27, 2021 · bdc6f95 · bdc6f95
1 parent 144e49f
commit bdc6f95
Show file tree

Hide file tree

Showing 5 changed files with 183 additions and 182 deletions.
diff --git a/docs/biblio.md b/docs/biblio.md
@@ -8,6 +8,7 @@ Links to online versions of cited works use [DOIs](https://www.doi.org/) when av
 then separately list [open access](https://peerj.com/preprints/3119v1/) URLs obtained
 through <https://github.com/Coleridge-Initiative/RCApi>
 
+
 ## – F –
 
 ### florescuc17
@@ -19,6 +20,20 @@ doi: 10.18653/v1/P17-1102
 open: <https://www.aclweb.org/anthology/P17-1102.pdf>
 
 
+## – H –
+
+### hogan2020knowledge
+
+["Knowledge Graphs"](https://arxiv.org/abs/2003.02320)  
+**Aidan Hogan**, **Eva Blomqvist**, **Michael Cochez**, **Claudia d'Amato**,
+**Gerard de Melo**, **Claudio Gutierrez**, **José Emilio Labra Gayo**,
+**Sabrina Kirrane**, **Sebastian Neumaier**, **Axel Polleres**,
+**Roberto Navigli**, **Axel-Cyrille Ngonga Ngomo**, **Sabbir M. Rashid**,
+**Anisa Rula**, **Lukas Schmelzeisen**, **Juan Sequeda**, **Steffen Staab**,
+**Antoine Zimmermann**  
+*arXiv* (2020)
+
+
 ## – M –
 
 ### mihalcea04textrank
@@ -28,9 +43,11 @@ open: <https://www.aclweb.org/anthology/P17-1102.pdf>
 *EMNLP* (2004)  
 open: <https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf>
 
-### mihalcea16msr
 
-["Single and Multiple Document Summarization with Graph-based Ranking Algorithms"](https://www.youtube.com/watch?v=NvpCFJ0dA8A)  
-**Rada Mihalcea**  
-Microsoft Research on *YouTube* (2016-09-05)
+## - W -
+
+### williams2016
 
+["Text summarization, topic models and RNNs"](http://mike.place/2016/summarization/)  
+**Mike Williams**  
+*PyGotham*, (2016-09-25)
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -7,73 +7,35 @@ This material is a work in progress, at "rough draft" stage.
 
 ## – A –
 
-### abstraction layer
-
-A technology implementing a [*separation of concerns*](#separation-of-concerns)
-
-> a way of hiding the working details of a subsystem, allowing the separation of concerns to facilitate interoperability and platform independence
-
-see: <https://en.wikipedia.org/wiki/Abstraction_layer>
-
-### Apache Spark
-
-One of the most popular open source projects for "Big Data" infrastructure,
-based on [[zahariacfss10]](../biblio/#zahariacfss10).
-Spark provides a classic example of [*applicative systems*](#applicative-systems)
-used for data analytics.
-
-See: <https://spark.apache.org/>
-
-### applicative systems
+### abstractive summarization
 
 
 ## – C –
 
-### cloud computing
-
-see: <https://derwen.ai/d/cloud_computing>
-
-### computable content
+### coreference resolution
 
 
 ## – D –
 
-### data context
-
-### data engineering
-
-### data governance
+### deep learning
 
-see: <https://derwen.ai/d/data_governance>
 
-### data science
+## – E –
 
-see: <https://derwen.ai/d/data_science>
+### eigenvector centrality
 
-### data strategy
+### entity linking
 
-### distributed systems
+### extractive summarization
 
 
 ## – G –
 
 ### graph algorithms
 
-### graph database
-
-### graph-based data science
-
 
 ## – K –
 
-### KG
-
-abbr. [knowledge graph](#knowledge-graph)
-
-### KGC
-
-abbr. [Knowledge Graph Conference](#knowledge-graph-conference)
-
 ### knowledge graph
 
 One of the more concise, contemporary definitions is given in
@@ -83,92 +45,33 @@ One of the more concise, contemporary definitions is given in
 
 see: <https://derwen.ai/d/knowledge_graph>
 
-### Knowledge Graph Conference
-
-The annual *Knowledge Graph Conference*
-and its community
-<https://www.knowledgegraph.tech/>
-
-### knowledge graph embedding
 
 
-## – M –
+## – L –
 
-### machine learning
+### language models
 
-see: <https://derwen.ai/d/machine_learning>
+### lemma graph
 
 
 ## – N –
 
-### natural language
-
-see: <https://derwen.ai/d/natural_language>
-
-
-## – O –
-
-### OSFA
-
-abbr. "One size fits all", a common antipattern in technology
-> a description for a product that would fit in all instances
-
-see: <https://en.wikipedia.org/wiki/One_size_fits_all>
+### named entity recognition
 
 
 ## – P –
 
-### probabilistic graph inference
-
-### probabilistic soft logic
-
-A computationally efficient form of [*statistical relational learning*](#statistical-relational-learning)
-described in [[bachbhg17]](../biblio/#bachbhg17)
-
-> We unite three approaches from the randomized algorithms, probabilistic graphical models, and fuzzy logic communities, showing that all three lead to the same inference objective. 
-
-see: <https://psl.linqs.org/>
-
-### property graph
-
-
-## – R –
+### personalized pagerank
 
-### RDF
-
-abbr. *Resource Description Framework*
-<https://www.w3.org/RDF/>
-> a standard model for data interchange on the Web
-
-### reinforcement learning
-
-see: <https://derwen.ai/d/reinforcement_learning>
+### phrase extraction
 
 
 ## – S –
 
-### semantic technologies
-
-### Semantic Web
-
-A proposed evolution of the World Wide Web,
-discussed in retrospect by [[shadbolt06semantic]](../biblio/#shadbolt06semantic),
-and coordinated through the [W3C](#W3C).
-The intent was to move from documents for humans to read,
-to a Web that included data and information for computers to manipulate.
-
-see: <https://en.wikipedia.org/wiki/Semantic_Web> <https://www.w3.org/standards/semanticweb/>
-
-### separation of concerns
-
-### statistical relational learning
-
-see: <https://www.cs.umd.edu/srl-book/>
-
+### semantic relations
 
-## – W –
+## – T –
 
-### W3C
+### textgraphs
 
-abbr. *World Wide Web Consortium* <https://www.w3.org/>
-> an international community where Member organizations, a full-time staff, and the public work together to develop Web standards
+### transformers
diff --git a/docs/overview.md b/docs/overview.md
@@ -1,90 +1,101 @@
 # Overview
 
-## Open Source Integration
+## Lemma Graph
 
-The **kglab** package is mostly about integration.
-On the one hand, there are useful graph libraries, most of which don't
-share much common ground and can often be difficult to use together.
-One the other hand, there are the popular tools used for data science
-and data engineering, with expectation about how to repeat process,
-how to scale and leverage multi-cloud resources, etc.
+Internally, **PyTextRank** constructs a *lemma graph* to represent
+links among the candidate phrases (e.g., unrecognized entities) and
+also references within supporting language within the text.
 
-Much of the role of **kglab** is to provide abstractions that make
-these integrations simpler, while fitting into the tools and processes
-that are expected by contemporary data teams in industry.
-The following figure shows a *landscape diagram* for how **kglab**
-fits into multiple technology stacks and related workflows:
 
-<a href="../assets/landscape.png" target="_blank"><img src="../assets/landscape.png" width="500" /></a>
+The results from components in earlier stages of the `spaCy` pipeline
+produce two important kinds of annotations for each token in a parsed
+document:
 
-Items shown in *black* have been implemented, while the items shown in
-*blue* are on our roadmap.
-We include use cases for most of what's
-implemented within the [tutorial](../tutorial/).
+  1. *part-of-speech*
+  2. *lemmatized*
 
+Note that when you have these two annotation plus the disambiguated
+*word sense* (i.e., the meaning of a word based on its context and
+usage) then you can map from a token to a *concept*.
 
-## Just Enough Math, Edition 2
+The gist of the *TextRank* algorithm is to apply a sliding window
+across the tokens within a parsed sentence, constructing a graph from
+the lemmatized tokens where neighbor within the window get linked.
+Each lemma is unique within the lemma graph, such that repeated
+instances collect more links.
 
-To be candid, **kglab** is partly a follow-up edition of 
-[*Just Enough Math*](../biblio/#nathan2014jem)
-– which originally had the elevator pitch: 
+A *centrality* measure gets calculated for each node in the graph,
+then the nouns can be ranked in decending order.
 
-> practical uses of advanced math for business execs (who probably didn't take +3 years of calculus) to understand big data use cases through hands-on coding experience plus case studies, histories of the key innovations and their innovators, and links to primary sources
+An additional pass through the graph uses both *noun chunks* and
+*named entities* to help agglomerate adjacent nouns into ranked
+phrases.
 
-[*JEM*](../biblio/#nathan2014jem) started as a book which –
-thanks to quick thinking by editor Ann Spencer – 
-turned into a popular video+notebook series,
-followed by tutorials, and then a community focused on open source.
-Seven years later the field of 
-[data science](../glossary/#data-science)
-has changed dramatically
-This time around, **kglab** starts as an open source Python library,
-with a notebook-based tutorial at its core,
-focused on a community and their business use cases.
 
-The scope now is about
-[*graph-based data science*](../glossary/#graph-based-data-science),
-and perhaps someday this may spin-out a book or other learning materials.
+## Leveraging Semantic Relations
 
+Generally speaking, any means of enriching the lemma graph prior to
+phrase ranking will tend to improve results.
 
-## How to use these materials
+Possible ways to enrich the lemma graph include
+[*coreference resolution*](http://nlpprogress.com/english/coreference_resolution.html)
+and
+[*semantic relations*](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy).
+The latter can leverage *knowledge graphs* or some form of *thesaurus*
+in the general case.
 
-Following the *JEM* approach, throughout the tutorial you'll find a
-mix of topics:
-data science, business context, AI applications, data management, 
-design, distributed systems – plus explorations of how to leverage
-relatively advanced math, where appropriate.
+For example,
+[WordNet](https://spacy.io/universe/project/spacy-wordnet)
+and
+[DBpedia](https://wiki.dbpedia.org/)
+both provide means for inferring links among entities, and
+purpose-built knowledge graphs can be applied for specific use cases.
+These can help enrich a lemma graph even in cases where links are not
+explicit within the text.
 
-To addresses these topics, this documentation uses a particular
-structure, shown in the following figure:
+Consider a paragraph that mentions `cats` and `kittens` in different
+sentences: an implied semantic relation exists between the two nouns
+since the lemma `kitten` is a hyponym of the lemma `cat` -- such that
+an inferred link can be added between them.
 
-<a href="../assets/learning.png" target="_blank"><img src="../assets/learning.png" width="500" /></a>
 
-To make these materials useful to a wide audience, we've provided
-multiple entry points, depending on what you need:
+## Entity Linking
 
-  * Introduce [concepts](../concepts), exploring the math behind the concepts
-  * Point toward histories, [primary sources](../biblio), and other materials for context
-  * Show [use cases](../use_case) and linking to related case studies for grounding
-  * Practice through [hands-on coding](../tutorial/), based on a progressive example
-  * Clarify terminology with a [glossary](../glossary) for shared definitions
+One of the motivations for **PyTextRank** is to provide support (eventually) for
+[*entity linking*](http://nlpprogress.com/english/entity_linking.html),
+in contrast to the more commonplace usage of
+[*named entity recognition*](http://nlpprogress.com/english/named_entity_recognition.html).
+These approaches can be used together in complementary ways to improve
+the results overall.
 
-Ideally, there should also be two other parts – stay tuned for both:
+This has an additional benefit of linking parsed and annotated
+documents into more structured data, and can also be used to support
+knowledge graph construction.
 
-  * *self-assessments* for personal feedback
-  * the coding examples show lead into a *capstone project*
 
-In any case, the objective for these materials is to help people learn
-how leverage **kglab** effectively, gain confidence working with
-graph-based data science, plus have examples to repurpose for your own
-use cases.
+## Extractive Summarization
 
-Start at any point, whatever is most immediately useful for you.
-The material is hyper-linked together; it may be helpful to run
-JupyterLab for the coding examples in one browser tab, while reading
-this documentation in another browser tab.
+The simple implementation of *extractive summarization* in *
+*PyTextRank** was inspired by the
+[[williams2016]](../biblio/#williams2016),
+talk on text summarization.
 
-Again, we're focused on a [community](../#community-resources)
-and pay special attention to their business use cases.
+Note that while **much better** approaches exist for
+[*summarizing text*](http://nlpprogress.com/english/summarization.html),
+questions linger about some of the top contenders -- see:
+[1](https://arxiv.org/abs/1909.03004),
+[2](https://arxiv.org/abs/1906.02243).
+
+Arguably, having alternatives such as **PyTextRank** 
+allow for a wider range of cost trade-offs.
+
+
+## Feedback
+
+Let us know if you find this package useful, tell us about use cases, 
+describe what else you would like to see integrated, etc.
+
+We're focused on our [community](../#community-resources) 
+and pay special attention to the business use cases.
 We're also eager to hear your feedback and suggestions for this 
 open source project.