Skip to content

Commit

Permalink
WIP online docs #93
Browse files Browse the repository at this point in the history
  • Loading branch information
ceteri committed Feb 27, 2021
1 parent 144e49f commit bdc6f95
Show file tree
Hide file tree
Showing 5 changed files with 183 additions and 182 deletions.
25 changes: 21 additions & 4 deletions docs/biblio.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Links to online versions of cited works use [DOIs](https://www.doi.org/) when av
then separately list [open access](https://peerj.com/preprints/3119v1/) URLs obtained
through <https://github.com/Coleridge-Initiative/RCApi>


## – F –

### florescuc17
Expand All @@ -19,6 +20,20 @@ doi: 10.18653/v1/P17-1102
open: <https://www.aclweb.org/anthology/P17-1102.pdf>


## – H –

### hogan2020knowledge

["Knowledge Graphs"](https://arxiv.org/abs/2003.02320)
**Aidan Hogan**, **Eva Blomqvist**, **Michael Cochez**, **Claudia d'Amato**,
**Gerard de Melo**, **Claudio Gutierrez**, **José Emilio Labra Gayo**,
**Sabrina Kirrane**, **Sebastian Neumaier**, **Axel Polleres**,
**Roberto Navigli**, **Axel-Cyrille Ngonga Ngomo**, **Sabbir M. Rashid**,
**Anisa Rula**, **Lukas Schmelzeisen**, **Juan Sequeda**, **Steffen Staab**,
**Antoine Zimmermann**
*arXiv* (2020)


## – M –

### mihalcea04textrank
Expand All @@ -28,9 +43,11 @@ open: <https://www.aclweb.org/anthology/P17-1102.pdf>
*EMNLP* (2004)
open: <https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf>

### mihalcea16msr

["Single and Multiple Document Summarization with Graph-based Ranking Algorithms"](https://www.youtube.com/watch?v=NvpCFJ0dA8A)
**Rada Mihalcea**
Microsoft Research on *YouTube* (2016-09-05)
## - W -

### williams2016

["Text summarization, topic models and RNNs"](http://mike.place/2016/summarization/)
**Mike Williams**
*PyGotham*, (2016-09-25)
131 changes: 17 additions & 114 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,73 +7,35 @@ This material is a work in progress, at "rough draft" stage.

## – A –

### abstraction layer

A technology implementing a [*separation of concerns*](#separation-of-concerns)

> a way of hiding the working details of a subsystem, allowing the separation of concerns to facilitate interoperability and platform independence
see: <https://en.wikipedia.org/wiki/Abstraction_layer>

### Apache Spark

One of the most popular open source projects for "Big Data" infrastructure,
based on [[zahariacfss10]](../biblio/#zahariacfss10).
Spark provides a classic example of [*applicative systems*](#applicative-systems)
used for data analytics.

See: <https://spark.apache.org/>

### applicative systems
### abstractive summarization


## – C –

### cloud computing

see: <https://derwen.ai/d/cloud_computing>

### computable content
### coreference resolution


## – D –

### data context

### data engineering

### data governance
### deep learning

see: <https://derwen.ai/d/data_governance>

### data science
## – E –

see: <https://derwen.ai/d/data_science>
### eigenvector centrality

### data strategy
### entity linking

### distributed systems
### extractive summarization


## – G –

### graph algorithms

### graph database

### graph-based data science


## – K –

### KG

abbr. [knowledge graph](#knowledge-graph)

### KGC

abbr. [Knowledge Graph Conference](#knowledge-graph-conference)

### knowledge graph

One of the more concise, contemporary definitions is given in
Expand All @@ -83,92 +45,33 @@ One of the more concise, contemporary definitions is given in
see: <https://derwen.ai/d/knowledge_graph>

### Knowledge Graph Conference

The annual *Knowledge Graph Conference*
and its community
<https://www.knowledgegraph.tech/>

### knowledge graph embedding


## M
## L

### machine learning
### language models

see: <https://derwen.ai/d/machine_learning>
### lemma graph


## – N –

### natural language

see: <https://derwen.ai/d/natural_language>


## – O –

### OSFA

abbr. "One size fits all", a common antipattern in technology
> a description for a product that would fit in all instances
see: <https://en.wikipedia.org/wiki/One_size_fits_all>
### named entity recognition


## – P –

### probabilistic graph inference

### probabilistic soft logic

A computationally efficient form of [*statistical relational learning*](#statistical-relational-learning)
described in [[bachbhg17]](../biblio/#bachbhg17)

> We unite three approaches from the randomized algorithms, probabilistic graphical models, and fuzzy logic communities, showing that all three lead to the same inference objective.
see: <https://psl.linqs.org/>

### property graph


## – R –
### personalized pagerank

### RDF

abbr. *Resource Description Framework*
<https://www.w3.org/RDF/>
> a standard model for data interchange on the Web
### reinforcement learning

see: <https://derwen.ai/d/reinforcement_learning>
### phrase extraction


## – S –

### semantic technologies

### Semantic Web

A proposed evolution of the World Wide Web,
discussed in retrospect by [[shadbolt06semantic]](../biblio/#shadbolt06semantic),
and coordinated through the [W3C](#W3C).
The intent was to move from documents for humans to read,
to a Web that included data and information for computers to manipulate.

see: <https://en.wikipedia.org/wiki/Semantic_Web> <https://www.w3.org/standards/semanticweb/>

### separation of concerns

### statistical relational learning

see: <https://www.cs.umd.edu/srl-book/>

### semantic relations

## W
## T

### W3C
### textgraphs

abbr. *World Wide Web Consortium* <https://www.w3.org/>
> an international community where Member organizations, a full-time staff, and the public work together to develop Web standards
### transformers
139 changes: 75 additions & 64 deletions docs/overview.md
Original file line number Diff line number Diff line change
@@ -1,90 +1,101 @@
# Overview

## Open Source Integration
## Lemma Graph

The **kglab** package is mostly about integration.
On the one hand, there are useful graph libraries, most of which don't
share much common ground and can often be difficult to use together.
One the other hand, there are the popular tools used for data science
and data engineering, with expectation about how to repeat process,
how to scale and leverage multi-cloud resources, etc.
Internally, **PyTextRank** constructs a *lemma graph* to represent
links among the candidate phrases (e.g., unrecognized entities) and
also references within supporting language within the text.

Much of the role of **kglab** is to provide abstractions that make
these integrations simpler, while fitting into the tools and processes
that are expected by contemporary data teams in industry.
The following figure shows a *landscape diagram* for how **kglab**
fits into multiple technology stacks and related workflows:

<a href="../assets/landscape.png" target="_blank"><img src="../assets/landscape.png" width="500" /></a>
The results from components in earlier stages of the `spaCy` pipeline
produce two important kinds of annotations for each token in a parsed
document:

Items shown in *black* have been implemented, while the items shown in
*blue* are on our roadmap.
We include use cases for most of what's
implemented within the [tutorial](../tutorial/).
1. *part-of-speech*
2. *lemmatized*

Note that when you have these two annotation plus the disambiguated
*word sense* (i.e., the meaning of a word based on its context and
usage) then you can map from a token to a *concept*.

## Just Enough Math, Edition 2
The gist of the *TextRank* algorithm is to apply a sliding window
across the tokens within a parsed sentence, constructing a graph from
the lemmatized tokens where neighbor within the window get linked.
Each lemma is unique within the lemma graph, such that repeated
instances collect more links.

To be candid, **kglab** is partly a follow-up edition of
[*Just Enough Math*](../biblio/#nathan2014jem)
– which originally had the elevator pitch:
A *centrality* measure gets calculated for each node in the graph,
then the nouns can be ranked in decending order.

> practical uses of advanced math for business execs (who probably didn't take +3 years of calculus) to understand big data use cases through hands-on coding experience plus case studies, histories of the key innovations and their innovators, and links to primary sources
An additional pass through the graph uses both *noun chunks* and
*named entities* to help agglomerate adjacent nouns into ranked
phrases.

[*JEM*](../biblio/#nathan2014jem) started as a book which –
thanks to quick thinking by editor Ann Spencer –
turned into a popular video+notebook series,
followed by tutorials, and then a community focused on open source.
Seven years later the field of
[data science](../glossary/#data-science)
has changed dramatically
This time around, **kglab** starts as an open source Python library,
with a notebook-based tutorial at its core,
focused on a community and their business use cases.

The scope now is about
[*graph-based data science*](../glossary/#graph-based-data-science),
and perhaps someday this may spin-out a book or other learning materials.
## Leveraging Semantic Relations

Generally speaking, any means of enriching the lemma graph prior to
phrase ranking will tend to improve results.

## How to use these materials
Possible ways to enrich the lemma graph include
[*coreference resolution*](http://nlpprogress.com/english/coreference_resolution.html)
and
[*semantic relations*](https://en.wikipedia.org/wiki/Hyponymy_and_hypernymy).
The latter can leverage *knowledge graphs* or some form of *thesaurus*
in the general case.

Following the *JEM* approach, throughout the tutorial you'll find a
mix of topics:
data science, business context, AI applications, data management,
design, distributed systems – plus explorations of how to leverage
relatively advanced math, where appropriate.
For example,
[WordNet](https://spacy.io/universe/project/spacy-wordnet)
and
[DBpedia](https://wiki.dbpedia.org/)
both provide means for inferring links among entities, and
purpose-built knowledge graphs can be applied for specific use cases.
These can help enrich a lemma graph even in cases where links are not
explicit within the text.

To addresses these topics, this documentation uses a particular
structure, shown in the following figure:
Consider a paragraph that mentions `cats` and `kittens` in different
sentences: an implied semantic relation exists between the two nouns
since the lemma `kitten` is a hyponym of the lemma `cat` -- such that
an inferred link can be added between them.

<a href="../assets/learning.png" target="_blank"><img src="../assets/learning.png" width="500" /></a>

To make these materials useful to a wide audience, we've provided
multiple entry points, depending on what you need:
## Entity Linking

* Introduce [concepts](../concepts), exploring the math behind the concepts
* Point toward histories, [primary sources](../biblio), and other materials for context
* Show [use cases](../use_case) and linking to related case studies for grounding
* Practice through [hands-on coding](../tutorial/), based on a progressive example
* Clarify terminology with a [glossary](../glossary) for shared definitions
One of the motivations for **PyTextRank** is to provide support (eventually) for
[*entity linking*](http://nlpprogress.com/english/entity_linking.html),
in contrast to the more commonplace usage of
[*named entity recognition*](http://nlpprogress.com/english/named_entity_recognition.html).
These approaches can be used together in complementary ways to improve
the results overall.

Ideally, there should also be two other parts – stay tuned for both:
This has an additional benefit of linking parsed and annotated
documents into more structured data, and can also be used to support
knowledge graph construction.

* *self-assessments* for personal feedback
* the coding examples show lead into a *capstone project*

In any case, the objective for these materials is to help people learn
how leverage **kglab** effectively, gain confidence working with
graph-based data science, plus have examples to repurpose for your own
use cases.
## Extractive Summarization

Start at any point, whatever is most immediately useful for you.
The material is hyper-linked together; it may be helpful to run
JupyterLab for the coding examples in one browser tab, while reading
this documentation in another browser tab.
The simple implementation of *extractive summarization* in *
*PyTextRank** was inspired by the
[[williams2016]](../biblio/#williams2016),
talk on text summarization.

Again, we're focused on a [community](../#community-resources)
and pay special attention to their business use cases.
Note that while **much better** approaches exist for
[*summarizing text*](http://nlpprogress.com/english/summarization.html),
questions linger about some of the top contenders -- see:
[1](https://arxiv.org/abs/1909.03004),
[2](https://arxiv.org/abs/1906.02243).

Arguably, having alternatives such as **PyTextRank**
allow for a wider range of cost trade-offs.


## Feedback

Let us know if you find this package useful, tell us about use cases,
describe what else you would like to see integrated, etc.

We're focused on our [community](../#community-resources)
and pay special attention to the business use cases.
We're also eager to hear your feedback and suggestions for this
open source project.
Loading

0 comments on commit bdc6f95

Please sign in to comment.