add some documentation of data flow

author: Dafydd Harries <daf@rhydd.org> 2013-05-23 05:31:29 -0400
committer: Dafydd Harries <daf@rhydd.org> 2013-05-23 05:31:29 -0400
commit: ae21ecd58535c030ec1c367bb7715550d9a5015c (patch)
tree: a984381147277770c37fea54b293e5c2c7963684 /README
parent: 473c023c1cf4f08ab489f02ebe5bd6fba90595b3 (diff)
1 files changed, 23 insertions, 0 deletions
diff --git a/README b/README
index e408a09..283766a 100644
--- a/README
+++ b/README
@@ -10,6 +10,25 @@
  * nose (to run test.py)
      <https://nose.readthedocs.org/en/latest/>
 
+## Overview
+
+Data is collected from various sources by the "load" scripts and converted to
+the Pandas library's "data frame" structure, which is somewhat similar to a
+SQL database except that there is no schema. Or to put it another way, it's
+like a sparse grid that has named fields along one axis and numbered rows on
+the other. This approach means that we can import data fairly direcetly from
+fairly messy sources and work out the details at export time.
+
+These data frames are saved into a pair of HDF (hierarchical data format)
+files, `pkg.h5` and `cp.h5`, which contain general package information and
+copyright/licensing information respectively.
+
+We generate Semantic MediaWiki pages from this data using one of a pair of
+export scripts. `export.py` exports the pages as a directory containing one
+file per page. `export_json.py` exports the pages as a single JSON file. This
+JSON file can be converted to a directory of wiki pages using the
+`json_to_wiki.py` script.
+
 ## Importing data
 
 Loading data from package files:
@@ -56,3 +75,7 @@ JSON output can be converted to wiki pages:
 
 (Output is in "converted" directory.)
 
+## Running the test suite
+
+    $ python test.py
+
author	Dafydd Harries <daf@rhydd.org>	2013-05-23 05:31:29 -0400
committer	Dafydd Harries <daf@rhydd.org>	2013-05-23 05:31:29 -0400
commit	ae21ecd58535c030ec1c367bb7715550d9a5015c (patch)
tree	a984381147277770c37fea54b293e5c2c7963684 /README
parent	473c023c1cf4f08ab489f02ebe5bd6fba90595b3 (diff)