aboutsummaryrefslogtreecommitdiff
path: root/README
diff options
context:
space:
mode:
authorDafydd Harries <daf@rhydd.org>2013-05-23 05:31:29 -0400
committerDafydd Harries <daf@rhydd.org>2013-05-23 05:31:29 -0400
commitae21ecd58535c030ec1c367bb7715550d9a5015c (patch)
treea984381147277770c37fea54b293e5c2c7963684 /README
parent473c023c1cf4f08ab489f02ebe5bd6fba90595b3 (diff)
add some documentation of data flow
Diffstat (limited to 'README')
-rw-r--r--README23
1 files changed, 23 insertions, 0 deletions
diff --git a/README b/README
index e408a09..283766a 100644
--- a/README
+++ b/README
@@ -10,6 +10,25 @@
* nose (to run test.py)
<https://nose.readthedocs.org/en/latest/>
+## Overview
+
+Data is collected from various sources by the "load" scripts and converted to
+the Pandas library's "data frame" structure, which is somewhat similar to a
+SQL database except that there is no schema. Or to put it another way, it's
+like a sparse grid that has named fields along one axis and numbered rows on
+the other. This approach means that we can import data fairly direcetly from
+fairly messy sources and work out the details at export time.
+
+These data frames are saved into a pair of HDF (hierarchical data format)
+files, `pkg.h5` and `cp.h5`, which contain general package information and
+copyright/licensing information respectively.
+
+We generate Semantic MediaWiki pages from this data using one of a pair of
+export scripts. `export.py` exports the pages as a directory containing one
+file per page. `export_json.py` exports the pages as a single JSON file. This
+JSON file can be converted to a directory of wiki pages using the
+`json_to_wiki.py` script.
+
## Importing data
Loading data from package files:
@@ -56,3 +75,7 @@ JSON output can be converted to wiki pages:
(Output is in "converted" directory.)
+## Running the test suite
+
+ $ python test.py
+