From ae21ecd58535c030ec1c367bb7715550d9a5015c Mon Sep 17 00:00:00 2001
From: Dafydd Harries <daf@rhydd.org>
Date: Thu, 23 May 2013 05:31:29 -0400
Subject: add some documentation of data flow

---
 README | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

(limited to 'README')

diff --git a/README b/README
index e408a09..283766a 100644
--- a/README
+++ b/README
@@ -10,6 +10,25 @@
  * nose (to run test.py)
      <https://nose.readthedocs.org/en/latest/>
 
+## Overview
+
+Data is collected from various sources by the "load" scripts and converted to
+the Pandas library's "data frame" structure, which is somewhat similar to a
+SQL database except that there is no schema. Or to put it another way, it's
+like a sparse grid that has named fields along one axis and numbered rows on
+the other. This approach means that we can import data fairly direcetly from
+fairly messy sources and work out the details at export time.
+
+These data frames are saved into a pair of HDF (hierarchical data format)
+files, `pkg.h5` and `cp.h5`, which contain general package information and
+copyright/licensing information respectively.
+
+We generate Semantic MediaWiki pages from this data using one of a pair of
+export scripts. `export.py` exports the pages as a directory containing one
+file per page. `export_json.py` exports the pages as a single JSON file. This
+JSON file can be converted to a directory of wiki pages using the
+`json_to_wiki.py` script.
+
 ## Importing data
 
 Loading data from package files:
@@ -56,3 +75,7 @@ JSON output can be converted to wiki pages:
 
 (Output is in "converted" directory.)
 
+## Running the test suite
+
+    $ python test.py
+
-- 
cgit v1.2.3