1 files changed, 23 insertions, 0 deletions
diff --git a/README b/README
index e408a09..283766a 100644
--- a/README
+++ b/README
@@ -10,6 +10,25 @@
  * nose (to run test.py)
      <https://nose.readthedocs.org/en/latest/>
 
+## Overview
+
+Data is collected from various sources by the "load" scripts and converted to
+the Pandas library's "data frame" structure, which is somewhat similar to a
+SQL database except that there is no schema. Or to put it another way, it's
+like a sparse grid that has named fields along one axis and numbered rows on
+the other. This approach means that we can import data fairly direcetly from
+fairly messy sources and work out the details at export time.
+
+These data frames are saved into a pair of HDF (hierarchical data format)
+files, `pkg.h5` and `cp.h5`, which contain general package information and
+copyright/licensing information respectively.
+
+We generate Semantic MediaWiki pages from this data using one of a pair of
+export scripts. `export.py` exports the pages as a directory containing one
+file per page. `export_json.py` exports the pages as a single JSON file. This
+JSON file can be converted to a directory of wiki pages using the
+`json_to_wiki.py` script.
+
 ## Importing data
 
 Loading data from package files:
@@ -56,3 +75,7 @@ JSON output can be converted to wiki pages:
 
 (Output is in "converted" directory.)
 
+## Running the test suite
+
+    $ python test.py
+