From ae21ecd58535c030ec1c367bb7715550d9a5015c Mon Sep 17 00:00:00 2001 From: Dafydd Harries Date: Thu, 23 May 2013 05:31:29 -0400 Subject: add some documentation of data flow --- README | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/README b/README index e408a09..283766a 100644 --- a/README +++ b/README @@ -10,6 +10,25 @@ * nose (to run test.py) +## Overview + +Data is collected from various sources by the "load" scripts and converted to +the Pandas library's "data frame" structure, which is somewhat similar to a +SQL database except that there is no schema. Or to put it another way, it's +like a sparse grid that has named fields along one axis and numbered rows on +the other. This approach means that we can import data fairly direcetly from +fairly messy sources and work out the details at export time. + +These data frames are saved into a pair of HDF (hierarchical data format) +files, `pkg.h5` and `cp.h5`, which contain general package information and +copyright/licensing information respectively. + +We generate Semantic MediaWiki pages from this data using one of a pair of +export scripts. `export.py` exports the pages as a directory containing one +file per page. `export_json.py` exports the pages as a single JSON file. This +JSON file can be converted to a directory of wiki pages using the +`json_to_wiki.py` script. + ## Importing data Loading data from package files: @@ -56,3 +75,7 @@ JSON output can be converted to wiki pages: (Output is in "converted" directory.) +## Running the test suite + + $ python test.py + -- cgit v1.2.3