## Dependencies * python-debian * python-nose (to run test.py) * python-pandas * pv ## Overview Data is collected from various sources by the "load" scripts and converted to the Pandas library's "data frame" structure, which is somewhat similar to a SQL database except that there is no schema. Or to put it another way, it's like a sparse grid that has named fields along one axis and numbered rows on the other. This approach means that we can import data fairly direcetly from fairly messy sources and work out the details at export time. These data frames are saved into a pair of HDF (hierarchical data format) files, `pkg.h5` and `cp.h5`, which contain general package information and copyright/licensing information respectively. We generate Semantic MediaWiki pages from this data using one of a pair of export scripts. `export.py` exports the pages as a directory containing one file per page. `export_json.py` exports the list of pages as a single JSON file named index.json. This JSON file can be converted to a directory of wiki pages using the `json_to_wiki.py` script. To import and export all packages, do ./runall.sh ## Importing data from debian Loading data from package files: $ pv .../Packages python | python load_packages.py Packages files can be obtained from Debian mirrors, and are cached by APT in /var/lib/apt/lists. Loading package descriptions: $ pv .../Translation-en | python load_descriptions.py Loading data from copyright files: $ python load_copyright.py main/*/*/current/copyright | tee cp_import.log ## Exporting data One package: $ python export.py pandoc All packages, as wiki pages: $ python export.py (Output is in "output" directory.) All packages, as JSON: $ python export_json.py JSON output can be converted to wiki pages: $ python json_to_wiki.py < packages.json (Output is in "converted" directory.) ## Running the test suite $ python test.py