aboutsummaryrefslogtreecommitdiff
path: root/README
blob: afc8a26a79aba02b3cd744ab1e9a3fb8989c934b (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
## Dependencies

 * python-debian

 * python-nose (to run test.py)
     <https://nose.readthedocs.org/en/latest/>

 * python-pandas
     <http://pandas.pydata.org/>

 * pv

## Overview

Data is collected from various sources by the "load" scripts and converted to
the Pandas library's "data frame" structure, which is somewhat similar to a
SQL database except that there is no schema. Or to put it another way, it's
like a sparse grid that has named fields along one axis and numbered rows on
the other. This approach means that we can import data fairly direcetly from
fairly messy sources and work out the details at export time.

These data frames are saved into a pair of HDF (hierarchical data format)
files, `pkg.h5` and `cp.h5`, which contain general package information and
copyright/licensing information respectively.

We generate Semantic MediaWiki pages from this data using one of a pair
of export scripts. `export.py` exports the pages as a directory
containing one file per page. `export_json.py` exports the list of pages
as a single JSON file named index.json. This JSON file can be converted
to a directory of wiki pages using the `json_to_wiki.py` script.

To import and export all packages, do

./doall.sh

## Importing data from debian

Loading data from package files:

    $ pv .../Packages python | python load_packages.py

Packages files can be obtained from Debian mirrors, and are cached by APT in
/var/lib/apt/lists.

Loading package descriptions:

    $ pv .../Translation-en | python load_descriptions.py

Loading data from copyright files:

    $ python load_copyright.py main/*/*/current/copyright | tee cp_import.log


## Exporting data

One package:

    $ python export.py pandoc

All packages, as wiki pages:

    $ python export.py

(Output is in "output" directory.)

All packages, as JSON:

    $ python export_json.py

JSON output can be converted to wiki pages:

    $ python json_to_wiki.py < packages.json

(Output is in "converted" directory.)

## Running the test suite

    $ python test.py