aboutsummaryrefslogtreecommitdiff
path: root/README
blob: 1c19afc972c11f4dd7cde807c8b5c08dbc42ee2b (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
## Dependencies

 * pandas
     <http://pandas.pydata.org/>

 * debian
     <https://alioth.debian.org/projects/pkg-python-debian>

 * nose (to run test.py)
     <https://nose.readthedocs.org/en/latest/>

## Overview

Data is collected from various sources by the "load" scripts and converted to
the Pandas library's "data frame" structure, which is somewhat similar to a
SQL database except that there is no schema. Or to put it another way, it's
like a sparse grid that has named fields along one axis and numbered rows on
the other. This approach means that we can import data fairly direcetly from
fairly messy sources and work out the details at export time.

These data frames are saved into a pair of HDF (hierarchical data format)
files, `pkg.h5` and `cp.h5`, which contain general package information and
copyright/licensing information respectively.

We generate Semantic MediaWiki pages from this data using one of a pair of
export scripts. `export.py` exports the pages as a directory containing one
file per page. `export_json.py` exports the pages as a single JSON file. This
JSON file can be converted to a directory of wiki pages using the
`json_to_wiki.py` script.

## Importing data from debian

Loading data from package files:

    $ pv .../Packages python | python load_packages.py

Packages files can be obtained from Debian mirrors, and are cached by APT in
/var/lib/apt/lists.

Loading package descriptions:

    $ pv .../Translation-en | python load_descriptions.py

Loading data from copyright files:

    $ python load_copyright.py main/*/*/current/copyright | tee cp_import.log

Unfortunately, I don't know of a way to easily and quickly get copyright files
for all packages in main if you are not a Debian developer. I obtained them by
logging into powell.debian.org (which hosts.packages.debian.org) and running:

    $ cd /srv/packages.debian.org/www/changelogs/pool
    $ tar -zchf ~/copyright.tar.gz main/*/*/current/copyright

## Exporting data

One package:

    $ python export.py Pandoc

All packages, as wiki pages:

    $ python export.py

(Output is in "output" directory.)

All packages, as JSON:

    $ python export_json.py

JSON output can be converted to wiki pages:

    $ python json_to_wiki.py < packages.json

(Output is in "converted" directory.)

## Running the test suite

    $ python test.py