blob: aef4670295bcf1d83237a2b7e9502bece849675c (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
## Dependencies
* pandas
<http://pandas.pydata.org/>
* debian
<https://alioth.debian.org/projects/pkg-python-debian>
* nose (to run test.py)
<https://nose.readthedocs.org/en/latest/>
## Overview
Data is collected from various sources by the "load" scripts and converted to
the Pandas library's "data frame" structure, which is somewhat similar to a
SQL database except that there is no schema. Or to put it another way, it's
like a sparse grid that has named fields along one axis and numbered rows on
the other. This approach means that we can import data fairly direcetly from
fairly messy sources and work out the details at export time.
These data frames are saved into a pair of HDF (hierarchical data format)
files, `pkg.h5` and `cp.h5`, which contain general package information and
copyright/licensing information respectively.
We generate Semantic MediaWiki pages from this data using one of a pair
of export scripts. `export.py` exports the pages as a directory
containing one file per page. `export_json.py` exports the list of pages
as a single JSON file named index.json. This JSON file can be converted
to a directory of wiki pages using the `json_to_wiki.py` script.
## Importing data from debian
Loading data from package files:
$ pv .../Packages python | python load_packages.py
Packages files can be obtained from Debian mirrors, and are cached by APT in
/var/lib/apt/lists.
Loading package descriptions:
$ pv .../Translation-en | python load_descriptions.py
Loading data from copyright files:
$ python load_copyright.py main/*/*/current/copyright | tee cp_import.log
Unfortunately, I don't know of a way to easily and quickly get copyright files
for all packages in main if you are not a Debian developer. I obtained them by
logging into powell.debian.org (which hosts.packages.debian.org) and running:
$ cd /srv/packages.debian.org/www/changelogs/pool
$ tar -zchf ~/copyright.tar.gz main/*/*/current/copyright
## Exporting data
One package:
$ python export.py Pandoc
All packages, as wiki pages:
$ python export.py
(Output is in "output" directory.)
All packages, as JSON:
$ python export_json.py
JSON output can be converted to wiki pages:
$ python json_to_wiki.py < packages.json
(Output is in "converted" directory.)
## Running the test suite
$ python test.py
|