README.rst 1.18 KB
Newer Older
guglielmo's avatar
guglielmo committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
This project contains source code used for automatic classification of the texts found in 
the italian Parliament acts, from 16th legislature.

References
==========

Mysql dump of the 16th legislature (anonymized users data)
  https://s3.amazonaws.com/op_backup/opp16_anonym.sql.gz
  
Mysql dump of the 17th legislature (anonymized users data)
  https://s3.amazonaws.com/op_backup/opp17_anonym.sql.gz

Database ER design
  See files ``docs/opp_model.png`` e ``docs/opp_model.mwb`` (mysql workbench)


Contents
========

docs:
  the ER schema, as a PNG low-res image, 
  the ER schema, as a Mysql Workbench file, 
  the sql queries used for the views extracting texts and categories.
    

Usage
=====

Download the sql dump, decompress and restore in a mysql instance:

    wget https://s3.amazonaws.com/op_backup/opp16_anonym.sql.zip
    mysql -uroot -e "create database opp16 default charset utf8;"
    gzip opp16_anonym.sql | mysql -uroot opp16
    
Use the views to extract the texts or categories for each act:

    select * from atto_texts;

    select * from atto_tags;
    
Build test and training set using this extractions directly (sqlalchemy) or indirectly (csvexport + pandas, other means).