Introduction
|
|
Importing Data with Pandas
|
Use import to load a library and make it available to your own code
Use the help() function to see the built-in documentation for a library
Import data into python with the pandas library
The info() function will display a summary of your imported data
|
Working With Data
|
|
PII and Other Risky Data
|
Personally Identifiable Information (PII) is of two types.
In a library context, PII 1 is information about a patron. (E.g. name, date of birth, library barcode, etc.)
PII 2 is information about your activities and other information that can be linked back to a patron. (E.g. search history, circulation records, access to electronic resources, etc.)
By making connections within a pool of data, it is possible to identify specific patrons and their activities
Limiting the data we collect and how long we keep it around can help mitigate these risks
|
Parsing Data with Functions
|
Write functions to efficiently run code you want to reuse.
Functions can make use of other functions - those you import from libraries, as well as those you write yourself.
Well written and tested functions can reliably do things that might be hard to accomplish by hand.
|
De-identification
|
De-identification is the process of removing or obscuring PII, such that the remaining information does not identify an individual.
De-identified information can be re-identified, given access to the right information (e.g. the algorithm or pseudonym used for de-identification or sufficient data from other sources about the patrons in the original data).
Anonymization is the process of de-identifying information in such a way that it cannot be re-identified, usually by means of statistical disclosure limitation techniques.
Due to continuous advances in computation technology, full anonymity is difficult (some would say impossible) to guarantee.
|
Aggregation & Re-identification
|
Data aggregation is the process of combining data in such a way that it no longer refers to specific individuals, but rather reveals insight about groups within the population.
Data which is both de-identified and aggregated can still be valuable for analysis while posing less risk to the privacy of our patrons.
|