szeitlin/CAMLIS_notes_2019-10-26.md

## CAMLIS_notes_2019-10-26.md

      
    Raw
  

              CAMLIS_notes_2019-10-26.md
            
          
    Aleatha Parker-Wood, PhD Keynote
HuMu, Symantec, many security-related patents
BYOD (bring your own device) - bigger attack surface
DLP (data loss prevention) - need more data, but that expands your attack surface
Need strict ACLs - have to avoid letting marketing use data intended only for security models
Encryption is not a magic bullet
"You can't store enough data to protect all your data"
Have to worry about model inversion attacks
"privacy for all users... except the bad ones"
Incremental learning, e.g. Bloom filter
Sketching - learning summary statistics as you go, e.g. rolling averages, hyperloglog
Robust against threats, low space needs
Online learning, e.g. Hoeffding is an online decision tree, or stochastic gradient descent on mini-batches
Harder to do model selection because "you can't go back to the original data" (or have to store some)
Data poisoning attacks - big area of research, see IEEE Security & Privacy
Differential Privacy - ref Dwork 2006 (not available for free)
Carefully calibrated noise to cover user identity
Add a randomization factor (unbiased error) - epsilon "probability correction", and some also add a delta
In theory it sounds great, but she says in practice, not so much
ex. Palpatine is CEO, Padme is Eng, Anakin is Sales
Differencing Attack - Anakin didn't respond to the survey, so Palpatine can easily identify Padme's answers
Solution: use a "high water mark"
tl;dr over-report just a little all the time (with fake data I guess?)
2020 Census will use differential privacy; Apple uses it for predictive text
Advantages: prevents overfitting (Abadi 2016), protects against breaches, poisoning, and insider attacks
Disavantages: suppresses outliers (so not great for anomaly detection or security investigations), choosing epsilon is "a black art", requires more data because noisy by design
Private multi-party ML - distributed learning across mutually distrusting systems
ex. predictive text where data stays local to the phone, great for GDPR
Different kinds of solutions:


Secure differential privacy - fast, but accuracy issues


Secure multiparty computation (SMC) - high I/O overhead (due to multi-round communication required), but higher accuracy


Homomorphic encryption - special case of SMC -- never decrypt


Truex et al. (paper coming out soon) - hybrid of differential privacy + homomorphic encryption
Papernot et al. PATE - Private Aggregation of Teacher Ensembles. Learn locally on private data, predict on public. Accuracy and efficiency are challenges. Problems on the mathematical side e.g. encryption.
@aleatha
aleatha@humu.com

Felipe Ducau - Describing Malware via Tagging
Sophos - been there 2.5 years
Tokenize information about malware - what type it is, whether it's compressed, etc.
deep learning approaches - binary entropy model
Trained a model on 1 year of data (10M or 76M?), validation set 3M, test set 3.8M
Ended up with 20 features - ref Saxe and Berlin 2015
Multi-head neural nets get 96% coverage
explode to get more tokens, then collapse back down
joint embedding was actually better and more interpretable, comes from computer vision
use dot product for distance to get relationships between tags
mean TPR (true positive rate) of 0.88, overall 0.71
AUC 0.99
pre-print is available on arXiv: https://arxiv.org/abs/1905.06262
ALOHA- auxiliary loss optimization for hypothesis augmentation, ref. Rudd et al. 2019
video is here: https://www.usenix.org/conference/usenixsecurity19/presentation/rudd
Can use this approach to cluster threats and prioritize
easy to inspect via t-SNE
tested with known positive controls
they don't have to do any unpacking of the binary with this approach
someone asked why not do topic modeling, he says they tried that and didn't pursue it, but someone should

**Laura Dedic -- Novetta **
CNN-based malware visualization and explainability