Data Science Experience - Data Sciences
The final class in the Data Science Certificate program involves a "capstone project" where students tackle a realistic problem with realistic (or actual real) data. Here are a few recent capstone projects:
Our study of the USASpending data demonstrates the identification of important variables related to best spending practices can be used to identify trends and contracts that stand out as unusual. We show anomalies in multiple ways: (1) visual comparison of averages; (2) outlier identification of spending patterns, (3) secondary testing for values, (4) a more in-depth analysis of an awarding office, and (5) an application to view multiple results at once.
Anomaly Detection Approaches
MAJ Smith & Dr. John Alt
Initial anomaly detection methods identified ~1,000 anomalous records out of 1.45 million, reducing the records for an analyst to explore manually by over 99%.
Clustering--a technique used to put objects into groups that are somehow similar in characteristics.
Clustering Methods: Kmodes, Hierarchical Agglomerative Clustering (HAC), Partition Around Medoids (PAM), Randome Forests (supervised and unsupervised), PAM with Dissimilarity Matrix (Daisy)
Neural Network Autoencoder--an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs.
LCDR Jeff Good & MAJ Gabe Samudio
USAREC recruiters regularly receive partial demographic information from potential recruits. Recruiters and associated professionals need a rapid method to predict unknown demographic and financial information to enhance decision making by reducing the uncertainty of missing information.
Moneyball for Maintainers
MAJ Chris Arnold & MAJ Dan Hudalla
Analytics have revolutionized many industries concerned with individual and team performance. Naval Air Systems Command (NAVAIR) seeks an analytic approach to improve maintenance by using maintenance data to gain insights about maintainer performance, end item reliability, and maintenance efficiency.
NAVAIR is interested in a “Moneyball for Maintenance” approach that allows leaders to clearly see and maximize the performance metrics of their maintainers for the good of the organization.
Analysis of Large Text Data Corpus
Major Joe Moeller, Captain Chris Teska, & Lieutenant Cang Pham
Develop a process to rapidly extract text data in order to classify documents and identify key topics, issues, or questions.
Currently, much time and effort is needed to analyze documents manually for inclusion into senior level “Decision Packages”. Personnel must organize data into one of three categories: Challenges, Opportunities, or Issues. These categories are then aligned with the current priorities incorporated into AFC Task 5.
- Fully connected graph of sentences
- Based on Cosine Similarity
- PageRank Algorithm on graph
Latent Dirichlet Allocation
- Algorithmically identifies topics
- Assesses document – topic fit
- Allow association with multiple topics
- Keywords associated with each topic
- Easy access to document keywords and summaries
- Topic Modelling ranks each document by topic assignment
- Keywords for each topic are identified and easily searched
Special thanks to the following Advisors on these projects: Dr. J. Alt, LTC B. Wade, MAJ K. Klingensmith, MAJ A. Smith, CPT Sean Clement, (TRAC); Dr. C. Darken, Dr. R. Koyak (GSOIS).