Open Source Data Management for Long Tail Data

A customizable and scalable data management system to support any data format and multiple research domains. Catalogs in the clouds.

Flexible Metadata Representation

Support for both user-defined and machine-defined metadata. System accepts metadata in a flexible representation based on JSON-LD. Users can add metadata entries directly from the UI. Extractors and external clients can attach metadata to files and datasets using the Web service API.

Automatic Metadata Extraction

When new data is added to the system, whether it is via the web front-end or through its Web service API, a cluster of extraction services process the data to extract interesting metadata and create web based data visualizations.

Extend the system by creating new extractors to analyze data. Using the publish-subscribe model and the RabbitMQ broker, when certain events occur in Clowder, such as the uploading of a new file, a message is published notifying any listening metadata extractors that a new file is available. Each extractor can then use the public Web Service API to analize the data and write back to Clowder any relevant information.

A partial list of available extractors is available in the NCSA Bitbucket and in the Brown Dog wiki.

Data Visualizations

To preview the content of large files and visualize the information contained in files and datasets in a meaningful way, Clowder provides ways to write Javascript based widgets that can be added to files and datasets. Often these data previews are added by automatic extractions.

For example, the geospatial extractors watch for shapefiles and geotiff files to be uploaded to Clowder and then submit the GIS layers to an instance of Geoserver. A custom Javascript previewer visualizes the data on an interactive map.


If you publish work that uses Clowder, please cite the following paper:

Luigi Marini, Indira Gutierrez-Polo, Rob Kooper, Sandeep Puthanveetil Satheesan, Maxwell Burnette, Jong Lee, Todd Nicholson, Yan Zhao, and Kenton McHenry. 2018. Clowder: Open Source Data Management for Long Tail Data. In Proceedings of the Practice and Experience on Advanced Research Computing (PEARC '18). ACM, New York, NY, USA, Article 40, 8 pages. DOI:

Projects and Partners

  • NARA/NSF OCI – Understanding Data Intensive and CPU Intensive Services to Support Preservation and Reconstruction of Electronic Records
  • NSF CDI – Groupscope: Instrumenting Research on Interaction Networks in Complex Social Contexts
  • NSF EAR – Critical Zone Observatory Network for Intensively Managed Landscapes (IML-CZO)
  • NSF ICER - EarthCube Building Blocks: Collaborative Proposal: A Geo-Semantic Framework for Integrating Long-Tail Data and Models
  • NIH – Immunomodulatory and Regenerative Effects of Mesenchymal Stem Cells on Allografts
  • Illinois-Indiana Sea Grant – Great Lakes Monitoring
  • European Commission – Linking Scientific Computing in Europe and the Eastern Mediterranean
  • XSEDE – Large Scale Video Analytics
  • NSF ACI – CIF21 DIBBs: Brown Dog
  • NSF ACI - Sustainable Environment through Actionable Data (SEAD)
  • NSF ACI - CIF21 DIBBs: T2-C2: Timely and Trusted Curator and Coordinator Data Building Blocks
  • Syngenta
  • NSF OAC - Collaborative Research: CSSI: Framework: Data: Clowder Open Source Customizable Research Data Management, Plus-Plus


A short list of publications related to Clowder:

  • S. Puthanveetil Satheesan, J. Alameda, S. Bradley, M. Dietze, B. Galewsky, G. Jansen, R. Kooper, P. Kumar, J. Lee, R. Marciano, L. Marini, B. S. Minsker, C. M. Navarro, A. Schmidt, M. Slavenas, W. C. Sullivan, B. Zhang, Y. Zhao, I. Zharnitsky, and K. McHenry. 2018. Brown Dog: Making the Digital World a Better Place, a Few Files at a Time. In Proceedings of the Practice and Experience on Advanced Research Computing (PEARC '18). ACM, New York, NY, USA, Article 38, 8 pages. DOI:
    Winner of PEARC '18 best paper award!
  • C. Sophocleous, L. Marini, R. Georgiou, M. Elfarargy, and K. McHenry, “Medici 2: A Scalable Content Management System for Cultural Heritage Datasets,” in Code4Lib, 2017.
  • P. Nguyen, S. Konstanty, T. Nicholson, T. O’Brien, A. Schwartz-Duval, T. Spila, K. Nahrstedt, R. Campbell, I. Gupta, M. Chan, K. McHenry, and N. Paquin, "4CeeD: Real-time Acquisition and Analysis Framework for Materials-related Cyber-Physical Environments," in 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2017.
  • Y. Zhao, E. Black, L. Marini, K. McHenry, N. Kenyon, R. Patil, A. Kajdacsy-Balla, and A. Bartholomew, "Automatic Glomerulus Extraction in Whole Slide Images Towards Computer Aided Diagnosis," 2016 IEEE 12th International Conference on e-Science (e-Science), 2016.
  • S. Padhy, J. Alameda, R. Kooper, R. Liu, S. P. Satheesan, I. Zharnitsky, G. Jansen, M. Dietze, P. Kumar, B. Minsker, C. Navarro, M. Slavenas, W. Sullivan, and K. McHenry, “An Architecture for Automatic Deployment of Brown Dog Services At Scale into Diverse Computing Infrastructures,” in XSEDE, 2016.
  • M. Slavenas, E. Wuerffel, P. Rodriguez, J. Will, and A. Craig, “Image Analysis and Infrastructure Support for Mining the Farm Security Administration – Office of War Information Photography Collection,” in XSEDE, 2016.
  • S. Padhy, L. Diesendruck, R. Kooper, R. Liu, L. Marini, C. Navarro, M. Slavenas, I. Zharnitsky, M. Dietze, P. Kumar, B. Minsker, J. Lee, and K. McHenry, “Autocuration Cyberinfrastrucutre for Scientific Discovery and Preservation,” in IEEE eScience, 2015.
  • V. Kuhn, A. Craig, M. Simeone, S. P. Satheesan, and L. Marini, “The VAT: Enhanced Video Analysis,” in XSEDE, 2015.
  • M. Poole, N. Lambert, S. Satheesan, A. Das, A. Yahja, and M. Hasegawa-Johnson, “GroupScope: A Framework and Tools for Large Scale Study of Social Processes,” in International Conference on Computational Social Science, 2015.
  • J. Myers, M. Hedstrom, D. Akmon, S. Payette, B. A. Plale, I. Kouper, S. McCaulay, R. McDonald, I. Suriarachchi, A. Varadharaju, P. Kumar, M. Elag, J. Lee, R. Kooper, and L. Marini, ``Towards Sustainable Curation and Preservation: The SEAD Project's Data Services Approach", in 2015 IEEE 11th International Conference on e-Science, 2015.
  • L.Marini, R.Kooper, J.Futrelle, J.Plutchak, A.Craig, T.McLaren, and J. Myers, “Medici: A scalable multimedia environment for research,” in The Microsoft e-Science Workshop, 2010.
  • More

Special Thanks

  • Atlassian for kindly giving us an open source license to their software development products that make our daily efforts so much easier
  • Balsamiq Mockups for kindly giving us an open source license for their rapid wireframing tool that makes iterating over designs so much faster and enjoyable