This repository contains code, resources, and workflows for analyzing patent data using Python, Apache Spark, AWS, and Microsoft Azure services. The objective of this project is to extract actionable insights and trends from patent datasets to aid intellectual property strategies and business decisions.
The workflow follows a 4-phase architecture:
-
Sourcing: Data is scraped and ingested from major patent repositories such as:
- Google Patents
- WIPO
- USPTO
- FPO
- Espacenet
-
Storage: Patent data is stored in cloud solutions:
- Amazon S3
- Microsoft Azure Blob Storage
-
ETL (Extract, Transform, Load):
- Tools Used: Apache Spark (Azure Databricks) and Delta Lake
- Data pipelines are built using Azure Data Factory to clean and transform data.
- The Medallion Architecture ensures:
- Bronze: Raw ingestion
- Silver: Filtered and clean data
- Gold: Aggregated and analytics-ready data.
-
Visualization: Insights are visualized using:
- Power BI
- Matplotlib & Seaborn (Python libraries)
-
Web Scraping: Patent data is extracted using BeautifulSoup and Python scripts.
-
Preprocessing:
- Data cleaning
- Parsing XML, JSON, CSV, and PDF formats
-
Feature Engineering:
- Keyword extraction
- Citation network analysis
-
ETL Pipelines: Scalable data processing with Apache Spark.
-
Visualizations: Interactive charts for patent trends, keyword frequency, and metrics.
Count of Patents by Year
Count o Power BI Desktop f inventor by country
Th Power BI Desktop e development of countries' interest in patenting
├── Analysis of Patents on Virus Engineering.pdf # PDF report on virus engineering patents
├── ETL_PROCESS.ipynb # Notebook for the ETL process
├── Interface_DEMO.rar # Demo interface (compressed file)
├── Patents_Scraping.ipynb # Notebook for web scraping patent data
├── Project_Architecture.png # Architecture diagram for the project
├── Project_Presentation.pdf # Project presentation file
├── Projet_visualizations.pdf # Visualizations and insights in PDF
└── README.md # Project documentation
- Python 3.x
- Apache Spark
- AWS credentials for S3
- Microsoft Azure access
- Clone the repository:
git clone https://github.com/your_username/patent-analysis.git cd patent-analysis
The project leverages patent data from:
- Google Patents
- WIPO
- USPTO
- FPO
- Espacenet
- Data Scraping: Use Patents_Scraping.ipynb to collect and store patent data.
- ETL Process: Run the ETL_PROCESS.ipynb notebook to clean, transform, and prepare the data.
- Visualization: Load the processed data into Power BI or Python notebooks to generate insights.
Contributions are welcome! Follow these steps:
- Fork this repository.
- Create a new branch: git checkout -b feature/new-feature.
- Commit your changes: git commit -m "Add new feature".
- Push to the branch: git push origin feature/new-feature.
- Submit a Pull Request.
For questions, feedback, or collaborations, contact:
Najma El boutaheri Email: najmaelboutaheri@gmail.com
Special thanks to all contributors and the open-source libraries used in this project.