A Beginner’s Guide to Generating Custom Dataset for Domains Where Dataset Is Sparse or Absent

July 28, 2023


A lot of aspects of our life have been transformed by machine learning (ML) models, which allow us to automate tasks as well as enhance our decision-making. An essential cog in this revolutionary wheel is the data that feeds these models. Textual data, specifically in the context of Natural Language Processing (NLP) and Large Language Models (LLMs), is crucial to the development and effectiveness of these systems. The goal of this article is to help beginners create tailored datasets in fields where data may be scarce or non-existent.

Importance of Datasets in ML 

Machine learning models thrive on discerning patterns in data. The superior the quality of data, the more effective the model becomes in predicting or classifying unknown instances. In the realm of machine learning, data serves as our guiding principle, directing us towards precise outcomes and pioneering solutions. This holds particularly true for Natural Language Processing (NLP), where text-based datasets play an integral role in comprehending, interpreting, and producing human language with purpose and relevance. 

Challenges of Sparse or Absent Datasets

In machine learning, the unavailability or insufficiency of appropriate datasets in particular fields, such as healthcare, poses a substantial obstacle. This data shortage can profoundly impact a model's effectiveness, giving rise to issues like overfitting, underfitting, and subpar generalisation. For instance, in the healthcare industry, predicting rare diseases or conditions may be difficult due to the lack of comprehensive patient data. The absence of substantial real-world data in such niche areas can obstruct the evolution of bespoke solutions, hindering the advancement of healthcare AI applications designed to detect or predict these rare medical conditions. Such challenges highlight the significance of creating customised datasets. These datasets, which are designed to satisfy specific requirements, provide a more direct path to accurate and reliable machine learning (ML) models.

Significance of Generating Custom Datasets

Embarking on the journey of creating a custom dataset begins with a crucial first step: deciphering the problem statement. The problem statement stands as a cornerstone, determining the purpose for which the dataset will be used. It sketches a detailed outline of the unique requirements and attributes that the dataset should possess, laying a foundation for what the dataset should look like. Understanding the problem statement involves delving into the nuances of the task, identifying the kinds of inputs and outputs that the ML model will need to handle, and recognizing any constraints imposed by the domain or the nature of the problem. It's a process that requires a deep understanding of both the ML model's needs and the specificities of the task or problem, ensuring that the resulting dataset will effectively serve its purpose in the wider machine learning workflow.

After establishing the problem statement, the next stage involves discerning the unique necessities and traits of the dataset. This process includes taking into account elements like the structure of the data, its intricacy, the field-specific details demanded, and the quantity of data needed for the model's efficient training.

Understanding the Problem Statement

Understanding the Specific Requirements & Characteristics of the Dataset

For instance, consider a problem statement centred around the classification of articles into categories. In this scenario, the dataset's specific requirements and characteristics could include:

  • Format of Data: The dataset would need to consist of textual data from articles, potentially including both the title and body of the article.
  • Complexity: Given that the task is about categorising articles, the complexity might reside in the diversity of language used, the length of the articles, and the range of topics covered.
  • Domain-Specific Information: The dataset would need to cover a broad spectrum of categories into which articles might be classified. This could include politics, technology, sports, culture, etc. Therefore, the dataset should contain articles pertaining to these specific domains.
  • Volume of Data: The dataset must be substantial enough to expose the machine learning model to a wide variety of linguistic patterns, topics, and styles to effectively learn and generalise. The exact volume might depend on the complexity of the categories and the variety of the articles.

Solution: Data Generation Using LLMs


Large Language Models (LLMs), such as GPT-4, offer a powerful solution for generating custom datasets. They are capable of producing a range of text outputs in response to specific prompts, which enables the creation of varied datasets that are meticulously tailored to align with unique problem requirements. For our example of article category classification, we can employ GPT-4 to generate textual data representing a wide array of article categories, thereby providing a robust foundational dataset.

However, in the Python code illustrated below, we use 'text-davinci', a model offered by OpenAI, instead of GPT-4. This choice is guided by considerations of financial efficiency as 'text-davinci' provides a satisfactory balance of cost and performance for our task. It's crucial to remember that one could choose any LLM as per their specific requirements and constraints. You might also consider using an open-source LLM from Hugging Face's transformer library, which provides a wide range of pre-trained models. However, it's essential to note that the quality of the generated data will vary with each LLM. The choice of the model should thus be influenced by your requirements concerning data quality, budget constraints, and the specific demands of your problem statement.

import openai

# Set your OpenAI API key
openai.api_key = 'your-openai-api-key'

class CategoryArticleGenerator:
    def __init__(self, engine="text-davinci-003", temperature=0.6):
        self.engine = engine
        self.temperature = temperature

    def generate_article(self, category, max_tokens=1000):
        prompt = f"Write an article about {category}:\n"
        response = openai.Completion.create(
        return response.choices[0].text.strip()

if __name__ == "__main__":
    categories = ["Emerging Technologies", "Healthcare Innovations", "Global Politics", "Environmental Conservation"]

    generator = CategoryArticleGenerator()

    for index, category in enumerate(categories):
        print(f"Generating article for category: {category}")
        article = generator.generate_article(category)
        with open(f'article_{index}.txt', 'w') as f:
        print(f"Article for category '{category}' saved to article_{index}.txt")
        print("\n" + "="*80 + "\n")

In the above code, we define a Python class `CategoryArticleGenerator` that generates article texts based on the given category. The output articles are saved as text files, which can be utilised later as a custom dataset for the article category classification problem.

Limitations and Challenges 

It's noteworthy to mention that while this code serves as a functional example, it is not without its limitations and does not represent a comprehensive solution. For example, it doesn't incorporate any mechanism to ensure the quality or relevance of the generated articles. For real-world applications, we would need a method to validate the output of the model, ensuring the articles generated are contextually appropriate for the given category and meet specified quality benchmarks. Further, the code doesn't factor in potential biases in the generated content. 

We must remember that machine learning models, LLMs included, can unknowingly propagate and amplify biases found in their training data, which could inadvertently introduce bias into our dataset. The prompt simplicity may also lead to a less diverse dataset than intended. To improve the diversity of the generated dataset, we could employ more intricate prompts, adjust the 'temperature' parameter to influence the model's output randomness. Finally, while 'text-davinci' presents an economically viable option, the quality of data produced might not be on par with more advanced models. 

Depending on the unique requirements of your problem, it might be necessary to consider different models, even if they come with a higher cost. Despite these limitations, the illustrated code exemplifies the potential of LLMs in generating custom datasets and serves as a springboard for further refinement and enhancement.

Annotating & Labelling the Dataset

Labelling and Annotating datasets is a vital step for models to understand the task and make accurate predictions. Once the dataset is generated in the context of article category classification, the next essential step is the annotation and labelling process. This refers to the task of assigning each generated article to its corresponding category, such as 'Health', 'Technology', 'Environment', and so forth. This labelling provides the ground truth for supervised learning models, which is instrumental in teaching these models to identify and understand the distinct features that are indicative of each category. 

Although this process may be labour-intensive and time-consuming, it is crucial for the successful training of machine learning models. Without these labels, models would be unable to ascertain the task at hand, significantly impacting their ability to make accurate predictions. Furthermore, the quality of these labels directly impacts the performance of the model, emphasising the need for careful and accurate annotation. 

For instance, if an article about 'Blockchain Technology' is incorrectly labelled as 'Health', the model might learn incorrect associations, leading to suboptimal performance and inaccurate predictions. Therefore, a properly annotated and labelled dataset is not just a requirement, but a critical asset for the effectiveness of supervised learning models in tasks such as article category classification.

After meticulously labelling and annotating the dataset, the generated articles provide a substantial foundation for our article category classification task. However, taking a closer look at one of the articles generated by our code on healthcare innovations, it's clear that the process of generating custom datasets, while incredibly useful, presents its own set of challenges and limitations. This article, though rich in content, helps illuminate some of the potential pitfalls that might arise during this process.


Healthcare innovations have been a continually evolving process, driven by the need to improve patient care and create more efficient healthcare systems. From the introduction of new medical technologies to the implementation of new models for delivering care, healthcare innovations have helped to revolutionize the way we approach health care.

One of the most significant and recent healthcare innovations is the use of artificial intelligence (AI). AI technology is being used in a variety of ways, from helping diagnose illnesses to providing personalized treatments. AI can help doctors diagnose illnesses more quickly and accurately, as well as helping to reduce healthcare costs. AI is also being used to develop new drugs and treatments, as well as to improve the accuracy of medical imaging.

The use of big data and analytics has also been a major healthcare innovation. By collecting and analyzing large amounts of patient data, healthcare providers can gain valuable insights into patient health, and use this information to make better decisions about patient care. This can help to improve patient outcomes and reduce costs.

Another major healthcare innovation is the use of telemedicine. Telemedicine allows patients to access medical care remotely, without having to travel to a doctor's office. This can be particularly beneficial for those who are unable to leave their homes due to illness or disability. Telemedicine can also help to reduce waiting times and provide more timely access to care.

Finally, healthcare innovations also include the use of mobile health applications. These apps allow patients to access their health information, schedule appointments, and even monitor their health from their smartphones. Mobile health apps can help to make healthcare more accessible and convenient for patients, while also helping to reduce healthcare costs.

Healthcare innovations are continuing to evolve, and will continue to shape the way we approach healthcare in the future. From the use of AI and big data to the introduction of telemedicine and mobile health apps, healthcare innovations are helping to revolutionize the way we deliver care. By staying up to date with the latest healthcare innovations, we can ensure that we are providing the best possible care to our patients.

Data Bias: The LLMs, as proficient as they are in generating text, can still harbour biases, typically inherited from their training data. For instance, our generated article is heavily skewed towards the technological aspects of healthcare, such as AI, big data, telemedicine, and mobile health apps. This could be indicative of a bias within the model, which could be due to the prevalence of technology-related data over other healthcare facets in the model's training data.

Quality of the Data: The quality of the data generated can vary significantly and may not always meet high standards. Despite our healthcare article's overall coherence, it could be lacking in depth or unique insights typically found in articles authored by experts in the field. Furthermore, there is some degree of repetition, underscoring that while LLMs can generate relevant content, the quality may not always be optimal.

Ethical & Legal Considerations: It's imperative to consider the ethical aspects when generating data. The article generated appears to respect these norms, as it doesn't contain personally identifiable information or violate any copyright laws. However, constant vigilance is needed to ensure these ethical boundaries are consistently maintained.

Scalability Issues: Generating a large dataset can be a resource-intensive and time-consuming task. Although our code successfully generated a few articles, generating thousands more encompassing a wide range of topics could pose a significant challenge.

These challenges and limitations should be kept in mind when opting to generate custom datasets using LLMs, the ability to create targeted, rich, and diverse datasets makes it a worthwhile pursuit, especially in domains where relevant datasets are sparse or unavailable.

Benefits of Using LLMs 

To summarise, Large language models (LLMs) like GPT-4 can be used to create custom datasets for machine learning. This is useful when existing datasets are insufficient or non-existent. However, there are some challenges associated with this approach, such as bias in generated data, variable data quality, ethical considerations, and scalability issues.

Despite these challenges, the potential benefits of generating custom datasets with LLMs can outweigh the limitations. This is especially true in data-sparse domains. The key is to be mindful of the challenges and to implement the approach thoughtfully and responsibly.

Here are some specific examples of the benefits of using LLMs to create custom datasets:

  • It can be used to generate a variety of data that is not easily available in existing datasets. For example, LLMs can be used to generate text in different languages, styles, or genres.
  • It can be used to generate data that is more representative of the real world. For example, LLMs can be used to generate data that reflects the diversity of human experiences.
  • It can be used to generate data that is more challenging for machine learning models to learn from. This can help to improve the performance of machine learning models.


Overall, the use of LLMs to create custom datasets is a promising approach for machine learning. However, it is important to be aware of the challenges and to implement the approach thoughtfully and responsibly.


Here are some potential references you might find useful for further exploration of the topic:

  1. OpenAI API Documentation: The API documentation provides details on how to generate texts using the GPT-4 model and other models. 
  2. Using large language models (LLMs) to synthesize training data. - Amazon Science
Latest Blogs
This is a decorative image for: A Complete Guide To Customer Acquisition For Startups
October 18, 2022

A Complete Guide To Customer Acquisition For Startups

Any business is enlivened by its customers. Therefore, a strategy to constantly bring in new clients is an ongoing requirement. In this regard, having a proper customer acquisition strategy can be of great importance.

So, if you are just starting your business, or planning to expand it, read on to learn more about this concept.

The problem with customer acquisition

As an organization, when working in a diverse and competitive market like India, you need to have a well-defined customer acquisition strategy to attain success. However, this is where most startups struggle. Now, you may have a great product or service, but if you are not in the right place targeting the right demographic, you are not likely to get the results you want.

To resolve this, typically, companies invest, but if that is not channelized properly, it will be futile.

So, the best way out of this dilemma is to have a clear customer acquisition strategy in place.

How can you create the ideal customer acquisition strategy for your business?

  • Define what your goals are

You need to define your goals so that you can meet the revenue expectations you have for the current fiscal year. You need to find a value for the metrics –

  • MRR – Monthly recurring revenue, which tells you all the income that can be generated from all your income channels.
  • CLV – Customer lifetime value tells you how much a customer is willing to spend on your business during your mutual relationship duration.  
  • CAC – Customer acquisition costs, which tells how much your organization needs to spend to acquire customers constantly.
  • Churn rate – It tells you the rate at which customers stop doing business.

All these metrics tell you how well you will be able to grow your business and revenue.

  • Identify your ideal customers

You need to understand who your current customers are and who your target customers are. Once you are aware of your customer base, you can focus your energies in that direction and get the maximum sale of your products or services. You can also understand what your customers require through various analytics and markers and address them to leverage your products/services towards them.

  • Choose your channels for customer acquisition

How will you acquire customers who will eventually tell at what scale and at what rate you need to expand your business? You could market and sell your products on social media channels like Instagram, Facebook and YouTube, or invest in paid marketing like Google Ads. You need to develop a unique strategy for each of these channels. 

  • Communicate with your customers

If you know exactly what your customers have in mind, then you will be able to develop your customer strategy with a clear perspective in mind. You can do it through surveys or customer opinion forms, email contact forms, blog posts and social media posts. After that, you just need to measure the analytics, clearly understand the insights, and improve your strategy accordingly.

Combining these strategies with your long-term business plan will bring results. However, there will be challenges on the way, where you need to adapt as per the requirements to make the most of it. At the same time, introducing new technologies like AI and ML can also solve such issues easily. To learn more about the use of AI and ML and how they are transforming businesses, keep referring to the blog section of E2E Networks.

Reference Links




This is a decorative image for: Constructing 3D objects through Deep Learning
October 18, 2022

Image-based 3D Object Reconstruction State-of-the-Art and trends in the Deep Learning Era

3D reconstruction is one of the most complex issues of deep learning systems. There have been multiple types of research in this field, and almost everything has been tried on it — computer vision, computer graphics and machine learning, but to no avail. However, that has resulted in CNN or convolutional neural networks foraying into this field, which has yielded some success.

The Main Objective of the 3D Object Reconstruction

Developing this deep learning technology aims to infer the shape of 3D objects from 2D images. So, to conduct the experiment, you need the following:

  • Highly calibrated cameras that take a photograph of the image from various angles.
  • Large training datasets can predict the geometry of the object whose 3D image reconstruction needs to be done. These datasets can be collected from a database of images, or they can be collected and sampled from a video.

By using the apparatus and datasets, you will be able to proceed with the 3D reconstruction from 2D datasets.

State-of-the-art Technology Used by the Datasets for the Reconstruction of 3D Objects

The technology used for this purpose needs to stick to the following parameters:

  • Input

Training with the help of one or multiple RGB images, where the segmentation of the 3D ground truth needs to be done. It could be one image, multiple images or even a video stream.

The testing will also be done on the same parameters, which will also help to create a uniform, cluttered background, or both.

  • Output

The volumetric output will be done in both high and low resolution, and the surface output will be generated through parameterisation, template deformation and point cloud. Moreover, the direct and intermediate outputs will be calculated this way.

  • Network architecture used

The architecture used in training is 3D-VAE-GAN, which has an encoder and a decoder, with TL-Net and conditional GAN. At the same time, the testing architecture is 3D-VAE, which has an encoder and a decoder.

  • Training used

The degree of supervision used in 2D vs 3D supervision, weak supervision along with loss functions have to be included in this system. The training procedure is adversarial training with joint 2D and 3D embeddings. Also, the network architecture is extremely important for the speed and processing quality of the output images.

  • Practical applications and use cases

Volumetric representations and surface representations can do the reconstruction. Powerful computer systems need to be used for reconstruction.

Given below are some of the places where 3D Object Reconstruction Deep Learning Systems are used:

  • 3D reconstruction technology can be used in the Police Department for drawing the faces of criminals whose images have been procured from a crime site where their faces are not completely revealed.
  • It can be used for re-modelling ruins at ancient architectural sites. The rubble or the debris stubs of structures can be used to recreate the entire building structure and get an idea of how it looked in the past.
  • They can be used in plastic surgery where the organs, face, limbs or any other portion of the body has been damaged and needs to be rebuilt.
  • It can be used in airport security, where concealed shapes can be used for guessing whether a person is armed or is carrying explosives or not.
  • It can also help in completing DNA sequences.

So, if you are planning to implement this technology, then you can rent the required infrastructure from E2E Networks and avoid investing in it. And if you plan to learn more about such topics, then keep a tab on the blog section of the website

Reference Links



This is a decorative image for: Comprehensive Guide to Deep Q-Learning for Data Science Enthusiasts
October 18, 2022

A Comprehensive Guide To Deep Q-Learning For Data Science Enthusiasts

For all data science enthusiasts who would love to dig deep, we have composed a write-up about Q-Learning specifically for you all. Deep Q-Learning and Reinforcement learning (RL) are extremely popular these days. These two data science methodologies use Python libraries like TensorFlow 2 and openAI’s Gym environment.

So, read on to know more.

What is Deep Q-Learning?

Deep Q-Learning utilizes the principles of Q-learning, but instead of using the Q-table, it uses the neural network. The algorithm of deep Q-Learning uses the states as input and the optimal Q-value of every action possible as the output. The agent gathers and stores all the previous experiences in the memory of the trained tuple in the following order:

State> Next state> Action> Reward

The neural network training stability increases using a random batch of previous data by using the experience replay. Experience replay also means the previous experiences stocking, and the target network uses it for training and calculation of the Q-network and the predicted Q-Value. This neural network uses openAI Gym, which is provided by taxi-v3 environments.

Now, any understanding of Deep Q-Learning   is incomplete without talking about Reinforcement Learning.

What is Reinforcement Learning?

Reinforcement is a subsection of ML. This part of ML is related to the action in which an environmental agent participates in a reward-based system and uses Reinforcement Learning to maximize the rewards. Reinforcement Learning is a different technique from unsupervised learning or supervised learning because it does not require a supervised input/output pair. The number of corrections is also less, so it is a highly efficient technique.

Now, the understanding of reinforcement learning is incomplete without knowing about Markov Decision Process (MDP). MDP is involved with each state that has been presented in the results of the environment, derived from the state previously there. The information which composes both states is gathered and transferred to the decision process. The task of the chosen agent is to maximize the awards. The MDP optimizes the actions and helps construct the optimal policy.

For developing the MDP, you need to follow the Q-Learning Algorithm, which is an extremely important part of data science and machine learning.

What is Q-Learning Algorithm?

The process of Q-Learning is important for understanding the data from scratch. It involves defining the parameters, choosing the actions from the current state and also choosing the actions from the previous state and then developing a Q-table for maximizing the results or output rewards.

The 4 steps that are involved in Q-Learning:

  1. Initializing parameters – The RL (reinforcement learning) model learns the set of actions that the agent requires in the state, environment and time.
  2. Identifying current state – The model stores the prior records for optimal action definition for maximizing the results. For acting in the present state, the state needs to be identified and perform an action combination for it.
  3. Choosing the optimal action set and gaining the relevant experience – A Q-table is generated from the data with a set of specific states and actions, and the weight of this data is calculated for updating the Q-Table to the following step.
  4. Updating Q-table rewards and next state determination – After the relevant experience is gained and agents start getting environmental records. The reward amplitude helps to present the subsequent step.  

In case the Q-table size is huge, then the generation of the model is a time-consuming process. This situation requires Deep Q-learning.

Hopefully, this write-up has provided an outline of Deep Q-Learning and its related concepts. If you wish to learn more about such topics, then keep a tab on the blog section of the E2E Networks website.

Reference Links



This is a decorative image for: GAUDI: A Neural Architect for Immersive 3D Scene Generation
October 13, 2022

GAUDI: A Neural Architect for Immersive 3D Scene Generation

The evolution of artificial intelligence in the past decade has been staggering, and now the focus is shifting towards AI and ML systems to understand and generate 3D spaces. As a result, there has been extensive research on manipulating 3D generative models. In this regard, Apple’s AI and ML scientists have developed GAUDI, a method specifically for this job.

An introduction to GAUDI

The GAUDI 3D immersive technique founders named it after the famous architect Antoni Gaudi. This AI model takes the help of a camera pose decoder, which enables it to guess the possible camera angles of a scene. Hence, the decoder then makes it possible to predict the 3D canvas from almost every angle.

What does GAUDI do?

GAUDI can perform multiple functions –

  • The extensions of these generative models have a tremendous effect on ML and computer vision. Pragmatically, such models are highly useful. They are applied in model-based reinforcement learning and planning world models, SLAM is s, or 3D content creation.
  • Generative modelling for 3D objects has been used for generating scenes using graf, pigan, and gsn, which incorporate a GAN (Generative Adversarial Network). The generator codes radiance fields exclusively. Using the 3D space in the scene along with the camera pose generates the 3D image from that point. This point has a density scalar and RGB value for that specific point in 3D space. This can be done from a 2D camera view. It does this by imposing 3D datasets on those 2D shots. It isolates various objects and scenes and combines them to render a new scene altogether.
  • GAUDI also removes GANs pathologies like mode collapse and improved GAN.
  • GAUDI also uses this to train data on a canonical coordinate system. You can compare it by looking at the trajectory of the scenes.

How is GAUDI applied to the content?

The steps of application for GAUDI have been given below:

  • Each trajectory is created, which consists of a sequence of posed images (These images are from a 3D scene) encoded into a latent representation. This representation which has a radiance field or what we refer to as the 3D scene and the camera path is created in a disentangled way. The results are interpreted as free parameters. The problem is optimized by and formulation of a reconstruction objective.
  • This simple training process is then scaled to trajectories, thousands of them creating a large number of views. The model samples the radiance fields totally from the previous distribution that the model has learned.
  • The scenes are thus synthesized by interpolation within the hidden space.
  • The scaling of 3D scenes generates many scenes that contain thousands of images. During training, there is no issue related to canonical orientation or mode collapse.
  • A novel de-noising optimization technique is used to find hidden representations that collaborate in modelling the camera poses and the radiance field to create multiple datasets with state-of-the-art performance in generating 3D scenes by building a setup that uses images and text.

To conclude, GAUDI has more capabilities and can also be used for sampling various images and video datasets. Furthermore, this will make a foray into AR (augmented reality) and VR (virtual reality). With GAUDI in hand, the sky is only the limit in the field of media creation. So, if you enjoy reading about the latest development in the field of AI and ML, then keep a tab on the blog section of the E2E Networks website.

Reference Links




Build on the most powerful infrastructure cloud

A vector illustration of a tech city using latest cloud technologies & infrastructure