Loading...

Azure Quick Links

Part 1

Advanced Data Preparation in Power BI

Preparing data for analysis is a critical first step in any data project. It ensures that the data is clean, consistent, and structured in a way that makes analysis accurate and efficient. This process involves connecting to data sources, transforming and shaping tables, handling inconsistencies, and applying best practices for data organization..

This guide is designed to be followed in sequence:

Part 1 - Advanced Data Preparation in Power BI

Part 2 - Model the Data with power BI

Part 3 - Visualization and Report Creation with Power BI

Part 4 - Power BI Report Deployment & Management

*

Task Details

1. Establish a connection to a CSV data source.

2. Change the file path in the source settings if the file is moved to a different location.

3. Shape and transform tables by removing rows and columns and promoting the first row to a header.

4. User-friendly naming convention.

5. Resolve inconsistencies and handle null values by replacing values and removing duplicates.

6. Evaluate and transform the data types of columns.

7. Learn how to append and merge queries, pivot and unpivot columns, and add conditional columns to prepare data for analysis.

8. Perform data profiling to identify potential problems in the data and resolve errors.

9. Importing Data from Folders and Azure SQL Database (Import or DirectQuery)

*

All files used in this project are located here

Steps

Establish a connection to a CSV data source.

Connect to CSV file

1. In Power BI Desktop go to Home tab then click on "Get data."

*

2. Select your CSV file.

*

3. Click "load" to load the CSV file.

Note: You could also click the "Transform Data" button to clean and transform the data automatically.

*

4. You can see the loaded file on the right side.

*

Change the file path in the source settings if the file is moved to a different location.

1. In case you move the file to a different location, you will have to change the file path as follows:

Go to "Transform data" and choose "Data source settings."

*

2. Click the "Change source..." and choose new file location.

*

Shape and transform tables by removing rows and columns and promoting the first row to a header.

1. Click on "Transform data" to enter the Power Query editor.

*

2. Delete the first two rows and use the third row as a column header.

Select the third row and click on "Remove rows" then "Remove top rows."

*

3. Specify how many rows to remove, then click OK.

*

4. Promote the first row as a column header.

Note: Use the same method to remove the first blank row.

*

*

5. Remove Column 3 because it is empty.

Right-click on the column and choose remove.

*

6. To apply your changes, click Apply or Close & Apply if you are done modifying the data.

*

7. You can see that the changes took effect on the right side.

Note: To return to Power Query, click Transform Data.

User-friendly naming convention.

Use consistent naming conventions in Power BI for table names, column names, and measures.

  • Apply clear, descriptive names that reflect the data’s meaning.
  • Avoid spaces, special characters, and unclear abbreviations.
  • Use a consistent format (for example, PascalCase or camelCase).
  • Keep measure names meaningful and easy to understand for reporting.

Consistent naming improves readability, maintenance, and collaboration within reports and data models.

*
1. Change the FactCmpySales_2017 table name to Sales 2017 and also replace underscores in column names with spaces.

Note: Also delete the extra description (unique) in the order id column.

*

Resolve inconsistencies and handle null values by replacing values and removing duplicates.


1. The Sales column should contain only numbers. Remove any text (like the word "Sales") from the column so that changing the data type will not produce an error.

  • Select the Sales column and choose replace values.

*

2. In the Value to Find field, enter a space followed by the word ‘Sales’.

  • Leave the "Replace with" field empty.

Note: This will delete everything after the space, including the word 'Sales.'

*

Note: Please note that you can remove any steps you performed to roll back by using the Applied Steps tab in the Query Settings.

*

3. If you try to change the data type to Whole Number, you will still get an error because the column contains the text 'NA.'

  • Right-click on the sales column, then change the type to whole number.

*

  • Go back one step

*

4. Replace ‘NA’ with null values.

Note: When you enter the word "null" in the Replace With field, Power BI will insert actual null values.

*

Note: You will no longer get an error when changing the data type to Whole Number.

*

5. Remove duplicates.

Removing duplicates in the Order ID column ensures that each order is counted only once. This prevents double counting in your analysis and ensures accurate calculations of sales, totals, and other metrics.

  • Select the Order ID column, then go to Remove Rows and choose Remove Duplicates.

*

6. You can now see that the duplicates have been removed.

*

7. Remove blanks and NA values.

  • Click the arrow on the right.

       

*

8. The data looks a lot cleaner now.

Evaluate and transform the data types of columns.

Power BI automatically detects the data type for each column, making it easier to work with your data and visualize it accurately.

1. Select all columns by pressing Ctrl + A.

  • Then, go to the Transform tab and click Detect Data Types.

The data is still not detected correctly due to the ‘NA’ values in the Order Date column.

Note: Use the same method to remove or replace the ‘NA’ values with null in this column.

*

2. Now the data looks even cleaner.

Note: After completing this step, click Close & Apply to return to Power BI Desktop.”

The best approach would be to replace NA values with null in every column.

*

3. You can see that Power BI has correctly detected the data types, indicated by the icons next to each column.

*

Learn how to append and merge queries, pivot and unpivot columns, and add conditional columns to prepare data for analysis.

*
Append queries.

Append queries combine rows from multiple tables into a single table with the same structure.

1. Click New Source, choose Text/CSV, then locate Sales2018 and Sales2019 as the tables to append data from.

*

2. Make sure that you load the file successfully.

Note: It is worth mentioning that you can save your project as a .pbix file from the main screen.

*

3. Ensure consistent column names before appending tables

Before appending, make sure all tables use the same column names. Modify them if needed:

  • Click on Sales 2018
  • Remove the order_date_2 column
  • Rename the remaining column to match the Sales 2017 table (replace the underscore _ with a space)
  • Do the same with sales 2019

*

4. Append the data after matching the column structure.

Note: Append queries combines rows from multiple tables into a single table with the same structure.

Choose the base table to which the rows will be appended - in this case, Sales 2017.
Click append queries in home tab
Select three or more tables.
Add both tables
Click OK
Click Close & Apply

Note: Appending queries as new will create a new table.

Note: All data from the three tables is now combined into the Sales 2017 table

Note: Rename the Sales 2017 table to Sales, since it now contains data from all years.

*

5. Make sure that you have all data combined in Power BI Desktop.

*

Merge queries.

Merge Queries is a Power Query operation that combines two tables based on a related column (key), similar to a database join, allowing you to bring matching data from one table into another.

In Merge Queries, the column names do not have to be the same, but both tables must contain a column with matching values that can be used as the key for the join.

1. Click New source, choose Text/CSV and Load sales_details table.

*

2. In the sales_details table, remove two empty columns.

*

3. Go to the main Sales table and merge the new table based on a related column (key).

  • Select the sales table
  • Click Merge queries

*

4. Choose the table to merge and also select the common column.

*

5. After merging is done, you will see the new column sales_details.

  • Click icon on the right of the column
  • Choose Expand
  • Deselect “Use original column name as prefix."

Note: Deselect “Use original column name as prefix” to keep merged column names clean and simple, avoiding unnecessary table-name prefixes.

*

6. You should now see three new columns added at the end. Drag and drop these new columns to the beginning.

*

7. Delete the duplicate order_id column, and replace underscores with spaces in the product_id and store_id columns.

Note: You will need to remove duplicates in the order id column.

*

8. Remove duplicates from the order id column using Remove rows → Remove duplicates.

Unpivot columns.

Unpivot columns: Turn columns into rows (normalize data for analysis).

1. Click New source, choose Text/CSV and Load sales_details table

*

2. We want all of the costs in one single column.

  • First make the first rows as a header, as we did before.

*

3. Rename the Year column to Month and unpivot two columns.

  • Select both columns
  • Go to transform tab
  • Unpivot columns

Note: After unpivoting in Power BI, two new columns are created: Attribute and Value. You will need to rename these columns to meaningful names for your analysis.

*

After unpivoting

Note: After unpivoting in Power BI, two new columns are created: Attribute and Value. You will need to rename these columns to meaningful names for your analysis.

*

4. Rename the Attribute column to Year and the Value column to Cost Amount for better readability.

*

Pivot columns

Pivot columns: Turn row data into columns (aggregate values by categories).

1. Select Type column and click Pivot Column

  • As "Values column," choose Cost amount

*

2. Here’s what will happen:

Each unique value in the Type column (e.g., costs for employees, costs for rent) will become a new column.

The Month and Year columns will stay as columns.

The corresponding cost amount values will be placed under the new columns for each month-year combination.

You will get a wide table where each type of cost is a separate column, making it easier to compare costs side by side.

Note: This structure is much better for visualizations like line charts, bar charts, or comparing costs across categories and time.

*

Add conditional columns. (month number)

Use a conditional column to assign a numeric value to each month name (for example, January = 1, February = 2, …, December = 12). This helps sort months in the correct chronological order and improves time-based analysis in reports and visuals.

*

Method 1

1. In costs table, navigate to Add column tab → Conditional column

  • Set condition as follows:
  • New column name: MonthNo
  • Column name: Month
  • Operator: equals
  • Value: Month Name
  • Output: Month number
  • Click OK
  • Convert column to whole number

Note: Create a conditional column to convert each month name into its corresponding numeric value. For example, if Month equals January, return 1; if February, return 2; continue this pattern through November, and use Else = 12 to represent December.

This transformation ensures months are sorted in the correct chronological order and improves accuracy for time-based analysis, filtering, and visualizations in reports.

After

*

Method 2 – Create Month Number Using “Column From Example”

  • Go back two steps to remove the column created previously.
  • Click Column from Examples in the top-left corner of the Add Column tab.
  • Enter the column name: MonthNo.
  • Type the corresponding month number for each row.
  • Click OK.

Note: This method is faster and more efficient than creating a conditional column manually.

*

8. Perform data profiling to identify potential problems in the data and resolve errors.

This process analyzes column quality, distribution, and statistics to detect issues such as missing values, duplicates, incorrect data types, or unexpected patterns, helping ensure clean and reliable data for analysis and reporting.

*

Data profiling

Data profiling can be performed using three main tools:

Column distribution – Shows how values are spread across a column, helping detect outliers, unexpected ranges, or unusual patterns.

Column quality – Displays the percentage of valid, empty, and error values to quickly identify data issues.

Column profile – Provides detailed statistics for each column, such as minimum, maximum, average, and value frequency, to better understand the data structure and reliability.

*

Column distribution

Column distribution shows how many distinct values exist in a column and how many of them are unique (appear only once).

This helps you quickly detect duplicates, identify unexpected repeated values, and understand the overall variety of data, making it easier to clean and validate the dataset before analysis.

1. Go to the View tab and select Column distribution.

Note: By default, data profiling checks only the first 1,000 rows. To analyze the entire dataset, switch to column profiling based on the entire dataset.

*

Column quality.

Column quality shows the proportion of valid, error, and empty values in a column.

This helps identify data issues so you can clean, correct, or remove problematic entries to ensure accurate analysis.

Column profile

Column profile provides detailed statistics for a column, such as min, max, average, count of values, and value distribution.

This helps understand the data’s structure, detect outliers, and make informed decisions for cleaning and analysis.

Note: You need to select a column to view its column profile.

*

Here’s a simple explanation of what you see in your screenshot:

Column Statistics (left side):

  • Count (1000) – Total number of rows in the column.
  • Error (0) – Number of rows with invalid or problematic data. Here, there are none.
  • Empty (0) – Number of blank or missing values. Here, none are empty.
  • Distinct (231) – Number of different values in the column.
  • Unique (43) – Number of values that appear only once.
  • Empty string (1) – Number of rows with an empty text string (not exactly null).
  • Min (P0499) – The smallest value in the column (alphabetically or numerically depending on type).
  • Max (P0747) – The largest value in the column.

Value Distribution (right side):

  • Shows a bar chart for the most frequent values in the column.
  • The longer the bar, the more times that value appears.
  • For example, P0499, P0131, P0103, P0325 appear most frequently.
  • Values with shorter bars appear less frequently.

What we can do with this:

  • Quickly see if there are any missing or error values.
  • Understand which values are repeated and which are rare.
  • Identify outliers or unusual patterns.
  • Help decide how to clean, filter, or categorize the data for analysis.

*

Resolve data import errors.

Resolving data import errors usually involves identifying why the data failed to load into a system (like Power BI, Excel, SQL, or other analytics platforms) and taking steps to fix it.

*
1. If you get a “Could not find file” error, update the file source as described at the beginning of this project.

Go to the file source settings.

Update the file path to the correct location, as explained at the beginning of this project.

*

2. Data Conversion Error.

If you try to convert the store_id column to a whole number, an error may occur.
Click on the row with the error to see the specific type of conversion issue.

*

3. You can either remove the conversion step to undo the changes or search online for a solution to the problem.

*

Importing Data from Folders and Azure SQL Database (Import or DirectQuery).

Load and append files from a folder.

All files used in this project are located here

1. Go to New Source or Get data → Folder → Connect

*

2. Browse to the sales details data set folder.

*

3. Select “Combine & Transform Data” to append tables.

You will see a list of files in this folder.

*

4. Before appending data, ensure that all files have the same structure.

Note: Switch between files to ensure that all files have the same structure.

*

5. The data has been combined successfully.

Note: Additional folders are automatically created by Power BI at the top.

*

Connect to Azure Relational SQL database.

The three storage modes in Power BI are:

Import - Data is copied into Power BI. You work with a snapshot of the data, which makes reports fast, but you need to refresh to get updates.

DirectQuery – Power BI queries the source directly. You always see the latest data, but reports may be slower because every interaction queries the database.

Dual mode - allows a table to act as both Import and DirectQuery, using the fastest option automatically while keeping data up to date.

*

Import mode

Import - Data is copied into Power BI. You work with a snapshot of the data, which makes reports fast, but you need to refresh to get updates.

*

2. Enter the server name and Database name and select Import

Note:

Import - Data is copied into Power BI. You work with a snapshot of the data, which makes reports fast, but you need to refresh to get updates.

DirectQuery - Power BI queries the source directly. You always see the latest data, but reports may be slower because every interaction queries the database.

1. Go to Get data → type sql → choose Azure SQL database → Connect

*

3. You will need to provide credentials.

Note: For authentication, always use:

  • Windows/AD authentication: Uses your domain credentials and inherits all domain security policies.
  • Entra ID (Azure AD) authentication: Supports multi-factor authentication (MFA), conditional access, and secure token-based login.

Important: Database authentication is the least secure method.

*

4. Here you can choose tables to load.

*

5. In Model view, you can verify that the table is in Import mode.

Note: You cannot switch from Import to DirectQuery or Dual mode.

*

6. You also can retrieve data by using an SQL statement.

Note: This will retrieve the top 20 rows of the Name and Color columns from the SalesLT.Product table.

*

*

DirectQuery mode.

DirectQuery - Power BI queries the source directly. You always see the latest data, but reports may be slower because every interaction queries the database.

1. Enter the server name and Database name and choose DirectQuery

*

2. Choose a table to load.

*

3. In Model view, you can verify that the table is in DirectQuery mode.

Note: You can switch from DirectQuery to Import or Dual mode.

*

Conclusion
By properly preparing data for analysis, you can work with reliable and well-structured datasets, making it easier to uncover insights and build meaningful reports. Mastering these steps ensures that your analyses are accurate, efficient, and ready for advanced visualization and modeling.

Written by Kirill.A - Azure & Cybersecurity Consultant at AntusNet

➤ Want more? Browse all our Azure implementation guides.

Need help implementing secure Azure solutions?

Contact us for a free consultation.

    error: Content is protected !!